0% found this document useful (0 votes)
34 views

A11 Find

algorithm

Uploaded by

Joydipto Bose
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

A11 Find

algorithm

Uploaded by

Joydipto Bose
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Robust Principal Component Analysis?

EMMANUEL J. CANDÈS and XIAODONG LI, Stanford University


YI MA, University of Illinois at Urbana-Champaign, Microsoft Research Asia
JOHN WRIGHT, Microsoft Research Asia

This article is about a curious phenomenon. Suppose we have a data matrix, which is the superposition of a
low-rank component and a sparse component. Can we recover each component individually? We prove that
under some suitable assumptions, it is possible to recover both the low-rank and the sparse components 11
exactly by solving a very convenient convex program called Principal Component Pursuit; among all feasible
decompositions, simply minimize a weighted combination of the nuclear norm and of the 1 norm. This sug-
gests the possibility of a principled approach to robust principal component analysis since our methodology
and results assert that one can recover the principal components of a data matrix even though a positive
fraction of its entries are arbitrarily corrupted. This extends to the situation where a fraction of the entries
are missing as well. We discuss an algorithm for solving this optimization problem, and present applications
in the area of video surveillance, where our methodology allows for the detection of objects in a cluttered
background, and in the area of face recognition, where it offers a principled way of removing shadows and
specularities in images of faces.
Categories and Subject Descriptors: G.1.6 [Mathematics of Computing]: Numerical Analysis—Convex
optimization; G.3 [Mathematics of Computing]: Probability and Statistics—Robust regression
General Terms: Algorithms, Theory
Additional Key Words and Phrases: Principal components, robustness vis-a-vis outliers, nuclear-norm mini-
mization, 1 -norm minimization, duality, low-rank matrices, sparsity, video surveillance
ACM Reference Format:
Candès, E. J., Li, X., Ma, Y., and Wright, J. 2011. Robust principal component analysis? J. ACM 58, 3,
Article 11 (May 2011), 37 pages.
DOI = 10.1145/1970392.1970395 http://doi.acm.org/10.1145/1970392.1970395

1. INTRODUCTION
1.1. Motivation
Suppose we are given a large data matrix M, and know that it may be decomposed as

M = L0 + S0 ,

E. J. Candès was supported by ONR grants N00014-09-1-0469 and N00014-08-1-0749 and by the Waterman
Award from NSF. Y. Ma was partially supported by the grants NSF IIS 08-49292, NSF ECCS 07-01676, and
ONR N00014-09-1-0230.
Authors’ addresses: E. J. Candès and X. Li, Departments of Mathematics and Statistics, Stanford University,
450 Serra Mall, Building 380, Stanford, CA 94305; email: {candes, xdil1985}@stanford.edu; Y. Ma, Depart-
ment of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 145 Coordinated
Science Laboratory, 1308 West Main Street, Urbana, IL61801, and Visual Computing Group, Microsoft Re-
search Asia, Building 2, No. 5 Dan Ling Street, Beijing, China 100080; email: [email protected]; J. Wright,
Visual Computing Group, Microsoft Research Asia, Building 2, No. 5 Dan Ling Street, Beijing, China 100080;
email: [email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or [email protected].
c 2011 ACM 0004-5411/2011/05-ART11 $10.00
DOI 10.1145/1970392.1970395 http://doi.acm.org/10.1145/1970392.1970395

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:2 E. J. Candès et al.

where L0 has low rank and S0 is sparse; here, both components are of arbitrary magni-
tude. We do not know the low-dimensional column and row space of L0 , not even their
dimension. Similarly, we do not know the locations of the nonzero entries of S0 , not
even how many there are. Can we hope to recover the low-rank and sparse components
both accurately (perhaps even exactly) and efficiently?
A provably correct and scalable solution to the above problem would presumably
have an impact on today’s data-intensive process of scientific discovery.1 The recent
explosion of massive amounts of high-dimensional data in science, engineering, and
society presents a challenge as well as an opportunity to many areas such as image,
video, multimedia processing, web relevancy data analysis, search, biomedical imaging
and bioinformatics. In such application domains, data now routinely lie in thousands
or even billions of dimensions, with a number of samples sometimes of the same order
of magnitude.
To alleviate the curse of dimensionality and scale,2 we must leverage on the fact
that such data have low intrinsic dimensionality, for example, that they lie on some
low-dimensional subspace [Eckart and Young 1936], are sparse in some basis [Chen
et al. 2001], or lie on some low-dimensional manifold [Tenenbaum et al. 2000; Belkin
and Niyogi 2003]. Perhaps the simplest and most useful assumption is that the data
all lie near some low-dimensional subspace. More precisely, this says that if we stack
all the data points as column vectors of a matrix M, the matrix should (approximately)
have low rank: mathematically,
M = L0 + N0 ,
where L0 has low-rank and N0 is a small perturbation matrix. Classical Principal
Component Analysis (PCA) [Hotelling 1933; Eckart and Young 1936; Jolliffe 1986]
seeks the best (in an 2 sense) rank-k estimate of L0 by solving
minimize M − L
subject to rank(L) ≤ k.
(Throughout this article, M denotes the 2-norm; that is, the largest singular value
of M.) This problem can be efficiently solved via the singular value decomposition
(SVD) and enjoys a number of optimality properties when the noise N0 is small and
independent and identically distributed Gaussian.
Robust PCA. PCA is arguably the most widely used statistical tool for data analysis
and dimensionality reduction today. However, its brittleness with respect to grossly
corrupted observations often puts its validity in jeopardy—a single grossly corrupted
entry in M could render the estimated L̂ arbitrarily far from the true L0 . Unfortunately,
gross errors are now ubiquitous in modern applications such as image processing, web
data analysis, and bioinformatics, where some measurements may be arbitrarily cor-
rupted (due to occlusions, malicious tampering, or sensor failures) or simply irrelevant
to the low-dimensional structure we seek to identify. A number of natural approaches
to robustifying PCA have been explored and proposed in the literature over several
decades. The representative approaches include influence function techniques [Huber
1981; Torre and Black 2003], multivariate trimming [Gnanadesikan and Kettenring
1972], alternating minimization [Ke and Kanade 2005], and random sampling tech-
niques [Fischler and Bolles 1981]. Unfortunately, none of these approaches yields a

1 Data-intensive computing is advocated by Jim Gray as the fourth paradigm for scientific discovery [Hey
et al. 2009].
2 We refer to either the complexity of algorithms that increases drastically as dimension increases, or to their
performance that decreases sharply when scale goes up.

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:3

polynomial-time algorithm with strong performance guarantees under broad condi-


tions.3 The problem we study here can be considered an idealized version of Robust
PCA, in which we aim to recover a low-rank matrix L0 from highly corrupted measure-
ments M = L0 + S0 . Unlike the small noise term N0 in classical PCA, the entries in S0
can have arbitrarily large magnitude, and their support is assumed to be sparse but
unknown.4 Motivated by a different set of applications, this problem was also investi-
gated by Chandrasekaran et al. [2009], who also base their work on the formulation
(1.1). Their work was carried out independently of ours, while this manuscript was in
preparation. Their results are of a somewhat different nature; see Section 1.5 for a
detailed explanation.

Applications. There are many important applications in which the data under study
can naturally be modeled as a low-rank plus a sparse contribution. All the statisti-
cal applications, in which robust principal components are sought, of course fit our
model. We give examples inspired by contemporary challenges in computer science,
and note that depending on the applications, either the low-rank component or the
sparse component could be the object of interest.

—Video Surveillance. Given a sequence of surveillance video frames, we often need to


identify activities that stand out from the background. If we stack the video frames
as columns of a matrix M, then the low-rank component L0 naturally corresponds to
the stationary background and the sparse component S0 captures the moving objects
in the foreground. However, each image frame has thousands or tens of thousands of
pixels, and each video fragment contains hundreds or thousands of frames. It would
be impossible to decompose M in such a way unless we have a truly scalable solution
to this problem. In Section 4, we will show the results of our algorithm on video
decomposition.
—Face Recognition. It is well known that images of a convex, Lambertian surface un-
der varying illuminations span a low-dimensional subspace [Basri and Jacobs 2003].
This fact has been a main reason why low-dimensional models are mostly effective
for imagery data. In particular, images of a human’s face can be well-approximated
by a low-dimensional subspace. Being able to correctly retrieve this subspace is cru-
cial in many applications such as face recognition and alignment. However, realistic
face images often suffer from self-shadowing, specularities, or saturations in bright-
ness, which make this a difficult task and subsequently compromise the recognition
performance. In Section 4, we will show how our method is able to effectively remove
such defects in face images.
—Latent Semantic Indexing. Web search engines often need to analyze and index
the content of an enormous corpus of documents. A popular scheme is the Latent
Semantic Indexing (LSI) [Dewester et al. 1990; Papadimitriou et al. 2000]. The basic
idea is to gather a document-versus-term matrix M whose entries typically encode
the relevance of a term (or a word) to a document such as the frequency it appears
in the document (e.g. the TF/IDF). PCA (or SVD) has traditionally been used to
decompose the matrix as a low-rank part plus a residual, which is not necessarily
sparse (as we would like). If we were able to decompose M as a sum of a low-rank
component L0 and a sparse component S0 , then L0 could capture common words used

3 Random sampling approaches guarantee near-optimal estimates, but have complexity exponential in the
rank of the matrix L0 . Trimming algorithms have comparatively lower computational complexity, but guar-
antee only locally optimal solutions.
4 The unknown support of the errors makes the problem more difficult than the matrix completion problem
that has been recently much studied.

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:4 E. J. Candès et al.

in all the documents while S0 captures the few key words that best distinguish each
document from others.
—Ranking and Collaborative Filtering. The problem of anticipating user tastes is gain-
ing increasing importance in online commerce and advertisement. Companies now
routinely collect user rankings for various products, for example, movies, books,
games, or web tools, among which the Netflix Prize for movie ranking is the best
known [Netflix, Inc.]. The problem is to use incomplete rankings provided by the
users on some of the products to predict the preference of any given user on any of
the products. This problem is typically cast as a low-rank matrix completion prob-
lem. However, as the data collection process often lacks control or is sometimes even
ad hoc—a small portion of the available rankings could be noisy and even tampered
with. The problem is more challenging since we need to simultaneously complete the
matrix and correct the errors. That is, we need to infer a low-rank matrix L0 from a
set of incomplete and corrupted entries. In Section 1.6, we will see how our results
can be extended to this situation.
Similar problems also arise in many other applications such as graphical model
learning, linear system identification, and coherence decomposition in optical systems,
as discussed in Chandrasekaran et al. [2009]. All in all, the new applications we have
listed above require solving the low-rank and sparse decomposition problem for matri-
ces of extremely high dimension and under broad conditions, a goal this article aims to
achieve.
1.2. Our Message
At first sight, the separation problem seems impossible to solve since the number
of unknowns to infer for L0 and S0 is twice as many as the given measurements in
M ∈ Rn1 ×n2 . Furthermore, it seems even more daunting that we expect to reliably
obtain the low-rank matrix L0 with errors in S0 of arbitrarily large magnitude.
In this article, we are going to see that not only canthis problem be solved, it can be
solved by tractable convex optimization. Let M∗ := i σi (M) denote the nuclear  norm
of the matrix M, that is, the sum of the singular values of M, and let M1 = i j |Mi j |
denote the 1 -norm of M seen as a long vector in Rn1 ×n2 . Then we will show that under
rather weak assumptions, the Principal Component Pursuit (PCP) estimate solving5
minimize L∗ + λS1
(1.1)
subject to L+ S = M
exactly recovers the low-rank L0 and the sparse S0 . Theoretically, this is guaranteed
to work even if the rank of L0 grows almost linearly in the dimension of the matrix,
and the errors in S0 are up to a constant fraction of all entries. Algorithmically, we
will see that this problem can be solved by efficient and scalable algorithms, at a
cost not so much higher than the classical PCA. Empirically, our simulations and
experiments suggest this works under surprisingly broad conditions for many types
of real data. The approach (1.1) was first studied by Chandrasekaran et al. [2009]
who released their findings during the preparation of this article. Their work was
motivated by applications in system identification and learning of graphical models. In
contrast, this article is motivated by robust principal component computations in high-
dimensional settings when there are erroneous and missing entries; missing entries
are not considered in Chandrasekaran et al. [2009]. Hence, the assumptions and results
of this article will be significantly different from those in Chandrasekaran et al. [2009].

5 Although the name naturally suggests an emphasis on the recovery of the low-rank component, we reiterate
that in some applications, the sparse component truly is the object of interest.

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:5

We will comment more thoroughly on the results of Chandrasekaran et al. [2009] in


Section 1.5 after introducing ours in Section 1.3.
1.3. When Does Separation Make Sense?
A normal reaction is that the objectives of this article cannot be met. Indeed, there
seems to not be enough information to perfectly disentangle the low-rank and the
sparse components. And indeed, there is some truth to this, since there obviously is an
identifiability issue. For instance, suppose the matrix M is equal to e1 e1∗ (this matrix
has a one in the top left corner and zeros everywhere else). Then since M is both sparse
and low-rank, how can we decide whether it is low-rank or sparse? To make the problem
meaningful, we need to impose that the low-rank component L0 is not sparse. In this
article, we will borrow the general notion of incoherence introduced in Candès and
Recht [2009] for the matrix completion problem; this is an assumption concerning the
singular vectors of the low-rank component. Write the singular value decomposition of
L0 ∈ Rn1 ×n2 as

r
L0 = U V ∗ = σi ui vi∗ ,
i=1

where r is the rank of the matrix, σ1 , . . . , σr are the positive singular values, and
U = [u1 , . . . , ur ], V = [v1 , . . . , vr ] are the matrices of left- and right-singular vectors.
Then, the incoherence condition with parameter μ states that
μr μr
max U ∗ ei 2 ≤ , max V ∗ ei 2 ≤ , (1.2)
i n1 i n2
and

∗ μr
U V ∞ ≤ . (1.3)
n1 n2
Here, M∞ = maxi, j |Mi j |, that is, is the ∞ norm of M seen as a long vector. Note that
since the orthogonal projection PU onto the column space of U is given by PU = U U ∗ ,
(1.2) is equivalent to maxi PU ei 2 ≤ μr/n1 , and similarly for PV . As discussed in
Candès and Recht [2009], Candès and Tao [2010], and Gross [2011], the incoherence
condition asserts that for small values of μ, the singular vectors are reasonably spread
out—in other words, not sparse.
Another identifiability issue arises if the sparse matrix has low-rank. This will occur
if, say, all the nonzero entries of S occur in a column or in a few columns. Suppose for
instance, that the first column of S0 is the opposite of that of L0 , and that all the other
columns of S0 vanish. Then it is clear that we would not be able to recover L0 and S0
by any method whatsoever since M = L0 + S0 would have a column space equal to, or
included in that of L0 . To avoid such meaningless situations, we will assume that the
sparsity pattern of the sparse component is selected uniformly at random.
1.4. Main Result
The surprise is that, under these minimal assumptions, the simple PCP solution per-
fectly recovers the low-rank and the sparse components, provided, of course, that the
rank of the low-rank component is not too large, and that the sparse component is
reasonably sparse. Throughout this article, n(1) = max(n1 , n2 ) and n(2) = min(n1 , n2 ).
THEOREM 1.1. Suppose L0 is n × n, obeys (1.2)–(1.3). Fix any n × n matrix  of
signs. Suppose that the support set  of S0 is uniformly distributed among all sets of
cardinality m, and that sgn([S0 ]i j ) = i j for all (i, j) ∈ . Then, there is a numerical
constant c such that with probability at least 1 − cn−10 (over the choice of support of S0 ),

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:6 E. J. Candès et al.


Principal Component Pursuit (1.1) with λ = 1/ n is exact, that is, L̂ = L0 and Ŝ = S0 ,
provided that
rank(L0 ) ≤ ρr n μ−1 (log n)−2 and m ≤ ρs n2 . (1.4)
In this equation, ρr and ρs are positive numerical√ constants. In the general rectangular
case, where L0 is n1 × n2 , PCP with λ = 1/ n(1) succeeds with probability at least
1 − cn−10 −1
(1) , provided that rank(L0 ) ≤ ρr n(2) μ (log n(1) )
−2
and m ≤ ρs n1 n2 .

In other words, matrices L0 whose singular vectors—or principal components—are


reasonably spread can be recovered with probability nearly one from arbitrary and
completely unknown corruption patterns (as long as these are randomly distributed).
In fact, this works for large values of the rank, that is, on the order of n/(log n)2 when
μ is not too large. We would like to emphasize that the only “piece of randomness” in
our assumptions concerns the locations of the nonzero entries of S0 ; everything else
is deterministic. In particular, all we require about L0 is that its singular vectors are
not spiky. Also, we make no assumption about the magnitudes or signs of the nonzero
entries of S0 . To avoid any ambiguity, our model for S0 is this: take an arbitrary matrix
S and set to zero its entries on the random set c ; this gives S0 .
A rather remarkable fact is that there is no tuning parameter in our algorithm.
Under the assumption of the theorem, minimizing
1
L∗ + √ S1 , n(1) = max(n1 , n2 ),
n(1)

always returns the correct answer. This is surprising because one might have expected
that one would have to choose the right scalar λ to balance the two terms in L∗ +λS1
appropriately (perhaps depending on their √ relative size). This is, however, clearly not
the case. In this sense,√ the choice λ = 1/ n(1) is universal. Further, it is not a priori
very clear why λ = 1/ n(1) is a correct choice no matter what L0 and S0 are. It is the
mathematical analysis which reveals the correctness of this value. In fact, the proof of
the theorem gives a whole range of correct values, and we have selected a sufficiently
simple value in that range.
Another comment is that one can obtain  −βresults
 with larger probabilities of success,
that is, of the form 1 − O(n−β ) (or 1 − O n(1) ) for β > 0 at the expense of reducing the
value of ρr .

1.5. Connections with Prior Work and Innovations


The last year or two have seen the rapid development of a scientific literature con-
cerned with the matrix completion problem introduced in Candès and Recht [2009],
see also Candès and Tao [2010], Candès and Plan [2010], Keshavan et al. [2010], Gross
et al. [2010], and Gross [2011] and the references therein. In a nutshell, the matrix
completion problem is that of recovering a low-rank matrix from only a small fraction
of its entries, and by extension, from a small number of linear functionals. Although
other methods have been proposed [Keshavan et al. 2010], the method of choice is to
use convex optimization [Candès and Tao 2010; Candès and Plan 2010; Gross et al.
2010; Gross 2011; Recht et al. 2010]: among all the matrices consistent with the data,
simply find that with minimum nuclear norm. These articles all prove the mathemat-
ical validity of this approach, and our mathematical analysis borrows ideas from this
literature, and especially from those pioneered in Candès and Recht [2009]. Our meth-
ods also much rely on the powerful ideas and elegant techniques introduced by David
Gross in the context of quantum-state tomography [Gross et al. 2010; Gross 2011]. In

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:7

particular, the clever golfing scheme [Gross 2011] plays a crucial role in our analysis,
and we introduce two novel modifications to this scheme.
Despite these similarities, our ideas depart from the literature on matrix completion
on several fronts. First, our results obviously are of a different nature. Second, we
could think of our separation problem, and the recovery of the low-rank component, as
a matrix completion problem. Indeed, instead of having a fraction of observed entries
available and the other missing, we have a fraction available, but do not know which
one, while the other is not missing but entirely corrupted altogether. Although, this
is a harder problem, one way to think of our algorithm is that it simultaneously de-
tects the corrupted entries, and perfectly fits the low-rank component to the remaining
entries that are deemed reliable. In this sense, our methodology and results go be-
yond matrix completion. Third, we introduce a novel derandomization argument that
allows us to fix the signs of the nonzero entries of the sparse component. We believe
that this technique will have many applications. One such application is in the area
of compressive sensing, where assumptions about the randomness of the signs of a
signal are common, and merely made out of convenience rather than necessity; this is
important because assuming independent signal signs may not make much sense for
many practical applications when the involved signals can all be nonnegative (such as
images).
We mentioned earlier the related work [Chandrasekaran et al. 2009], which also
considers the problem of decomposing a given data matrix into sparse and low-rank
components, and gives sufficient conditions for convex programming to succeed. These
conditions are phrased in terms of two quantities. The first is the maximum ratio
between the ∞ norm and the operator norm, restricted to the subspace generated
by matrices whose row or column spaces agree with those of L0 . The second is the
maximum ratio between the operator norm and the ∞ norm, restricted to the subspace
of matrices that vanish off the support of S0 . Chandrasekaran et al. [2009] show that
when the product of these two quantities is small, then the recovery is exact for a
certain interval of the regularization parameter.
One very appealing aspect of this condition is that it is completely deterministic: it
does not depend on any random model for L0 or S0 . It yields a corollary that can be
easily compared to our result: suppose n1 = n2 = n for simplicity, and let μ0 be the
smallest quantity satisfying (1.2), then correct recovery occurs whenever

max{i : [S0 ]i j = 0} × μ0r/n < 1/12.
j

The left-hand side is at least as large as ρs μ0 nr, where ρs is the fraction of entries
of S0 that are nonzero. Since μ0 ≥ 1 always, this statement only guarantees recovery
if ρs = O((nr)−1/2 ); that is, even when rank(L0 ) = O(1), only vanishing fractions of the
entries in S0 can be nonzero.
In contrast, our result shows that for incoherent L0 , correct recovery occurs with
high probability for rank(L0 ) on the order of n/[μ log2 n] and a number of nonzero
entries in S0 on the order of n2 . That is, matrices of large rank can be recovered from
non-vanishing fractions of sparse errors. This improvement comes at the expense of
introducing one piece of randomness: a uniform model on the error support.
A difference with the results in Chandrasekaran et al. [2009] is that √ our analysis
leads to the conclusion that a single universal value of λ, namely λ = 1/ n, works with
high probability for recovering any low-rank, incoherent matrix. In Chandrasekaran
et al. [2009], the parameter λ is data-dependent, and may have to be selected by solving
a number of convex programs. The distinction between our results and Chandrasekaran
et al. [2009] is a consequence of differing assumptions about the origin of the data
matrix M. We regard the universality of λ in our analysis as an advantage, since it may

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:8 E. J. Candès et al.

provide useful guidance in practical circumstances where the generative model for M
is not completely known.
1.6. Implications for Matrix Completion from Grossly Corrupted Data
We have seen that our main result asserts that it is possible to recover a low-rank
matrix even though a significant fraction of its entries are corrupted. In some applica-
tions, however, some of the entries may be missing as well, and this section addresses
this situation. Let P be the orthogonal projection onto the linear space of matrices
supported on  ⊂ [n1 ] × [n2 ],

Xi j , (i, j) ∈ ,
P X =
0, (i, j) ∈
/ .
Then, imagine we only have available a few entries of L0 + S0 , which we conveniently
write as
Y = Pobs (L0 + S0 ) = Pobs L0 + S0 ;
that is, we see only those entries (i, j) ∈ obs ⊂ [n1 ] × [n2 ]. This models the following
problem: we wish to recover L0 but only see a few entries about L0 , and among those a
fraction happens to be corrupted, and we of course do not know which one. As is easily
seen, this is a significant extension of the matrix completion problem, which seeks to
recover L0 from undersampled but otherwise perfect data Pobs L0 .
We propose recovering L0 by solving the following problem:
minimize L∗ + λS1
(1.5)
subject to Pobs (L + S) = Y.
In words, among all decompositions matching the available data, Principal Component
Pursuitfinds the one that minimizes the weighted combination of the nuclear norm, and
of the 1 norm. Our observation is that under some conditions, this simple approach
recovers the low-rank component exactly. In fact, the techniques developed in this
article establish this result:
THEOREM 1.2. Suppose L0 is n × n, obeys the conditions (1.2)–(1.3), and that obs
is uniformly distributed among all sets of cardinality m obeying m = 0.1n2 . Suppose
for simplicity, that each observed entry is corrupted with probability τ independently
of the others. Then, there is a numerical constant c such √
that with probability at least
−10
1 − cn , Principal Component Pursuit (1.5) with λ = 1/ 0.1n is exact, that is, L̂ = L0 ,
provided that
rank(L0 ) ≤ ρr nμ−1 (log n)−2 , and τ ≤ τs . (1.6)
In this equation, ρr and τs are positive
 numerical constants. For general n1 × n2 rectan-
gular matrices, PCP with λ = 1/ 0.1n(1) succeeds from m = 0.1n1 n2 corrupted entries
with probability at least 1 − cn−10 −1 −2
(1) , provided that rank(L0 ) ≤ ρr n(2) μ (log n(1) ) .

In short, perfect recovery from incomplete and corrupted entries is possible by convex
optimization.
On the one hand, this result extends our previous result in the following way. If
all the entries are available, that is, m = n1 n2 , then this is Theorem 1.1. On the
other hand, it extends matrix completion results. Indeed, if τ = 0, we have a pure
matrix completion problem from about a fraction of the total number of entries, and
our theorem guarantees perfect recovery as long as r obeys (1.6), which, for large
values of r, matches the strongest results available. We remark that the recovery is
exact, however, via a different algorithm. To be sure, in matrix completion one typically

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:9

minimizes the nuclear norm L∗ subject to the constraint Pobs L = Pobs L0 . Here, our
program would solve
minimize L∗ + λS1
(1.7)
subject to Pobs (L + S) = Pobs L0 ,
and return L̂ = L0 , Ŝ = 0! In this context, Theorem 1.2 proves that matrix completion
is stable vis-a-vis gross errors.
Remark. We have stated Theorem 1.2 merely to explain how our ideas can easily
be adapted to deal with low-rank matrix recovery problems from undersampled and
possibly grossly corrupted data. In our statement, we have chosen to see 10% of the
entries but, naturally, similar results hold for all other positive fractions provided that
they are large enough. We would like to make it clear that a more careful study is likely
to lead to a stronger version of Theorem 1.2. In particular, for very low rank matrices,
we expect to see similar results holding with far fewer observations; that is, in the limit
of large matrices, from a decreasing fraction of entries. In fact, our techniques would
already establish such sharper results but we prefer not to dwell on such refinements
at the moment, and leave this up for future work.
1.7. Notation
We provide a brief summary of the notations used throughout this article. We shall
use five norms of a matrix. The first three are functions of the singular values and
they are: (1) the operator norm or 2-norm denoted by X; (2) the Frobenius norm
denoted by X F ; and (3) the nuclear norm denoted by X∗ . The last two are the
1 and ∞ norms of a matrix seen as a long vector, and are denoted by X1 and
X∞ respectively. The Euclidean inner product between two matrices is defined by the
formula X, Y := trace(X∗ Y ), so that X2F = X, X .
Further, we will also manipulate linear transformations that act on the space of
matrices, and we will use calligraphic letters for these operators as in P X. We shall
also abuse notation by also letting  be the linear space of matrices supported on .
Then, P⊥ denotes the projection onto the space of matrices supported on c so that
I = P + P⊥ , where I is the identity operator. We will consider a single norm for these,
namely, the operator norm (the top singular value) denoted by A, which we may want
to think of as A = sup{X F =1} AX F ; for instance, P  = 1 whenever  = ∅.
1.8. Organization of the Article
The article is organized as follows. In Section 2, we provide the key steps in the proof
of Theorem 1.1. This proof depends upon on two critical properties of dual certificates,
which are established in the separate Section 3. The reason why this is separate is
that in a first reading, the reader might want to jump to Section 4, which presents ap-
plications to video surveillance, and computer vision. Section 5 introduces algorithmic
ideas to find the Principal Component Pursuitsolution when M is of very large scale.
We conclude this article with a discussion about future research directions in Section 6.
Finally, the proof of Theorem 1.2 is in Appendix A together with those of intermediate
results.
2. ARCHITECTURE OF THE PROOF
This section introduces the key steps underlying the proof of our main result, The-
orem 1.1. We will prove the result for square matrices for simplicity, and write
n = n1 = n2 . Of course, we shall indicate where the argument needs to be modified
to handle the general case. Before we start, it is helpful to review some basic concepts
and introduce additional notation that shall be used throughout. For a given scalar x,

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:10 E. J. Candès et al.

we denote by sgn(x) the sign of x, which we take to be zero if x = 0. By extension, sgn(S)


is the matrix whose entries are the signs of those of S. We recall that any subgradient
of the 1 norm at S0 supported on , is of the form
sgn(S0 ) + F,
where F vanishes on , i.e. P F = 0, and obeys F∞ ≤ 1.
We will also manipulate the set of subgradients of the nuclear norm. From now on,
we will assume that L0 of rank r has the singular value decomposition U V ∗ , where
U, V ∈ Rn×r just as in Section 1.3. Then, any subgradient of the nuclear norm at L0 is
of the form
U V ∗ + W,
where U ∗ W = 0, W V = 0 and W ≤ 1. Denote by T the linear space of matrices
T := {U X∗ + Y V ∗ , X, Y ∈ Rn×r }, (2.1)
⊥ ∗
and by T its orthogonal complement. It is not hard to see that taken together, U W = 0
and W V = 0 are equivalent to PT W = 0, where PT is the orthogonal projection onto
T . Another way to put this is PT ⊥ W = W. In passing, note that for any matrix M,
PT ⊥ M = (I − U U ∗ )M(I − V V ∗ ), where we recognize that I − U U ∗ is the projection
onto the orthogonal complement of the linear space spanned by the columns of U
and likewise for (I − V V ∗ ). A consequence of this simple observation is that for any
matrix M, PT ⊥ M ≤ M, a fact that we will use several times in the sequel. Another
consequence is that for any matrix of the form ei e∗j ,

PT ⊥ ei e∗j 2F = (I − U U ∗ )ei 2 (I − V V ∗ )e j 2 ≥ (1 − μr/n)2 ,

where we have assumed μr/n ≤ 1. Since PT ei e∗j 2F + PT ⊥ ei e∗j 2F = 1, this gives

∗ 2μr
PT ei e j  F ≤ . (2.2)
n

For rectangular matrices, the estimate is PT ei e∗j  F ≤ min(n
2μr
1 ,n2 )
.
Finally, in the sequel we will write that an event holds with high or large probability
whenever it holds with probability at least 1 − O(n−10 ) (with n(1) in place of n for
rectangular matrices).

2.1. An Elimination Theorem


We begin with a useful definition and an elementary result we shall use a few times.
Definition 2.1. We will say that S is a trimmed version of S if supp(S ) ⊂ supp(S)
and Si j = Si j whenever Si j = 0.
In other words, a trimmed version of S is obtained by setting some of the entries of
S to zero. Having said this, the following intuitive theorem asserts that if Principal
Component Pursuit correctly recovers the low-rank and sparse components of M0 =
L0 + S0 , it also correctly recovers the components of a matrix M0 = L0 + S0 where S0 is
a trimmed version of S0 . This is intuitive since the problem is somehow easier as there
are fewer things to recover.
THEOREM 2.2. Suppose the solution to (1.1) with input data M0 = L0 + S0 is unique
and exact, and consider M0 = L0 + S0 , where S0 is a trimmed version of S0 . Then, the
solution to (1.1) with input M0 is exact as well.

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:11

PROOF. Write S0 = P0 S0 for some 0 ⊂ [n] × [n] and let ( L̂, Ŝ) be the solution of (1.1)
with input L0 + S0 . Then
 L̂∗ + λ Ŝ1 ≤ L0 ∗ + λP0 S0 1
and, therefore,
 L̂∗ + λ Ŝ1 + λP⊥0 S0 1 ≤ L0 ∗ + λS0 1 .

Note that ( L̂, Ŝ + P⊥0 S0 ) is feasible for the problem with input data L0 + S0 , and since
 Ŝ + P⊥0 S0 1 ≤  Ŝ1 + P⊥0 S0 1 , we have

 L̂∗ + λ Ŝ + P⊥0 S0 1 ≤ L0 ∗ + λS0 1 .


The right-hand side, however, is the optimal value, and by unicity of the optimal
solution, we must have L̂ = L0 , and Ŝ + P⊥0 S0 = S0 or Ŝ = P0 S0 = S0 . This proves
the claim.
The Bernoulli Model. In Theorem 1.1, probability is taken with respect to the uni-
formly random subset  = {(i, j) : Si j = 0} of cardinality m. In practice, it is a little
more convenient to work with the Bernoulli model  = {(i, j) : δi j = 1}, where the δi j ’s
are independent and identically distributed variables Bernoulli taking value one with
probability ρ and zero with probability 1 − ρ, so that the expected cardinality of  is
ρn2 . From now on, we will write  ∼ Ber(ρ) as a shorthand for  is sampled from the
Bernoulli model with parameter ρ.
Since by Theorem 2.2, the success of the algorithm is monotone in ||, any guarantee
proved for the Bernoulli model holds for the uniform model as well, and vice-versa,
if we allow for a vanishing shift in ρ around m/n2 . The arguments underlying this
equivalence are standard, see Candès et al. [2006] and Candès and Tao [2010], and
may be found in the appendix for completeness.
2.2. Derandomization
In Theorem 1.1, the values of the nonzero entries of S0 are fixed. It turns out that it
is easier to prove the theorem under a stronger assumption, which assumes that the
signs of the nonzero entries are independent symmetric Bernoulli variables, that
is, take the value ±1 with probability 1/2 (independently of the choice of the support
set). The convenient theorem below shows that establishing the result for random
signs is sufficient to claim a similar result for fixed signs.
THEOREM 2.3. Suppose L0 obeys the conditions of Theorem 1.1 and that the locations
of the nonzero entries of S0 follow the Bernoulli model with parameter 2ρs , and the
signs of S0 are independent and identically distributed ±1 as previously stated (and
independent from the locations). Then, if the PCP solution is exact with high probability,
then it is also exact with at least the same probability for the model in which the signs
are fixed and the locations are sampled from the Bernoulli model with parameter ρs .
This theorem is convenient because to prove our main result, we only need to show
that it is true in the case where the signs of the sparse component are random.
PROOF. Consider the model in which the signs are fixed. In this model, it is convenient
to think of S0 as P S, for some fixed matrix S, where  is sampled from the Bernoulli
model with parameter ρs . Therefore, S0 has independent components distributed as

Si j , w. p. ρs ,
(S0 )i j =
0, w. p. 1 − ρs .

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:12 E. J. Candès et al.

Consider now a random sign matrix with independent and identically distributed en-
tries distributed as

⎨1, w. p. ρs ,
Ei j = 0, w. p. 1 − 2ρs ,

−1, w. p. ρs ,
and an “elimination” matrix with entries defined by

0, if Ei j [sgn(S)]i j = −1,
i j =
1, otherwise.
Note that the entries of are independent since they are functions of independent
variables.
Consider now S0 = ◦ (|S| ◦ E), where ◦ denotes the Hadamard or componentwise
product so that, [S0 ]i j = i j (|Si j |Ei j ). Then, we claim that S0 and S0 have the same
distribution. To see why this is true, it suffices by independence to check that the
marginals match. For Si j = 0, we have
P([S0 ]i j = Si j ) = P( i j = 1 and Ei j = [sgn(S)]i j )
= P(Ei j [sgn(S)]i j = −1 and Ei j = [sgn(S)]i j )
= P(Ei j = [sgn(S)]i j ) = ρs ,
which establishes the claim.
This construction allows us to prove the theorem. Indeed, |S| ◦ E now obeys the
random sign model, and by assumption, PCP recovers |S| ◦ E with high probability. By
the elimination theorem, this program also recovers S0 = ◦ (|S| ◦ E). Since S0 and S0
have the same distribution, the theorem follows.

2.3. Dual Certificates


We introduce a simple condition for the pair (L0 , S0 ) to be the unique optimal solution
to Principal Component Pursuit. These conditions, given in the following lemma, are
stated in terms of a dual vector, the existence of which certifies optimality. This lemma
is equivalent to Proposition 2 of Chandrasekaran et al. [2009]; for completeness we
record its statement and proof here. (Recall that  is the space of matrices with the
same support as the sparse component S0 , and that T is the space defined via the the
column and row spaces of the low-rank component L0 (2.1).)
LEMMA 2.4 [CHANDRASEKARAN ET AL. 2009, PROPOSITION 2]. Assume that P PT  < 1.
With the standard notations, (L0 , S0 ) is the unique solution if there is a pair (W, F)
obeying
U V ∗ + W = λ(sgn(S0 ) + F),
with PT W = 0, W < 1, P F = 0 and F∞ < 1.
Note that the condition P PT  < 1 is equivalent to saying that  ∩ T = {0}.
PROOF. We consider a feasible perturbation (L0 + H, S0 − H) and show that the
objective increases whenever H = 0, hence proving that (L0 , S0 ) is the unique solution.
To do this, let U V ∗ + W0 be an arbitrary subgradient of the nuclear norm at L0 ,
and sgn(S0 ) + F0 be an arbitrary subgradient of the 1 -norm at S0 . By definition of
subgradients,
L0 + H∗ + λS0 − H1 ≥ L0 ∗ + λS0 1 + U V ∗ + W0 , H − λ sgn(S0 ) + F0 , H .

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:13

Now pick W0 such that W0 , H = PT ⊥ H∗ and F0 such that F0 , H = −P⊥ H1 .6
We have
L0 + H∗ +λS0 − H1 ≥ L0 ∗ +λS0 1 +PT ⊥ H∗ +λP⊥ H1 + U V ∗ −λsgn(S0 ), H .
By assumption
| U V ∗ − λsgn(S0 ), H | ≤ | W, H | + λ| F, H | ≤ β(PT ⊥ H∗ + λP⊥ H1 )
for β = max(W, F∞ ) < 1 and, thus,
L0 + H∗ + λS0 − H1 ≥ L0 ∗ + λS0 1 + (1 − β)(PT ⊥ H∗ + λP⊥ H1 ).
Since by assumption,  ∩ T = {0}, we have PT ⊥ H∗ + λP⊥ H1 > 0 unless H = 0.
Hence, we see that to prove exact recovery, it is sufficient to produce a “dual certifi-
cate” W obeying

⎪ W ∈ T ⊥,


W < 1,
(2.3)

⎪ P (U V ∗ + W) = λsgn(S0 ),
⎩ 
P⊥ (U V ∗ + W)∞ < λ.
Our method, however, will produce with high probability a slightly different certificate.
The idea is to slightly relax the constraint P (U V ∗ + W) = λsgn(S0 ), a relaxation that
has been introduced in Gross [2011] in a different context. We prove the following
lemma.
LEMMA 2.5. Assume P PT  ≤ 1/2 and λ < 1. Then with the same notation, (L0 , S0 )
is the unique solution if there is a pair (W, F) obeying
U V ∗ + W = λ(sgn(S0 ) + F + P D)
with PT W = 0 and W ≤ 12 , P F = 0 and F∞ ≤ 12 , and P D F ≤ 14 .
PROOF. Following the proof of Lemma 2.4, we have
1
L0 + H∗ + λS0 − H1 ≥ L0 ∗ + λS0 1 + (PT ⊥ H∗ + λP⊥ H1 ) − λ P D, H
2
1 λ
≥ L0 ∗ + λS0 1 + (PT ⊥ H∗ + λP⊥ H1 ) − P H F .
2 4
Observe now that
P H F ≤ P PT H F + P PT ⊥ H F
1
≤ H F + PT ⊥ H F
2
1 1
≤ P H F + P⊥ H F + PT ⊥ H F
2 2
and, therefore,
P H F ≤ P⊥ H F + 2PT ⊥ H F .
In conclusion,
 
1 λ
L0 + H∗ + λS0 − H1 ≥ L0 ∗ + λS0 1 + (1 − λ)PT H∗ + P H1 ,
⊥ ⊥
2 2
and the term between parenthesis is strictly positive when H = 0.

instance, F0 = −sgn(P⊥ H) is such a matrix. Also, by duality between the nuclear and the operator
6 For
norm, there is a matrix obeying W  = 1 such that W, PT ⊥ H = PT ⊥ H∗ , and we just take W0 = PT ⊥ (W ).

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:14 E. J. Candès et al.

As a consequence of Lemma 2.5, it now suffices to produce a dual certificate W


obeying

⎪ W ∈ T ⊥,


W < 1/2,
(2.4)
⎪P (U V ∗ − λsgn(S0 ) + W) F ≤ λ/4,


P⊥ (U V ∗ + W)∞ < λ/2.
Further, we would like to note that the existing literature on matrix completion [Candès
and Recht 2009] gives good bounds on P PT  (see Theorem 2.6 in Section 2.5).

2.4. Dual Certification via the Golfing Scheme


Gross [Gross 2011; Gross et al. 2010], introduces a new scheme, termed the golfing
scheme, to construct a dual certificate for the matrix completion problem, that is, the
problem of reconstructing a low-rank matrix from a subset of its entries. In this section,
we will adapt this clever golfing scheme, with two important modifications, to our
separation problem.
Before we introduce our construction, our model assumes that  ∼ Ber(ρ), or equiv-
alently that c ∼ Ber(1 − ρ). Now the distribution of c is the same as that of
c = 1 ∪ 2 ∪ · · · ∪  j0 , where each  j follows the Bernoulli model with parame-
ter q, which has an explicit expression. To see this, observe that by independence, we
just need to make sure that any entry (i, j) is selected with the right probability. We
have
P((i, j) ∈ ) = P(Bin( j0 , q) = 0) = (1 − q) j0 ,
so that the two models are the same if
ρ = (1 − q) j0 ,
hence, justifying our assertion. Note that because of overlaps between the  j ’s, q ≥
(1 − ρ)/j0 .
We now propose constructing a dual certificate
W = W L + W S,
where each component is as follows:
(1) Construction of W L via the Golfing Scheme. Fix an integer j0 ≥ 1 whose value shall
be discussed later, and let  j , 1 ≤ j ≤ j0 , be defined as previously described so that
c = ∪1≤ j≤ j0  j . Then, starting with Y0 = 0, inductively define
Y j = Y j−1 + q−1 P j PT (U V ∗ − Y j−1 ),
and set
W L = PT ⊥ Y j0 . (2.5)
This is a variation on the golfing scheme discussed in Gross [2011], which assumes
that the  j ’s are sampled with replacement, and does not use the projector P j
but something more complicated taking into account the number of times a specific
entry has been sampled.
(2) Construction of W S via the Method of Least Squares. Assume that P PT  < 1/2.
Then, P PT P  < 1/4 and, thus, the operator P − P PT P mapping  onto itself
is invertible; we denote its inverse by (P − P PT P )−1 . We then set
W S = λPT ⊥ (P − P PT P )−1 sgn(S0 ). (2.6)

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:15

Clearly, an equivalent definition is via the convergent Neumann series



W S = λPT ⊥ (P PT P )ksgn(S0 ). (2.7)
k≥0

Note that P W S = λP (I − PT )(P − P PT P )−1 sgn(S0 ) = λsgn(S0 ). With this, the
construction has a natural interpretation: one can verify that among all matrices
W ∈ T ⊥ obeying P W = λsgn(S0 ), W S is that with minimum Frobenius norm.
Since both W L and W S belong to T ⊥ and P W S = λsgn(S0 ), we will establish that
W L + W S is a valid dual certificate if it obeys

⎨W + W  < 1/2,
L S

P (U V + W L) F ≤ λ/4, (2.8)
⎩ 
P⊥ (U V ∗ + W L + W S )∞ < λ/2.

2.5. Key Lemmas


We now state three lemmas, which taken collectively, establish our main theorem. The
first may be found in Candès and Recht [2009].
THEOREM 2.6 [CANDÈS AND RECHT 2009, THEOREM 4.1]. Suppose 0 is sampled from
the Bernoulli model with parameter ρ0 . Then with high probability,
 
PT − ρ −1 PT P PT  ≤ , (2.9)
0 0

provided that ρ0 ≥ C0 −2 (μr log n)/n for some numerical constant C0 > 0 (μ is the
incoherence parameter). For rectangular matrices, we need ρ0 ≥ C0 −2 (μr log n(1) )/n(2) .
Among other things, this lemma is important because it shows that P PT  ≤ 1/2,
provided || is not too large. Indeed, if  ∼ Ber(ρ), we have
PT − (1 − ρ)−1 PT P⊥ PT  ≤ ,
with the proviso that 1−ρ ≥ C0 −2 (μr log n)/n. Note, however, that since I = P +P⊥ ,
PT − (1 − ρ)−1 PT P⊥ PT = (1 − ρ)−1 (PT P PT − ρPT )
and, therefore, by the triangular inequality
PT P PT  ≤ (1 − ρ) + ρPT  = ρ + (1 − ρ).
Since P PT 2 = PT P PT , we have established the following:
COROLLARY 2.7. Assume that  ∼ Ber(ρ), then P PT 2 ≤ ρ + , provided that
1 − ρ ≥ C0 −2 (μr log n)/n, where C0 is as in Theorem 2.6. For rectangular matrices, the
modification is as in Theorem 2.6.
The lemma below is proved is Section 3.
LEMMA 2.8. Assume that  ∼ Ber(ρ) with parameter ρ ≤ ρs for some ρs > 0. Set
j0 = 2log n (use log n(1) for rectangular matrices). Then, under the other assumptions
of Theorem 1.1, the matrix W L (2.5) obeys
(a) W L < 1/4,
(b) P (U V ∗ + W L) F < λ/4,
(c) P⊥ (U V ∗ + W L)∞ < λ/4.
Since P PT  < 1 with large probability, W S is well defined and the following holds:

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:16 E. J. Candès et al.

LEMMA 2.9. Assume that S0 is supported on a set  sampled as in Lemma 2.8,


and that the signs of S0 are independent and identically distributed symmetric (and
independent of ). Then, under the other assumptions of Theorem 1.1, the matrix W S
(2.6) obeys
(a) W S  < 1/4,
(b) P⊥ W S ∞ < λ/4.
The proof is also in Section 3. Clearly, W L and W S obey (2.8), hence certifying that
Principal Component Pursuit correctly recovers the low-rank and sparse components
with high probability when the signs of S0 are random. The earlier “derandomization”
argument then establishes Theorem 1.1.

3. PROOFS OF DUAL CERTIFICATION


This section proves the two crucial estimates, namely, Lemma 2.8 and Lemma 2.9.

3.1. Preliminaries
We begin by recording two results that will be useful in proving Lemma 2.8. While
Theorem 2.6 asserts that, with large probability,
 
 Z − ρ −1 PT P Z ≤ Z F ,
0 0 F

for all Z ∈ T , the next lemma shows that for a fixed Z, the sup-norm of Z−ρ0−1 PT P0 (Z)
also does not increase (also with large probability).
LEMMA 3.1. Suppose Z ∈ T is a fixed matrix, and 0 ∼ Ber(ρ0 ). Then, with high
probability,
 
 Z − ρ −1 PT P Z ≤ Z∞ (3.1)
0 0 ∞

provided that ρ0 ≥ C0 −2 (μr log n)/n (for rectangular matrices, ρ0 ≥ C0 −2


(μr log n(1) )/n(2) ) for some numerical constant C0 > 0.
The proof is an application of Bernstein’s inequality and may be found in Appendix
A. A similar but somewhat different version of (3.1) appears in Recht [2009].
The second result was proved in Candès and Recht [2009].
LEMMA 3.2 [CANDÈS AND RECHT 2009, THEOREM 6.3]. Suppose Z is fixed, and 0 ∼
Ber(ρ0 ). Then, with high probability,

  
 I − ρ −1 P Z ≤ C n log n Z∞ (3.2)
0
0 0
ρ0

for some small numerical constant C0 > 0 provided that ρ0 ≥ C0 (μ log n)/n (or
ρ0 ≥ C0 (μ log n(1) )/n(2) for rectangular matrices in which case n(1) log n(1) replaces n log n
in (3.2)).
As a remark, Lemmas 3.1 and 3.2, and Theorem 2.6 all hold with probability at least
1 − O(n−β ), β > 2, if C0 is replaced by Cβ for some numerical constant C > 0.

3.2. Proof of Lemma 2.8


We begin by introducing a piece of notation and set Z j = U V ∗ − PT Y j obeying
 
Z j = PT − q−1 PT P j PT Z j−1 .

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:17

Obviously Z j ∈ T for all j ≥ 0. First, note that when


μr log n
q ≥ C0 −2 , (3.3)
n
or for rectangular matrices
μr log n(1)
q ≥ C0 −2 ,
n(2)
we have
Z j ∞ ≤ Z j−1 ∞ (3.4)
by Lemma 3.1. (This holds with high probability because  j and Z j−1 are independent,
and this is why the golfing scheme is easy to use.) In particular, this gives that with
high probability
Z j ∞ ≤ j U V ∗ ∞ .
When q obeys the same estimate,
Z j  F ≤ Z j−1  F (3.5)
by Theorem 2.6. In particular, this gives that with high probability

Z j  F ≤ j U V ∗  F = j r. (3.6)
−1
We will assume ≤ e .
PROOF OF (a). We prove the first part of the lemma and the argument parallels that
in Gross [2011], see also Recht [2009]. From

Y j0 = q−1 P j Z j−1 ,
j

we deduce

W L = PT ⊥ Y j0 ∞ ≤ q−1 PT ⊥ P j Z j−1 
j

= PT ⊥ (q−1 P j Z j−1 − Z j−1 )
j

≤ q−1 P j Z j−1 − Z j−1 
j

n log n 
≤ C0 Z j−1 ∞
q j

n log n  j−1
≤ C0 U V ∗ ∞
q j

n log n
≤ C0 (1 − )−1 U V ∗ ∞ .
q

The fourth step follows from Lemma 3.2 and the fifth from (3.5). Since U V ∗  ≤ μr/n,
this gives
W L ≤ C
for some numerical constant C whenever q obeys (3.3).

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:18 E. J. Candès et al.

PROOF OF (b). Since P Y j0 = 0,


P (U V ∗ + PT ⊥ Y j0 ) = P (U V ∗ − PT Y j0 ) = P (Z j0 ),
and it follows from (3.6) that

Z j0  F ≤ j0 U V ∗  F = j0 r.
Since ≤ e−1 and j0 ≥ 2 log n, j0 ≤ 1/n2 and this proves the claim.
PROOF OF (c). We have U V ∗ + W L = Z j0 + Y j0 and know that Y j0 is supported on c .
Therefore, since Z j0  F ≤ λ/8, it suffices to show that Y j0 ∞ ≤ λ/8. We have

Y j0 ∞ ≤ q−1 P j Z j−1 ∞
j

−1
≤q Z j−1 ∞
j

≤ q−1 j U V ∗ ∞ .
j
∗ √
Since U V ∞ ≤ μr/n, this gives
2
Y j0 ∞ ≤ C 
μr(log n)2

for some numerical constant C whenever q obeys (3.3). Since λ = 1/ n, Y j0 ∞ ≤ λ/8
if
 1/4
μr(log n)2
≤C .
n
Summary. We have seen that (a) and (b) are satisfied if is sufficiently small and
j0 ≥ 2 log n. For (c), we can take on the order of (μr(log n)2 /n)1/4 , which will be
sufficiently small as well, provided that ρr in (1.4) is sufficiently small. Note that
everything is consistent since C0 −2 μr log
n
n
< 1. This concludes the proof of Lemma 2.8.
3.3. Proof of Lemma 2.9
It is convenient to introduce the sign matrix E = sgn(S0 ) distributed as

⎨1, w. p. ρ/2,
Ei j = 0, w. p. 1 − ρ, (3.7)

−1, w. p. ρ/2.
We shall be√interested in the event {P PT  ≤ σ }, which holds with large probability
when σ = ρ + , see Corollary 2.7. In particular, for any σ > 0, {P PT  ≤ σ } holds
with high probability provided ρ is sufficiently small.
PROOF OF (a). By construction,

W S = λPT ⊥ E + λPT ⊥ (P PT P )k E
k≥1

:= PT ⊥ W0S + PT ⊥ W1S .
For the first term, we have PT ⊥ W0S  ≤ W0S  = λE. Then standard arguments
about the norm of a matrix with independent and identically distributed entries

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:19

give [Vershynin 2011]:



E ≤ 4 nρ
√ √
with large probability. Since λ = 1/ n, this gives W0S  ≤ 4 ρ. When the matrix is
rectangular, we have

E ≤ 4 n(1) ρ
√ √
with high probability. Since λ = 1/ n(1) in this case, W0S  ≤ 4 ρ as well.

Set R = k≥1 (P PT P )k and observe that R is self-adjoint. For the second term,
PT ⊥ W1S  ≤ W1S , where W1S = λR(E). We need to bound the operator norm of the
matrix R(E), and use a standard covering argument to do this. Throughout, N denotes
an 1/2-net for Sn−1 of size at most 6n (such a net exists, see Ledoux [2001, Theorem
4.16]). Then, a standard argument [Vershynin 2011] shows that
R(E) = sup y, R(E)x ≤ 4 sup y, R(E)x .
x,y∈Sn−1 x,y∈N

For a fixed pair (x, y) of unit-normed vectors in N × N, define the random variable
X(x, y) := y, R(E)x = R(yx ∗ ), E .
Conditional on  = supp(E), the signs of E are independent and identically dis-
tributed symmetric and Hoeffding’s inequality gives
 
2t2
P(|X(x, y)| > t | ) ≤ 2 exp − .
R(xy∗ )2F
Now since yx ∗  F = 1, the matrix R(yx ∗ ) obeys R(yx ∗ ) F ≤ R and, therefore,
   
2t2
P sup |X(x, y)| > t |  ≤ 2|N| exp −
2
.
x,y∈N R2
Hence,
 
t2
P(R(E) > t | ) ≤ 2|N|2 exp − .
8R2
On the event {P PT  ≤ σ },
 σ2
R ≤ σ 2k =
1 − σ2
k≥1

and, therefore, unconditionally,


 
γ 2 t2 1 − σ2
P(R(E) > t) ≤ 2|N| exp − 2
+ P(P PT  ≥ σ ), γ = .
2 2σ 2
This gives
 
γ 2 t2
P(λR(E) > t) ≤ 2 × 6 exp − 2 2n
+ P(P PT  ≥ σ ).


With λ = 1/ n,
W S  ≤ 1/4,
with large probability, provided that σ , or equivalently ρ, is small enough.

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:20 E. J. Candès et al.

PROOF OF (b). Observe that


P⊥ W S = −λP⊥ PT (P − P PT P )−1 E.
Now for (i, j) ∈ c , WiSj = ei , W S e j = ei e∗j , W S , and we have

WiSj = λ X(i, j), E ,


where X(i, j) is the matrix −(P − P PT P )−1 P PT (ei e∗j ). Conditional on  = supp(E),
the signs of E are independent and identically distributed symmetric, and Hoeffding’s
inequality gives
 
 S   2t 2
P Wi j  > tλ |  ≤ 2 exp − ,
X(i, j)2F
and, thus,
   
 S 2t2
P sup Wi j  > tλ |  ≤ 2n exp −
2
.
i, j supi, j X(i, j)2F
Since (2.2) holds, we have

P PT (ei e∗j ) F ≤ P PT PT (ei e∗j ) F ≤ σ 2μr/n
on the event {P PT  ≤ σ }. On the same event, (P − P PT P )−1  ≤ (1 − σ 2 )−1 and,
therefore,
2σ 2 μr
X(i, j)2F ≤ .
(1 − σ 2 )2 n
Then, unconditionally,
   
 S nγ 2 t2 (1 − σ 2 )2
 
P sup Wi j > tλ ≤ 2n exp −
2
+ P(P PT  ≥ σ ), γ = .
i, j μr 2σ 2

This proves the claim when μr < ρr n(log n)−1 and ρr is sufficiently small.
4. NUMERICAL EXPERIMENTS
In this section, we perform numerical experiments corroborating our main results and
suggesting their many applications in image and video analysis. We first investigate
Principal Component Pursuit’s ability to correctly recover matrices of various rank
from errors of various density. We then sketch applications in background modeling
from video and removing shadows and specularities from face images. It is important to
emphasize that in both of these real data examples, the support of the sparse error term
may not follow a uniform or Bernoulli model, and so our theorems do not guarantee
that the algorithm will succeed. Nevertheless, we will see that Principal Component
Pursuitdoes give quite appealing results. We also emphasize that our goal in these
examples is not to engineer complete systems for either of these applications, but
rather to demonstrate the potential applicability of this model and algorithm to robustly
compute “principal components” of real data such as images and videos.
While the exact recovery guarantee provided by Theorem 1.1 is independent of
the particular algorithm used to solve Principal Component Pursuit, its applicabil-
ity to large scale problems depends on the availability of scalable algorithms for
nonsmooth convex optimization. For the experiments in this section, we use the
an augmented Lagrange multiplier algorithm introduced in Lin et al. [2009a] and

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:21

Table I. Correct Recovery for Random Problems of Varying Size


 L̂−L0  F
Dimension n rank(L0 ) S0 0 rank( L̂)  Ŝ0 L0  F # SVD Time(s)
500 25 12,500 25 12,500 1.1 × 10−6 16 2.9
1,000 50 50,000 50 50,000 1.2 × 10−6 16 12.4
2,000 100 200,000 100 200,000 1.2 × 10−6 16 61.8
3,000 250 450,000 250 450,000 2.3 × 10−6 15 185.2
rank(L0 ) = 0.05 × n, S0 0 = 0.05 × n2 .
 L̂−L0  F
Dimension n rank(L0 ) S0 0 rank( L̂)  Ŝ0 L0  F # SVD Time(s)
500 25 25,000 25 25,000 1.2 × 10−6 17 4.0
1,000 50 100,000 50 100,000 2.4 × 10−6 16 13.7
2,000 100 400,000 100 400,000 2.4 × 10−6 16 64.5
3,000 150 900,000 150 900,000 2.5 × 10−6 16 191.0
rank(L0 ) = 0.05 × n, S0 0 = 0.10 × n2 .
Here, L0 = XY ∗ ∈ Rn×n with X, Y ∈ Rn×r ; X, Y have entries independent and identically dis-
tributed N (0, 1/n). S0 ∈ {−1, 0, 1}n×n has support chosen uniformly at random and independent
random signs; S0 0 is the number of nonzero entries in S0 . Top: recovering matrices of rank
0.05 × n from 5% gross errors. Bottom: recovering matrices of rank 0.05 × n from 10% gross errors.
In all cases, the rank of L0 and 0 -norm of S0 are correctly estimated. Moreover, the number of
partial singular value decompositions (# SVD) required to solve PCP is almost constant.

Yuan and Yang [2009].7 In Section 5, we describe this algorithm in more detail, and
explain why it is our algorithm of choice for sparse and low-rank separation.
One important implementationdetail in our approach is the choice of λ. Our anal-
ysis identifies one choice, λ = 1/ max(n1 , n2 ), which works well for incoherent ma-
trices. In
 order to illustrate the theory, throughout this section we will always choose
λ = 1/ max(n1 , n2 ). For practical problems, however, it is often possible to improve
performance by choosing λ in accordance with prior knowledge about the solution. For
example, if we know that S is very sparse, increasing λ will allowus to recover matrices
L of larger rank. For practical problems, we recommend λ = 1/ max(n1 , n2 ) as a good
rule of thumb, which can then be adjusted slightly to obtain the best possible result.
4.1. Exact Recovery from Varying Fractions of Error
We first verify the correct recovery phenomenon of Theorem 1.1 on randomly generated
problems. We consider square matrices of varying dimension n = 500, . . . , 3000. We
generate a rank-r matrix L0 as a product L0 = XY ∗ where X and Y are n × r matrices
with entries independently sampled from a N (0, 1/n) distribution. S0 is generated by
choosing a support set  of size k uniformly at random, and setting S0 = P E, where
E is a matrix with independent Bernoulli ±1 entries.
Table I (top) reports the results with r = rank(L0 ) = 0.05 × n and k = S0 0 =
0.05×n2 . Table I (bottom) reports the results for a more√ challenging scenario, rank(L0 ) =
0.05 × n and k = 0.10 × n2 . In all cases, we set λ = 1/ n. Notice that in all cases, solving
the convex PCP gives a result (L, S) with the correct rank and sparsity. Moreover, the
relative error L − L0  F /L0  F is small, less than 10−5 in all examples considered.8
The last two columns of Table I give the number of partial singular value decom-
positions computed in the course of the optimization (# SVD) as well as the total
computation time. This experiment was performed in Matlab on a Mac Pro with dual

7 Both Lin et al. [2009a] and Yuan and Yang [2009] have posted a version of their code online.
8 We measure relative error in terms of L only, since, in this article we view the sparse and low-rank
decomposition as recovering a low-rank matrix L0 from gross errors. S0 is, of course, also well recovered: in
this example, the relative error in S is actually smaller than that in L.

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:22 E. J. Candès et al.

Fig. 1. Correct recovery for varying rank and sparsity. Fraction of correct recoveries across 10 trials, as a
function of rank(L0 ) (x-axis) and sparsity of S0 (y-axis). Here, n1 = n2 = 400. In all cases, L0 = XY ∗ is a
product of independent n × r i.i.d. N (0, 1/n) matrices. Trials are considered successful if  L̂ − L0  F /L0  F <
10−3 . Left: low-rank and sparse decomposition, sgn(S0 ) random. Middle: low-rank and sparse decomposition,
S0 = P sgn(L0 ). Right: matrix completion. For matrix completion, ρs is the probability that an entry is
omitted from the observation.

quad-core 2.66-GHz Intel Xenon processors and 16 GB RAM. As we will discuss in


Section 5 the dominant cost in solving the convex program comes from computing one
partial SVD per iteration. Strikingly, in Table I, the number of SVD computations is
nearly constant regardless of dimension, and in all cases less than 17.9 This suggests
that in addition to being theoretically well founded, the recovery procedure advocated
in this article is also reasonably practical.
4.2. Phase Transition in Rank and Sparsity
Theorem 1.1 shows that convex programming correctly recovers an incoherent low-
rank matrix from a constant fraction ρs of errors. We next empirically investigate the
algorithm’s ability to recover matrices of varying rank from errors of varying sparsity.
We consider square matrices of dimension n1 = n2 = 400. We generate low-rank
matrices L0 = XY ∗ with X and Y independently chosen n×r matrices with independent
and identically distributed Gaussian entries of mean zero and variance 1/n. For our
first experiment, we assume a Bernoulli model for the support of the sparse term S0 ,
with random signs: each entry of S0 takes on value 0 with probability 1 − ρ, and values
±1 each with probability ρ/2. For each (r, ρ) pair, we generate 10 random problems,
each of which is solved via the algorithm of Section 5. We declare a trial to be successful
if the recovered L̂ satisfies L − L0  F /L0  F ≤ 10−3 . Figure 1 (left) plots the fraction of
correct recoveries for each pair (r, ρ). Notice that there is a large region in which the
recovery is exact. This highlights an interesting aspect of our result: the recovery √ is
correct even though in some cases S0  F  L0  F (e.g., for r/n = ρ, S0  F is n = 20
times larger!). This is to be expected from Lemma 2.4: the existence (or nonexistence)
of a dual certificate depends only on the signs and support of S0 and the orientation of
the singular spaces of L0 .
However, for incoherent L0 , our main result goes one step further and asserts that
the signs of S0 are also not important: recovery can be guaranteed as long as its support
is chosen uniformly at random. We verify this by again sampling L0 as a product of
Gaussian matrices and choosing the support  in accordance with the Bernoulli model,
but this time setting S0 = P sgn(L0 ). One might expect such S0 to be more difficult to
distinguish from L0 . Nevertheless, our analysis showed that the number of errors that

9 One might reasonably ask whether this near constant number of iterations is due to the fact that random
problems are in some sense well conditioned. There is some validity to this concern, as we will see in our
real data examples. Lin et al. [2009a] suggests a continuation strategy (there termed “Inexact ALM”) that
produces qualitatively similar solutions with a similarly small number of iterations. However, to the best of
our knowledge, its convergence is not guaranteed.

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:23

can be corrected drops by at most 1/2 when moving to this more difficult model. Figure 1
(middle) plots the fraction of correct recoveries over 10 trials, again varying r and ρ.
Interestingly, the region of correct recovery in Figure 1 (middle) actually appears to
be broader than that in Figure 1 (left). Admittedly, the shape of the region in the
upper-left corner is puzzling, but has been corroborated by several distinct simulation
experiments (using different solvers).
Finally, inspired by the connection between matrix completion and robust PCA,
we compare the breakdown point for the low-rank and sparse separation problem
to the breakdown behavior of the nuclear-norm heuristic for matrix completion. By
comparing the two heuristics, we can begin to answer the question how much is gained
by knowing the location  of the corrupted entries? Here, we again generate L0 as a
product of Gaussian matrices. However, we now provide the algorithm with only an
incomplete subset M = P⊥ L0 of its entries. Each (i, j) is included in  independently
with probability 1 − ρ, so rather than a probability of error, here, ρ stands for the
probability that an entry is omitted. We solve the nuclear norm minimization problem
minimize L∗ subject to P⊥ L = P⊥ M
using an augmented Lagrange multiplier algorithm very similar to the one discussed in
Section 5. We again declare L0 to be successfully recovered if L − L0  F /L0  F < 10−3 .
Figure 1 (right) plots the fraction of correct recoveries for varying r, ρ. Notice that
nuclear norm minimization successfully recovers L0 over a much wider range of (r, ρ).
This is interesting because in the regime of large k, k = (n2 ), the best performance
guarantees for each heuristic agree in their order of growth—both guarantee correct
recovery for rank(L0 ) = O(n/ log2 n). Fully explaining the difference in performance
between the two problems may require a sharper analysis of the breakdown behavior
of each.
4.3. Real Data Example: Background Modeling from Surveillance Video
Video is a natural candidate for low-rank modeling, due to the correlation between
frames. One of the most basic algorithmic tasks in video surveillance is to estimate a
good model for the background variations in a scene. This task is complicated by the
presence of foreground objects: in busy scenes, every frame may contain some anomaly.
Moreover, the background model needs to be flexible enough to accommodate changes
in the scene, for example due to varying illumination. In such situations, it is natural
to model the background variations as approximately low rank. Foreground objects,
such as cars or pedestrians, generally occupy only a fraction of the image pixels and
hence can be treated as sparse errors.
We investigate whether convex optimization can separate these sparse errors from
the low-rank background. Here, it is important to note that the error support may not be
well-modeled as Bernoulli: errors tend to be spatially coherent, and more complicated
models such as Markov random fields may be more appropriate [Cevher et al. 2009;
Zhou et al. 2009]. Hence, our theorems do not necessarily guarantee the algorithm will
succeed with high probability. The results in Chandrasekaran et al. [2009] apply to
any sufficiently sparse error term, and hence might lead to analytic results for spa-
tially coherent sparse errors. However, there does not appear to be any straightforward
way to estimate the quantity μ(S0 ) in Chandrasekaran et al. [2009] for realistic spa-
tially coherent sparse errors S0 arising in this application. Nevertheless, as we will see,
Principal Component Pursuit still gives visually appealing solutions to this practical
low-rank and sparse separation problem, without using any additional information
about the spatial structure of the error. We emphasize that our goal in this example is
not to engineer a complete system for visual surveillance. Such a system would have
to cope with real-time processing requirements [Stauffer and Grimson 1999], and also

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:24 E. J. Candès et al.

Fig. 2. Background modeling from video. Three frames from a 200-frame video sequence taken in an airport
[Li et al. 2004]. (a) Frames of original video M. (b)-(c) Low-rank L̂ and sparse components Ŝ obtained by PCP,
(d)-(e) competing approach based on alternating minimization of an m-estimator [Torre and Black 2003].
PCP yields a much more appealing result despite using less prior knowledge.

perform nontrivial post-processing for object detection, tracking, and so on. Our goal
here is simply to demonstrate the potential real-world applicability of the theory and
approaches of this article.
We consider two example videos introduced in Li et al. [2004]. The first is a sequence
of 200 grayscale frames taken in an airport. This video has a relatively static back-
ground, but significant foreground variations. The frames have resolution 176 × 144;
we stack each frame as a column of our matrix M ∈ R25,344×200 . We decompose M
into a √
low-rank term and a sparse term by solving the convex PCP problem (1.1) with
λ = 1/ n1 . On a desktop PC with a 2.33 GHz Core2 Duo processor and 2 GB RAM, our
Matlab implementation requires 806 iterations, and roughly 43 minutes to converge.10
Figure 2(a) shows three frames from the video; (b) and (c) show the corresponding
columns of the low rank matrix L̂ and sparse matrix Ŝ (its absolute value is shown
here). Notice that L̂ correctly recovers the background, while Ŝ correctly identifies the
moving pedestrians. The person appearing in the images in L̂ does not move throughout
the video.
Figure 2 (d) and (e) compares the result obtained by Principal Component Pursuit to
a very closely related technique from the computer vision literature [Torre and Black
2003].11 That approach also aims at robustly recovering a good low-rank approxima-
tion, but uses a more complicated, nonconvex m-estimator, which incorporates a local
scale estimate that implicitly exploits the spatial characteristics of natural images.
This leads to a highly nonconvex optimization, which is solved locally via alternat-
ing minimization. Interestingly, despite using more prior information about the signal

10 Linet al. [2009a] suggests a variant of ALM optimization procedure, there termed the “Inexact ALM” that
finds a visually similar decomposition in far fewer iterations (less than 50). However, since the convergence
guarantee for that variant is weak, we choose to present the slower, exact result here.
11 We use the code package downloaded from http://www.salleurl.edu/˜ftorre/papers/rpca/rpca.zip, modified
to choose the rank of the approximation as suggested in de la Torre and Black [2003].

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:25

Fig. 3. Background modeling from video. Three frames from a 250 frame sequence taken in a lobby, with
varying illumination [Li et al. 2004]. (a) Original video M. (b)-(c) Low-rank L̂ and sparse Ŝ obtained by
PCP. (d)-(e) Low-rank and sparse components obtained by a competing approach based on alternating
minimization of an m-estimator [Torre and Black 2003]. Again, convex programming yields a more appealing
result despite using less prior information.

to be recovered, this approach does not perform as well as the convex programming
heuristic: notice the large artifacts in the top and bottom rows of Figure 2(d).
In Figure 3, we consider 250 frames of a sequence with several drastic illumination
changes. Here, the resolution is 168 × 120, and so M is a 20, 160 × 250 matrix. For
simplicity,
√ and to illustrate the theoretical results obtained above, we again choose
λ = 1/ n1 .12 For this example, on the same 2.66-GHz Core 2 Duo machine, the algo-
rithm requires a total of 561 iterations and 36 minutes to converge.
Figure 3(a) shows three frames taken from the original video, while (b) and (c) show
the recovered low-rank and sparse components, respectively. Notice that the low-rank
component correctly identifies the main illuminations as background, while the sparse
part corresponds to the motion in the scene. On the other hand, the result produced
by the algorithm of Torre and Black [2003] treats some of the first illumination as
foreground. PCP again outperforms the competing approach, despite using less prior
information. These results suggest the potential power for convex programming as a
tool for video analysis.
Notice that the number of iterations for the real data is typically higher than that
of the simulations with random matrices given in Table I. The reason for this discrep-
ancy might be that the structures of real data could slightly deviate from the idealistic
low-rank and sparse model. Nevertheless, it is important to realize that practical ap-
plications such as video surveillance often provide additional information about the
signals of interest, for example, the support of the sparse foreground is spatially piece-
wise contiguous, or even impose additional requirements, for example, the recovered
background needs to be non-negative etc. We note that the simplicity of our objec-
tive and solution suggests that one can easily incorporate additional constraints and
more accurate models of the signals so as to obtain much more efficient and accurate
solutions in the future.

12 For this example, slightly more appealing results can actually be obtained by choosing larger λ (say, 2/√n ).
1

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:26 E. J. Candès et al.

Fig. 4. Removing shadows, specularities, and saturations from face images. (a) Cropped and aligned images
of a person’s face under different illuminations from the Extended Yale B database. The size of each image is
192 × 168 pixels, a total of 58 different illuminations were used for each person. (b) Low-rank approximation
L̂ recovered by convex programming. (c) Sparse error Ŝ corresponding to specularities in the eyes, shadows
around the nose region, or brightness saturations on the face. Notice in the bottom left that the sparse term
also compensates for errors in image acquisition.

4.4. Real Data Example: Removing Shadows and Specularities from Face Images
Face recognition is another problem domain in computer vision where low-dimensional
linear models have received a great deal of attention. This is mostly due to the work
of Basri and Jacobs [2003], who showed that, for convex, Lambertian objects, images
taken under distant illumination lie near an approximately nine-dimensional linear
subspace known as the harmonic plane. However, since faces are neither perfectly
convex nor Lambertian, real face images often violate this low-rank model, due to
cast shadows and specularities. These errors are large in magnitude, but sparse in
the spatial domain. It is reasonable to believe that if we have enough images of the
same face, Principal Component Pursuit will be able to remove these errors. As with the
previous example, some caveats apply: the theoretical result suggests the performance
should be good, but does not guarantee it, since again the error support does not follow
a Bernoulli model. Nevertheless, as we will see, the results are visually striking.
Figure 4 shows two examples with face images taken from the Yale B face database
[Georghiades et al. 2001]. Here, each image has resolution 192 × 168; there are a
total of 58 illuminations per subject, which we stack√ as the columns of our matrix
M ∈ R32,256×58 . We again solve PCP with λ = 1/ n1 . In this case, the algorithm
requires 642 iterations to converge, and the total computation time on the same Core
2 Duo machine is 685 seconds.
Figure 4 plots the low rank term L̂ and the magnitude of the sparse term Ŝ obtained
as the solution to the convex program. The sparse term Ŝ compensates for cast shadows
and specular regions. In one example (bottom row of Figure 4 left), this term also com-
pensates for errors in image acquisition. These results may be useful for conditioning
the training data for face recognition, as well as face alignment and tracking under
illumination variations.

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:27

5. ALGORITHMS
Theorem 1.1 shows that incoherent low-rank matrices can be recovered from nonvan-
ishing fractions of gross errors in polynomial time. Moreover, as the experiments in the
previous section attest, the low computation cost is guaranteed not only in theory, the
efficiency is becoming practical for real imaging problems. This practicality is mainly
due to the rapid recent progress in scalable algorithms for nonsmooth convex optimiza-
tion, in particular for minimizing the 1 and nuclear norms. In this section, we briefly
review this progress, and discuss our algorithm of choice for this problem.
For small problem sizes, Principal Component Pursuit
minimize L∗ + λS1
subject to L+ S = M
can be performed using off-the-shelf tools such as interior point methods [Grant and
Boyd 2009]. This was suggested for rank minimization in Fazel et al. [2003] and Recht
et al. [2010] and for low-rank and sparse decomposition [Chandrasekaran et al. 2009]
(see also Liu and Vandenberge [2009]). However, despite their superior convergence
rates, interior point methods are typically limited to small problems, say n < 100, due
to the O(n6 ) complexity of computing a step direction.
The limited scalability of interior point methods has inspired a recent flurry of work
on first-order methods. Exploiting an analogy with iterative thresholding algorithms
for 1 -minimization [Yin et al. 2008a, 2008b], Cai et al. [2010] developed an algorithm
that performs nuclear-norm minimization by repeatedly shrinking the singular values
of an appropriate matrix, essentially reducing the complexity of each iteration to the
cost of an SVD. However, for our low-rank and sparse decomposition problem, this form
of iterative thresholding converges slowly, requiring up to 104 iterations. Goldfarb and
Ma [2009] and Ma et al. [2009] suggest improving convergence using continuation
techniques, and also demonstrate how Bregman iterations [Osher et al. 2005] can be
applied to nuclear norm minimization.
The convergence of iterative thresholding has also been greatly improved using
ideas from Nesterov’s optimal first-order algorithm for smooth minimization [Nesterov
1983], which was extended to nonsmooth optimization in Nesterov [2005] and Beck and
Teboulle [2009], and applied to 1 -minimization in Nesterov [2007], Beck and Teboulle
[2009], and Becker et al. [2011]. Based on Beck and Teboulle [2009], Toh and Yun
[2010] developed a proximal gradient algorithm for matrix completion that they termed
Accelerated Proximal Gradient (APG). A very similar APG algorithm was suggested
for low-rank and sparse decomposition in Lin et al. [2009b]. That algorithm inherits
the optimal O(1/k2 ) convergence rate for this class of problems. Empirical evidence
suggests that these algorithms can solve the convex PCP problem at least 50 times
faster than straightforward iterative thresholding (for more details and comparisons,
see Lin et al. [2009b]).
However, despite its good convergence guarantees, the practical performance of APG
depends strongly on the design of good continuation schemes. Generic continuation
does not guarantee good accuracy and convergence across a wide range of problem
settings.13 In this article we have chosen to instead solve the convex PCP problem (1.1)
using an augmented Lagrange multiplier (ALM) algorithm introduced in Lin et al.
[2009a] and Yuan and Yang [2009]. In our experience, ALM achieves much higher
accuracy than APG, in fewer iterations. It works stably across a wide range of problem
settings with no tuning of parameters. Moreover, we observe an appealing (empirical)
property: the rank of the iterates often remains bounded by rank(L0 ) throughout the

13 In
our experience, the optimal choice may depend on the relative magnitudes of the L and S terms and the
sparsity of the corruption.

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:28 E. J. Candès et al.

optimization, allowing them to be computed especially efficiently. APG, on the other


hand, does not have this property.
The ALM method operates on the augmented Lagrangian
μ
l(L, S, Y ) = L∗ + λS1 + Y, M − L − S + M − L − S2F . (5.1)
2
A generic Lagrange multiplier algorithm [Bertsekas 1982] would solve PCP by repeat-
edly setting (Lk, Sk) = arg min L,S l(L, S, Yk), and then updating the Lagrange multiplier
matrix via Yk+1 = Yk + μ(M − Lk − Sk).
For our low-rank and sparse decomposition problem, we can avoid having to solve
a sequence of convex programs by recognizing that min L l(L, S, Y ) and min S l(L, S, Y )
both have very simple and efficient solutions. Let Sτ : R → R denote the shrinkage
operator Sτ [x] = sgn(x) max(|x| − τ, 0), and extend it to matrices by applying it to each
element. It is easy to show that
arg min l(L, S, Y ) = Sλ/μ (M − L + μ−1 Y ). (5.2)
S

Similarly, for matrices X, let Dτ (X) denote the singular value thresholding operator
given by Dτ (X) = U Sτ ()V ∗ , where X = U V ∗ is any singular value decomposition. It
is not difficult to show that
arg min l(L, S, Y ) = D1/μ (M − S + μ−1 Y ). (5.3)
L

Thus, a more practical strategy is to first minimize l with respect to L (fixing S), then
minimize l with respect to S (fixing L), and then finally update the Lagrange multiplier
matrix Y based on the residual M− L−S, a strategy that is summarized as Algorithm 1.

ALGORITHM 1: (Principal Component Pursuitby Alternating Directions [Lin et al. 2009a;


Yuan and Yang 2009])
1: initialize: S0 = Y0 = 0, μ > 0.
2: while not converged do
3: compute Lk+1 = D1/μ (M − Sk + μ−1 Yk);
4: compute Sk+1 = Sλ/μ (M − Lk+1 + μ−1 Yk);
5: compute Yk+1 = Yk + μ(M − Lk+1 − Sk+1 );
6: end while
7: output: L, S.

Algorithm 1 is a special case of a more general class of augmented Lagrange multi-


plier algorithms known as alternating directions methods [Yuan and Yang 2009]. The
convergence of these algorithms has been well-studied (see, e.g., Lions and Mercier
[1979] and Kontogiorgis and Meyer [1989] and the many references therein, as well
as discussion in Lin et al. [2009a] and Yuan and Yang [2009]). Algorithm 1 performs
excellently on a wide range of problems: as we saw in Section 3, relatively small num-
bers of iterations suffice to achieve good relative accuracy. The dominant cost of each
iteration is computing Lk+1 via singular value thresholding. This requires us to com-
pute those singular vectors of M − Sk + μ−1 Yk whose corresponding singular values
exceed the threshold μ. Empirically, we have observed that the number of such large
singular values is often bounded by rank(L0 ), allowing the next iterate to be computed
efficiently via a partial SVD.14 The most important implementation details for this
algorithm are the choice of μ and the stopping criterion. In this work, we simply choose

14 Furtherperformance gains might be possible by replacing this partial SVD with an approximate SVD, as
suggested in Goldfarb and Ma [2009] for nuclear norm minimization.

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:29

μ = n1 n2 /4M1 , as suggested in Yuan and Yang [2009]. We terminate the algorithm


when M − L − S F ≤ δM F , with δ = 10−7 .
Very similar ideas can be used to develop simple and effective augmented Lagrange
multiplier algorithms for matrix completion Lin et al. [2009a], and for the robust
matrix completion problem (1.5) discussed in Section 1.6, with similarly good per-
formance. In the preceding section, all simulations and experiments are therefore
conducted using ALM-based algorithms. For a more thorough discussion, implemen-
tation details and comparisons with other algorithms, please see Lin et al. [2009a]
and Yuan and Yang [2009].
6. DISCUSSION
This article delivers some encouraging news: one can disentangle the low-rank and
sparse components exactly by convex programming, and this provably works under
quite broad conditions. Further, our analysis has revealed rather close relationships
between matrix completion and matrix recovery (from sparse errors) and our results
even generalize to the case when there are both incomplete and corrupted entries (i.e.,
Theorem 1.2). In addition, Principal Component Pursuit does not have any free param-
eter and can be solved by simple optimization algorithms with remarkable efficiency
and accuracy. More importantly, our results may point to a very wide spectrum of new
theoretical and algorithmic issues together with new practical applications that can
now be studied systematically.
Our study so far is limited to the low-rank component being exactly low-rank, and
the sparse component being exactly sparse. It would be interesting to investigate when
either or both these assumptions are relaxed. One way to think of this is via the
new observation model M = L0 + S0 + N0 , where N0 is a dense, small perturbation
accounting for the fact that the low-rank component is only approximately low-rank
and that small errors can be added to all the entries (in some sense, this model unifies
the classical PCA and the robust PCA by combining both sparse gross errors and dense
small noise). The ideas developed in Candès and Plan [2010] in connection with the
stability of matrix completion under small perturbations may be useful here. Even more
generally, the problems of sparse signal recovery, low-rank matrix completion, classical
PCA, and robust PCA can all be considered as special cases of a general measurement
model of the form
M = A(L0 ) + B(S0 ) + C(N0 ),
where A, B, C are known linear maps. An ambitious goal might be to understand exactly
under what conditions, one can effectively retrieve or decompose L0 and S0 from such
noisy linear measurements via convex programming.
The remarkable ability of convex optimizations in recovering low-rank matrices and
sparse signals in high-dimensional spaces suggest that they will be a powerful tool for
processing massive data sets that arise in image/video processing, web data analysis,
and bioinformatics. Such data are often of millions or even billions of dimensions so
the computational and memory cost can be far beyond that of a typical PC. Thus,
one important direction for future investigation is to develop algorithms that have
even better scalability, and can be easily implemented on the emerging parallel and
distributed computing infrastructures.
APPENDIX A
A.1. Equivalence of Sampling Models
We begin by arguing that a recovery result under the Bernoulli model automatically
implies a corresponding result for the uniform model. Denote by PUnif(m) and PBer( p)
probabilities calculated under the uniform and Bernoulli models and let “Success” be

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:30 E. J. Candès et al.

the event that the algorithm succeeds. We have


2
n
PBer( p) (Success) = PBer( p) (Success | || = k) PBer( p) (|| = k)
k=0

 
2
m−1 n
≤ PBer( p) (|| = k) + PUnif(k) (Success) PBer( p) (|| = k)
k=0 k=m
≤ PBer( p) (|| < m) + PUnif(m) (Success),

where we have used the fact that for k ≥ m, PUnif(k) (Success) ≤ PUnif(m) (Success), and
that the conditional distribution of  given its cardinality is uniform. Thus,

PUnif(m) (Success) ≥ PBer( p) (Success) − PBer( p) (|| < m).


2 n2
Take p = m/n2 + , where > 0. The conclusion follows from PBer( p) (|| < m) ≤ e− 2p .
In the other direction, the same reasoning gives


m
PBer( p) (Success) ≥ PBer( p) (Success | || = k) PBer( p) (|| = k)
k=0

m
≥ PUnif(m) (Success) PBer( p) (|| = k)
k=0
= PUnif(m) (Success) P(|| ≤ m),

and choosing m such that P(|| > m) is exponentially small, establishes the claim.

A.2. Proof of Lemma 3.1


The proof is essentially an application of Bernstein’s inequality, which states that for
a sum of uniformly bounded independent random variables with |Yk − E Yk| < c,
 n   
 t2
P (Yk − E Yk) > t ≤ 2 exp − 2 , (A.1)
2σ + 2ct/3
k=1


where σ 2 is the sum of the variances, σ 2 ≡ nk=1 Var(Yk).
Define 0 via 0 = {(i, j) : δi j = 1} where {δi j } is an independent sequence of Bernoulli
variables with parameter ρ0 . With this notation, Z = Z − ρ0−1 PT P0 Z is given by
 
Z = 1 − ρ0−1 δi j Zi j PT (ei e∗j )
ij

so that Zi 0 j0 is a sum of independent random variables,


  
Zi 0 j0 = Yi j , Yi j = 1 − ρ0−1 δi j Zi j PT (ei e∗j ), ei0 e∗j0 .
ij

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:31

We have
 
Var(Yi j ) = (1 − ρ0 )ρ0−1 |Zi j |2 | PT (ei e∗j ), ei0 e∗j0 |2
ij ij

≤ (1 − ρ0 )ρ0−1 Z2∞ | ei e∗j , PT (ei0 e∗j0 ) |2
ij

= (1 − ρ0 )ρ0−1 Z2∞ PT (ei0 e∗j0 )2F


2μr
≤ (1 − ρ0 )ρ0−1 Z2∞ ,
n
where the last inequality holds because of (2.2). Also, it follows from (1.2) that
| PT (ei e∗j ), ei0 e∗j0 | ≤ PT (ei e∗j ) F PT (ei0 e∗j0 ) F ≤ 2μr/n so that |Yi j | ≤ ρ0−1 Z∞ μr/n.
Then, Bernstein’s inequality gives
 
3 2 nρ0
P(|Zi j | > Z∞ ) ≤ 2 exp − .
16 μr
If ρ0 is as in Lemma 3.1, the union bound proves the claim.
A.3. Proof of Theorem 1.2
This section presents a proof of Theorem 1.2, which resembles that of Theorem 1.1.
Here and in what follows, S0 = Pobs S0 so that the available data are of the form
Y = Pobs L0 + S0 . We make three observations:
—If PCP correctly recovers L0 from the input data Pobs L0 + S0 (note that this means
that L̂ = L0 and Ŝ = S0 ), then it must correctly recover L0 from Pobs L0 + S0 ,
where S0 is a trimmed version of S0 . The proof is identical to that of our elimination
result, namely, Theorem 2.2. The derandomization argument then applies and it
suffices to consider the case where the signs of S0 are independent and identically
distributed symmetric Bernoulli variables.
—It is of course sufficient to prove the theorem when each entry in obs is revealed
with probability p0 := 0.1, that is, when obs ∼ Ber( p0 ).
—We establish the theorem in the case where n1 = n2 = n as slight modifications would
give the general case.
Further, there are now three index sets of interest:
—obs are those locations where data are available.
— ⊂ obs are those locations where data are available and clean; that is, P Y = P L0 .
— = obs \  are those locations where data are available but totally unreliable.
The matrix S0 is thus supported on . If obs ∼ Ber( p0 ), then by definition,  ∼
Ber( p0 τ ).
Dual Certification. We begin with two lemmas concerning dual certification.
LEMMA A.1. Assume P⊥ PT  < 1. Then (L0 , S0 ) is the unique solution if there is a
pair (W, F) obeying
U V ∗ + W = λ(sgn(S0 ) + F),
with PT W = 0, W < 1, P⊥ F = 0 and F∞ < 1.
The proof is about the same as that of Lemma 2.4, and is discussed in very brief
terms. The idea is to consider a feasible perturbation of the form (L0 + HL, S0 − HS )
obeying Pobs HL = Pobs HS , and show that this increases the objective functional unless

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:32 E. J. Candès et al.

HL = HS = 0. Then, a sequence of steps similar to that in the proof of Lemma 2.4


establishes
L0 + HL∗ + λS0 − HS 1 ≥ L0 ∗ + λS0 1 + (1 − β)(PT ⊥ HL∗ + λP HL1 ), (A.2)
where β = max(W, F∞ ). Finally, PT ⊥ HL∗ + λP HL1 vanishes if and only if
HL ∈  ⊥ ∩ T = {0}.
LEMMA A.2. Assume that for any matrix M, PT P⊥ M F ≤ nPT ⊥ P⊥ M F and take
λ > 4/n. Then (L0 , S0 ) is the unique solution if there is a pair (W, F) obeying
U V ∗ + W + PT D = λ(sgn(S0 ) + F),
with PT W = 0, W < 1/2, P⊥ F = 0 and F∞ < 1/2, and PT D F ≤ n−2 .
Note that PT P⊥ M F ≤ nPT ⊥ P⊥ M F implies  ⊥ ∩ T = {0}, or equivalently
P⊥ PT  < 1. Indeed if M ∈  ⊥ ∩ T , PT P⊥ M = M while PT ⊥ P⊥ M = 0, and thus
M = 0.
PROOF. It follows from (A.2) together with the same argument as in the proof of
Lemma A.2 that
1 1
L0 + HL∗ + λS0 − HS 1 ≥ L0 ∗ + λS0 1 + (PT ⊥ HL∗ + λP HL1 ) − 2 PT HL F .
2 n
Observe now that
PT HL F ≤ PT P HL F + PT P⊥ HL F
≤ PT P HL F + nPT ⊥ P⊥ HL F
≤ PT P HL F + n(PT ⊥ P HL F + PT ⊥ HL F )
≤ (n + 1)P HL F + nPT ⊥ HL F .
Using both P HL F ≤ P HL1 and PT ⊥ HL F ≤ PT ⊥ HL∗ , we obtain
   
1 1 λ n+ 1
L0 + HL∗ +λS0 − HS 1 ≥ L0 ∗ +λS0 1 + − PT ⊥ HL∗ + − P HL1 .
2 n 2 n2
The claim follows from  ⊥ ∩ T = {0}.
LEMMA A.3. Under the assumptions of Theorem 1.2, the assumption of Lemma A.2
is satisfied with high probability. That is, PT P⊥ M F ≤ nPT ⊥ P⊥ M F for all M.
PROOF. Set ρ0 = p0 (1 − τ ) and M = P⊥ M. Since  ∼ Ber(ρ0 ), Theorem 2.6 gives
PT − ρ0−1 PT P PT  ≤ 1/2 with high probability. Further, because P PT M  F =
P PT ⊥ M  F , we have
P PT M  F ≤ PT ⊥ M  F .
In the other direction,
ρ0−1 P PT M 2F = ρ0−1 PT M , PT P PT M )
= PT M , PT M + PT M , (ρ0−1 PT P PT − PT )M )
1 1
≥ PT M 2F − PT M 2F = PT M 2F .
2 2
ρ0
In conclusion, PT ⊥ M  F ≥ P PT M  F ≥ 2
PT M  F , and the claim follows since
ρ0
2
≥ 1n .

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:33

Thus far, our analysis shows that to establish our theorem, it suffices to construct a
pair (Y L, W S ) obeying

⎧ ⎪
⎪ PT W S = 0,
⎪ TP ⊥ Y L
 < 1/4, ⎪
⎪W S  ≤ 1/4,

⎨ ⎪

PT Y L − U V ∗  F ≤ n−2 ,
and P W S = λsgn(S0 ), (A.3)
⎪P⊥ Y L = 0,
⎪ ⎪

⎩ ⎪
⎪ P ⊥ W
S
= 0,
P Y L∞ < λ/4, ⎪ obs S

P W ∞ ≤ λ/4.

Indeed, by definition, Y L + W S obeys


Y L + W S = λ(sgn(S0 ) + F),
where F is as in Lemma A.2, and it can also be expressed as
Y L + W S = U V ∗ + W + PT D,
where W and PT D are as in this lemma as well.
Construction of the Dual Certificate Y L. We use the golfing scheme to construct Y L.
Think of  ∼ Ber(ρ0 ) with ρ0 = p0 (1 − τ ) as ∪1≤ j≤ j0  j , where the sets  j ∼ Ber(q) are
independent, and q obeys ρ0 = 1 − (1 − q) j0 . Here, we take j0 = 3 log n, and observe
that q ≥ ρ0 /j0 as before. Then, starting with Y0 = 0, inductively define
Y j = Y j−1 + q−1 P j PT (U V ∗ − Y j−1 ),
and set

Y L = Y j0 = q−1 P j Z j−1 , Z j = (PT − q−1 PT P j PT )Z j−1 . (A.4)
j

By construction, P⊥ Y L = 0. Now, just √ as in Section (3.2), because q is sufficiently large,


Z j  ≤ e− j U V ∗ ∞ and Z j  F ≤ e− j r, both inequality holding with large probability.
The proof is now identical to that in (2.5). First, the same steps show that
 
μr(log n)
n log n 2

PT ⊥ Y  ≤ C
L
U V ∞ = C .
q nρ0

Whenever ρ0 ≥ C0 μr(log
2
n)
n
for a sufficiently large value of the constant C0 (which is
possible provided that ρr in (1.6) is sufficiently small), this terms obeys PT ⊥ Y L ≤ 1/4
as required. Second,

PT Y L − U V ∗  F = Z j0  F ≤ e−3 log n r ≤ n−2 .
And third, the same steps give

 μr(log n)2
−1 ∗ −j −1
Y ∞ ≤ q
L
U V ∞ e ≤ 3(1 − e ) .
j
ρ02 n2

λ
Now it suffices to bound the right-hand side by 4
= 1
4
1−τ
nρ0
. This is automatic when
C0 μr(log
2
n)
ρ0 ≥ n
whenever C0 is sufficiently large and, thus, the situation is as before.
In conclusion, we have established that Y L obeys (A.3) with high probability.

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:34 E. J. Candès et al.

Construction of the Dual Certificate W S . We first establish that with high probability,

PT P  ≤ τ p0 , τ = τ + τ0 , (A.5)
where τ0 (τ ) is a continuous function of τ approaching zero when τ approaches zero.
In other words, the parameter τ may become arbitrary small constant by selecting τ
small enough. This claim is a straight application of Corollary 2.7. We also have
P P(T +⊥obs ) P  ≤ 2τ . (A.6)

with high probability. This second claim uses the identity


P P(T +⊥obs ) P = P PT (PT Pobs PT )−1 PT P .

This is well defined since the restriction of PT Pobs PT to T is invertible. Indeed, Theo-
rem 2.6 gives PT Pobs PT ≥ p20 PT and, therefore, (PT Pobs PT )−1  ≤ 2 p0−1 . Hence,

P P(T +⊥obs ) P  ≤ 2 p0−1 P PT 2 ,

and (A.6) follows from (A.5).


Setting E = sgn(S0 ), this allows to define W S via

W S = λ(I − P(T +⊥obs ) )(P − P P(T +⊥obs ) P )−1 E


:= (I − P(T +⊥obs ) )(W0S + W1S ),

where W0S = λE, and W1S = RE with R = k≥1 (P P(T +⊥obs ) P )k. The operator R is

self-adjoint and obeys R ≤ 1−2τ with high probability. By construction, P T W
S
=

P⊥obs W = 0 and P W = λsgn(S0 ). It remains to check that both events W  ≤ 1/4
S S S

and P W S ∞ ≤ λ/4 hold with high probability.


Control of W S . For the first term, we have (I − P(T +⊥obs ) )W0S  ≤ W0S  = λE.
Because the entries of E are i.i.d. and take the value ±1 each with probability p0 τ/2,
and the value 0 with probability 1 − p0 τ , standard arguments give

E ≤ 4 np0 (τ + τ0 )
√ √
with large probability. Since λ = 1/ p0 n, W0S  ≤ 4 τ + τ0 < 1/8 with high probability,
provided τ is small enough.
For the second term, (I − P(T +⊥obs ) )W1S  ≤ λRE, and the same covering argument
as before gives
 
t2
P(λR(E) > t) ≤ 2 × 6 exp − 2 2 + P(R ≥ σ ).
2n
2λ σ

Since λ = 1/ np0 this shows that W S  ≤ 1/4 with high probability, since one can
always choose σ , or equivalently τ = τ + τ0 , sufficiently small.
Control of P W S ∞ . For (i, j) ∈ , we have
WiSj = ei e∗j , W S = λ X(i, j), E ,
where
X(i, j) = (P − P P(T +⊥obs ) P )−1 P P(T +⊥obs )⊥ ei e∗j .

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:35

The same strategy as before gives


     
λ 1
P sup |Wi j | >
S
≤ 2n exp − 2 + P sup X(i, j) F > σ .
2
(i, j)∈G 4 8σ (i, j)∈G

It remains to control the Frobenius norm of X(i, j). To do this, we use the identity
P P(T +⊥obs )⊥ ei e∗j = P PT (PT Pobs PT )−1 PT ei e∗j ,
which gives
 
4τ 8μrτ
P P(T +⊥obs )⊥ ei e∗j  F ≤ PT ei e∗j  F ≤
p0 np0
with high probability. This follows from the fact that (PT Pobs PT )−1  ≤ 2 p0−1

and P PT  ≤ p0 τ as we have already seen. Since we also have (P −
−1
P P(T +⊥obs ) P )  ≤ 1−2τ
1
with high probability,


1 8μrτ
sup X(i, j) F ≤
.
(i, j)∈ 1 − 2τ np0
This shows that P W S ∞ ≤ λ/4 if τ , or equivalently τ , is sufficiently small.

ACKNOWLEDGMENTS
E. C. would like to thank Deanna Needell for comments on an earlier version of this manuscript. We would
also like to thank Zhouchen Lin (MSRA) and Xiaoming Yuan (Hong Kong Baptist University) for their help
with the ALM algorithm, and Hossein Mobahi (UIUC) for his help with some of the simulations. Finally, we
would like to thank one anonymous reviewer for very useful suggestions on improving the presentation.

REFERENCES
BASRI, R., AND JACOBS, D. 2003. Lambertian reflectance and linear subspaces. IEEE Trans. Patt. Anal. Mach.
Intel. 25, 2, 218–233.
BECK, A., AND TEBOULLE, M. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse prob-
lems. SIAM J. Imag. Sci. 2, 1, 183–202.
BECKER, S., BOBIN, J., AND CANDÈS, E. J. 2011. NESTA: A fast and accuract first-order method for sparse
recovery. SIAM J. Imag. Sci. 4, 1, 1–39.
BELKIN, M., AND NIYOGI, P. 2003. Laplacian eigenmaps for dimensionality reduction and data representation.
Neural Comput. 15, 6, 1373–1396.
BERTSEKAS, D. 1982. Constrained Optimization and Lagrange Multiplier Method. Academic Press.
CAI, J., CANDÈS, E. J., AND SHEN, Z. 2010. A singular value thresholding algorithm for matrix completion.
SIAM J. Optimiz. 20, 4, 1956–1982.
CANDÈS, E. J., AND PLAN, Y. 2010. Matrix completion with noise. Proc. IEEE 98, 6, 925–936.
CANDÈS, E. J., AND RECHT, B. 2009. Exact matrix completion via convex optimzation. Found. Comput. Math. 9,
717–772.
CANDÈS, E. J., ROMBERG, J., AND TAO, T. 2006. Robust uncertainty principles: exact signal reconstruction from
highly incomplete frequency information. IEEE Trans. Inform. Theory 52, 2, 489–509.
CANDÈS, E. J., AND TAO, T. 2010. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans.
Inf. Theory 56, 5, 2053–2080.
CEVHER, V., SANKARANARAYANAN, A., DUARTE, M., REDDY, D., BARANIUK, R., AND CHELLAPPA, R. 2009. Compressive
sensing for background subtraction. In Proceedings of the European Conference on Computer Vision
(ECCV).
CHANDRASEKARAN, V., SANGHAVI, S., PARRILO, P., AND WILLSKY, A. 2009. Rank-sparsity incoherence for matrix
decomposition. Siam J. Optim., to appear http://arxiv.org/abs/0906.2220.
CHEN, S., DONOHO, D., AND SAUNDERS, M. 2001. Atomic decomposition by basis pursuit. SIAM Rev. 43, 1,
129–159.

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
11:36 E. J. Candès et al.

DEWESTER, S., DUMAINS, S., LANDAUER, T., FURNAS, G., AND HARSHMAN, R. 1990. Indexing by latent semantic
analysis. J. Soc. Inf. Sci. 41, 6, 391–407.
ECKART, C., AND YOUNG, G. 1936. The approximation of one matrix by another of lower rank. Psychometrika 1,
211–218.
FAZEL, M., HINDI, H., AND BOYD, S. 2003. Log-det heuristic for matrix rank minimization with applications to
Hankel and Euclidean distance matrices. In Proceedings of the American Control Conference 2156–2162.
FISCHLER, M., AND BOLLES, R. 1981. Random sample consensus: A paradigm for model fitting with applications
to image analysis and automated cartography. Comm. ACM 24, 381–385.
GEORGHIADES, A., BELHUMEUR, P., AND KRIEGMAN, D. 2001. From few to many: Illumination cone models for face
recognition under variable lighting and pose. IEEE Trans. Patt. Anal. Mach. Intel. 23, 6.
GNANADESIKAN, R., AND KETTENRING, J. 1972. Robust estimates, residuals, and outlier detection with multire-
sponse data. Biometrics 28, 81–124.
GOLDFARB, D., AND MA, S. 2009. Convergence of fixed point continuation algorithms for matrix rank mini-
mization. http://arxiv.org/abs/0906.3499.
GRANT, M., AND BOYD, S. 2009. CVX: Matlab software for disciplined convex programming (web page and
software). http://stanford.edu/∼boyd/cvx.
GROSS, D. 2011. Recovering low-rank matrices from few coefficients in any basis. IEEE Trans. Inf. Theory 57, 3,
1548–1566.
GROSS, D., LIU, Y.-K., FLAMMIA, S. T., BECKER, S., AND EISERT, J. 2010. Quantum state tomography via compressed
sensing. Phys. Rev. Lett. 105, 15.
HEY, T., TANSLEY, S., AND TOLLE, K. 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft
Research.
HOTELLING, H. 1933. Analysis of a complex of statistical variables into principal components. J. Educat.
Psych. 24, 417–441.
HUBER, P. 1981. Robust Statistics. Wiley.
JOLLIFFE, I. 1986. Principal Component Analysis. Springer-Verlag.
KE, Q., AND KANADE, T. 2005. Robust 1 -norm factorization in the presence of outliers and missing data. In
Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition.
KESHAVAN, R. H., MONTANARI, A., AND OH, S. 2010. Matrix completion from a few entries. IEEE Trans. Inf.
Theory 56, 6, 2980–2998.
KONTOGIORGIS, S., AND MEYER, R. 1989. A variable-penalty alternating direction method for convex optimiza-
tion. Math. Prog. 83, 29–53.
LEDOUX, M. 2001. The Concentration of Measure Phenomenon. American Mathematical Society.
LI, L., HUANG, W., GU, I., AND TIAN, Q. 2004. Statistical modeling of complex backgrounds for foreground object
detection. IEEE Trans. Image Proc. 13, 11, 1459–1472.
LIN, Z., CHEN, M., AND MA, Y. 2009a. The augmented Lagrange multiplier method for exact recovery of a
corrupted low-rank matrices. http://arxiv.org/abs/1009.5055.
LIN, Z., GANESH, A., WRIGHT, J., WU, L., CHEN, M., AND MA, Y. 2009b. Fast convex optimization algorithms
for exact recovery of a corrupted low-rank matrix. In Proceedings of the Symposium on Computational
Advances in Multi-Sensor Adaptive Processing (CAMSAP).
LIONS, P., AND MERCIER, B. 1979. Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer.
Anal. 16, 6, 964–979.
LIU, Z., AND VANDENBERGE, L. 2009. Interior-point method for nuclear norm approximation with application
to system identification. SIAM J. Matrix Anal. Appl. 31, 3, 1235–1256.
MA, S., GOLDFARB, D., AND CHEN, L. 2009. Fixed point and Bregman iterative methods for matrix rank
minimization. Math. Prog. Ser. A. DOI 10.1007/s10107-009-0306-5.
NESTEROV, Y. 1983. A method of solving a convex programming problem with convergence rate O(1/k2 ). Soviet
Math. Dokl. 27, 2, 372–376.
NESTEROV, Y. 2005. Smooth minimization of non-smooth functions. Math. Prog. 103, 1.
NESTEROV, Y. 2007. Gradient methods for minimizing composite objective functions. Tech. rep. - CORE -
Universite Catholique de Louvain.
NETFLIX, INC. The Netflix prize. http://www.netflixprize.com/.
OSHER, S., BURGER, M., GOLDFARB, D., XU, J., AND YIN, W. 2005. An iterative regularization method for total
variation-based image restoration. Multi. Model. Simul. 4, 460–489.
PAPADIMITRIOU, C., RAGHAVAN, P., TAMAKI, H., AND VEMPALA, S. 2000. Latent semantic indexing, a probabilistic
analysis. J. Comput. Syst. Sci. 61, 2, 217–235.

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.
Robust Principal Component Analysis? 11:37

RECHT, B. 2009. A simpler approach to matrix completion. CoRR abs/0910.0651.


RECHT, B., FAZEL, M., AND PARILLO, P. 2010. Guaranteed minimum rank solution of matrix equations via
nuclear norm minimization. SIAM Rev. 52, 471–501.
STAUFFER, C., AND GRIMSON, E. 1999. Adaptive background mixture models for real-time tracking. In Proceed-
ings of the IEEE International Conference on Computer Vision and Pattern Recognition.
TENENBAUM, J., DE SILVA, V., AND LANGFORD, J. 2000. A global geometric framework for nonlinear dimensionality
reduction. Science 290, 5500, 2319–2323.
TOH, K. C., AND YUN, S. 2010. An accelerated proximal gradient algorithm for nuclear norm regularized least
squares problems. Pac. J. Optim. 6, 615–640.
TORRE, F. D. L., AND BLACK, M. 2003. A framework for robust subspace learning. Int. J. Comput. Vis. 54,
117–142.
VERSHYNIN, R. 2011. Introduction to the non-asymptotic analysis of random matrices. http://www-
personal.umich.edu/˜romanv/papers/non-asymptotic-rmt-plain.pdf.
YIN, W., HALE, E., AND ZHANG, Y. 2008a. Fixed-point continuation for 1 -minimization: Methodology and
convergence. SIAM J. Optimiz. 19, 3, 1107–1130.
YIN, W., OSHER, S., GOLDFARB, D., AND DARBON, J. 2008b. Bregman iterative algorithms for 1 -minimization
with applications to compressed sensing. SIAM J. Imag. Sci. 1, 1, 143–168.
YUAN, X., AND YANG, J. 2009. Sparse and low-rank matrix decomposition via alternating direction methods.
http://www.optimization-online.org/08 HTML/2009/11/2447.html.
ZHOU, Z., WAGNER, A., MOBAHI, H., WRIGHT, J., AND MA, Y. 2009. Face recognition with contiguous occlusion
using Markov random fields. In Proceedings of the International Conference on Computer Vision (ICCV).

Received January 2010; revised February 2011; accepted February 2011

Journal of the ACM, Vol. 58, No. 3, Article 11, Publication date: May 2011.

You might also like