0% found this document useful (0 votes)
39 views

Learning Bayesiano Csmri

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Learning Bayesiano Csmri

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO.

12, DECEMBER 2014 5007

Bayesian Nonparametric Dictionary Learning for


Compressed Sensing MRI
Yue Huang, John Paisley, Qin Lin, Xinghao Ding, Xueyang Fu, and Xiao-Ping Zhang, Senior Member, IEEE

Abstract— We develop a Bayesian nonparametric model for have had a major impact on MRI [3]. CS-MRI allows for
reconstructing magnetic resonance images (MRIs) from highly significant undersampling in the Fourier measurement domain
undersampled k-space data. We perform dictionary learning as of MR images (called k-space), while still outputting a
part of the image reconstruction process. To this end, we use
the beta process as a nonparametric dictionary learning prior high-quality image reconstruction. While image reconstruc-
for representing an image patch as a sparse combination of tion using this undersampled data is a case of an ill-posed
dictionary elements. The size of the dictionary and patch-specific inverse problem, compressed sensing theory has shown that
sparsity pattern are inferred from the data, in addition to other it is possible to reconstruct a signal from significantly fewer
dictionary learning variables. Dictionary learning is performed measurements than mandated by traditional Nyquist sampling
directly on the compressed image, and so is tailored to the MRI
being considered. In addition, we investigate a total variation if the signal is sparse in a particular transform domain.
penalty term in combination with the dictionary learning model, Motivated by the need to find a sparse domain for rep-
and show how the denoising property of dictionary learning resenting the MR signal, a large body of literature now
removes dependence on regularization parameters in the noisy exists on reconstructing MRI from significantly undersam-
setting. We derive a stochastic optimization algorithm based on pled k-space data. Existing improvements in CS-MRI mostly
Markov chain Monte Carlo for the Bayesian model, and use
the alternating direction method of multipliers for efficiently focus on (i ) seeking sparse domains for the image, such
performing total variation minimization. We present empirical as contourlets [4], [5]; (ii ) using approximations of the
results on several MRI, which show that the proposed regu- 0 norm for better reconstruction performance with fewer
larization framework can improve reconstruction accuracy over measurements, for example 1 , FOCUSS,  p quasi-norms with
other methods. 0 < p < 1, or using smooth functions to approximate the
Index Terms— Compressed sensing, magnetic resonance 0 norm [6], [7]; and (iii ) accelerating image reconstruction
imaging, Bayesian nonparametrics, dictionary learning. through more efficient optimization techniques [8], [10], [29].
In this paper we present a modeling framework that is similarly
I. I NTRODUCTION motivated.
CS-MRI reconstruction algorithms tend to fall into

M AGNETIC resonance imaging (MRI) is a widely used


technique for visualizing the structure and functioning
of the body. A limitation of MRI is its slow scan speed dur-
two categories: Those which enforce sparsity directly
within some image transform domain [3]– [8], [10]–[12],
and those which enforce sparsity in some underlying
ing data acquisition. Therefore, methods for accelerating the
latent representation of the image, such as an adap-
MRI process have been heavily researched. Recent advances
tive dictionary-based representation [9], [14]. Most CS-MRI
in signal reconstruction from measurements sampled below
reconstruction algorithms belong to the first category. For
the Nyquist rate, called compressed sensing (CS) [1], [2],
example Sparse MRI [3], the leading study in CS-MRI,
Manuscript received August 21, 2013; revised May 2, 2014 and July 25, performs MR image reconstruction by enforcing sparsity in
2014; accepted September 18, 2014. Date of publication September 24, 2014; both the wavelet domain and the total variation (TV) of
date of current version October 13, 2014. This work was supported in part the reconstructed image. Algorithms with image-level sparsity
by the National Natural Science Foundation of China under Grant 30900328,
Grant 61172179, Grant 61103121, and Grant 81301278, in part by the Funda- constraints such as Sparse MRI typically employ an off-
mental Research Funds for the Central Universities under Grant 2011121051 the-shelf basis, which can usually capture only one feature
and Grant 2013121023, and in part by the Natural Science Foundation of the image. For example, wavelets recover point-like fea-
of Fujian Province, China, under Grant 2012J05160. The associate editor
coordinating the review of this manuscript and approving it for publication tures, while contourlets recover curve-like features. Since MR
was Dr. Ali Bilgin. (Corresponding author: Xinghao Ding.) images contain a variety of underlying features, such as edges
Y. Huang, Q. Lin, X. Ding, and X. Fu are with the Department of Commu- and textures, using a basis not adapted to the image can be
nications Engineering, Xiamen University, Xiamen 361005, China (e-mail:
[email protected]; [email protected]; [email protected]; considered a drawback of these algorithms.
[email protected]). Finding a sparse basis that is suited to the image at hand
J. Paisley is with the Department of Electrical Engineering, Columbia can benefit MR image reconstruction, since CS theory shows
University, New York, NY 10027 USA (e-mail: [email protected]).
X.-P. Zhang is with the Department of Electrical and Computer Engi- that the required number of measurements is linked to the
neering, Ryerson University, Toronto, ON M5B 2K3, Canada (e-mail: sparsity of the signal in the selected transform domain. Using
[email protected]). a standard basis not adapted to the image under consideration
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. will likely not provide a representation that can compete in
Digital Object Identifier 10.1109/TIP.2014.2360122 sparsity with an adapted basis. To this end, dictionary learning,
1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
5008 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 12, DECEMBER 2014

which falls in the second group of algorithms, learns a sparse A. Two Approaches to CS-MRI Inversion
basis on image subregions called patches that is adapted We focus on single-channel CS-MRI inversion via optimiz-
to the image class of interest. Recent studies in the image ing an unconstrained function of the form
processing literature have shown that dictionary learning is an
λ
effective means for finding a sparse, patch-level representation arg min h(x) + Fu x − y22 , (1)
of an image [19], [20], [25]. These algorithms learn a patch- x 2
level dictionary by exploiting structural similarities between where Fu x − y22 is a data fidelity term, λ > 0 is a parameter
patches extracted from images within a class of interest. and h(x) is a regularization function that controls properties
Among these approaches, adaptive dictionary learning—where of the image we want to reconstruct. As discussed in the
the dictionary is learned directly from the image being introduction, the function h can take several forms, but tends
considered—based on patch-level sparsity constraints usually to fall into one of two categories according to whether image-
outperforms analytical dictionary approaches in denoising, level or patch-level information is considered. We next review
super-resolution reconstruction, interpolation, inpainting, clas- these two approaches.
sification and other applications, since the adaptively learned 1) Image-Level Sparse Regularization: CS-MRI with an
dictionary suits the signal of interest [19]– [22]. image-level, or global regularization function h g (x) is one in
Dictionary learning has previously been applied to CS-MRI which sparsity is enforced within a transform domain defined
to learn a sparse basis for reconstruction, see [14]. With these on the entire image. For example, in Sparse MRI [3] the
methods, parameters such as the dictionary size and patch regularization function is
sparsity are preset, and algorithms are considered that are
h g (x) = W x1 + μ T V (x), (2)
non-Bayesian. In this paper, we consider a new dictionary
learning algorithm for CS-MRI that is motivated by Bayesian where W is the wavelet basis and T V (x) is the total variation
nonparametric statistics. Specifically, we consider a nonpara- (spatial finite differences) of the image. Regularizing with
metric dictionary learning model called BPFA [23] that uses this function requires that the image be sparse in the wavelet
the beta process to learn the sparse representation necessary for domain, as measured by the 1 norm of the wavelet coefficients
CS-MRI reconstruction. The beta process is an effective prior W x1 , which acts as a surrogate for 0 [1], [2]. The total
for nonparametric learning of latent factor models; in this case variation term enforces homogeneity within the image by
the latent factors correspond to dictionary elements. While encouraging neighboring pixels to have similar values while
the dictionary size is therefore infinite in principle, through allowing for sudden high frequency jumps at edges. The
posterior inference the beta process learns a suitably compact parameter μ > 0 controls the trade-off between the two terms.
dictionary in which the signal can be sparsely represented. A variety of other image-level regularization approaches have
We organize the paper as follows. In Section II we review been proposed along these lines, see [4], [5], [7].
CS-MRI inversion methods and the beta process for dictionary 2) Patch-Level Sparse Regularization: An alternative to the
learning. In Section III, we describe the proposed regulariza- image-level sparsity constraint h g (x) is a patch-level, or local
tion framework and algorithm. We derive a Markov Chain regularization function h l (x), which enforces that patches
Monte Carlo (MCMC) sampling algorithm for stochastic opti- (square sub-regions of the image) have a sparse representation
mization of the dictionary variables in the objective function. according to a dictionary. One possible general form of such
In addition, we consider including a sparse total variation (TV) a regularization function is,
penalty, for which we perform efficient optimization using the

N
γ
alternating direction method of multipliers (ADMM). We then h l (x) = Ri x − Dαi 22 + f (αi , D), (3)
show the advantages of the proposed Bayesian nonparametric 2
i=1
regularization framework on several CS-MRI problems in
Section IV. where the dictionary matrix is D ∈ C P×K and αi is a
K -dimensional vector in R K . An important difference between
II. BACKGROUND AND R ELATED W ORK h l (x) and h g (x) is the additional function f (αi , D). While
√ √ image-level sparsity constraints fall within a predefined trans-
We use the following notation: Let x ∈ C N be a N × N
MR image in vectorized form. Let Fu ∈ Cu×N , u  N, be form domain, such as the wavelet basis, the sparse dictio-
the undersampled Fourier encoding matrix and y = Fu x ∈ Cu nary domain can be unknown for patch-level regularization
represent the sub-sampled set of k-space measurements. The and learned from data. The function f enforces sparsity by
goal is to estimate x from the small fraction of k-space learning a D for which αi is sparse.1 For example, [9] uses
measurements y. For dictionary learning, let Ri be the i t h patch K-SVD to learn D off-line, and then approximately optimize
extraction matrix. That is, Ri is a P × N matrix of all the objective function
zeros
√ except
√ for a one in each row that extracts a vectorized 
N
P × P patch from the image, Ri x ∈ C P for i = 1, . . . , N. arg min Ri x − Dαi 22 subject to αi 0 ≤ T, ∀i, (4)
α1:N
We use overlapping image patches with a shift of one pixel and i=1
allow a patch to wrap around the image at the boundaries for using orthogonal matching pursuits (OMP) [21]. In this case,
mathematical convenience [15], [22]. All norms are extended the L 0 penalty on the additional parameters αi make this a
to complex vectors when necessary, a p = p 1/ p ,
i |ai |
1 The dependence of h (x) on α and D is implied in our notation.
where |ai | is the modulus of the complex number ai . l
HUANG et al.: BAYESIAN NONPARAMETRIC DICTIONARY 5009

Algorithm 1 Dictionary Learning With BPFA many values of k, the values of z ik will equal zero for all
i since πk will be very small; the model learns the number
of these unused dictionary elements and their index values
from the data. Therefore, the value of K should be set to
a large number that is more than the expected size of the
dictionary. It can be shown that, under the assumptions of this
prior, in the limit K → ∞, the number of dictionary elements
used by a patch is Poisson(γ ) distributed and the total number
of dictionary elements used by the data grows like cγ ln N,
where N is the number of patches [28]. The parameters of the
model include c, γ , e0 , f 0 , g0 , h 0 and K ; we discuss setting
these values in Section IV.
2) Relationship to K-SVD: Another widely used dictionary
learning method is K-SVD [20]. Though they are models for
the same problem, BPFA and K-SVD have some significant
differences that we briefly discuss. K-SVD learns the sparsity
pattern of the coding vector αi using the OMP algorithm [21]
for each i . Holding the sparsity pattern fixed, it then updates
each dictionary element and dimension of α jointly by a rank
non-convex problem. Using this definition of h l (x) in (1), one approximation to the residual. Unlike BPFA, it learns as
a local optimal solution can be found by an alternating many dictionary elements as are given to it, so K should
minimization procedure [32]: First solve the least squares be set wisely. BPFA on the other hand automatically prunes
solution for x using the current values of αi and D, and then unneeded elements, and updates the sparsity pattern by using
update αi and D, or only αi if D is learned off-line. the posterior distribution of a Bernoulli process, which is
significantly different from OMP. It updates the weights and
B. Dictionary Learning With Beta Process Factor Analysis the dictionary from their Gaussian posteriors as well. Because
of this probabilistic structure, we derive a sampling algorithm
Typical dictionary learning approaches require a predefined
for these variables that takes advantage of marginalization, and
dictionary size and, for each patch, the setting of either a
naturally learns the auxiliary variables γε and γs .
sparsity level T , or an error threshold  to determine how many
3) Example Denoising Problem: As we will see, the rela-
dictionary elements are used. In both cases, if the settings
tionship of dictionary learning to CS-MRI is essentially as a
do not agree with ground truth, the performance can signifi-
denoising step. To this end, we briefly illustrate BPFA on a
cantly degrade. Instead, we consider a Bayesian nonparametric
denoising problem. Denoising of an image using dictionary
method called beta process factor analysis (BPFA) [23], which
learning proceeds by first learning the dictionary representa-
has been shown to successfully infer both of these values,
tion of each patch, Ri x ≈ Dαi . The denoised reconstruction
as well as have competitive performance with algorithms in
of x using BPFA is then xBPFA = P1 i RiT Dαi .
several application areas [23]– [26], and see [33]– [36] for
We show an example using 6 × 6 patches extracted from
related algorithms. The beta process is driven by an under-
the noisy 512 × 512 image shown in Fig. 1(a). In Fig. 1(b)
lying Poisson process, and so it’s properties as a Bayesian
we show the resulting denoised image. For this problem we
nonparametric prior are well understood [27]. Originally used
truncated the dictionary size to K = 108 and set all other
for survival analysis in the statistics literature, its use for latent
model parameters to one. In Figs. 1(c) and 1(d) we show
factor modeling has been significantly increasing within the
some statistics from dictionary learning. For example, Fig. 1(c)
machine learning community [23]– [26], [28], [33]– [36].
shows the values of πk sorted, where we see that fewer than
1) Generative Model: We give the original hierarchical
100 elements are used by the data, many of which are very
prior structure of the BPFA model in Algorithm 1, extending
sparsely used. Fig. 1(d) shows the empirical distribution of the
this to complex-valued dictionaries in Section III-A. With this
number of elements used per patch. We see the ability of the
approach, the model constructs a dictionary matrix D ∈ R P×K
model to adapt the sparsity to the complexity of the patch. In
(C P×K below) of i.i.d. random variables, and assigns proba-
Table I we show PSNR results for three noise variance levels.
bility πk to vector d k . The parameters for these probabilities
For K-SVD, we consider the case when the error parameter
are set such that most of the πk are expected to be small,
matches the ground truth, and when it mismatches it by a
with a few large. In Algorithm 1 we use an approximation to
magnitude of five. As expected, when K-SVD does not have an
the beta process.2 Under this parameterization, each patch Ri x
appropriate parameter setting the performance suffers. BPFA
extracted from the image x is modeled as a sparse weighted
on the other hand adaptively infers this value, which helps
combination of the dictionary elements, as determined by
improve the denoising.
the element-wise product of z i ∈ {0, 1} K with the Gaussian
vector si . What makes the model nonparametric is that for III. CS-MRI W ITH BPFA AND TV P ENALTY
2 For a finite c > 0 and γ > 0, the random measure H =  K π δ We next present our approach for reconstructing single-
k=1 k dk
converges weakly to a beta process as K → ∞ [24], [27]. channel MR images from highly undersampled k-space data.
5010 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 12, DECEMBER 2014

in this optimization. Since we sample a new value of γε with


each iteration of the algorithm discussed shortly, this trade-off
is adaptively changing.
For the total variation penalty T V (x) we use the isotropic
TV model. Let ψi be the 2 × N difference operator for
pixel i . Each row of ψi contains a 1 centered on pixel i ,
and the first row also has a −1 on the pixel directly
above pixel i , while the second has a −1 corresponding
to the pixel to the right, and zeros elsewhere. Let =
[ψ1T , . . . , ψ NT ]T be the resulting 2N × N difference matrix for
the entire image. The TV coefficients are β =  x ∈ C ,
2N

and the isotropic TV penalty is T V (x) = i ψi x2 =



i |β|2i−1 + |β|2i , where i ranges over the pixels in the
2 2

MR image. For optimization we use the alternating direction


method of multipliers (ADMM) [30], [31]. ADMM works by
performing dual ascent on the augmented Lagrangian objective
function introduced for the total variation coefficients. For
completeness, we give a brief review of ADMM in the
appendix.

Fig. 1. (a)-(b) An example of denoising by BPFA (image scaled to [0,1]). A. Algorithm


(c) Shows the final probabilities of the dictionary elements and (d) shows a
distribution on the number of dictionary elements used per patch.
We present an algorithm for finding a local optimal solution
to the non-convex objective function given in (5). We can write
TABLE I this objective as
P EAK S IGNAL - TO -N OISE R ATIO (PSNR) FOR I MAGE D ENOISED BY  
BPFA. C OMPARED W ITH K-SVD U SING C ORRECT (M ATCH ) AND L(x, ϕ) = λg i ψi x2 + i γ2ε Ri x − Dαi 22

I NCORRECT (M ISMATCH ) N OISE PARAMETER + i f (ϕi ) + λ2 Fu x − y22 . (6)
We seek to minimize this function with respect to x and the
dictionary learning variables ϕi = {D, si , z i , γε , γs , π}.
Our first step is to put the objective into a more suitable
form. We begin by defining the TV coefficients for the
i t h pixel as β i := [β2i−1 β2i ]T = ψi x. We introduce the
In reference to the discussion in Section II, we consider a vector of Lagrange multipliers ηi , and then split β i from ψi x
sparsity constraint of the form by relaxing the equality via an augmented Lagrangian. This
results in the objective function
λ
arg min λg h g (x) + h l (x) + Fu x − y22 ,
x,ϕ 2 
N
ρ
L(x, β, η, ϕ) = λg β i 2 +ηiT (ψi x − β i )+ ψi x − β i 22
 γε
N
2
h g (x) : = T V (x), h l (x) := Ri x− Dαi 22 + f (ϕi ). (5) i=1
2 
N
i=1 γε
+ Ri x − Dαi 22 + f (ϕi )
For the local regularization function h l (x) we use BPFA 2
i=1
as given in Algorithm 1 in Section II-B. The parameters λ
to be optimized for this penalty are contained in the set + Fu x − y22 . (7)
2
ϕi = {D, si , z i , γε , γs , π}, and are defined in Algorithm 1.
We note that only si and zi vary in i , while the rest are From the ADMM theory [32], this objective will have (local)
shared by all patches. The regularization term γε is a model optimal values β ∗i and x∗ with β ∗i = ψi x∗ , and so the equality
variable that corresponds to an inverse variance parameter constraints will be satisfied [31].3 Optimizing this function can
of the multivariate Gaussian likelihood. This likelihood is be split into three separate sub-problems: one for TV, one for
equivalently viewed as the squared error penalty term in h l (x) BPFA and one for updating the reconstruction x. Following the
in (5). This term acts as the sparse basis for the image and discussion of ADMM in the appendix, we define ui = (1/ρ)ηi
also aids in producing a denoised reconstruction, as discussed and complete the square in the first line of (7). We then cycle
in Sections II-B, III-B and IV-B. For the global regularization through the following three sub-problems,
function h g (x) we use the total variation of the image. This ρ
(P1) β i = arg min λg β2 + ψi x − β + ui 22 ,
term encourages homogeneity within contiguous regions of β 2
the image, while still allowing for sharp jumps in pixel value u i = ui + ψi x − β i , i = 1, . . . , N,
at edges due to the underlying 1 penalty. The regularization
parameters λg , γε and λ control the trade-off between the terms 3 For a fixed D, α
1:N and x the solution is also globally optimal.
HUANG et al.: BAYESIAN NONPARAMETRIC DICTIONARY 5011

Algorithm 2 Outline of Algorithm The update for the dictionary D is

D = Xα T (αα T + (P/γε )I P )−1 + E, (9)


ind −1
E p,: ∼ CN (0, (γε αα + P I P )
T
), p = 1, . . . , P,

where E p,: is the p t h row of E. To sample this, we can first


draw E p,: from a multivariate Gaussian distribution with this
covariance structure, followed by an i.i.d. uniform rotation of
each value in the complex plane. We note that the first term in
Equation (9) is the 2 -regularized least squares solution for D.
The addition of correlated Gaussian noise in the complex plane
 γε generates the sample from the conditional posterior of D.
(P2) ϕ = arg min Ri x − Dαi 22 + f (ϕi ), Since both the number of pixels and γε will tend to be very
ϕ 2
i large, the variance of the noise is small and the mean term
ρ dominates the update for D.

(P3) x = arg min ψi x − β i + u i 22
x 2 b) Sample sparse coding αi : Sampling αi entails sam-
i
 γ pling sik and z ik for each k. We sample these values
ε λ
+ Ri x − D αi 22 + Fu x − y22 . using block sampling. We recall that to block sample two
2 2 variables from their joint conditional posterior distribution,
i
(s, z) ∼ p(s, z|−), one can first sample z from the marginal
Solutions for sub-problems P1 and P3 are globally optimal distribution, z ∼ p(z|−), and then sample s|z ∼ p(s|z, −)
(conditioned on the most recent values of all other parameters). from the conditional distribution. The other sampling direction
We cannot solve P2 analytically since the optimal values for is possible as well, but for our problem sampling z → s|z is
the set of all BPFA variables do not have a closed form solu- more efficient for finding a mode of the objective function.
tion. Our approach for P2 is to use stochastic optimization by We define r i,−k to be the residual error in approximating
Gibbs sampling each variable of BPFA conditioned on current the i t h patch with the current valuesfrom BPFA minus the
values of all other variables. We next present the updates for k t h dictionary element, r i,−k = Ri x− j =k (si j z i j )d j . We then
each sub-problem. We give an outline in Algorithm 2. sample z ik from its conditional posterior Bernoulli distribution
1) Algorithm for P1 (Total Variation): We can solve for β i z ik ∼ pik δ1 + qik δ0 , where following a simplification,
exactly for each pixel i = 1, . . . , N by using a generalized
shrinkage operation [31],
− 1
2
pik ∝ πk 1 + (γε /γs )d kH d k (10)
  γ
λg ψi x + ui ε
β i = max ψi x + ui 2 − , 0 · . (8) × exp Re(d kH r i,−k )2 /(γs /γε + d kH d k ) ,
ρ ψi x + ui 2 2
qik ∝ 1 − πk . (11)
We recall that β i corresponds to the 2D TV coefficients
for pixel i , with differences in one direction vertically and The symbol H denotes the conjugate transpose. The probabil-
horizontally. We then update the corresponding Lagrange ities can be obtained by dividing both of these terms by their
multiplier, u i = ui + ψi x − β i . sum. We observe that the probability that z ik = 1 takes into
2) Algorithm for P2 (BPFA): We update the parameters of account how well dictionary element d k correlates with the
BPFA using Gibbs sampling. We are therefore stochastically residual ri,−k . After sampling z ik we sample the corresponding
optimizing (7), but only for this sub-problem. With reference weight sik from its conditional posterior Gaussian distribution,
to Algorithm 1, the P2 sub-problem entails sampling new 
values for the complex dictionary D, the binary vectors zi and Re(d kH r i,−k ) 1
sik |z ik ∼ N zik , . (12)
real-valued weights si (with which we construct αi = si ◦ zi γs /γε + d kH d k γs + γε z ik d kH d k
through the element-wise product), the precisions γε and γs ,
When z ik = 1, the mean of sik is the regularized least squares
and the probabilities π1:K , with πk giving the probability that
solution and the variance will be small if γε is large. When
z ik = 1. In principle, there is no limit to the number of
z ik = 0, sik can is sampled from the prior, but does not factor
samples that can be made, with the final sample giving the
in the model in this case.
updates used in the other sub-problems. We found that a single
c) Sample γε and γs : We next sample from the condi-
sample is sufficient in practice and leads to a faster algorithm.
tional gamma posterior distribution of the noise precision and
We describe the sampling procedure below.
weight precision,
a) Sample dictionary D: We define the P × N matrix
  
X = [R1 x, . . . , R N x], which is a complex matrix of all γε ∼ Gam g0 + 12 P N, h 0 + 12 i Ri x − Dαi 22 , (13)
vectorized patches extracted from the image x. We also define  
γs ∼ Gam(e0 + 12 i,k z ik , f0 + 12 i,k z ik sik
2 ). (14)
the K × N matrix α = [α1 , . . . , α N ] containing the dictionary
weight coefficients for the corresponding columns in X such The expected value of each variable is the first term of the
that Dα is an approximation of X to which we add noise distribution divided by the second, which is close to the inverse
from a circularly-symmetric complex normal distribution. of the average empirical error for γε .
5012 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 12, DECEMBER 2014

d) Sample πk : Sample each πk from its conditional beta


posterior distribution,
N N

πk ∼ Beta a0 + i=1 z ik , b0 + i=1 (1 − z ik ) . (15)

The parameters to the beta distribution include counts of how


many times dictionary element d k was used by a patch.
3) Algorithm for P3 (MRI Reconstruction): The final sub-
problem is to reconstruct the image x. Our approach takes Fig. 2. The three masks considered for a given sampling percentage.
advantage of the Fourier domain similar to other methods, (a) Random 25%. (b) Cartesian 30%. (c) Radial 25%.
see [14], [30]. The corresponding objective function is
Since the left matrix is diagonal we can perform element-

N
ρ 
N
γε

x = arg min ψi x − β i + ui 22 + Ri x − Dαi 22 wise updating of the Fourier coefficients θ ,
x 2 2
i=1 i=1 ρFi T (β − u) + γε PFi x BPFA + λFi FuH y
λ θi = . (19)
+ Fu x − y22 . ρii + γε P + λFi FuH 1
2
We observe that the rightmost term in the numerator and
Since this is a least squares problem, x has a closed form denominator equals zero if i is not a measured k-space
solution that satisfies location. We invert θ via the inverse Fourier transform F H

to obtain the reconstructed MR image x .
ρ T + γε i RiT Ri + λFuH Fu x
= ρ T (β − u) + γε Px BPFA + λFuH y. (16) B. Discussion on λ
We recall that is the matrix of stacked ψi . The vector β is In noise-free compressed sensing, the fidelity term λ can
also obtained by stacking each β i and u is the vector formed tend to infinity giving an equality constraint for the mea-
by stacking ui . The vector x BPFA is the denoised reconstruction sured k-space values [1]. However, when y is noisy the
setting of λ is critical for most CS-MRI algorithms since
from BPFA using the current D and α1:N , which results from this parameter controls the level of denoising in the recon-
the definition xBPFA = P1 i RiT Dαi .
We observe that inverting the left N × N matrix is com- structed image. We note that a feature of dictionary learning
putationally prohibitive since N is the number of pixels CS-MRI approaches is that λ can still be set to a very
in the image. Fortunately, given the form of the matrix in large value, and so parameter selection isn’t necessary here.
Equation (16) we can use the procedure described in [14] This is because a denoised version of the image is obtained
and simplify the problem by working in the Fourier domain. through dictionary learning (xBPFA in this paper) and can
This allows for element-wise updates in k-space, followed by be taken as the denoised reconstruction. In Equation (19),
an inverse Fourier transform. We represent x as x = F H θ , we observe that by setting λ to a large value, we are
where θ is the Fourier transform of x. We then take the Fourier effectively fixing the measured k-space values and using the
transform of each side of Equation (16) to give k-space projection of BPFA and TV to fill in the missing
values. The reconstruction x will be noisy, but have artifacts


F ρ T + γε i RiT Ri + λFuH Fu F H θ due to sub-sampling removed. The output image x BPFA is a


denoised version of x using BPFA in essentially the same
= ρF T (β − u) + γε F Px BPFA + λF FuH y. (17) manner as in Section II-B3. Therefore, the quality of our
algorithm depends largely on the quality of BPFA as an image
The left-hand matrix simplifies to a diagonal matrix, denoising algorithm [25]. We show examples of this using

synthetic and clinical data in Sections IV-B and IV-E.
F ρ T + γε i RiT Ri + λFuH Fu F H
= ρ + γε P I N + λI Nu . (18) IV. E XPERIMENTAL R ESULTS
We evaluate the proposed algorithm on real-valued and
Term-by-term this results as follows: The product of the finite complex-valued MRI, and on a synthetic phantom. We con-
difference operator matrix with itself yields a circulant
sider three sampling masks: 2D random sampling, Carte-
matrix, which has the rows of the Fourier matrix F as its sian sampling with random phase encodes (1D random),
eigenvectors and eigenvalues equal to  = F T F H . The and pseudo radial sampling.4 We show an example of each
matrix RiT Ri is a matrix of all zeros, except for ones on the
mask in Fig. 2. We consider a variety of sampling rates
diagonal entries that correspond to the indices of x associated for each mask. As a performance measure we use PSNR,
with the i t h patch. Since each pixel appears in P patches,
and also consider SSIM [37]. We compare with three other
the sum over i gives P I N , and the Fourier product cancels.
algorithms: Sparse MRI [3],5 which as discussed above is a
The final diagonal matrix I Nu also contains all zeros, except
for ones along the diagonal corresponding to the indices in 4 We used codes referenced in [3], [8], and [10] to generate these masks.
k-space that are measured, which results from F FuH Fu F H . 5 http://www.eecs.berkeley.edu/~mlustig/Software.html
HUANG et al.: BAYESIAN NONPARAMETRIC DICTIONARY 5013

combination of wavelets and total variation, DLMRI [14],6


which is a dictionary learning method based on K-SVD,
and PBDW [15],7 which is patch-based method that uses
directional wavelets and therefore places greater restrictions
on the dictionary. We use the publicly available code for these
algorithms indicated above and used the built-in parameter
settings, or those indicated in the relevant papers. We also
compare with the BPFA algorithm without using total variation
by setting λg = 0.

A. Set-Up
For all images, we extract 6 × 6 patches where each pixel
defines the upper left corner of a patch and wrap around
the image at the boundaries; we investigate different patch
sizes later to show that this is a reasonable size. We initial-
ize x by zero-filling in k-space. We use a dictionary with
K = 108 initial dictionary elements, recalling that the final
number of dictionary elements will be smaller due to the sparse
BPFA prior. If 108 is found to be too small, K can be increased
with the result being a slower inference algorithm.8 We ran
1000 iterations and use the results of the last iteration.
For regularization parameters, we set the data fidelity term
λ = 10100 . We are therefore effectively requiring equality with
the measured values of k-space and allowing BPFA to fill
in the missing values, as well as give a denoised reconstruc-
tion, as discussed in Section III-B and highlighted below in
Sections IV-B and IV-E. After trying several values, we also
found λg = 10 and ρ = 1000 to give good results. We set the
BPFA hyperparameters as c = γ = e0 = f 0 = g0 = h 0 = 1.
These settings result in a relatively non-informative prior given
the amount of data we have. However, we note that our algo-
rithm was robust to these values, since the data overwhelms
these prior values when calculating posterior distributions.

B. Experiments on a GE Phantom
We consider a noisy synthetic example to highlight the
advantage of dictionary learning for CS-MRI. In Fig. 3 we
show results on a 256 × 256 GE phantom with additive
noise having standard deviation σ = 0.1. In this experiment Fig. 3. GE data with noise (σ = 0.1) and 30% Cartesian sampling. BPFA
(b) reconstructs the original noisy image, and (c) denoises the reconstruction
we use BPFA without TV to reconstruct the original image simultaneously. (d) TV denoises as part of the reconstruction. Also shown
using 30% Cartesian sampling. We show the reconstruction are the dictionary learning variables sorted by πk . (e) the dictionary, (f) the
using zero-filling in Fig. 3(a). Since λ = 10100 , we see distribution on the dictionary, πk . (g) The normalized histogram of number
of the dictionary elements used per patch.
in Fig. 3(b) that BPFA essentially helps reconstruct the
underlying noisy image for x. However, using the denoising
property of the BPFA model shown in Fig. 1, we obtain the roughly 80 dictionary elements were used (the unused noisy
denoised reconstruction of Fig. 3(c) by focusing on x BPFA elements in Fig. 3(e) are draws from the prior). We note that
from Equation (16). This is in contrast with the best result 2.28 elements were used on average by a patch given that at
we could obtain with TV in Fig. 3(d), which places the least one was used, which discounts the black regions.
TV penalty on the reconstructed image. As discussed, for
TV the setting of λ relative to λg is important. We set C. Experiments on Real-Valued (Synthetic) MRI
λ = 1 and swept through λg ∈ (0, 0.15), showing the result
with highest PSNR in Fig. 3(d). Similar to Fig. 1 we show For our synthetic MRI experiments, we consider two pub-
statistics from the BPFA model in Figs. 3(e)-(g). We see that licly available real-valued 512 × 512 MRI9 of a shoulder
and lumbar. We construct these problems by applying the
6 http://www.ifp.illinois.edu/~yoram/DLMRI-Lab/Documentation.html relevant sampling mask to the projection of real-valued MRI
7 http://www.quxiaobo.org/index_publications.html
8 As discussed in Section II-B, in theory K can be infinitely large. 9 www3.americanradiology.com/pls/web1/wwimggal.vmg/wwimggal.vmg
5014 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 12, DECEMBER 2014

TABLE II
PSNR R ESULTS FOR R EAL -VALUED L UMBAR MRI AS F UNCTION OF
S AMPLING P ERCENTAGE AND M ASK (C ARTESIAN W ITH R ANDOM
P HASE E NCODES , 2D R ANDOM AND P SEUDO R ADIAL )

TABLE III
PSNR R ESULTS FOR R EAL -VALUED S HOULDER MRI AS F UNCTION OF Fig. 4. Absolute errors for 30% Cartesian sampling of synthetic lumbar
S AMPLING P ERCENTAGE AND M ASK (C ARTESIAN W ITH R ANDOM MRI. (a) BPFA+TV. (b) PBDW. (c) DLMRI. (d) Sparse MRI.
P HASE E NCODES , 2D R ANDOM AND P SEUDO R ADIAL )

into k-space. Though using such real-valued MRI data may


not reflect clinical reality, we include this idealized setting
to provide a complete set of experiments similar to other
papers [3], [14], [15]. We evaluate the performance of our
algorithm using PSNR and compare with Sparse MRI [3],
DLMRI [14] and PBDW [15]. Although the original data is Fig. 5. Absolute errors for 20% radial sampling of the shoulder MRI.
real-valued, we learn complex dictionaries since the recon- (a) BPFA+TV. (b) PBDW. (c) DLMRI. (d) Sparse MRI.
structions are complex. We consider our algorithm with and
images, the algorithm naturally performs well in this setting.
without the total variation penalty, denoted BPFA+TV and
In Figs. 4 and 5 we show the absolute value of the residuals
BPFA, respectively.
of different algorithms using one experiment from each MRI.
We present the PSNR results for all sampling masks and
We see an improvement using the proposed method, which
rates in Tables II and III. From these values we see the compet-
has more noise-like errors.
itive performance of the propose dictionary learning algorithm.
We also see a slight improvement by the addition of the
TV penalty. As expected, we observe that 2D random sampling D. Experiments on Complex-Valued MRI
produced the best results, followed by pseudo-radial sampling We also consider two clinically obtained complex-valued
and Cartesian sampling, which is due to their decreasing level MRI: We use the T2-weighted brain MRI from [4], which is
of incoherence, with greater incoherence producing artifacts a 256 × 256 MRI of a healthy volunteer from a 3T Siemens
that are more noise-like [3]. Since BPFA is good at denoising Trio Tim MRI scanner using the T2-weighted turbo spin echo
HUANG et al.: BAYESIAN NONPARAMETRIC DICTIONARY 5015

TABLE IV
PSNR/SSIM R ESULTS FOR C OMPLEX -VALUED B RAIN MRI AS A F UNCTION OF S AMPLING P ERCENTAGE . S AMPLING M ASKS I NCLUDE C ARTESIAN
S AMPLING W ITH R ANDOM P HASE E NCODES , 2D R ANDOM S AMPLING AND P SEUDO R ADIAL S AMPLING

Fig. 6. Reconstruction results for 25% pseudo radial sampling of a complex-valued MRI of the brain. (a) Original. (b) BPFA+TV. (c) BPFA. (d) PBDW.
(e) DLMRI. (f) PSNR vs iteration. (g) BPFA+TV error. (h) BPFA error. (i) PBDW error. (j) DLMRI error.

sequence (TR/TE = 6100/99 ms, 220 × 220 mm field of view, these experiments that the BPFA algorithm learns a slightly
3 mm slice thickness). We also use an MRI scan of a lemon finer detail structure compared with other algorithms, with
obtained from the Research Center of Magnetic Resonance and the errors being more noise-like. We also show the PSNR of
Medical Imaging at Xiamen University (TE = 32 ms, size = BPFA+TV and BPFA as a function of iteration. As is evident,
256 × 256, spin echo sequence, TR/TE = 10000/32 ms, the algorithm does not necessarily need all 1000 iterations, but
FOV = 70×70 mm2 , 2-mm slice thickness). This MRI is from performs competitively even in half that number.
a 7T/160mm bore Varian MRI system (Agilent Technologies,
Santa Clara, CA, USA) using a quadrature-coil probe.
For the brain MRI experiment we use both PSNR and E. Experiments in the Noisy Setting
SSIM as performance measures. We show these values in The MRI we have considered thus far have been essentially
Table IV for each algorithm, sampling mask and sampling noiseless. For some MRI machines this may be an unrealistic
rate. As with the synthetic MRI, we see that our algorithm assumption. We continue our evaluation of noisy MRI begun
performs competitively with the state-of-the-art. We also see with the toy GE phantom in Section IV-B by evaluating
the significant improvement of all algorithms over zero-filling. how our model performs on clinically obtained MRI with
Example reconstructions are shown for each MRI dataset additive noise. We show BPFA results without TV to highlight
in Figs. 6 and 7. Also in Fig. 7 are PSNR values for the the dictionary learning features, but note that results with
lemon MRI. We see from the absolute error residuals for TV provide a slight improvement in terms of PSNR and
5016 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 12, DECEMBER 2014

Fig. 7. Reconstruction results for 35% 2D random sampling of a complex-valued MRI of a lemon. (a) Original. (b) BPFA+TV: PSNR = 39.64. (c) BPFA:
PSNR = 38.21. (d) PBDW: PSNR = 37.89. (e) DLMRI: PSNR = 35.05. (f) PSNR vs iteration. (g) BPFA+TV error. (h) BPFA error. (i) PBDW error.
(j) DLMRI error.

Fig. 8. PSNR vs λ in the noisy setting (σ = 0.03) for the complex-value


brain MRI with 30% 2D random sampling.

SSIM. We again consider the brain MRI and use additive


complex white Gaussian noise having standard deviation
σ = 0.01, 0.02, 0.03. For all experiments we use the original
noise-free MRI as the ground truth.
As discussed in Section III-B and illustrated in
Section IV-B, dictionary learning allows us to consider
two possible reconstructions: the actual reconstruction
 x, and
the denoised BPFA reconstruction x BPFA = P1 i RiT Dαi .
As detailed in these sections, as λ becomes larger the
reconstruction will be noisier, but with the artifacts from
sub-sampling removed. However, for all values of λ, x BPFA
produces a denoised version that essentially doesn’t change. Fig. 9. The denoising properties of dictionary learning on noisy complex-
We see this clearly in Fig. 8, where we show the PSNR valued MRI with 35% Cartesian sampling and σ = 0.03. (a) Zero filling.
of each reconstruction as a function of λ. When λ is (b) BPFA reconstruction (x). (c) BPFA denoising (xBPFA). (d) DLMRI.
small, the performance degrades for both algorithms since TABLE V
too much smoothing is done by dictionary learning on x. PSNR FOR 35% C ARTESIAN S AMPLING OF C OMPLEX -VALUED B RAIN
As λ increases, both improve, but eventually the reconstruction MRI FOR VARIOUS N OISE S TANDARD D EVIATIONS . (λ = 10100 )
of x degrades again because near equality to the noisy y is
being more strictly enforced. The denoised reconstruction
however levels off and does not degrade. We show
PSNR values in Table V as a function of noise level.10
10 We are working with a different scaling of the MRI than in [14] and made
Example reconstructions that parallel those given in Fig. 3 are
the appropriate modifications. Also, since DLMRI is a dictionary learning
method it can output “x KSVD ”, though it was not originally motivated this also shown in Fig. 9. These results highlight the robustness
way. Issues discussed in Sections II-B.2 and II-B.3 apply in this case. of our approach to λ in the noisy setting, and we note that we
HUANG et al.: BAYESIAN NONPARAMETRIC DICTIONARY 5017

TABLE VI
PSNR AS A F UNCTION OF PATCH S IZE FOR A R EAL -VALUED AND
C OMPLEX -VALUED B RAIN MRI W ITH C ARTESIAN S AMPLING

TABLE VII
T OTAL RUNTIME IN M INUTES (S ECONDS /I TERATION ). W E R AN
1000 I TERATIONS OF BPFA, 100 OF DLMRI AND
10 OF S PARSE MRI

We see that there are two modes, one for the empty back-
ground and one for the foreground, and the second mode tends
to increase as the sampling rate increases. The adaptability of
this value to each patch is another characteristic of the beta
process model.
We also performed an experiment with varying patch sizes
and show our results in Table VI. We see that the results
are not very sensitive to this setting and that comparisons
using 6 × 6 patches are meaningful. We also compare the
runtime for different algorithms in Table VII, showing both
the total runtime of each algorithm and the per-iteration
Fig. 10. Radial sampling for the Brain MRI. (a)-(c) The learned dictionary
for various sampling rates. The noisy elements towards the end of each were times using an Intel Xeon CPU E5-1620 at 3.60GHz, 16.0G
unused and are samples from the prior. (d) The cumulative function of the ram. However, we note that we arguably ran more iterations
sorted πk from BPFA for each sampling rate. This gives information on than necessary for these algorithms; the BPFA algorithms
sparsity and average usage of the dictionary. (e) The distribution on the number
of elements used per patch for each sampling rate. generally produced high quality results in half the number
of iterations, as did DLMRI (the authors of [14] recommend
encountered no stability issues using extremely large values 20 iterations), while Sparse MRI uses 5 iterations as default
of λ. and the performance didn’t improve beyond 10 iterations.
We note that the speed-up over DLMRI arises from the lack of
F. Dictionary Learning and Further Discussion the OMP algorithm, which in Matlab is much slower than our
sparse coding update.11 We note that inference for the BPFA
We investigate the model learned by BPFA. In Fig. 10
model is easily parallelizable—as are the other dictionary
we show dictionary learning results learned by BPFA+TV
learning algorithms—which can speed up processing time.
for radial sampling of the complex Brain MRI. In the top
The proposed method has several advantages, which we
portion, we show the dictionaries learned for 10%, 20% and
believe leads to the improvement in performance. A significant
30% sampling. We see that they are similar in their shape, but
advantage is the adaptive learning of the dictionary size
the number of elements increases as the sampling percentage
and per-patch sparsity level using a nonparametric stochastic
increases since more complex information about the image is
process that is naturally suited for this problem. Several other
contained in the k-space measurements. We again note that
dictionary learning parameters such as the noise variance and
unused elements are represented by draws from the prior.
the variances of the score weights are adjusted as well through
In Fig. 10(d) we show the cumulative sum of the ordered πk
a natural MCMC sampling approach. These benefits have
from BPFA. We can read off the average number of elements
been investigated in other applications of this model [25], and
used per patch by looking at the right-most value. We see
naturally translate here since CS-MRI with BPFA is closely
that more elements are used per patch as the fraction of
related to image denoising as we have shown.
observed k-space increases. We also see that for 10%, 20%
Another advantage of our model is the Markov Chain
and 30% sampling, roughly 70, 80 and 95, respectively, of
Monte Carlo inference algorithm itself. In highly non-convex
the 108 total dictionary elements were significantly used, as
Bayesian models (or similar models with a Bayesian interpre-
indicated by the leveling off of these functions. This highlights
tation), it is often observed by the statistics community that
the adaptive property of the nonparametric beta process prior.
In Fig. 10(e) we show the empirical distribution on the number 11 BPFA is significantly faster than K-SVD in Matlab because it requires
of dictionary elements used per patch for each sampling rate. fewer loops. This difference may not be as large with other coding languages.
5018 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 12, DECEMBER 2014

MCMC sampling can outperform deterministic methods, and by a matrix as x is in (20). Such objective functions tend to
rarely performs worse [38]. Given that BPFA is a Bayesian be easier to optimize. For example when h is the TV penalty
model, such sampling techniques are readily derived, as we the solution for v is analytical.
showed in Section III-A.
ACKNOWLEDGMENT
V. C ONCLUSION Y. Huang and J. Paisley contributed equally to this work.
We have presented an algorithm for CS-MRI reconstruction
that uses Bayesian nonparametric dictionary learning. Our R EFERENCES
Bayesian approach uses a model called beta process factor [1] E. J. Candés, J. Romberg, and T. Tao, “Robust uncertainty principles:
analysis (BPFA) for in situ dictionary learning. Through this Exact signal reconstruction from highly incomplete frequency informa-
tion,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, Feb. 2006.
hierarchical generative structure, we can learn the dictionary [2] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52,
size, sparsity pattern and additional regularization parame- no. 4, pp. 1289–1306, Apr. 2006.
ters. We also considered a total variation penalty term for [3] M. Lustig, D. Donoho, and J. M. Pauly, “Sparse MRI: The application of
compressed sensing for rapid MR imaging,” Magn. Reson. Med., vol. 58,
additional constraints on image smoothness. We presented an no. 6, pp. 1182–1195, 2007.
optimization algorithm using the alternating direction method [4] X. Qu, W. Zhang, D. Guo, C. Cai, S. Cai, and Z. Chen, “Iterative
of multipliers (ADMM) and MCMC Gibbs sampling for all thresholding compressed sensing MRI based on contourlet transform,”
Inverse Problems Sci. Eng., vol. 18, no. 6, pp. 737–758, Jun. 2009.
BPFA variables. Experimental results on real and complex- [5] X. Qu, X. Cao, D. Guo, C. Hu, and Z. Chen, “Combined sparsifying
valued MRI showed that our proposed regularization frame- transforms for compressed sensing MRI,” Electron. Lett., vol. 46, no. 2,
work compares favorably with other algorithms for various pp. 121–123, Jan. 2010.
[6] J. Trzasko and A. Manduca, “Highly undersampled magnetic res-
sampling trajectories and rates. We also showed the natural onance image reconstruction via homotopic 0 -minimization,” IEEE
ability of dictionary learning to handle noisy MRI without Trans. Med. Imag., vol. 28, no. 1, pp. 106–121, Jan. 2009.
dependence on the measurement fidelity parameter λ. To this [7] H. Jung, K. Sung, K. S. Nayak, E. Y. Kim, and J. C. Ye, “k-t FOCUSS:
A general compressed sensing framework for high resolution dynamic
end, we showed that the model can enforce a near equality MRI,” Magn. Reson. Med., vol. 61, no. 1, pp. 103–116, 2009.
constraint to the noisy measurements and use the dictionary [8] J. Yang, Y. Zhang, and W. Yin, “A fast alternating direction method
learning result as a denoised output of the noisy MRI. for TVL1-L2 signal reconstruction from partial Fourier data,” IEEE
J. Sel. Topics Signal Process., vol. 4, no. 2, pp. 288–297, Apr. 2010.
A PPENDIX [9] Y. Chen, X. Ye, and F. Huang, “A novel method and fast algorithm for
MR image reconstruction with significantly under-sampled data,” Inverse
We give a brief review of the ADMM algorithm [32]. Problems Imag., vol. 4, no. 2, pp. 223–240, 2010.
[10] J. Huang, S. Zhang, and D. Metaxas, “Efficient MR image reconstruction
We start with the convex optimization problem for compressed MR imaging,” Med. Image Anal., vol. 15, no. 5,
pp. 670–679, 2011.
min Ax − b22 + h(x), (20) [11] S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEE
x
Trans. Signal Process., vol. 56, no. 6, pp. 2346–2356, Jun. 2008.
where h is a non-smooth convex function, such as an [12] X. Ye, Y. Chen, W. Lin, and F. Huang, “Fast MR image reconstruction
1 penalty. ADMM decouples the smooth squared error term for partially parallel imaging with arbitrary k-space trajectories,” IEEE
Trans. Med. Imag., vol. 30, no. 3, pp. 575–585, Mar. 2011.
from this penalty by introducing a second vector v such that [13] M. Akçakaya et al., “Low-dimensional-structure self-learning and
thresholding: Regularization beyond compressed sensing for MRI recon-
min Ax − b22 + h(v) subject to v = x. (21) struction,” Magn. Reson. Med., vol. 66, no. 3, pp. 756–767, 2011.
x
[14] S. Ravishankar and Y. Bresler, “MR image reconstruction from highly
This is followed by a relaxation of the equality v = x via an undersampled k-space data by dictionary learning,” IEEE Trans. Med.
augmented Lagrangian term Imag., vol. 30, no. 5, pp. 1028–1041, May 2011.
[15] X. Qu et al., “Undersampled MRI reconstruction with patch-based
ρ directional wavelets,” Magn. Reson. Imag., vol. 30, no. 7, pp. 964–977,
L(x, v, η) = Ax − b22 + h(v) + η T (x − v) + x − v22 . 2012.
2
(22) [16] Z. Yang and M. Jacob, “Robust non-local regularization framework
for motion compensated dynamic imaging without explicit motion
estimation,” in Proc. 9th IEEE Int. Symp. Biomed. Imag., May 2012,
A minimax saddle point is found with the minimization taking pp. 1056–1059.
place over both x and v and dual ascent for η. [17] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-local
Another way to write the objective in (22) is to define sparse models for image restoration,” in Proc. 12th Int. Conf. Comput.
Vis., Sep./Oct. 2009, pp. 2272–2279.
u = (1/ρ)η and combine the last two terms. The result is [18] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising
an objective that can be optimized by cycling through the by sparse 3-D transform-domain collaborative filtering,” IEEE Trans.
following updates for x, v and u, Image Process., vol. 16, no. 8, pp. 2080–2095, Aug. 2007.
[19] K. Engan, S. O. Aase, and J. H. Husøy, “Method of optimal directions
ρ
x = arg min Ax − b22 + x − v + u22 , (23) for frame design,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal
Process., Mar. 1999, pp. 2443–2446.
x 2
ρ [20] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for
v = arg min h(v) + x − v + u22 , (24) designing overcomplete dictionaries for sparse representation,” IEEE
v 2 Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006.
u = u + x − v . (25) [21] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching
pursuit: Recursive function approximation with applications to wavelet
This algorithm simplifies the optimization since the objective decomposition,” in Proc. 27th Asilomar Conf. Signals, Syst. Comput.,
for x is quadratic and thus has a simple analytic solution, Nov. 1993, pp. 40–44.
[22] M. Protter and M. Elad, “Image sequence denoising via sparse and
while the update for v is a proximity operator of h with redundant representations,” IEEE Trans. Image Process., vol. 18, no. 1,
penalty ρ, the difference being that v is not pre-multiplied pp. 27–36, Jan. 2009.
HUANG et al.: BAYESIAN NONPARAMETRIC DICTIONARY 5019

[23] J. Paisley and L. Carin, “Nonparametric factor analysis with beta process Qin Lin is currently pursuing the degree with the
priors,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 777–784. Department of Communication Engineering, Xia-
[24] J. W. Paisley, D. M. Blei, and M. I. Jordan, “Stick-breaking beta men University, Xiamen, China. His research inter-
processes and the Poisson process,” in Proc. Int. Conf. Artif. Intell. ests include computer vision, machine learning, and
Statist., 2012, pp. 850–858. data mining.
[25] M. Zhou et al., “Nonparametric Bayesian dictionary learning for analysis
of noisy and incomplete images,” IEEE Trans. Image Process., vol. 21,
no. 1, pp. 130–144, Jun. 2012.
[26] X. Ding, L. He, and L. Carin, “Bayesian robust principal component
analysis,” IEEE Trans. Image Process., vol. 20, no. 12, pp. 3419–3430,
Dec. 2011.
[27] N. L. Hjort, “Nonparametric Bayes estimators based on beta processes
in models for life history data,” Ann. Statist., vol. 18, no. 3, Xinghao Ding was born in Hefei, China, in 1977.
pp. 1259–1294, 1990. He received the B.S. and Ph.D. degrees from the
[28] R. Thibaux and M. I. Jordan, “Hierarchical beta processes and the Indian Department of Precision Instruments, Hefei Univer-
buffet process,” in Proc. Int. Conf. Artif. Intell. Statist., San Juan, PR, sity of Technology, Hefei, China, in 1998 and 2003,
USA, 2007, pp. 564–571. respectively.
[29] D. Gabay and B. Mercier, “A dual algorithm for the solution of nonlinear He was a Post-Doctoral Researcher with the
variational problems via finite element approximation,” Comput. Math. Department of Electrical and Computer Engineering,
Appl., vol. 2, no. 1, pp. 17–40, 1976. Duke University, Durham, NC, USA, from 2009 to
[30] W. Yin, S. Osher, D. Goldfarb, and J. Darbon, “Bregman iterative algo- 2011. Since 2011, he has been a Professor with
rithms for 1 -minimization with applications to compressed sensing,” the School of Information Science and Engineer-
SIAM J. Imag. Sci., vol. 1, no. 1, pp. 143–168, 2008. ing, Xiamen University, Xiamen, China. His main
[31] T. Goldstein and S. Osher, “The split Bregman method for research interests include image processing, sparse signal representation, and
L1-regularized problems,” SIAM J. Imag. Sci., vol. 2, no. 2, pp. 323–343, machine learning.
2009.
[32] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed
optimization and statistical learning via the alternating direction method
of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, Xueyang Fu is currently pursuing the degree with
2010. the Department of Communication Engineering,
[33] J. Paisley, L. Carin, and D. Blei, “Variational inference for stick-breaking Xiamen University, Xiamen, China. His research
beta process priors,” in Proc. 28th Int. Conf. Mach. Learn., Bellevue, interests include image processing, sparse represen-
WA, USA, pp. 889–896, 2011. tation, and machine learning.
[34] D. Knowles and Z. Ghahramani, “Infinite sparse factor analysis and
infinite independent components analysis,” in Independent Component
Analysis and Signal Separation. New York, NY, USA: Springer-Verlag,
2007, pp. 381–388.
[35] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky,
“Sharing features among dynamical systems with beta processes,” in
Advances in Neural Information Processing. Vancouver, BC, Canada:
Curran Associates, Inc., 2011.
[36] T. Griffiths and Z. Ghahramani, “Infinite latent feature models and the Xiao-Ping Zhang (M’97–SM’02) received the B.S.
Indian buffet process,” in Advances in Neural Information Processing and Ph.D. degrees in electronic engineering from
Systems. Cambridge, MA, USA: MIT Press, 2006, pp. 475–482. Tsinghua University, Beijing, China, in 1992 and
[37] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image 1996, respectively, and the M.B.A. (Hons.) degree in
quality assessment: From error visibility to structural similarity,” IEEE finance, economics, and entrepreneurship from the
Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004. Booth School of Business, University of Chicago,
[38] C. Faes, J. T. Ormerod, and M. P. Wand, “Variational Bayesian inference Chicago, IL, USA.
for parametric and nonparametric regression with missing data,” J. Amer. He has been with the Department of Electrical and
Statist. Assoc., vol. 106, no. 495, pp. 959–971, 2011. Computer Engineering, Ryerson University, Toronto,
ON, Canada, since Fall 2000, where he is currently
a Professor and the Director of the Communication
Yue Huang received the B.S. degree in electri- and Signal Processing Applications Laboratory. He has served as the Program
cal engineering from Xiamen University, Xiamen, Director of Graduate Studies. He is cross-appointed with the Department
China, in 2005, and the Ph.D. degree in biomed- of Finance, Ted Rogers School of Management, Ryerson University. Prior
ical engineering from Tsinghua University, Beijing, to joining Ryerson University, he was a Senior DSP Engineer with SAM
China, in 2010. Since 2010, she has been an Assis- Technology, Inc., San Francisco, CA, USA, and a Consultant with San
tant Professor with the School of Information Sci- Francisco Brain Research Institute, San Francisco. He held research and
ence and Engineering, Xiamen University. Her main teaching positions at the Communication Research Laboratory, McMaster
research interests include image processing, machine University, Hamilton, ON, USA, and was a Post-Doctoral Fellow with the
learning, and biomedical engineering. Beckman Institute, University of Illinois at Urbana-Champaign, Champaign,
IL, USA, and the University of Texas at San Antonio, San Antonio, TX, USA.
His research interests include statistical signal processing, multimedia content
analysis, sensor networks and electronic systems, computational intelligence,
John Paisley is currently an Assistant Professor with and applications in bioinformatics, finance, and marketing. He is a frequent
the Department of Electrical Engineering, Columbia Consultant for biotech companies and investment firms. He is the Co-Founder
University, New York, NY, USA. Prior to that, he and CEO of EidoSearch, an Ontario-based company offering a content-based
was a Post-Doctoral Researcher with the Depart- search and analysis engine for financial data.
ments of Computer Science, University of California Dr. Zhang is a registered Professional Engineer in Ontario, and a member of
at Berkeley, Berkeley, CA, USA, and Princeton Uni- the Beta Gamma Sigma Honor Society. He is the General Chair of MMSP’15,
versity, Princeton, NJ, USA. He received the B.S., Publicity Chair of ICME’06, and Program Chair of ICIC’05 and ICIC’10.
M.S., and Ph.D. degrees in electrical and computer He served as a Guest Editor of Multimedia Tools and Applications and the
engineering from Duke University, Durham, NC, International Journal of Semantic Computing. He is a Tutorial Speaker of
USA, in 2004, 2007, and 2010, respectively. His ACMMM2011, ISCAS2013, ICIP2013, and ICASSP2014. He is currently an
research is in the area of statistical machine learning Associate Editor of the IEEE T RANSACTIONS ON S IGNAL P ROCESSING, the
and focuses on probabilistic modeling and inference techniques, Bayesian IEEE T RANSACTIONS ON M ULTIMEDIA, the IEEE S IGNAL P ROCESSING
nonparametric methods, and text and image processing. L ETTERS , and the Journal of Multimedia.

You might also like