Array Signal Processing an Algebraic Approach
Array Signal Processing an Algebraic Approach
An algebraic approach
ii
An algebraic approach
EE 4715
Spring 2022
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction 1
1.1 Applications of array processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
I DATA MODELS 7
2 Wave propagation 9
2.1 The wave equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Spatial Fourier transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Spatial sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Correlation processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Application: radio astronomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
This reader contains the course material for the MSc level course on array signal processing at
TU Delft, ET4 147. Over the past 20 years, this course was presented as “signal processing for
communications”; in 2022 it was combined with another course on speech and audio processing
into one that is more general and focuses on multisensor array processing.
Sensor arrays are present in many applications:
– In wireless communications, multiple antennas at the transmitter and/or the receiver allows
to increase data rates and suppress unwanted interference.
– In radio astronomy, collections of the hallmark telescope dishes have been the workhorse for
many years. The array is called an interferometer. Over the past decade, the dishes have
been upgraded with antenna arrays in the focal plane, or been replaced with massive arrays of
“simple” (non-steerable) antennas, typically arranged in some hierarchy. Using the observed
data of one night (or even many nights), the aim is to create images of the sky, as function
of frequency.
– In a medical setting, ultrasound transducers are used to create images of organs in the human
body. Such a transducer can consist of a line array of piezo-electric elements, or of a 2D array.
– Still in a medical setting, electrode arrays are used to capture electrical signals from the
skull, i.e. electro-encephalogram (EEG) signals. These are then processed to obtain a crude
3D-localized image of functional regions in the brain.
– Microphone arrays are being used to filter out unwanted interference in noise-cancelling
headphones. In hearing aids, they are used to focus on an intended speaker while suppressing
background noise.
– Other applications are phased array radar, sonar, and seismic exploration.
While these applications are very diverse, the underlying signal processing data models and
mathematical techniques are in fact very similar. The course will focus on these data models,
introduce the appropriate mathematical technique, then derive generic signal processing algo-
rithms, and relate to one of the applications as an example. While you probably have seen
already many models and mathematical techniques in other courses, such as Detection and Es-
timation, or Machine Learning, or Convex Optimization, the present course will angle towards
matrix models and advanced techniques in linear algebra, such as the singular value decompo-
sition, factor analysis, generalized eigenvalues, and some tensor techniques. This is in line with
the origins of array processing.
However, we will hardly discuss adaptive techniques related to array signal processing, such as
LMS or CMA. An in-depth discussion would need an entire course of its own. Thus, the course
is mostly focused on algebraic technques for array processing.
It is assumed that the participants already have a fair background in linear algebra, although
one lecture is spent to refresh this knowledge.
Acknowledgements
The course material is derived in part from previously published papers, stemming from joint
work with many colleagues and (former) PhD students over the course of 30 years. In particular
I would like to acknowledge the collaboration with Amir Leshem, Stefan Wijnholds, Millad
Sardarabadi: text from (overview) papers we wrote together has been used, but edited to fit the
course.
INTRODUCTION
Contents
1.1 Applications of array processing . . . . . . . . . . . . . . . . . . . . . 2
1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Signal processing is the theory and engineering art of converting acquired sensor measurements
into “information” (or “useful data”). It starts by deriving data models, or concise abstractions
of the physics behind the observations. Next, methods are developed and algorithms are pro-
posed to extract the “information”. Strongly depending on the application, this could consist
of signal parameters, propagation parameters, reconstructed time domain signals, images, etc.
Finally, part of signal processing is concerned with efficient implementations on computational
platforms.
Array signal processing is the branch of signal processing that considers multiple sensors, or an
array of sensors. This could occur in many applications, e.g., an array of antennas in wireless
communication, or a microphone array inside hearing aids or teleconferencing equipment.
The received data from the multiple sensors are stacked into vectors, and simple data models
express each sensor signal as a linear combination of a stack of transmitted signals s(t) to which
noise n(t) is added, i.e.,
x(t) = As(t) + n(t) . (1.1)
The tools relevant to analyze and process this data are then found in linear algebra: in this
course we will be looking at matrix multiplications and inversion, subspace estimation, eigenvalue
decompositions, and more. Since noise is added and needs to be taken into account, we will also
need tools from statistics, as seen e.g., in a course on Estimation and Detection. These tools are
then generalized to the matrix-vector case.
Thirty years ago, an overview article was publised with the title “Two decades of array signal
processing research” [1]. We can thus consider that the area of array signal processing is about 50
years old, although its origins are of course much older (going back, e.g., to optics interferometry).
Sensor arrays can be used for many things. This section lists some basic applications.
1.1.1 Diversity
σs2
SNRm = .
σn2
then s(t) is unaffected, but the noised is averaged out: the power of the noise present in x(t) is
σn2 /M , and the SNR after averaging is
σs2
SNRout = M .
σn2
We say that we have an array gain of M . This is easily generalized to the model (1.1), which
for a single signal in noise is
x(t) = as(t) + n(t)
where in the previous example we had a unit-weight vector a = 1, with 1 = [1, 1, · · · , 1]T . The
signal s(t) is recovered by computing the weighted average1
H
ŝ(t) = w x(t) .
We will see later that the optimal weights are w = a/kak2 . The vector w is known as a
beamformer.
This is applied in wireless communication, where multiple antennas are used to provide diversity.
In the presence of multipath reflections, it may happen that a reflection cancels the desired signal
at the location of one antenna. If we have a second antenna at a slightly different location that
therefore captures a different linear combination of these signals, we can still receive the signal.
1 T H
Superscript denotes a transpose, and superscript a complex conjugate transpose.
(a)
(b) (c)
Figure 1.1. (a) Example of a radio telescope: The Very Large Array, New Mexico; (b) a single
element ultrasound transducer next to a 3D ultrasound array; (c) MiG-35 phased
array radar.
An array of sensors is used to sample signals in space. This is useful if the signals have spatial
properties: we consider wavefields, where signals propagate in space. Much of the early research
(1950–1990) is concerned with modeling and estimating the propagation conditions, e.g., direc-
tions of arrival, propagation delays, propagation velocities. If we represent directions of arrival in
two dimensions, then we obtain images, and direction finding is called image formation. Prime
application areas are radar, radio astronomy, ultrasound imaging, underwater acoustics, and
seismic exploration. Fig. 1.1 shows examples of sensor arrays in these applications. In relation
to (1.1), we would say that these applications are interested in estimating parameters of A: a
model for the propagation.
In other applications, we are interested in the transmitted signals s(t). E.g., if the matrix A
in (1.1) is invertible, we can compute the estimate ŝ(t) = A−1 x(t). In this case, the multiple
antennas are combined by A−1 such that interfering signals are cancelled and the desired signal
is found. A common application is MIMO wireless communication (“multiple input multiple
output”, i.e., multiple antennas at the transmitter and at the receiver), where we increase the
total capacity of the system by spatially separating overlapping signals. Using M antennas, we
can expect to separate M overlapping signals and thus to increase our capacity by a factor of M .
Fig. 1.2 shows an example of a MIMO antenna array that is used for this. In Massive MIMO
designs, we have M > 100, leading to huge capacity gains but also hardware complexities: not
every antenna can be equipped with a transmitter or receiver.
Similarly, in microphone array processing, we are interested in the audio signal (e.g., hearing
aids which nowadays employ multiple microphones to enable noise cancellation).
1.2 APPROACH
Signal processing starts with modeling. Given an application, we first construct a forward data
model which shows how the received sensor signals depend on the sources of interest and the
propagation medium. This can take the simple form of (1.1), but very often, more detail is
needed depending on the situation at hand. E.g., antenna gains may be direction dependent,
multipath may be present such that delayed signals s(t − τ ) also enter into the model, etc.
In wireless communication, source signals are often known up to the unknown symbols in the
message that we try to receive: the signals are deterministic with a number of unknown pa-
rameters. In other cases, such as radio astronomy, the source signals are quite random (e.g.,
described by temporally white Gaussian processes) and it may be more appropriate to define a
stochastic data model. We will frequently look at second order correlation models of the form
H
Rx = ARs A + Rn (1.2)
where Rx = E[x(t)xH (t)] is the correlation matrix of the received signals, and similarly for Rs
and Rn .
Several assumptions were already made to arrive at this model, e.g., stationarity, and inde-
pendence of the signals and the noise. In the modeling phase, it is important to specify the
assumptions that were made to arrive at the model. Classical array processing textbooks often
provide a lot of details on translating wave propagation into models [2]. We will cover some of
this in Chap. 2.
Once we have a model for either x(t) or Rx , we can start to look at methods to estimate the
parameters we are interested in. These could be A, or parameters on which A depends, or
the source signals or parameters related to them. Not surprisingly, the methods we consider
are based on linear algebra, and various methods target various structures that may be present
in the model. E.g., in future chapters we will consider the structure that arises if, in (1.2),
Rs is diagonal, if Rn is diagonal or equal to Rn = σn2 I. This will then result in eigenvalue
decomposition problems or, more general, in factor analysis.
Linear algebra was the main workhorse for array signal processing in the period 1990–2010.
Since then, the attention has shifted to methods arising from compressed sensing, resulting in
formulations of problems as constrained optimization problems. These are then solved using
generic optimization techniques. Nonetheless, the focus of the book is on tools from linear
algebra.
1.3 NOTES
“Classical” array processing textbooks are the books by Johnson and Dudgeon [2], and Van
Trees [3]. The state-of-the-art in 1995 is also quite nicely summarized by the Signal Processing
Magazine article of Krim and Viberg [1], Since then, blind beamforming techniques have given
a major impetus to the field. A few books giving an overview are found under the headings
of blind source separation and independent component analysis [4, 5], although this material is
probably better studied by consulting some of the original overview papers [].
A nice overview of applications is found in Haykin [6] which has extensive chapters on geophysics
exploration, sonar, radar, radio astronomy, and medical tomographic imaging (e.g., MRI and
CT scans). An early introduction to phased array radar is presented in Skolnik [7].
Linear algebra is used throughout the book, and a standard reference to this is Golub and Van
Loan [8].
Bibliography
[1] H. Krim and M. Viberg, “Two decades of array signal processing research: the parametric
approach,” IEEE Signal Processing Magazine, vol. 13, no. 4, pp. 67–94, 1996.
[2] D.H. Johnson and D.E. Dudgeon, Array signal processing: concepts and techniques. Prentice
Hall, 1993.
[3] H.L. Van Trees, Optimum array processing: Part IV of detection, estimation, and modulation
theory. Wiley, 2004.
[4] J.V. Stone, Independent component analysis: a tutorial introduction. MIT press, 2004.
[5] P. Comon and C. Jutten, Handbook of Blind Source Separation: Independent component
analysis and applications. Academic press, 2010.
[8] G.H. Golub and C.F. Van Loan, Matrix computations. Johns Hopkins University Press, 1996.
DATA MODELS
WAVE PROPAGATION
Contents
2.1 The wave equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Spatial Fourier transforms . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Spatial sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Correlation processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Application: radio astronomy . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
In signal processing, data models are used as an abstraction of the physics in an application.
The model should be based on reality but not be overly detailed. Often, a variety of data models
are suitable, with different assumptions leading to different algorithms.
In free space, RF signals propagate following the Maxwell equations. These describe the rela-
tions between the vector electric and magnetic field intensities. If we specialize them to scalar
components (most sensors will not measure vector fields), we arrive at the wave equation:
1 ∂ 2 s(x, t)
∇2 s(x, t) = (2.1)
c2 ∂t2
∂2 ∂2 ∂2
∇2 = + + .
∂x2 ∂y 2 ∂z 2
Here, x is a position in space; assuming 3D space, x = [x, y, z]T . The scalar field s(x, t) is a
function of both space x and time t, and we will call it a signal. The coefficient c in (2.1) will
later be interpreted as the speed of propagation. It depends on the properties of the medium
(specifically, the dielectric permittivity and the magnetic permeability).
In acoustics, a similar equation holds for the acoustic pressure of a sound wave in gas or in a
fluid, and for the longitudinal and transverse waves in solids. In this case, c represents the speed
of sound. In a gas (air), it depends on pressure and temperature.
Different media (or materials) will have different propagation speeds. Interesting effects occur
at the interface between materials, or objects in space, such as reflection and diffraction. It is
also possible for c to vary continuously in space, e.g., due to gradients in salinity or temperature
in ocean water, or due to varying electron densities in the ionosphere.
If we insert this function into the wave equation, we find the constraint
ω
k= , (2.3)
c
where k = kkk is called the wavenumber (or spatial frequency, in analogy to its role next to ω in
(2.2)), with unit radians per meter. The function s(x, t) represents a monochromatic plane wave.
Indeed, ω represents the radial frequency (in rad/s), and the vector k is called the wavenumber
vector.
To interprete this signal, pick a constant C and look at the function argument,
ωt − k · x = C .
Clearly, for each time t, this describes a plane in 3D space where the function is constant, and
this defines a wavefront. The vector k is the normal to the plane, and indicates the direction of
propagation: in directions x parallel to k, the function argument changes fastest.
For the monochromatic wave, the time period of a cycle is
2π
T = .
ω
Over this time, the wavefront moves over a distance
2π
λ=
k
1
To remain consistent with the literature, we use here a dot, k · x, to represent the inner product between two
T
vectors. Other notations are hk, xi and k x.
t y z
k
θ θ
T
k
λ
0 x 0 x φ y
x
λ λx
meters, in the direction of k. Combining these expressions with (2.3), we see that the distance
λ covered over time T has the ratio
λ ω
= = c,
T k
showing that, indeed, c in (2.1) defines the propagation speed. We can interprete λ as the
wavelength in meters, and k/(2π) = λ−1 as the number of wave cycles that fit into 1 meter.
Let ζ be a unit-norm vector in the direction of k so that k = kζ, and write
1
ωt − k · x = ω t − ζ · x .
c
This takes care of the constraint (2.3). The factor 1/c represents a delay in the propagation
direction: the time it takes the wave to cover 1 meter.
Fig. 2.1 shows the propagation in 1, 2 and 3 dimensions. Propagation in 1 dimension, e.g., on
a rope, is less relevant for the book, but sometimes provides a nice example as (x, t) can be
visualized in a simple plot. The figure shows s(x, t) = cos(ωt − kx) where the positive part of
the wave is shaded, and the plot shows both the period T and the wavelength λ.
For propagation in 2 dimensions, a 2D plot can show (x, y) but not t, so the meaning of the plot
is in fact quite different. We can parametrize the propagation direction ζ with a single angle θ,
the angle of incidence of the wave: " #
sin(θ)
ζ=− . (2.4)
cos(θ)
The minus sign in this expression comes from the choice to let the wave propagate towards the
origin. For fixed ω and c, this allows to parametrize k with a single parameter θ,
" # " # " #
k ω sin(θ) 2π sin(θ)
k= x =− =− . (2.5)
ky c cos(θ) λ cos(θ)
Fig. 2.1(b) shows that the wavefronts are orthogonal to k, and that λ is the (shortest) distance
between wavefronts. If we observe the wavefronts only on the x-axis, the apparent wavelength
λx is longer; similarly, the k-vector projected on the x-axis is shorter. As a result, the apparent
propagation velocity is larger and depends on the direction θ. This provides a means to recover
the direction θ even from 1D observations, for cases where we know the true propagation velocity
c. This plays a role later, when we place our sensors on the x-axis.
Likewise, in 3 dimensions, we will need 2 parameters φ and θ, the azimuth and elevation,
respectively:
sin(θ) cos(φ)
ζ = − sin(θ) sin(φ) . (2.6)
cos(θ)
This leads to
kx sin(θ) cos(φ) sin(θ) cos(φ)
ω 2π
k = ky = − sin(θ) sin(φ) = − sin(θ) sin(φ) . (2.7)
c λ
kz cos(θ) cos(θ)
In the plot, the wavefronts are shown as planes orthogonal to k.
More generally, the wave equation supports the addition of multiple monochromatic solutions,
e.g., solutions at different frequencies ω. We can also scale these solutions by some S(ω), and
in the limit we find
1 1 ∞
Z
1
s t− ζ·x = S(ω) ejω(t− c ζ·x) dω , (2.8)
c 2π −∞
provided this inverse Fourier transform integral converges. Thus, functions of the form s(t −
(1/c)ζ ·x) are also plane waves and a solution of the wave equation. Note that the corresponding
time-domain signal s(t) could be anything; it could be a sinusoid, but also a short pulse traveling
through space. The reason this pulse does not get distorted on its way is that all frequency
components receive the same delay as function of position, as represented by (1/c)ζ · x. The
frequency components remain coherent because in the present formulation c is not a function
of frequency. More in general, c does depend on frequency and this leads to dispersion: a
distortion of the pulse as different frequency components experience different delays (or phase
shifts) during propagation.
The additivity of the wave equation also supports the superposition of signals coming from
different directions. These can be monochromatic signals at the same frequency, at different
frequencies, or general signals of the form (2.8). E.g., the original signal could have been reflected
in an object, resulting in a copy of the signal traveling in a different direction (multipath), or
we can have multiple sources transmitting from different locations.
The wave equation (2.1) also admits other solutions than plane waves. If we switch from Carte-
sian coordinates to spherical coordinates (r, φ, θ) centered around the origin, and assume spher-
ically symmetric solutions s(r, t), then these will satisfy the spherical wave equation [1]
1 ∂ 2 (rs)
∇2 (rs) = .
c2 ∂t2
This is the same equation as before, but now in terms of rs(t) instead of s(t). We thus obtain
similar solutions as before, but now as a function of radius r, and scaled by 1/r. E.g., the
monochromatic spherical wave, propagating from the origin, has the form
1
s(r, t) = ej(ωt−kr) ,
r
and a more general solution has the form s(r, t) = 1r s(t − 1c r). The scaling by 1/r is related to
the Friis freespace transmission equation used in telecommunication and radar.
Far away from the origin, in the so-called far field, the solution can be approximated by a plane
wave in case we only study a limited part of space.
The wave equation (2.1) describes propagation in a lossless, homogeneous medium with constant
propagation velocity c. However, the situation may be more general:
Other interesting effects occur if the propagation velocity is frequency dependent: in this case
we do not have the linear relation ω = ck, and dispersion occurs. An example is a prism, where,
at the interface with air, different colors of light are bent in slightly different angles. Other ex-
amples are the propagation of RF signals through the atmosphere or ionosphere, or the acoustic
propagation in the ocean, where propagation speed depends on salinity and temperature.
The effect of dispersion is that of a linear filter H(r, ω), which introduces range-dependent,
frequency-dependent effects in the transmitted signal. Sensors placed in the far field (r large)
will measure the same waveforms, but they are not equal to the transmitted waveform because
of the frequency-dependent filtering: pulse distortion has occured.
The Fourier transform has shown to be an essential tool in the analysis of linear time-invariant
systems: such systems are characterized by an impulse response, an input signal is convolved
by this impulse response, and in the Fourier domain, this convolution becomes a frequency-wise
multiplication with the transfer function of the system. The transfer function is the Fourier
transform of the impulse response.
Viewed in another way, the inverse Fourier transform shows that a signal can be represented as
a sum of sinusoids, ejωt .
where, for convenience, the three-dimensional integral over space is represented by a single
integral sign. S(k, ω) is called the wavenumber-frequency representation, and correponding
plots are called F-K plots (with frequency in Hz). Such plots are used in seismic exploration
(geophysics) and underwater acoustics [2].
The corresponding inverse Fourier transform is
Z ∞ Z ∞
1
s(x, t) = S(k, ω) ej(ωt−k·x) dkdω . (2.9)
(2π)4 −∞ −∞
Equation (2.2) showed that a monochromatic plane wave is given by ej(ωt−k·x) . Thus, (2.9)
shows that any space-time signal can be represented by a weighted sum of monochromatic plane
waves.
If s(x, t) is a single monochromatic plane wave with frequency ω0 and wavenumber vector k0 ,
s(x, t) = ej(ω0 t−k0 ·x) , (2.10)
then its spectrum is
S(k, ω) = (2π)4 δ(ω − ω0 )δ(k − k0 ) , (2.11)
which represents a single point in wavenumber-frequency space. This expression is verified by
inserting (2.11) into (2.9).
Let us extend this to a wideband source with spectrum S(ω). The velocity of propagation was
previously shown to be c = ω/k with k = kkk. Since the velocity of propagation is given by the
medium, k and ω are not completely independent. We can pick the direction of propagation,
ζ 0 (a unit-norm vector), and then k = ωc ζ 0 . A single wideband plane wave thus traces a line in
wavenumber-frequency space (i.e., in an F-K plot), and
ω
S(k, ω) = (2π)3 S(ω)δ (k − k0 ) ; k0 = ζ 0 . (2.12)
c
This generalizes (2.11) to wideband sources. Since they come from a single direction, such
sources will be called point sources (as opposed to spatially extended sources).
Next, we could look at filters. Working by analogy, a space-time filter can be defined by a
frequency response H(k, ω), and the output of the filter is
Practically speaking, it is not clear how such filters can be realized, as they act over all of space.
2.2.2 Apertures
In the next section we will look at sampling. Obviously, we will not be able to place sensors
anywhere in space: normally they will be placed on a line or within a limited spatial region. In
analogy to optics, the area over which we will sample space is called the aperture, and it acts
as a spatial window w(x):
y(x, t) = w(x)s(x, t) . (2.13)
E.g., for x = [x, y, z]T , a linear aperture on the x-axis of size D (a “slit”) is defined by
(
1, |x| < D/2
h(x) = ⇔ w(x) = h(x)δ(y)δ(z) , (2.14)
0, otherwise
For time-domain signals, we know that a product in time domain becomes a convolution in
frequency domain. Thus, by analogy, applying the space-time Fourier transform to (2.13) yields
1 1
Z
Y (k, ω) = 3
W (k) ∗ S(k, ω) = W (k − p)S(p, ω)dp (2.16)
(2π) (2π)3
where the aperture smoothing function is
Z
W (k) = w(x) ej k·x dx . (2.17)
sin(kx D/2)
W (k) = , (2.18)
kx /2
y D
D
θ k ∼ 2π
∼ λ
D/2 D D/2 D
W( )
W(k)
visible
0 0
0 x -8 /D -4 /D 0 4 /D 8 /D - /2 - /4 0 /4 /2
−D/2 D/2 kx
Figure 2.2. Aperture function in a 2D scenario. (a) a linear aperture (slit); (b) the corre-
sponding W (k), only the kx component is shown; (c) W (θ), for D = 2λ.
which is a sinc function in kx , and constant in ky , kz . For the plane wave signal s(x, t) with
wavenumber-frequency transform (2.12), the resulting spectrum is
ω
Y (k, ω) = W (k) ∗ S(ω)δ(k − k0 ) = S(ω)W (k − k0 ) ; k0 = ζ 0 . (2.19)
c
Thus, the effect of the aperture (window) in spatial domain is a convolution of the signal in
wavenumber-frequency domain, which smears out (smooths) the spatial spectrum. The resulting
signal does not come from a single direction ζ 0 anymore, but appears to come from a range of
directions around ζ 0 . Thus, the effect of the aperture is a limitation on resolution.
For the linear aperture, (2.18) shows that we get some dilution in the kx component, while ky
and kz are completely dropped: the y and z components of the field are not measured. Thus,
by looking through the slit, only a 1D propagation scenario is visible. Signals with the same kx
but different ky , kz are indistinguishable.
Fig. 2.2 shows a linear aperture in a 2D scenario, and the corresponding function W (k). Since
it only depends on kx , only this component is shown. It is seen from (2.18) that the peak of
the sinc function has magnitude D. The first zero crossing occurs at kx = 2π/D, hence the
main lobe width is said to be approximately 2π/D (the exact value depends on the definition of
width). Thus, as D → ∞, the sinc function converges to a delta spike, as expected. Consider
now the parametrization of k as in (2.5). Then
2π
kx = − sin(θ) .
λ
Clearly, as θ varies from −π/2 to π/2, then kx ranges between ± 2π λ , and this is the part of the plot
of W (kx ) that is “visible” for fixed λ and varying direction of arrival θ. In this parametrization,
we find (with some abuse of notation)2
sin( D
λ π sin(θ))
W (θ) = D D
.
λ π sin(θ)
2
Correct notation would define Wk (k) and Wθ (θ), but we would like to avoid such adorned notation.
z
D2 /4
k λ
∼ 1.22 D
W( )
2
θ D /8
0
0 D/2 y
D/2 - /2 - /4 0 /4 /2
x
Figure 2.3. Aperture function in a 3D scenario. (a) a circular aperture; (b) the corresponding
W (k), only the (kx , ky ) components are shown; (c) W (θ), for D = 2λ.
This function is plotted in the right panel, for D = 2λ. The first zero crossing occurs for
(D/λ)π sin(θ) = π, i.e., sin(θ) = λ/D. Thus, we see that the main lobe width in the θ-plot is
approximately λ/D. This will later be interpreted as the angular resolution of this aperture.
Since the maximum kx that can be obtained is (D/λ) π, we also see that only part of the plot
of W (kx ) is visible, as indicated by the dashed box. For a given D, the visible part depends on
λ, and the ratio D/λ determines the number of sidelobes of W (k) that are visible in W (θ).
Note that we defined W (θ), but to compute the response for a source from direction θ0 , we cannot
work with W (θ − θ0 ). Instead, starting from (2.19), we we can write Y (θ, ω) = S(ω)W (θ; θ0 ),
where
sin( D
λ π[sin(θ) − sin(θ0 )])
W (θ; θ0 ) = D D
.
λ π[sin(θ) − sin(θ0 )]
For θ0 close to 21 π, the beamshape will not only center around θ0 , but be distinctively different,
with a much broader main lobe.
In 3D, if we take a square aperture around the origin in the (x, y)-plane, then w(x) =
h(x)h(y)δ(z), and
sin(kx D/2) sin(ky D/2)
W (k) = . (2.20)
kx /2 ky /2
Ideally, we design apertures such that W (k) is as close to a delta spike as possible: the width
of the main lobe determines the spatial resolution in applications such as direction finding. On
the other hand, we don’t necessarily need narrow main lobes in all three dimensions of k: for
direction finding, we are interested in the direction vector ζ, and if c is known, there are only 2
independent dimensions to specify ζ.
Circulair aperture In 3D, for a circular aperture with diameter D = 2R, one shows that [1]
2πR q
W (k) = J1 (kxy R) , kxy = kx2 + ky2 ,
kx y
where J1 (·) is the first-order Bessel function of the first kind. This smooting function is known in
optics as the Airy disk. It describes the pattern (bright spot and rings around it) that is visible
on a screen placed behind a small uniformly illuminated aperture. Fig. 2.3 shows the aperture,
W (k) and W (θ). The Bessel function is quite similar to a two-dimensional sinc function, but
note that it is circularly symmetric (unlike (2.20))..
The first zero crossing of J1 (x) occurs for x = 3.8317 . . .. Using (2.7), we find
2π λ
sin(θ)R = 3.8317 ⇔ sin(θ) = 1.22 (2.21)
λ D
where D = 2R is the diameter of the array. Again, the beamwidth of this aperture is determined
by λ/D.
To measure a field, practically we can observe it only over some finite area in space: the aperture.
We could use, e.g., a parabolic dish with diameter D, and then the aperture is the size of the
dish. The dish casts the incoming energy onto (usually) a single sensor, and because of its
directionality, we will have to scan it to cover all directions.
If D is large, then this is not practical. Instead, we can place a number of sensors inside the area
covered by the aperture. This sensor array will spatially sample the wavefield. The sensors could
be simple antennas, or they could be small dishes or arrays themselves, leading to a hierarchy
of arrays. For the moment, we will assume that the sensors are ideal, omnidirectional antennas,
i.e., they simply capture s(x, t) at a specific position x.
At first, the theory of spatial sampling can be presented as a direct extension of the usual tempo-
ral sampling: sampling creates periodicities in the spectrum, leads to aliasing, and bandlimited
signals can be perfectly reconstructed from their samples. Aliasing in this context means that
sources from two different directions will result in the same sampled signal and thus cannot be
distinguished.
Fig. 2.4 shows some of the notation we will be using in this section. We will first sample the
wavefield, and subsequently apply the aperture (= select a finite number of spatial samples)
and also apply weights to the selected samples. We use X(k, t) and Y (k, t) to keep track of the
related spatial spectra.
s(x, t)
k
xm (t) y(t)
θ
{wm }
are
xm (t) = s(md, t) , m = · · · , −1, 0, 1, · · · .
An infinite number of sensors is needed, but this will be managed later. The original “continu-
ous” signal s(x, t) has space-time spectrum
Z Z Z Z
S(k, ω) = s(x, t)e−j(ωt−kx) dxdω = s(x, t)ejkx dx e−jωt dω ,
Note that the Fourier transform over space and time decouples. For simplicity of notation, we
will instead consider here only the spatial spectrum (omitting the transformation in time): let
Z
S(k, t) = s(x, t)ejkx dx .
This looks like the usual Fourier transform, except for the minus sign in the exponent, but that
is of no consequence. Inversely,
1
Z
s(x, t) = S(k, t)e−jkx dk .
2π
2π
ks = .
d
Next, we split the integration over k into a fundamental interval (−ks /2, ks /2], plus shifts nks
of this interval, for n = · · · , −1, 0, 1, · · ·. This leads to
1 X nks +ks /2
Z
s(x, t) = S(k, t)e−jkx dk
2π n nks −ks /2
1 X ks /2
Z
= S(k − nks , t)e−jkx e−jnks x dk .
2π n −ks /2
1 X ks /2
Z
xm (t) = s(md, t) = S(k − nks , t)e−jkdm e−j2πnm dk
2π n −ks /2
Z ks /2 "X #
1
= S(k − nks , t) e−jkdm dk . (2.22)
2π −ks /2 n
Let us compare this to the spectrum that we can define for the sampled signal, in analogy to
the DTFT. Various definitions are possible, and we opt for
Z ks /2
d
X(k, t)e−jkdm dk .
X
X(k, t) = xm (t)ejkdm ⇔ xm (t) = (2.23)
m 2π −ks /2
This spectrum is defined on the fundamental interval −ks /2 ≤ k ≤ ks /2, and for larger k it is
periodic (since ks = 2π/d). The usual factor 1/2π in the inverse transform is replaced here by
d/(2π) = 1/ks because we have defined the spectrum using k, instead of a normalized frequency
variable kd which would range from −π to π.
Comparing to (2.22), we see that the spectrum of the sampled signal is related to that of the
unsampled signal via
1X 1 1
X(k, t) = S(k − nks , t) , − ks ≤ k ≤ ks ,
d n 2 2
where ω = 2πf , with f in Hz. Alternatively, the distance between the sensors has to satisfy
c
d< ,
2B
where B = fmax is the bandwidth of the signal in Hz. Or, if λmin = c/B is the smallest
wavelength in the signal,
d < 12 λmin .
In words: the distance between sensors has to be less than half of the shortest wavelength in
the signal. If this Nyquist condition holds, then Shannon’s sampling theorem states that the
continuous signal can be recovered perfectly by lowpass filtering the periodic spectrum of the
sampled signal, which amounts to sinc interpolation. Consequently, no information is lost if the
sensors are spaced closer than 21 λmin . Otherwise, aliasing will occur, which will be problematic
for direction finding (or imaging) applications.
These results extend to higher dimensions. For a wavefield in 2D, we use uniform sampling in
2 dimensions, etc. Nonetheless, this theory is not entirely satisfying yet, as we would like (i) to
sample using a finite number of sensors, (ii) to sample 3D space using only a 2D array, (iii) to
consider using a random (non-uniformly spaced) array.
Let X(k, t) be the (periodic) spectrum of the sampled signal using an infinite number of sensors,
m = −∞, · · · , ∞. Using (2.23) gives
∞ Z ks /2
d X
Y (k, t) = wm e jkmd
X(p, t)e−jpmd dp
2π m=−∞ −ks /2
∞
Z ks /2 " X #
d j(k−p)md
= wm e X(p, t)dp . (2.25)
2π −ks /2 m=−∞
10
grating main lobe M =9
8 lobe
|W(k)|
2π
Md
4
fundamental interval
2
0
-2 /d - /d 0 /d 2 /d
k
Figure 2.5. Amplitude of the discrete aperture function W (k) for M = 9 sensors. The plot is
periodic with period ks = 2π
d .
This is recognized as a (circular) convolution of W (k) with the discrete spatial spectrum X(p, t),
over one period of the spectrum. This convolution will smooth the spectrum and limit its
resolution. It will also introduce sidelobes, as we will now see. From (2.26), we obtain
M −1
X 1 − ejkM d sin(kM d/2) jk(M −1)d/2
W (k) = ejkmd = = e . (2.28)
m=0
1 − ejkd sin(kd/2)
The factor ejk(M −1)d/2 is simply a phase factor that determines the phase center of the array,
and can be ignored here. The real part (sin over sin) can be viewed as a “periodic sinc-function”
(it is known as the Dirichlet kernel and occurs in convergence proofs for Fourier series). The
amplitude |W (k)| is periodic with period ks = 2π d , as determined by the denominator. It has
2π
zero crossings for k = M d = ks /M .
A plot that shows this beamshape is shown in Fig. 2.5, for M = 9. The periodicity with ks
is clearly visible. The peak of |W (k)| is equal to M , called the array gain. The width of the
2π
main lobe is determined by the first zero crossing, i.e., M d = ks /M . Ideally, for M → ∞,
W (k) converges to a delta spike train, such that the convolution (2.27) does not change the
spectrum: Y (k, t) = X(k, t). For finite M , the convolution with the main lobe will smear out
2π
the spectrum, and reduce its resolution to M d . Indeed, suppose the spectrum S(k, t) contains
two point sources (i.e., delta spikes at specific values for k). The convolution with W (k) will
replace the delta spikes by the main lobe of W (k). If the delta spikes are closer to each other
2π
than approximately M d , then the main lobes will highly overlap, and appear in the spectrum
as a single point source. This is similar to the discussion in Sec. 2.2.2 where we looked at the
effect of an aperture. Note that D = M d can be interpreted as the spatial coverage (aperture)
of the array.
We also note that, next to the main lobe, there are in total M − 1 side lobes within the
fundamental interval, in between the zero crossings of the sinus function in the nominator of
(2.28). The sidelobes in Fig. 2.5 will cause confusion: if in the spectrum Y (k, t) we observe a
small peak, we will not know if it is a weak point source, or if it is a side lobe of another (strong)
source. Thus, the sidelobes limit the sensitivity of the array.3
Due to the periodicity, the main lobe is repeated outside the fundamental interval; these lobes
are called grating lobes. They might appear in the spectrum if the visible region is larger than
the fundamental interval.
Until now, we selected in (2.24) aperture weights that were either 0 (outside the aperture)
or 1 (for the M sensors inside the aperture). However, we are not bound to take the non-zero
weights equal to 1: we can select other weights. Doing so will allow us to design other smoothing
functions than the Dirichlet kernel which, after all, has quite high sidelobes. Thus we define,
generalizing (2.26),
M
X −1
W (k) = wm ejkmd . (2.29)
m=0
The nonzero weights are called a shading, or tapering, of the array; in general they could
be complex numbers. W (k) can be interpreted as a (discrete-space) Fourier transform of the
sequence [wm ], similar to the DTFT. Thus, similar design techniques as used for digital filters
can be applied here to design weights that result in a desired “transfer function” |W (k)| with
minimal sidelobe heights or other desired features. For example, instead of the rectangular
window (2.24), we can use a triangular window, a Hann or Hamming window, etc., or apply
other window/filter design techniques such as Parks/McClellan. An extensive overview is given
in [3, Ch. 3].
y M M =9 M d = 12 λ
k
2π λ
θ M/2 ∼ Md M/2 ∼ Md
W(k)
W( )
visible
0 0
0 x -2 /d - /d 0 /d 2 /d - /2 - /4 0 /4 /2
− 12 M d 1
2Md kx
Figure 2.6. Uniform linear array with M = 9 sensors in a 2D scenario. (a) Configuration; (b)
the corresponding W (kx ); (c) W (θ), for d = 12 λ.
of M sensors in 2D. Thus, we observe only kx , and drop ky . Then, using (2.28),
sin(kx M d/2)
W (k) = W (kx ) = ,
sin(kx d/2)
(the phase offset due to the non-zero phase center of the array was dropped for simplicity: the
array is centered around the origin), and with kx = − 2π
λ sin(θ) we obtain
We saw in Sec. 2.2.2 that the relation between D and λ determined the part of W (k) that is
visible in case we fix λ and scan θ: the visible part is the interval [− 2π 2π
λ , λ ]. Comparing to Fig.
1
2.5, we see that if d < 2 λ, then the visible part is within one period of W (kx ). Using similar
arguments as before, we estimate the angular resolution as λ/(M d). There are M − 1 zero
crossings, which corresponds to the number of sidelobes.
Fig. 2.6 shows W (kx ) and W (θ), for M = 9 and d = 12 λ. For this choice of d, exactly one
period of W (kx ) is visible. Only the visible part is in W (θ), where we see that the horizontal
axis is stretched at the edges (around ± π2 ) compared to W (kx ). If we take d > 12 λ, then we do
not satisfy the Nyquist criterion, and the resulting aliasing may result in visible grating lobes:
secondary main lobes of sources may appear, especially for sources with angles close to ± 12 π.
More in general, we can consider an irregular array, where M sensors are placed “randomly”, at
locations xm in 3D. If S(k, t) is the wavefield, then at a location x,
Z
s(x, t) = S(k, t)ej(ωt−k·xm ) dk .
6
y
|W(kx )|
θ 4
0
-4 /d -3 /d -2 /d - /d 0 /d 2 /d 3 /d 4 /d
x kx
d 11.4d
8
6
y
|W(kx )|
θ 4
0
-4 /d -3 /d -2 /d - /d 0 /d 2 /d 3 /d 4 /d
x kx
d 44d
Figure 2.7. Amplitude of the discrete aperture function W (kx ) for M = 9 non-uniformly
spaced sensors. The smallest spacing is d. Two designs are shown: a random one
with an aperture slightly larger than the uniform design, and a sparse nonredun-
dant array with a much larger aperture. The dotted lines correspond to a uniform
array with M sensors spaced at d.
Taking a finite number of samples at locations xm and weighting them by a taper wm gives
with
M
X −1
w(x) = wm δ(x − xm ) .
m=0
Let the (continuous) spatial Fourier transform of y(x, t) be Y (k, t). Using a similar derivation
as before, we obtain
1 1
Z
Y (k, t) = W (k − p)S(p, t)dp = W (k) ∗ S(k, t) ,
(2π)3 (2π)3
is the spatial Fourier transform of w(x). This generalizes the previous definition (2.29) of W (k)
to the non-uniform case. Again, the sampled spectrum Y (k, t) is the convolution of the original
spectrum with a smoothing function W (k). However, if the sensor locations are not uniformly
spaced, then W (k) and Y (k, t) will not be periodic, and it will be hard to analyze theoretically.
Two examples are shown in Fig. 2.7. The first array is mildly irregular, with a sensor at x = 0,
one at x = d, and subsequent ones uniformly selected randomly between d and 1.5 d away from
the previous one. In total M = 9 sensors are used, all with equal weights. Since the array is
irregular, it is seen that the plot of |W (kx )| is non-periodic. The aperture of this array is not
much larger than M d, and the main lobe width is not much narrower than that for a uniform
linear array of M sensors spaced at d (the red dotted line). Grating lobes are present, but not at
the full height of the main lobe, and a bit closer than would be expected for a minimum spacing
of d. Moreover, these plots greatly vary if another, similar, random design is selected. It is clear
that without a design process such random arrays will not have desired properties.
The second array in Fig. 2.7 is based on a uniform distance d. It can be viewed as an M = 45
uniform array that is subsequently thinned to M = 9; this is called a sparse linear array. The
spacings between the sensors are [1]
[1, 4, 7, 13, 2, 8, 6, 3] · d .
Because of the underlying uniformity (all sensor spacings are a multiple of d), the spectrum
W (k) is periodic with period ks = 2π/d. The main lobe is much narrower than before, as the
aperture is D = 45d. As a penalty, the side lobes are now much stronger and appear noise-like.
It could be argued that the effective array gain is only a factor 3 rather than 9.
To analyze array designs, let
c(x) = w(x) ∗ w(−x) (2.30)
is the Fourier transform of c(x). Thus, the co-array determines the magnitude of the spectrum
smoothing function.
Filter design techniques can be used to determine, starting from a desired |W (k)|2 , the corre-
sponding c(x), and subsequently a set of positions and sensor weights that approximate this
c(x). Generally, however, array design comes with many constraints and is not an easy art.
In telecommunication, we use array processing to spatially separate sources and receive one
signal of interest. In many other applications, we are not so much interested in the source signal
s(t), but more in the propagation parameters, i.e., k or the unit-norm direction vector ζ. In the
2D case, ζ is specified by the direction of arrival θ.
In these cases, we can do away with the temporal dimension and work with correlation models.
Generally, we look at second-order correlations between the sensor signals. For non-Gaussian
sources, we might also consider higher-order statistics, cf. Chap. 11.
Thus, in this section we will consider signals s(t) as random processes. If we limit ourselves to
descriptions by second-order statistics, we look at the mean and the variance,
h i
E[s(t)] , E |s(t) − E[s(t)]|2
and more in general the autocorrelation function, rs (t, t0 ) = E[s(t)s∗ (t0 )]. (For generality, com-
plex signals are assumed, and the superscript ∗ denotes the complex conjugate.) Usually we
immediately make several simplifying assumptions: we consider that the signal is wide sense
stationary, so that the mean is constant over time, the autocorrelation function only depends
on the time difference τ = t0 − t, and the variance is finite. We can then write
Moreover, we usually consider the mean to be zero, E[s(t)] = 0, so that rs (τ ) equals the variance.
In any case, rs (0) represents the power of the random signal.
Recall from a course on random processes that the power spectral density is defined as the
Fourier transform of the autocorrelation function:
Z
Rs (ω) = rs (τ )e−jωτ dτ .
This can be related to the spectrum S(ω) of s(t), but we have to be careful as for a random
process, the energy in s(t) is infinite: we have to look at the energy per unit time. Therefore,
let sT (t) be equal to s(t) on the interval (− 21 T, 12 T ] and zero otherwise, and let ST (ω) be the
Fourier transform of sT (t), then one can show that
1 h i
Rs (ω) = lim E |ST (ω)|2 .
T →∞ T
For white noise, rs (τ ) is a delta spike, and Rs (ω) is a constant. (Its power, rs (0), is actually
infinite, so truly white noise does not exist.)
E[α] = 0 , E[ |α|2 ] = P .
(the source is stationary in time and the result does not depend on the position x). Now, in
analogy, consider a sensor at location x0 and one at location x1 . The spatial cross-correlation
function is
h i
rs (x0 , x1 , τ ) = E[s(x1 , t + τ )s∗ (x0 , t)] = E |α|2 ej(ω0 τ −k0 ·(x1 −x0 )) .
Note that this depends only on the baseline b = x1 − x0 , i.e., the vector pointing from x0 to x1 .
Such a random field is called homogeneous, and we can write
where
ζ·b
τg =.
c
See Fig. 2.8. The figure shows that τg is the geometric delay, the delay of the wavefront in
propagating from x0 to x1 . In the figure, the signal arrives at x1 first, and therefore the delay
is actually an advance, and therefore negative. If d = |b| and θ is the angle between the source
direction and the direction orthogonal to the baseline (broadside), then
d
τg = − sin(θ) .
c
(The minus sign is due to the orientation of ζ, and indeed, for positive θ the delay is negative.)
Thus, τg is related to the direction of arrival (DOA). Often, DOA estimation algorithms estimate
τg , or the phase delay e−jω0 τg , and determine the DOA from this.
Taking the temporal Fourier transform of (2.31) gives the cross power spectral density,
y
ζ
θ λ
τg
x0 b x1 x
i.e., the crosscorrelation between two sensors (spaced by b) is the autocorrelation of the source,
convolved with a delay. Of course, this result could have been obtained also directly!
2.5.1 Interferometry
An interferometer measures the correlation of the signals received by two antennas spaced at a
certain distance. After a number of successful experiments in the 1950s and 1960s, two arrays of
25-m dishes were built in the 1970s: the 3 km Westerbork Synthesis Radio Telescope (WSRT, 14
dishes, see Fig. 2.10) in Westerbork, The Netherlands and the 36 km Very Large Array (VLA,
27 movable dishes) in Socorro, New Mexico, USA (Fig. 1.1). These telescopes use Earth rotation
to obtain a sequence of correlations for varying antenna baselines, resulting in high-resolution
images via synthesis mapping. A more extensive historical overview is presented in [5].
The radio astronomy community has recently commissioned a new generation of radio telescopes
for low frequency observations, including the Murchison Widefield Array (MWA) [6] in Western
Australia and the Low Frequency Array (LOFAR) [7] in Europe. These telescopes exploit phased
array technology to form a large collecting area with ∼1000 to ∼50,000 receiving elements. The
community is also making detailed plans for the Square Kilometre Array (SKA), a future radio
Figure 2.9. Radio image of Cygnus A observed at 240 MHz with the Low Frequency Ar-
ray (showing mostly the lobes left and right), overlaid over an X-Ray image of
the same source observed by the Chandra satellite (the fainter central cloud).
(Courtesy of Michael Wise and John McKean.)
FOV
geometric
delay
g1 g2 baseline gJ
x̃1 (t) x̃2 (t) x̃J (t)
telescope that should be one to two orders of magnitude more sensitive than any radio telescope
built to date [8]. This will require millions of elements to provide the desired collecting area of
order one square kilometer.
The concept of interferometry is illustrated in Fig. 2.11. An interferometer measures the spatial
coherency of the incoming electromagnetic field. This is done by correlating the signals from the
individual receivers with each other. The correlation of each pair of receiver outputs provides
the amplitude and phase of the spatial coherence function for the baseline defined by the vector
pointing from the first to the second receiver in a pair. In radio astronomy, these correlations
are called the visibilities.
Obviously, Fig. 2.11 is directly tied to Fig. 2.8. For a wideband plane wave source s propagating
in the direction ζ, the power spectral density Rs (ω) of this source is called the brightness, and
denoted by I(ω, ζ). (Actually, −ζ is used: this is the unit vector pointing towards the source.)
We also defined the observed cross power spectral density due to this source in (2.35) as Rs (b, ω),
and this is the visibility V (ω, b). Thus, for a single source,
ω
V (ω, b) = I(ω, ζ)e−j c ζ·b
Z
ω
V (ω, b) = I(ω, ζ)e−j c ζ·b dζ (2.36)
This relation is called the Van Cittert-Zernike theorem [5, 9]. For each ω, I(ω, ζ) is viewed as an
image (called the map), by parametrizing ζ in two coordinates or two angles, and the objective
in radio astronomy is to obtain this map.
which is computed for each ω separately. However, in practice, we can estimate V (ω, b) for only
a discrete set of baselines {bk }: every telescope pair provides one baseline, and as the earth
rotates, this baseline rotates and traces an arc in 3D space. We can obtain samples along this
arc.
Thus, we cannot directly implement (2.37). Instead, we can compute an estimate
1 X ω
ID (ω, ζ) = V (ω, bk )ej c ζ·bk (2.38)
(2π)3 k
which is called the dirty map. It is not equal to the desired map I(ω, ζ). Indeed, by substituting
(2.36), we find
1
Z X ω
ID (ω, ζ) = I(ω, n) ej c (n−ζ)·bk dn (2.39)
(2π)3 k
This follows a similar derivation as we saw in Sec. 2.3.5, but now using baselines instead of direct
location samples. We can write (2.39) as
1 1
Z
ID (ω, ζ) = W (n − ζ)I(ω, n)dn = W (ζ) ∗ I(ω, ζ) (2.40)
(2π)3 (2π)3
where X ω
W (ζ) = ej c ζ·bk . (2.41)
k
Thus, the obtained dirty map is a convolution of the desired “true” map with a smoothing
function W (ζ). This W (ζ) is called the dirty beam. Since, generally, the baseline sampling is
quite irregular, the dirty beam also looks quite random.
An example of a set of antenna coordinates and the corresponding dirty beam is shown in Fig.
2.12. This is for a single low-band LOFAR station and a single 10 second integration interval
and frequency bin. The dirty beam has heavy sidelobes, as high as −10 dB. To make this plot,
the unit-norm
√ direction vector ζ is parametrized as ζ = [`, m, n]T , where [`, m] are plotted and
n = 1 − `2 − m2 is not shown.
A resulting dirty image is shown in Fig. 2.13. The image shows the complete sky, in (`, m)
coordinates, where the reference direction is pointing towards zenith. The strong visible sources
are Cassiopeia A and Cygnus A, also visible is the milky way, ending in the north polar spur
(NPS) and, weaker, Virgo A. In the South, the Sun is visible as well. The image was obtained by
averaging 25 integration intervals, each consisting of 10 s data in 25 frequency channels of 156
40
1 0
30
−5
0.5 −10
South ← m → North
20
South ← y → North
−15
10
0 −20
0
−25
−10
−0.5 −30
−20 −35
−1 −40
−30
−40 −30 −20 −10 0 10 20 30
1 0.5 0 −0.5 −1
East ← x → West East ← l → West
Figure 2.12. (a) Coordinates of the antennas in a LOFAR station, which defines the spatial
sampling function, and (b) the resulting dirty beam, plotted in dB.
DFT image
1
0.8 1.6
0.6 1.4
0.4 1.2
South ← m → North
0.2
1
0
0.8
−0.2
0.6
−0.4
0.4
−0.6
−0.8 0.2
−1 0
1 0.5 0 −0.5 −1
East ← l → West
Figure 2.13. Dirty image following (2.40), using LOFAR station data.
kHz wide taken from the band 45–67 MHz, avoiding the locally present radio interference. As
this shows data from a single LOFAR station, with a relatively small maximal baseline (65 m),
the resolution is limited and certainly not representative of the capabilities of the full LOFAR
array.
The dirty beam is essentially a non-ideal point spread function due to finite and non-uniform
spatial sampling: we only have a limited set of baselines. The dirty beam has a main lobe
centered at ζ = 0, and many side lobes. If we would have a large number of telescopes positioned
in a uniform rectangular grid, the dirty beam would be a 2D sinc-function. The resulting beam
size is inversely proportional to the aperture (diameter) of the array. This determines the
resolution in the dirty image. The sidelobes of the beam give rise to confusion between sources:
it is unclear whether a small peak in the image is caused by the main lobe of a weak source,
or the sidelobe of a strong source. Therefore, attempts are made to design the array such that
the sidelobes are low. As mentioned in Sec. 2.3.3, it is also possible to introduce weighting
coefficients (“tapers”) in (2.38) to obtain an acceptable beamshape.
As mentioned, an antenna array generates a set of baselines, and as the Earth rotates, these
baselines also rotate and generate many of such sets. In the definition of W (ζ), the effect of
summing over these sets is that the sidelobes tend to get averaged out, to some extent. Many
images are also formed by averaging over a small number of frequency bins (assuming the source
powers Rs (ω) are constant over these frequency bins), which enters into the equations in exactly
the same way.
Since W (ζ) is data-independent, it can be nearly perfectly predicted after careful calibration
of the instrument. Thus, we can try to estimate I(ω, ζ) from ID (ω, ζ) using deconvolution
techniques.
There are many issues that we ignored in the discussion, such as coordinate systems, approxi-
mation of the 3D integral in (2.40) by a 2D integral (on the assumption that either the field of
interest is small, or that the antenna array sits on a flat plane), and how the V (ω, bk ) can be
estimated from received telescope signals. Also, there are directional disturbances due to non-
isotropic antennas, unequal antenna gains, and disturbances due to atmospheric effects. Some
of these questions are covered in future chapters.
2.6 NOTES
The discussion in Sec. 2.1 summarizes the presentation in [1]. Much more can be said about
these topics, in particular for applications where the propagation speed c is position dependent
(as for geophysics exploration or underwater acoustics). A range of applications and related
wave models is found in [2].
A more extensive introduction to wavefield propagation and its role in image formation is offered
in the course EE4595 Wavefield imaging.
The text on radio astronomy signal processing in Sec. 2.5 is based on Van der Veen e.a. [4]. This
paper gives a short introduction. A classical introduction from 1985 is in [2, Ch. 5]. Well-known
reference textbooks are [5, 9].
Bibliography
[1] D.H. Johnson and D.E. Dudgeon, Array signal processing: concepts and techniques. Prentice
Hall, 1993.
[3] H.L. Van Trees, Optimum array processing: Part IV of detection, estimation, and modulation
theory. Wiley, 2004.
[4] A.J. van der Veen, S.J. Wijnholds, and A.M. Sardarabadi, “Signal processing for radio
astronomy,” in Handbook of Signal Processing Systems, 3rd ed., Springer, November 2018.
ISBN 978-3-319-91734-4.
[5] A.R. Thompson, J.M. Moran, and G.W. Swenson, Interferometry and Synthesis in Radio
Astronomy. New York: Wiley, 2nd ed., 2001.
[6] C. Lonsdale et al., “The Murchison Widefield Array: Design overview,” Proceedings of the
IEEE, vol. 97, pp. 1497–1506, Aug. 2009.
[7] M. de Vos, A.W. Gunst, and R. Nijboer, “The LOFAR telescope: System architecture and
signal processing,” Proceedings of the IEEE, vol. 97, pp. 1431–1437, Aug. 2009.
[8] P.E. Dewdney, P.J. Hall, R.T. Schilizzi, and T.J. Lazio, “The square kilometre array,” Pro-
ceedings of the IEEE, vol. 97, pp. 1482–1496, Aug. 2009.
[9] R.A. Perley, F.R. Schwab, and A.H. Bridle, Synthesis Imaging in Radio Astronomy, vol. 6
of Astronomical Society of the Pacific Conference Series. BookCrafters Inc., 1994.
Contents
3.1 Antenna array receiver model . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
In the previous chapter, we have seen the basics of wave propagation in relation to array pro-
cessing. Our goal in this book is to give an overview of the basic signal processing algorithms
that are used in this area, and that form the basis of more complicated algorithms in real appli-
cations. An important first step in the derivation of any algorithm should be a good description
of the application scenario and a statement of the basic assumptions that can be made, such
that a clear data model that captures the scenario can be stated. The model will then determine
the type of algorithm that is appropriate. Different assumptions lead to different models and to
different algorithms.
The model should be based on reality but not be overly detailed: if we want to estimate model
parameters, their number should not be too large! The purpose of this chapter and the next is
to present models for a number of prototype scenarios. Depending on the assumptions that are
made, simple models with few parameters result, or more accurate models with more parameters.
It ultimately depends on the requirements of the application which model is preferred.
In this chapter, we start the modeling by focusing on the reception of signals on an antenna
array, under narrowband conditions. If the narrowband condition is satisfied, a delay can be
translated to a phase shift: convolutions simplify to scalar products. This greatly simplifies the
modeling (and subsequent processing), and allows us to develop spatial beamforming without
caring much about time domain aspects.
s(t)
y(t) x1 (t)
y(t)
x1 x2 x3
x2 (t) ∆
x3 (t)
∆ ∆ +
y(t)
Figure 3.1. Coherent adding of signals. A parabolic dish physically ensures the correct delays
for coherently adding signals that come from the same look direction. A phased
array has to electronically insert the correct delays.
3.1.1 Introduction
Beamforming An antenna array may be employed for several reasons. A traditional one is
signal enhancement. If the same signal is received at multiple antennas and can be coherently
added, then incoherent additive noise is averaged out. For example, suppose we have a signal
s(t) received at M antennas x0 , · · · , xM −1 ,
where s(t) is the desired signal and nm (t) is noise. Let us suppose that the noise variance is
E[|nm (t)|2 ] = σ 2 . If the noise is uncorrelated from each antenna to the others, then by averaging
we obtain
−1 −1
1 MX 1 MX
y(t) = xm (t) = s(t) + nm (t) .
M m=0 M m=0
1 2
The noise variance on y(t) is given by E[|y|2 ] = Mσ . We thus see that there is an array gain
equal to a factor M , the number of antennas.
The reason that we could simply average, or add up the received signals xm (t), is that the
desired signal entered coherently, with the same delay at each antenna. More in general, the
s(t)
x0 (t) x1 (t)
w1
y(t)
w0
desired signal is received at unequal delays and we have to introduce compensating delays to be
able to coherently add them. This requires knowledge on these delays, or the direction at which
the signal was received. The operation of delay-and-sum is known as beamforming, since it can
be regarded as forming a beam into the direction of the source. The delay-and-sum beamformer
acts like an equivalent of a parabolic dish, which physically inserts the correct delays to look in
the desired direction. See Fig. 3.1.
Spatial filtering A second reason to use an antenna array is to introduce a form of spatial
filtering. Filtering can be done in the frequency domain —very familiar—, but similarly in the
spatial domain. Spatial filtering can just be regarded as taking (often linear) combinations of
the antenna outputs, and perhaps delays of them, to reach a desired spatial response.
A prime application of spatial filtering is null steering: the linear combinations are chosen such
that a signal (interferer) is completely cancelled out. Suppose a signal s(t) is received at the
first antenna directly, but at the second antenna with a delay τ , see Fig. 3.2. It is easy to see
how the received signals can be combined to produce a zero output, by inserting a proper delay
and taking the difference. However, even without a delay we can do something. By weighting
and adding the antenna outputs, we obtain a signal y(t) at the output of the beamformer,
Thus, we can make sure that the signal is cancelled, Y (ω) = 0, at a certain frequency ω0 , if we
select the weights such that
w1 = −w0 ejω0 τ
w
xM −1 (t) xM −1 (t) w
Figure 3.3. (a) Narrowband beamformer (spatial filter); (b) broadband beamformer (spa-
tial/temporal filter).
Z(ω)
z(t) Lowpass s(t)
−ω0 0 ω0 e−jω0 t
B
S(ω)
−2ω0 0
Figure 3.4. Transmitted real signal z(t) and complex baseband signal s(t).
Note that (i) if we do not delay the antenna outputs but only scale them before adding, then
we need complex weights; (ii) with an implementation using weights, we can cancel the signal
only at a specific frequency, but not at all frequencies.
Thus, for signals that consist of a single frequency, or a narrow band around a carrier frequency,
we can do null steering by means of a phased array (i.e., summing after multiplications by
complex weights). In more general situations, with broadband signals, we need a beamformer
structure consisting of weights and delays. How narrow is narrow-band depends on the maximal
delay across the antenna array, as is discussed next.
Let us recall the following facts. In signal processing, signals are usually represented by their
lowpass equivalents, see e.g., [1]. This is a suitable representation for narrowband signals in a
digital communication system. A real valued bandpass signal with center frequency ω0 may be
written as
z(t) = real{s(t)ejω0 t } = x(t) cos(ω0 t) − y(t) sin(ω0 t) (3.1)
where s(t) = x(t) + jy(t) is the complex envelope of the signal z(t), also called the baseband
signal. The real and imaginary parts, x(t) and y(t), are called the in-phase and quadrature
components of the signal z(t). In practice, they are generated by multiplying the received signal
with cos(ω0 t) and sin(ω0 t) followed by low-pass filtering. (An alternative is to apply a Hilbert
transformation.)
Suppose that the bandpass signal z(t) is delayed by a time τ . This can be written as
The complex envelope of the delayed signal is thus sτ (t) = s(t − τ )e−jω0 τ . Let B be the
bandwidth of the complex envelope (the baseband signal) and let S(ω) be its Fourier transform.
We then have
1 B/2
Z
s(t − τ ) = S(ω)e−jωτ ejωt dω .
2π −B/2
If |ωτ | 2π for all frequencies |ω| ≤ B
2 we can approximate e−jωτ ≈ 1 for ω within the band,
and get
Z B/2
1
s(t − τ ) ≈ S(ω)ejωt dω = s(t) .
2π −B/2
Thus, we have for the complex envelope sτ (t) of the delayed bandpass signal zτ (t) that
Bτ 2π is called the narrowband condition. The conclusion is that, for narrowband signals,
time delays smaller than the inverse bandwidth may be represented as phase shifts of the complex
envelope. This is fundamental in direction estimation using phased antenna arrays.
For propagation across an antenna array, the maximal delay depends on the maximal distance
across the antenna array: the aperture. Let us work with frequencies f = ω/(2π) in Hz, and
corresponding bandwidths W = B/(2π) Hz. If the wavelength is λ = c/f0 and the aperture is ∆
wavelengths, then the maximal delay is τ = ∆λ/c = ∆/f0 . In this context, narrowband means
∆ f0
Bτ 2π ⇔ W 1 ⇔ W . (3.2)
f0 ∆
For mobile communications, the wavelength around f0 = 1 GHz is about 30 cm. For practical
purposes, ∆ is small, say ∆ < 5 wavelengths, and then narrowband means W 30 MHz.
This condition is satisfied for most communication systems around 1 GHz. Bluetooth operates
with channels that have a bandwidth of 1 MHz at 2.4 GHz, and the narrowband assumption is
satisfied. Ultrawideband (UWB) systems in the IEEE 802.15.4 standard operate in 500 MHz
bands at 3.1 GHz to 10.6 GHz, and the narrowband assumption does not hold. In low-frequency
radio astronomy, we could have a center frequency at 100 MHz (wavelength 3 m), and a telescope
array with a diameter of 100 km (33,000 wavelengths), so that the maximal bandwidth is in the
order of 3 kHz. This is implemented by splitting the received signals into narrow subbands.
s0 (t)
Figure 3.5. A uniform linear array receiving a far field point source.
The above considerations would be for a single plane wave traveling across the array. For outdoor
propagation, the situation may be different. If there is a reflection on a distant object, there
may be path length differences of a few km, or ∆ = O(1000) wavelengths at 1 GHz. In this
context, narrow band means W 1 MHz, and for many communication signals (e.g., GSM,
UMTS), this is not really satisfied. In this case, delays of the signal cannot be represented by
mere phase shifts, and we need to do broadband beamforming, i.e., space-time processing.
Usually we sample a signal at Nyquist, i.e., fs = W (assuming complex baseband samples), or
Ts = 1/W . The narrowband condition translates to τmax Ts : the maximal delay for propa-
gation across the array is less than the sampling period. This will give rise to an instantaneous
data model, discussed later in Sec. 4.3.1.
where ∗ denotes convolution and Ti is the time it takes the signal to travel from the source to
the mth antenna.
A uniform array has identical elements, i.e., all antennas have the same response a(t, θ). It is
reasonable to assume separability into a(t, θ) = a0 (θ)g(t), where a0 (θ) is the antenna gain pattern
in the direction θ, and g(t) is its temporal response. If the antennas are onmidirectional and
the frequency response is flat over the band of interest, as is often assumed, we have a0 (θ) = a0
and g(t) = δ(t).
Define by
s(t) = g(t) ∗ s0 (t − T0 )e−jω0 T0
the signal received by the first antenna element, save for the array gain, and let τm = Tm − T0
be the time difference of arrivals (the geometric delays). If the τm are small compared to the
inverse bandwidth of s(t), we may set sm (t) = s(t)e−jω0 τm , which is the signal received at time
t at the mth element of the array.
Collecting the signals received by the individual elements into a vector x(t), we obtain from
(3.3)
s(t) 1
sτ1 (t) e−jω0 τ1
x(t) = a0 (θ)
.. =
.. a0 (θ)s(t)
. .
sτM −1 (t) e−jω0 τM −1
For a uniform linear array, we have the same distance d between the antenna elements, so that
all delays between two consecutive array elements are the same: τm = mτ . We can also relate
the time difference (or phase shift) to the angle of arrival θ:
d sin(θ) 2π
ω0 τ = −ω0 = − d sin(θ) = −2π∆ sin(θ)
c λ
where ∆ = d/λ is the spacing between antenna elements measured in wavelengths (corresponding
to the center frequency ω0 ) so that
1
ej2π∆ sin(θ)
x(t) = .. a0 (θ)s(t) =: a(θ)s(t) , (3.4)
.
ej2π(M −1)∆ sin(θ)
where the array response vector a(θ) is the response of the array to a plane wave with DOA
θ. The array manifold A is the curve that a(θ) describes in the M -dimensional complex vector
space C| M when θ is varied over the domain of interest:
In (3.4), the array response vector a(θ) has a very regular form, due to the uniform linear
array structure. More in general, assume that in a 2D scenario we have an irregular array with
elements at positions xm . Further assume x0 = 0: this sets the phase reference at element 0.
Then, following Chap. 2, we find
1
jφ1
e
a(θ) = a0 (θ)
.. (3.5)
.
ejφM −1
1 signal
2 signals
x(t)
x(t)
a(θ) a(θ)
Figure 3.6. Direction finding means intersecting the array manifold with the line or plane
spanned by the antenna output vectors.
or " #
s (t)
x(t) = As(t) , A = A(θ1 , θ2 ) = [a(θ1 ) a(θ2 )] , s(t) = 1 .
s2 (t)
When s1 (t) and s2 (t) both vary with t, x(t) is confined to a plane. Direction finding now
amounts to intersecting this plane with the array manifold, see Fig. 3.6.
With multipath, we obtain a linear combination of the same source via two different paths. If
the relative delay between the two paths is small compared to the inverse bandwidth, it can be
represented by a phase shift. Thus, the data model is
x(t) = a(θ1 )s(t) + a(θ2 )βs(t)
= {a(θ1 ) + βa(θ2 )} s(t) = a s(t) .
In this case, the combined vector a is not on the array manifold and direction finding is more
complicated. At any rate, x(t) contains an instantaneous multiple a of s(t). In many applications
β is fluctuating relatively quickly, so that a is time varying on short time scales (the coherence
time).
1
2
0.5
0.5 1
0 0 0
−50 0 50 −50 0 50 −50 0 50
angle [deg] angle [deg] angle [deg]
Figure 3.7. Spatial responses to a beamformer w = [1, · · · , 1]T as a function of the incoming
source direction θ.
W = A(A A)−1 .
H H H
W x(t) = s(t) ⇔ W A=I ⇔
Thus, we have to obtain an estimate of the mixing matrix A and find a left inverse to separate
the sources. We assumed that A is such that AH A is invertible; this requires A to be tall: at
least as many antennas as sources.
There are several ways to estimate A. One we have seen before: if there is no multipath, then
A = [a(θ1 ) a(θ2 )]. By estimating the directions of the sources, we find estimates of θ1 and θ2 ,
and hence A becomes known and can be inverted.
In other situations, in wireless communications, we may know the values of s1 (t) and s2 (t) for a
short time interval t = [0, T ]: the data contains a “training period”. We thus have a data model
min k X − AS k2F
A
A = XS (SS )−1
H H
spatial response for fixed w spatial response for fixed w spatial response for fixed w
7 7 7
M=7 M=7 M=7
6 6 6
Delta = 0.5 Delta = 1 Delta = 2
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
−50 0 50 −50 0 50 −50 0 50
angle [deg] angle [deg] angle [deg]
1
..
w= .
1
i.e., we simply sum the outputs of the antennas. The response of the array to a unit-amplitude
signal from direction θ is characterized by
H
|y(t)| = |w a(θ)| .
Graphs of this response for M = 2, 3, 7 antennas are shown in Fig. 3.7, as a function of θ. Note
that the response is maximal for a signal from the direction 0◦ , or broadside from the array.
This is natural since a signal from this direction is summed coherently, as we have seen in the
begining of the section. The gain in this direction is equal to M , the array gain. From all other
directions, the signal is not summed coherently. For some directions, the response is even zero,
where the delayed signals add destructively. We saw in Chap. 2 that the number of zeros is
equal to M − 1. In between the zeros, sidelobes occur. The width of the main beam is also
related to the number of antennas, and in Chapter 2 we estimated it at about 180◦ /M . With
more antennas, the beamwidth gets smaller.
Ambiguity and grating lobes Let us now consider what happens if the antenna spacing in-
creases beyond d = λ/2. As before, let ∆ = d/λ. We have an array response vector
1
ejφ
a(θ) = .. , φ = 2π∆ sin(θ) .
.
ej(M −1)φ
spatial response for fixed w response for scanning w response for scanning w
7
8 8
M=7 M=7 M=7
6 7 7
Delta = 0.5 Delta = 0.5 Delta = 0.5
5 6 alpha = [0 30] 6 alpha = [0 12]
4 5 5
4 4
3
3 3
2
2 2
1 1 1
0 0 0
−50 0 50 −50 0 50 −50 0 50
angle [deg] angle [deg] angle [deg]
Figure 3.9. Beam steering. (a) response to w = a(30◦ ); (b) response for scanning w = a(θ),
in a scenario with two sources, well separated, and (c) separated less than a beam
width.
Since sin(θ) ∈ [−1, 1], we have that 2π∆ sin(θ) ∈ [−2π∆, 2π∆]. If ∆ > 0.5, then this interval
extends beyond [−π, π]. In that case, there are several values of θ that give rise to the same
argument of the exponent, or to the same φ. The effect is two-fold:
• In the array reponse graph, grating lobes occur, see Fig. 3.8. This is because coherent
addition is now possible for several values of θ.
Grating lobes prevent a unique estimation of θ. However, we can still estimate A and it does
not prevent the possibility of null steering or source separation. Sometimes, grating lobes can
be suppressed by using directional antennas rather than omnidirectional ones (e.g., parabolic
dishes): the spatial response is then multiplied with the directional response of the antenna
a0 (θ) as in (3.4), and if it is sufficiently narrow, only a single lobe is left.
Beam steering Finally, let us consider what happens when we change the beamforming vector
w. Although we are free to choose anything, let us choose a structured vector, e.g., w = a(30◦ ).
Fig. 3.9 shows the response to this beamformer. Note that now the main peak shifts to 30◦ ,
signals from this direction are coherently added. By scanning w = a(θ), we can place the peak
at any desired θ. This is called classical beamforming.
This also provides a simple way to do direction estimation. Suppose we have a single unit-norm
source, arriving from broadside (0◦ ),
1
..
x(t) = a(0)s(t) = . s(t) = 1s(t) .
1
If we compute y(t) = wH x(t) and scan w(θ) = a(θ) over all values of θ and monitor the output
power of the beamformer,
where Ps is the source power, then (except for the square) we obtain essentially the same array
graph as in Fig. 3.7 before (it is the same functional). Thus, there will be a main peak at 0◦ ,
the direction of arrival, and the beam width is related to the number of antennas. In general, if
the source is coming from direction θ0 , then the graph will have a peak at θ0 .
With two sources, x(t) = a(θ1 )s1 (t) + a(θ2 )s2 (t), the array graph will show two peaks, at θ1 and
θ2 , at least if the two sources are well separated. If the sources are close, then the two peaks
will shift, then merge and at some point we will not recognize that there are in fact two sources.
The choice w(θ) = a(θ) is one of the simplest forms of beamforming.1 It is data independent,
and optimal only for a single source in white noise. One can show that for more than 1 source,
the parameter estimates for the directions θi will be biased: the peaks have a tendency to move
a little bit to each other. Unbiased estimates are obtained only for a single source.
There are other ways of beamforming, in which the beamformer is selected depending on the
data, with higher resolution (sharper peaks) and better statistical properties in the presence
of noise. Alternatively, we may follow a parametric approach in which we pose the model
x(t) = a(θ1 )s1 (t) + a(θ2 )s2 (t) and try to compute the parameters θ1 and θ2 that best fit the
observed data, as we discussed in Section 3.1.4.
For more general arrays, the array response vector a(θ) is given by (3.5), and we would still pick
w(θ) = a(θ) to steer towards θ, i.e., set
1
jφ1
e
w(θ) =
..
. (3.7)
.
ejφM −1
Note that when we compute w(θ)H x(t), we apply complex conjugates to the entries of w, so that
φm becomes −φm . It is recognized that the resulting phases −φm are precisely those that are
needed to compensate the phase of the incoming signal at sensor m, so that we sum coherently
for signals coming from direction θ.
Beam shaping In (3.7), the amplitudes of each entry of w were all equal to 1. As a result,
all beams in Fig. 3.9 looked like Dirichlet functions. In particular, they all have quite high
sidelobes.
1
It is a spatial matched filter, and known as Maximum Ratio Combining in communications.
aperture
sampling beamsteer beamshape sum
s(t)
x0 (t)
y(t)
e−jφ0 ∗
w0,0
θ x1 (t)
+
e−jφ1 ∗
w0,1
xM −1 (t)
e−jφM −1 ∗
w0,M −1
We can address this by tapering, i.e., scale each entry ejφm of w(θ) by a weight w0,m . The
resulting beamformer can be written as
w0,0 w0,0
w0,1 ejφ1
w0,1
w = w0 w(θ) = .. =
.. a(θ) . (3.8)
.
.
w0,M −1 ejφM −1 w0,M −1
The latter way of writing (as a matrix multiplying a(θ)) will be recognized in a later chapter
when we consider more general beamformers.
The design of w0 to arrive at a desired beam shape follows the same discussion as in Sec. 2.3.3,
where we discussed weighted spatial Fourier transforms.
where s(t) is a narrowband passband signal with center frequency ω0 . However, the modulation
ejω0 t can be dropped from both s(t) and xm (t). This makes the expressions for xm (t) equivalent.
Rs = Σs =
.. ,
.
σd2
where the source variances (powers) σi2 are possibly unequal. If σ s = vecdiag(Σs ) is a
vector containing the source powers, then (3.10) becomes
rx = (Ā ◦ A)σ s .
Rs = σs2 I .
This leads to
Rx = σs2 AA , rx = σs2 (Ā ◦ A)1
H
x n = A n s n + nn .
Similar to the sources, the noise is considered to be zero mean and wide-sense stationary,
H
E[nn ] = 0 , Rn = E[nn nn ] .
Rn = σn2 I .
With diagonal source and noise covariance models, the vectored data model can be written as
which is called the sample correlation matrix. Using a data matrix X = [x0 , · · · , xN −1 ], we can
also write it as
1 H
R̂x = XX .
N
If we define R̂s in a similar way then, in the noiseless case, R̂x = AR̂s AH . With additive noise,
H
R̂x = AR̂s A + R̂n + (cross terms) .
Since the sources and the noise are zero mean, the cross terms are zero mean, and the covariance
estimate is unbiased:
H
E[R̂x ] = Rx = ARs A + Rn .
Thus, we can consider R̂x to be equal to Rx plus a zero mean error term E due to the finite
number of samples. For large N , we expect E → 0. What is the (co)variance of R̂x ? Or
equivalently, of E?
To answer this, we work with the vectored covariance matrices rx and r̂x .
For a matrix-valued stochastic variable R̂, its covariance matrix can be defined as the covariance
of r̂, i.e.,
H
cov[R̂] = cov[r̂] = E[(r̂ − E[r̂])(r̂ − E[r̂]) ] .
1 P ∗
Next, insert the definition of r̂ = N xn ⊗ xn , and use the fact that xi is independent of xj for
i 6= j to derive
1 X ∗ 1 X ∗
(xi ⊗ xi ) − E[x∗i ⊗ xi ])( (xj ⊗ xj ) − E[x∗j ⊗ xj ])
H
cov[R̂x ] = E ( (3.11)
N N
1 XX h ∗ ∗ ∗ ∗ H
i
= E (x i ⊗ x i − E[x i ⊗ x i ])(x j ⊗ x j − E[x j ⊗ x j ])
N2
1 h i
E (x∗i ⊗ xi − E[x∗i ⊗ xi ])(x∗i ⊗ xi − E[x∗i ⊗ xi ])
X H
= 2
N
1
E[(x∗ ⊗ x)(x∗ ⊗ x) ] − E[x∗ ⊗ x]E[x∗ ⊗ x]
H H
= (3.12)
N
1
= Cx ,
N
where
Cx = E[(x∗k ⊗ xk )(x∗k ⊗ xk ) ] − E[x∗k ⊗ xk ]E[x∗k ⊗ xk ] .
H H
(3.13)
The first term of this expression shows that the covariance of R̂x involves fourth-order correla-
tions. These can often be described in simpler terms using cumulants. A discussion of this is
deferred to Chap. 11.
For the special case where the entries xi of x are zero-mean and jointly Gaussian distributed, it
is known that (for arbitrary indices a, b, c, d = 0, · · · , M − 1)
E[xa x∗b xc x∗d ] = E[xa x∗b ]E[xc x∗d ] + E[xa x∗d ]E[x∗b xc ] + E[xa xc ]E[x∗b x∗d ] . (3.14)
This follows from an expression of the 4th order (joint) cumulant in terms of moments; for
Gaussian random variables this cumulant is zero. “Proper” (or circularly symmetric) complex
variables are such that E[xxT ] = 0. In this case, the last term vanishes.
The LHS of (3.14) represents a 4th order moment. Stacking in a matrix with row-index a + M b
and column-index c + M d, we can write this expression compactly as
E[(x∗ ⊗x)(x∗ ⊗x) ] = E[x∗ ⊗x]E[x∗ ⊗x] +E[x∗ x∗ ]⊗E[xx ]+E[(x∗ ⊗1)(1⊗x) ] E[(1⊗x)(x∗ ⊗1) ] .
H H H H H H
For proper complex variables, the last term vanishes. For this case, if we compare to (3.13), we
see that
Cx = R∗x ⊗ Rx .
Thus, for zero mean proper complex-valued Gaussian random variables,
1 ∗
cov[R̂x ] =R ⊗ Rx . (3.15)
N x
while for zero mean non-proper complex variables
1 h ∗ i
Rx ⊗ Rx + E[(x∗ ⊗ 1)(1 ⊗ x) ] E[(1 ⊗ x)(x∗ ⊗ 1) ] . .
H H
cov[R̂x ] = (3.16)
N
and for zero mean real-valued Gaussian random variables
1 h T T
i
cov[R̂x ] = Rx ⊗ Rx + E[(x ⊗ 1)(1 ⊗ x) ] E[(1 ⊗ x)(x ⊗ 1 ] . (3.17)
N
= (w∗ ⊗ w) cov[R̂](w∗ ⊗ w) .
H
Assuming zero mean proper complex Gaussian sources, we can insert (3.15):
1 1 1 H 1
(w∗ ⊗ w) (R∗ ⊗ R)(w∗ ⊗ w) = w∗ R∗ w∗ ⊗ w Rw = |w Rw|2 = |E[y]|2
H H
var[P̂ ] =
N N N N
√
In other words, the standard deviation of the spectrum estimate is 1/ N times the expected
value of the spectrum estimate itself.
This is the same result as for the periodogram [2, Ch. 8]. The result is valid for any data-
independent beamformer w(θ), i.e., also if we apply tapering.
3.2.4 Variance
The variance of r̂ is a vector consisting of the diagonal entries of cov[r̂]. The variance of R̂ is
defined as an unfolding of this vector into a matrix. Each entry of this matrix then shows the
variance of the corresponding entry in R̂.
Thus, if D = diag(R) and d = vecdiag(R), then, for zero mean complex proper Gaussian
variables,
1 1
var[r̂] = vecdiag(R∗ ⊗ R) = d ⊗ d
N N
and
1
var[R̂] = vec−1 (var[r]) = dd .
T
N
Some examples follow.
Independent noise If xk = nk is zero mean proper symmetric Gaussian noise with variance
2 ), then
Σn = diag(σ n ) (i.e., the sensors have independent noise with variance σn,i
R = Σn
1
cov[R̂] = Σn ⊗ Σn
N
1 T
var[R̂] = σnσn .
N
Single point source If xk = ask , where sk is a zero mean proper complex Gaussian source
with unit variance (a non-unit variance can be incorporated in a), then
H
R = aa
1 ∗ T 1
a a ⊗ aa = (a∗ ⊗ a)(a∗ ⊗ a)
H H
cov[R̂] =
N N
1 ∗ T H
var[R̂] = a a aa .
N
Although generally does not preserve the rank of matrices, it can be shown that a∗ aT aaH
is rank 1 (see Sec. 5.1.6).
x̃M (t)
In Sec. 2.5, we introduced radio astronomy. Starting from basic wave propagation, we arrived
at the “Van Cittert-Zernike” measured data model (2.36) of the form
Z
ω
V (ω, b) = I(ω, ζ)e−j c ζ·b dζ (3.18)
which describes the received cross power spectral density V (ω, b) over a baseline b, in terms of
the image I(ω, ζ), i.e., the intensity into the direction ζ, and the phase delays ωc ζ · b in that
look direction (the geometric delays).
With the tools in the present chapter, we can rewrite this into a data matrix form. Let us first
consider the receiver model in a bit more detail.
pN −1
1 X H
R̂p,k = x(n, k)x (n, k) , (3.19)
N n=(p−1)N
where p is the index of the corresponding “short-term interval” (STI) over which is correlated.
The processing chain is summarized in Fig. 3.11.
The duration of an STI depends on the stationarity of the data, which is limited by factors
like Earth rotation and the diameter of the array. For the Westerbork array, a typical value for
the STI is 10 to 30 s; the total observation can last for up to 12 hours. The resulting number
of samples N in a snapshot observation is equal to the product of bandwidth and integration
time and typically ranges from 103 (1 s, 1 kHz) to 106 (10 s, 100 kHz) in radio astronomical
applications.
For our purposes, it is convenient to model the sky as consisting of a collection of Q spatially
discrete point sources, with sq [n, k] the signal of the qth source at time sample n and frequency
fk .
For a single source, we saw in (3.5) that the received signal at an antenna array can be expressed
as
x[n, k] = aq [n, k]sq [n, k]
2πfk
zm [n, k] = xm
c
As the earth rotates, the antenna positions are actually functions of time. We can collect them
in a 3 × M matrix
Z[n, k] = [z1 [n, k], · · · , zM [n, k]] .
In this notation,
T
aq [n, k] = ejZ(n,k) ζq
, (3.20)
where aq [n, k] is the array response vector for the qth source, consisting of the phase multipli-
cation factors, and n[n, k] is an additive noise vector, due to thermal noise at the receiver. We
will model sq [n, k] and n[n, k] as baseband complex envelope representations of zero mean wide
sense stationary temporally white Gaussian random processes sampled at the Nyquist rate.
For convenience of notation, we will in future usually drop the dependence on the frequency fk
(index k) from the notation.
Previously, in (3.19), we defined correlation estimates R̂p as the output of the data acquisition
process, where the time index p corresponds to the pth short term integration interval (STI),
such that (p − 1)N ≤ n ≤ pN . Due to Earth rotation, the vector aq [n] changes slowly with time,
but we assume that within an STI it can be considered constant and can be represented, with
some abuse of notation, by aq [p]. In that case, x[n] is wide sense stationary over the STI, and
a single STI autocovariance is defined as
H n
Rp = E[x[n] x [n]] , p=d e (3.22)
N
where Rp has size M × M . Each element of Rp represents the interferometric correlation
along the baseline vector between the two corresponding receiving elements. It is estimated by
STI sample covariance matrices R̂p defined in (3.19), and our stationarity assumptions imply
E[R̂p ] = Rp .
If we generalize now to Q sources and add zero mean noise, uncorrelated from antenna to
antenna, as in the signal model (3.21), we obtain the covariance data model
H
Rp = Ap Σs Ap + Σn , p = 0, 1, 2, · · · , (3.23)
where (Rp )i,j is the correlation between antennas i and j at STI interval p, I(ζ q ) = σq2 is the
brightness (power) of the source in direction ζ q , zi (p) is the normalized location vector of the
ith antenna at STI p, and ζ q is the unit propagation vector from the qth source.
The function I(ζ) is the brightness image (or map) of interest. For our discrete point-source
model, it is
Q
X
I(ζ) = σq2 δ(ζ − ζ q ) (3.25)
q=1
where δ(·) is a Kronecker delta, and the direction vector ζ is mapped to the location of “pixels”
in the image (various transformations are possible). Only the pixels ζ q are nonzero, and have
value equal to the source variance σq2 .
Equation (3.24) describes the relation between the visibility model and the desired image, and
it has the form of a Fourier transform; as discussed in Chap. 2.5, it is the Van Cittert-Zernike
theorem [4, 5]. Image formation is essentially the inversion of this relation. We discussed this
in Sec. 2.5. In the present setting, we have only a finite set of observations (as indexed by p). If
we apply the inverse “discrete-space” Fourier transformation to the measured correlation data,
we obtain the dirty image
T
IˆD (ζ) := (R̂p )ij e−j(zi (p)−zj (p))
X
ζ
. (3.26)
i,j,p
In terms of the measurement data model (3.24), the “expected value” of the image is obtained
by replacing R̂p by Rp , or
T
(Rp )i,j e−j(zi (p)−zj (p))
X
ζ
ID (ζ) :=
i,j,p
T
σq2 e−j(zi (p)−zj (p))
XX
(ζ−ζ q )
=
i,j,p q
X
= I(ζ q ) B(ζ − ζ q )
q
= I(ζ) ∗ B(ζ) (3.27)
where the dirty beam (or point spread function) is given by
T
e−j(zi (p)−zj (p))
X
ζ
B(ζ) := . (3.28)
i,j,p
This is the same result as in Sec. 2.5, but now for spatially sampled observations. Again, the
dirty image ID (ζ) is the desired image I(ζ) convolved with the dirty beam B(ζ): every point
source excites a beam B(ζ − ζ q ) centered at its location ζ q . Note that B(ζ) is a known function:
it only depends on the locations of the telescopes, or rather the sampled set of telescope baselines
zi (p) − zj (p).
3.4 NOTES
Bibliography
[1] J.G. Proakis and M. Salehi, Communication Systems Engineering. Prentice-Hall, 1994.
[2] M.H. Hayes, Statistical digital signal processing and modeling. Wiley, 1996.
[3] A.J. van der Veen, S.J. Wijnholds, and A.M. Sardarabadi, “Signal processing for radio
astronomy,” in Handbook of Signal Processing Systems, 3rd ed., Springer, November 2018.
ISBN 978-3-319-91734-4.
[4] R.A. Perley, F.R. Schwab, and A.H. Bridle, Synthesis Imaging in Radio Astronomy, vol. 6
of Astronomical Society of the Pacific Conference Series. BookCrafters Inc., 1994.
[5] A.R. Thompson, J.M. Moran, and G.W. Swenson, Interferometry and Synthesis in Radio
Astronomy. New York: Wiley, 2nd ed., 2001.
Contents
4.1 Physical channel properties . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Signal modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Deterministic data models . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Frequency-domain data models . . . . . . . . . . . . . . . . . . . . . . 84
4.5 Application: radio astronomy . . . . . . . . . . . . . . . . . . . . . . . 84
4.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Having covered narrowband data models, in this chapter we continue and focus on wideband
data models. These are used in the context of wireless (RF) communication systems, where
convolutions by pulse shape functions and channel propagation delays play an important role.
A data model for wireless communication consists of the following parts (see Fig. 4.1):
1. Source model: signal alphabet, data packets, and modulation by a pulse shape function;
2. Physical channel: multipath propagation over the wireless channel, based on the wave
propagation model of Chapter 2;
space-time equalizer
x1 (t)
[ŝ1 ]k
[s1 ]k g(t)
xM (t) W
[ŝd ]k
[sd ]k
(“channel”) to a receiver (“base station”). This allows to present concepts that are to some
extent also relevant in other contexts, such as microphone arrays, radar, GPS receivers, etc.
The propagation of signals through the wireless channel is fairly complicated to model. A correct
treatment would require a complete description of the physical environment, and would not be
very useful for the design of signal processing algorithms. To arrive at a more useful parametric
model, we have to make simplifying assumptions regarding the wave propagation. Provided this
model is reasonably valid, we can, in a second stage, try to derive statistical models for the
parameters to obtain reasonable agreement with measurements.
The number of parameters in an accurate model can be quite high, and from a signal processing
point of view, they might not be very well identifiable. For this reason, another model used in
signal processing is a much less sophisticated unparametrized model. The radio channel is simply
modeled as an FIR (finite impulse response) filter, with main parameters the impulse response
length (in symbols) and the total attenuation or signal-to-noise ratio (SNR). This model is
described in Section 4.3. The parametrized model is a special case, giving structure to the FIR
coefficients.
Jakes’ model A commonly used parametric model is a multiray scattering model, also known
as Jakes’ model (after Jakes [1], see also [2–6]). In this model, the signal follows on its way from
the source to the receiver a number of distinct paths, referred to as multipaths. These arise from
scattering, reflection, or diffraction of the radiated energy on objects that lie in the environment.
The received signal from each path is much weaker than the transmitted signal due to various
scattering and fading effects. Multipath propagation also results in the spreading of the signal
in various dimensions: delay spread in time, Doppler spread in frequency, and angle spread in
space. Each of them has a significant effect on the signal. The mean path loss, shadowing, fast
fading, delay, Doppler spread and angle spread are the main channel characteristics and form
the parameters of the multiray model.
The scattering of the signal in the environment can be specialized into three stages: scattering
(α0 , β0 , τ0 )
x0 (t)
path i
local to the source at surrounding objects, reflections on distant objects of the few dominant
rays that emerge out of the local clutter, and scattering local to the receiver. See Fig. 4.2.
Scatterers local to the mobile Scattering local to the mobile is caused by buildings and other
objects in the direct vicinity of the mobile (at, say, a few tens of meters). Motion of the mobile
and local scattering give rise to Doppler spread which causes “time-selective fading”: the signal
power can have significant fluctuations over time. While local scattering contributes to Doppler
spread, the delay spread will usually be insignificant because of the small scattering radius.
Likewise, the angle spread will also be small.
Remote scatterers Away from the cluster of local scatterers, the emerging wavefronts may then
travel directly to the base or may be scattered toward the base by remote dominant scatterers,
giving rise to specular multipath. These remote scatterers can be either terrain features (distant
hills) or high rise building complexes. Remote scattering can cause significant delay and angle
spreads.
Scatterers local to the base Once these multiple wavefronts reach the base station, they may
be scattered further by local structures such as buildings or other structures that are in the
vicinity of the base. Such scattering will be more pronounced for low elevation and below-roof-
top antennas. The scattering local to the base can cause significant angle spread which can cause
space-selective fading: different antennas at the base station can receive totally different signal
powers. This fading is time invariant, unlike the time varying space-selective fading caused by
remote scattering.
Doppler spread and time selective fading If mobiles or scatterers are moving, then the phases
of each multipath component are quickly changing relative to each other, hence they add up
differently over time. As mentioned, this causes (fast) fluctuations in the received signal power
over that ray (time-selective fading).
Movement also results in a Doppler spread, i.e., a pure CW tone is spread over a non-zero
spectral bandwidth. If a source moves with a velocity of v m/s towards the receiver, then its
observed frequency is increased by fm = v/λ [Hz], or ωm = v 2πλ . Likewise, if the source moves
away from the receiver, its observed frequency is reduced by fm . If it moves sideways, there is
no shift in frequency.
If there is a ring of scatterers around the mobile, then seen via some reflectors, the mobile
may seem to move away, while via other reflectors, it seems to approach. Thus, we obtain a
distribution of Doppler shifts. If one assumes uniformly distributed scatterers, then the baseband
power spectrum of the vertical electrical field component of the channel is convolved with [1, ch.1]
" 2 #−1/2
3 ω
S(ω) = 1− , |ω| < ωm (4.1)
ωm ωm
The Doppler spectrum described by (4.1) is often called the classical spectrum. For a mobile
traveling at 100 kph, the Doppler spread is approximately fm = 175 Hz in the 1900 MHz band.
Because a convolution in frequency domain translates in pointwise multiplication in time domain,
and the function is non-flat in this case, Doppler spread causes time selective fading. It is usually
characterized by the coherence time of the channel [1], i.e., the time lag over which the Doppler
time function has an autocorrelation larger than 0.5. The larger the Doppler spread, the smaller
the coherence time. The coherence time is in the order of 1/ωm , i.e., approximately 0.9 ms for
fm = 175 Hz. In comparison, the burst length in a single GSM data package is 0.577 ms, so that
the GSM channel can be regarded almost time-invariant during the burst, but not in between
two bursts.
Signal processing model Let us ignore the local scattering for the moment, and assume that
there are r rays bouncing off remote objects such as hills or tall buildings. As extension of the
narrowband model (3.4), the received parametric signal model is then usually written as the
convolution
r
" #
X
x(t) = h(t) ∗ s(t) , h(t) = a(θi )βi g(t − τi ) , (4.2)
i=1
where x(t) is a vector consisting of the M antenna outputs, a(θ) is the array response vector,
and the impulse response g(t) collects all temporal aspects, such as pulse shaping and transmit
and receive filtering. The model parameters of each ray are its (mean) angle-of-incidence θi ,
(mean) path delay τi , and path loss βi . The latter parameter lumps the overall attenuation, all
phase shifts, and possibly the antenna response a0 (θ) as well.
Each of the rays is itself composed of a large number of “mini-rays” due to scattering close to the
source: all with roughly equal angles and delays, but arbitrary phases. This can be described by
extending the model with additional parameters such as the standard deviations from the mean
Table 4.1. Typical delay, angle and Doppler spreads in cellular applications.
Environment delay spread angle spread Doppler spread
Flat rural (macro) 0.5 µs 1◦ 190 Hz
Urban (macro) 5 µs 20◦ 120 Hz
Hilly (macro) 20 µs 30◦ 190 Hz
Mall (micro) 0.3 µs 120◦ 10 Hz
Indoors (pico) 0.1 µs 360◦ 5 Hz
angle θi and mean delay τi , which depend on the radius (aspect ratio) of the scattering region
and its distance to the remote scattering object [7, 8]. For macroscopic models, the standard
deviations are generally small (less than a few degrees, and a fraction of τi ) and are usually but
not always ignored.
The local scattering however has a major effect on the statistics and stationarity of βi . For
example, if all local rays have equal amplitude, then βi is the sum of a large number of arbi-
trary complex numbers, each with equal modulus but random phase, which gives βi a complex
Gaussian distribution. Consequently, its amplitude has a Rayleigh distribution (hence the name
Rayleigh fading). More in general, if there is a strong path with some scattering around it that
causes fluctuation, then often a Rice distribution or log-normal distribution is assumed.
A second effect is that βi = βi (t) is really (slowly) time-varying: if the source is in motion, then
the Doppler shifts and/or the varying location change the phase differences among the rays, so
that the sum can be totally different from one time instant to the next. The maximal Doppler
shift fD is given by the speed of the source (in m/s) divided by the wavelength of the carrier. The
coherence time of the channel is inversely proportional to fD , roughly by a factor of 0.2: βi (t)
can be considered approximately constant for time intervals smaller than this time [4, 9, 10].
Angles and delays are generally assumed to be stationary over much longer periods.
A proper discussion should now present statistical models for θi , βi , and τi . Since this is not the
focus of the book, we omit further details.
Typical channel parameters Angle spread, delay spread, and Doppler spread are important
characterizations of a mobile channel, as it determines the amount of equalization that is re-
quired, but also the amount of diversity that can be obtained. Measurements in macrocells
indicate that up to 6 to 12 dominant paths may be present. Typical channel delay and Doppler
spreads (1800 MHz) are given in table 4.1 [4, 9] (see also references in [5]). Typical angle spreads
are not well known; the given values are suggested by [6].
ejω0 t
Before a digital bit sequence can be transmitted over a radio channel, it has to be prepared:
among other things, it has to be transformed into an analog signal in continuous time and
modulated onto a carrier frequency. The various steps are shown in Fig. 4.3. The coding step,
in its simplest form, translates the binary sequence {sk } ∈ {0, 1} into a sequence {s̃k } with
another alphabet, such as {−1, +1}. A digital filter may be part of the coder as well. In linear
modulation schemes, the resulting sequence is then convolved with a pulse shape function g(t),
whereas in phase modulation, it is convolved with some other pulse shape function q(t) to yield
the phase of the modulated signal. The resulting baseband signal u(t) is modulated by the
carrier frequency ω0 to produce the RF signal that will be broadcast.
In this section, a few examples of coding alphabets and pulse shape functions are presented, for
future reference. We do not go into the properties and reasons why certain modulation schemes
are chosen; see e.g. [11] for more details.
Digital alphabets The first step in the modulation process is the coding of the binary sequence
{sk } into some other sequence {s̃k }. The {s̃k } are chosen from an alphabet or constellation, which
might be real or complex. There are many possibilities; common examples are BPSK (binary
phase shift keying), QPSK (quadrature phase shift keying), PAM-m (pulse amplitude modu-
lation), QAM-m (quadrature amplitude modulation), MSK (minimum-shift keying), DQPSK
(differential QPSK), defined as in table 4.2. See also Fig. 4.4. Smaller constellations are more
robust in the presence of noise, because of the larger distance between the symbols. Larger
constellations may lead to higher bitrates, but are harder to detect in noise.
It is possible that the data rate of the output of the coder is different than the input data
rate. E.g., if a binary sequence is coded into QPSK, the data rate halves. (The opposite is also
possible, e.g., in CDMA systems, where each bit is coded into a sequence of 31 or more “chips”.)
0.8
−0.2
Figure 4.5. (a) Family of raised-cosine pulse shape functions, (b) corresponding spectra.
Pulse shape functions The coded digital signal s̃(t) can be described as a sequence of dirac
pulses,
∞
X
s̃(t) = s̃k δ(t − k) ,
−∞
where, for convenience, the symbol rate is normalized to T = 1. In linear modulation schemes,
the digital dirac-pulse sequence is convolved by a pulse shape function g(t):
∞
X
u(t) = g(t) ∗ s̃(t) = s̃k g(t − k) . (4.3)
−∞
Again, there are many possibilities. The optimum wave form is one that is both localized in time
(to lie within a pulse period of length T = 1) and in frequency (to satisfy the Nyquist criterion
when sampled at a rate 1/T = 1). This is of course impossible, but good approximations exist.
A pulse with perfect frequency localization is the sinc-pulse, defined by
(
sin πt 1, |f | < 21
g(t) = , G(f ) = (4.4)
πt 0, otherwise
Raised cosine pulseshape A modification of this pulse leads to the family of raised-cosine
pulseshapes, with better localization properties. They are defined, for α ≤ 1, by [11, ch.6]
sin πt cos απt
g(t) = ·
πt 1 − 4α2 t2
with corresponding spectrum
1,
|f | < 12 (1 − α)
1 1
G(f ) = 2 − 2 sin( απ (|f | − 1
2 )) ,
1 1
2 (1 − α) < |f | < 2 (1 + α)
0, otherwise
The spectrum is limited to | f | ≤ 21 (1 + α), so that α represents the excess bandwidth. For
α = 0, the pulse is identical to the sinc pulse (4.4). For other values of α, the amplitude decays
more smoothly in frequency, so it is also known as the rolloff factor. The shape of the rolloff
is that of a cosine, hence the name. In the time domain, the pulses are still infinite in extent.
However, as α increases, the size of the tails diminishes. A common choice is α = 0.35, and to
truncate g(t) outside the interval [−3, 3].
The raised-cosine pulses are designed such that, when sampled at integer time instants, the
only nonzero sample occurs at t = 0. Thus, u(k) = s̃k , and to recover {s̃k } from u(t) is simple,
provided we are synchronized: any fractional delay 0 < τ < 1 results in intersymbol interference.
Phase modulation Many other modulation formats exists, in particular phase modulations are
often used. An example is GMSK as used in the GSM system. For signal processing purposes,
these are very often hard to handle. In some cases, these nonlinear modulations can be well
approximated by linear modulations (e.g., GMSK), in other cases, we simply use some general
properties of the resulting signal. E.g., several modulation formats are based on frequency or
phase modulation and satisfy a constant-modulus property (|s(t)| = 1).
s̃ = s2 c = s ⊗ c . (4.5)
..
.
which is subsequently modulated by g(t) in the usual way, cf. (4.3). Here, ‘⊗’ denotes a Kronecker
product, which for vectors is defined as indicated. (Properties of Kronecker products are found
in Sec. 5.1.6.) If the original sequence has a symbol duration T = 1, then each code chip has
a duration T /G, and g(t) is scaled accordingly. Thus, in frequency domain the pulse G(f )
occupies G times more bandwidth: this is called spread spectrum modulation. We can also
view the combination of code c and pulse g(t) as a new coded pulse g̃(t) that now has a more
complicated, user-specific form,
G−1
i
X
g̃(t) = ci g t − .
i=0
G
Typical values of G are 31 till 1024. The codes are used to distinguish individual users. Instead
of giving each user a dedicated time slot or frequency subband, they get a specific user code.
This permits us to separate a superposition of multiple users at a (basestation) receiver. An
advantage is that more than G users can be active simultaneously. A disadvantage is that, due
beamformer
x0 (t)
[s1 ]k
[s1 ]k W
xM −1 (t)
[sd ]k
[sd ]k
to the shorter chip duration, the convolution by the channel impulse response has more impact,
and equalization is more complicated.
In practical systems like the 3rd generation (3G) mobile system UMTS, the used codes are
non-periodic: they differ from symbol to symbol. This requires a simple extension of (4.5) to
use symbol-specific codes ck :
s1 c1
s2 c2
s̃ =
..
.
In Sec. 4.1, we have presented a channel model based on physical properties of the radio channel.
Though useful for generating simulated data, it is not always a suitable model for identification
purposes, e.g., if the number of parameters is large, if the angle spreads within a cluster are large
so that parametrization in terms of directions is not possible, or if there is a large and fuzzy
delay spread. In these situations, it is more appropriate to work with an unstructured model,
where the channel impulse responses are posed simply as arbitrary multichannel finite impulse
response (FIR) filters. It is a generalization of the physical channel model considered earlier, in
the sense that at a later stage we can still specify the structure of the coefficients.
In this section, we look at deterministic data models, i.e., no stochastic considerations are used.
In this case, the sampled data is directly placed in a matrix X which is subsequently analyzed.
where as before x(t) is a stack of the output of the M antennas. We will usually write this in
matrix form:
s1 (t)
sd (t)
Suppose we sample with a period T , normalized to T = 1, and collect a batch of N samples into
a matrix X, then
X = AS
where X = [x(0), · · · , x(N − 1)] and S = [s(0), · · · , s(N − 1)]. The resulting [X = AS] model
is called an instantaneous multi-input multi-output model, or I-MIMO for short. It is a generic
linear model for source separation, valid when the delay spread of the dominant rays is much
smaller than the inverse bandwidth of the signals, e.g., for narrowband signals, in line-of-sight
situations or in scenarios where there is only local scattering. Even though this appears to limit
its applicability, it is important to study it in its own right, since more complicated convolutive
models can often be reduced (after equalization or separation into sufficiently narrow subbands)
to X = AS.
The objective of beamforming for source separation is to construct a left-inverse WH of A, such
that WH A = I and hence WH X = S: see Fig. 4.6. This will recover the source signals from the
observed mixture. It immediately follows that in this scenario it is necessary to have d ≤ M to
ensure interference-free reception, i.e., not more sources than sensors. If we know already (part
of) S, e.g., because of training, then we can estimate W via WH = SX† = SXH (XXH )−1 , where
X† denotes the Moore-Penrose pseudo-inverse of X, here equal to its right inverse (see Chapter
5). With noise, other beamformers may be better.
Coherent multipath If we adopt the multipath propagation model, then A is endowed with a
parametric structure: every column ai is a sum of direction vectors a(θij ), with different fadings
βij . If the ith source is received through ri rays, then
βi1
ri
X ..
ai = a(θij )βij = [a(θi1 ), · · · , a(θi,ri )] . (i = 1, · · · , d) .
j=1
βi,ri
If each source has only a single ray to the receiver array (a line-of-sight situation), then each ai
is a vector on the array manifold, and identification will be relatively straightforward. The more
general case amounts to decomposing a given a-vector into a sum of vectors on the manifold,
which makes identification much harder.
To summarize the parametric structure in a compact way, we could collect all a(θij )-vectors and
path attenuation coefficients βij of all rays of all sources in single matrices Aθ and B,
To sum the rays belonging to each source into the single ai -vector of that source, we define a
selection matrix
1r1 0
J=
.. : r×d
(4.6)
.
0 1rd
where r = d1 ri and 1m denotes an m × 1 vector consisting of 1’s. Together, this allows to write
P
X = AS , A = Aθ BJ . (4.7)
To extend the instantaneous model to a situation with convolutive channels, let h[k] be a finite
impulse response (FIR) filter. The matrix equation corresponding to a convolution x[n] =
L−1
X
h[n] ∗ s[n] = h[k]s[n − k] is
k=0
h[0] 0
x[0]
h[1] h[0]
x[1]
..
x[2]
.
h[2] h[1]
s[0]
..
. ..
..
.
h[2] . h[0] s[1]
x = Hs ⇔ . = (4.8)
.. ..
.
.
..
h[L − 1] . . h[1]
.
..
.. s[Ns − 1]
.
h[L − 1] . h[2]
..
.. ..
. . .
x[N − 1] 0 h[L − 1]
where the “box” indicates the location of time-index 0, L is the channel length, Ns is the length
of the input sequence (prior and subsequent symbols are supposed to be zero; this is usually
achieved by a guard interval), and N = Ns + L − 1 is the length of the observation (ignoring
the other samples). Note that H has size Ns + L − 1 × Ns , so H is always tall. If there is no
guard interval, we have to drop the first L − 1 samples of x since they are “contaminated” by
prior symbols, and the top part of H has to be dropped accordingly. Likewise, we will probably
have to drop the last L − 1 rows of H as well, if we have to assume that subsequent symbols
s[Ns ], s[Ns + 1], · · · are nonzero and unknown. This will reduce the size of H to Ns − L + 1 × Ns ,
and it is not tall anymore.
H has a Toeplitz structure: it is constant along diagonals. That structure always appears when
we have time-invariant systems.
Suppose we observe x and know the channel matrix H, and it is tall. The input sequence can be
estimated by taking a left inverse H† of H, such that H† H = I. Since H is tall, we can usually
take
H† = (H H)−1 H
H H
ŝ = H† x = (H H)−1 H x .
H H
(4.9)
This is a block receiver: all entries of s are estimated simultaneously. If Ns is large, this is not
very efficient.
Due to the commutativity of the convolution, we can also write x[n] = s[n] ∗ h[n], and hence
s[0] 0
x[0]
s[1] s[0]
x[1]
.. .. ..
.
..
. .
. ..
h[0]
s[L − 1] s[L − 2] . s[0]
x[L − 1]
h[1]
..
.. .. .. ..
x = Sh ⇔
.
=
. . . .
..
.. ..
.
x[Ns − 1]
s[Ns − 1] . . s[Ns − L] h[L − 1]
x[Ns ] ..
s[Ns − 1] . s[Ns − L + 1]
..
. ..
..
. .
x[N + L − 2]
s
0 s[Ns − 1]
(4.10)
Now S has a Toeplitz structure, it has size Ns + L − 1 × L. This expression can be used to
estimate the channel coefficients in case we know the transmitted symbols (e.g., due to a training
period), i.e., ĥ = S† x, where S† = (SH S)−1 SH . Note that S is tall: we need Ns ≥ 1.
The “0” blocks in S should be replaced by symbols in case the transmitter is not silent be-
fore/after the transmission of the training symbols (i.e., if there is no guard interval). Often
these are unknown. To estimate h we should omit all rows in S that contain unknown entries of
s[n] (and also drop the corresponding entries in x). This results in the model x0 = S0 h, where x0
and S0 are the parts of x and S between the horizontal lines in (4.10). S0 has size Ns − L + 1 × L,
and more samples are needed to make it have a left inverse: Ns ≥ 2L − 1.
4.3.3 Oversampling
Since in (4.9) we aim to invert H, we would like it to be tall. If it is not tall (e.g., due to lack
of a guard interval), we can sometimes make it more tall by considering oversampling. In this
context, oversampling means sampling faster than the symbol rate. Although it does not make
sense to sample much faster than the Nyquist rate, often the Nyquist rate is higher than the
symbol rate. E.g., in Fig. 4.5, we saw examples of the raised cosine pulse shape which is more
compact in time than a sinc pulse, but therefore has excess bandwidth in frequency (controlled
by the parameter α).
In the case of linear modulation, we can define
∞
X
s(t) = sk δ(t − kT ) (4.11)
k=−∞
and define the modulation by a convolution with the pulse shape g(t). For convenience, we
normalize the symbol period to T = 1. Then the modulated signal is
X
u(t) = s(t) ∗ g(t) = g(t − k)sk .
k
As before, let x(t) be the baseband received signal. The impulse response of the channel from the
source to the receiver, h(t), is a convolution of the pulse shaping filter g(t) and the actual channel
response from u(t) to x(t). We can include any propagation delays and unknown synchronization
delays in h(t) as well. The data model is written compactly as the convolution x(t) = h(t) ∗ s(t).
Inserting (4.11) gives (with T = 1)
Z
h(t − t0 ) sk δ(t0 − k) dt0 =
X X
x(t) = sk h(t − k) . (4.12)
k k
This appears as a discrete-time convolution, even if x(t) and h(t) are continuous-time. An im-
mediate consequence of the FIR assumption is that, at any given moment, at most L consecutive
symbols play a role in x(t). Indeed, for t = n + τ , where n ∈ ZZ and 0 ≤ τ < 1, the convolution
(4.12) can be written as
L−1
X
x(n + τ ) = h(k + τ )sn−k . (4.13)
k=0
Suppose that we sample x(t) at a rate of P times the symbol rate.1 Then (4.13) shows that for
all samples that fall between times n and n + 1, the same L symbols play a role. If we define
x(n) h(k)
x(n + P1 )
h(k + P1 )
x[n] = .. , h[k] = .. (4.14)
. .
P −1 P −1
x(n + P ) h(k + P )
1
For the raised cosine pulses, we would select P = 2.
L−1
X
x[n] = h[n] ∗ s[n] = h[k]sn−k . (4.15)
k=0
This is the same as we had before, but now using sample vectors consisting of the P samples
that fall within one sample period. Thus, (4.8) becomes
h[0] 0
x[0]
h[1] h[0]
x[1]
..
x[2] .
h[2] h[1]
s[0]
.
..
. ..
..
h[2] . h[0] s[1]
x = Hs ⇔ .. = (4.16)
.. .
.
..
.
h[L − 1]
. . h[1]
.
..
.. s[Ns − 1]
.
h[L − 1] . h[2]
..
.. ..
.
. .
x[N − 1] 0 h[L − 1]
Now, H is a block-Toeplitz matrix, where each block is a P × 1 vector. We can estimate the
symbols by inverting H as before, ŝ = H† x = (HH H)−1 HH x. Compared to the previous case,
H is a factor P times more tall, which is usually good for inversion.
In fact, it would seem that if we take P very large, then we can make H as tall as we want.
However, it does not make sense to sample (much) faster than the Nyquist rate. If we sample
faster, H might be tall but at some point its columns will not become more orthogonal to each
other. Thus, the condition number of H (see Sec. 5.4.6), an indicator of the amount of noise
enhancement, converges to a constant. Said differently, by oversampling, we collect more signal
energy, but we also collect more noise. A more detailed analysis is needed here, but we can
expect that sampling faster than Nyquist will not give benefits.
sn
..
s[n] = . = sn ⊗ 1P
sn
X has size P × N ; its nth column x[n] contains the P samples taken during the nth symbol
period. Based on the FIR assumption, it follows that X has a factorization
X = HS (4.19)
where
h(0) h(1) · · · h(L − 1)
h( 1 ) · ·
P
H = [h[0] h[1] ··· h[L − 1]] =
.. .. : P ×L (4.20)
. .
h( PP−1 ) · · · · h(L − P1 )
s0 s1 · · · sL−1 · · · sNs −2 sNs −1 0
s0 · · · · · · ··· ··· sNs −2 sNs −1
S = .. .. .. : L×N,
. s1 ··· ··· ··· . .
0 s0 ··· ··· sNs −L ··· sNs −2 sNs −1
s(t) H(t) ↓
w1∗
1 z
P ↓ xk w2∗
z
↓
z
↓
z
↓ xk−1 yk
z
↓
↓
z
↓ xk−m+1
z ∗
wmP
↓
the different organization of the received data into a matrix, not a vector, we have avoided the
Kronecker-repetition of the symbols in S = S ⊗ 1P that was present in (4.16).
A problem with using this factorization to estimate S compared to the estimation based on
(4.16) is that the Toeplitz structure of S is not enforced: in the presence of noise, Ŝ = H† X
is not Toeplitz. The redundancy in S is not exploited. Also, usually P is not very large (e.g.,
P = 2), and therefore, usually H is not tall.
A linear equalizer in the present context can be written as a vector w which combines the rows
of X to generate an output y = wH X. If we consider the model X = HS, then we would require
wH H = [0, · · · , 0, 1, 0, · · · 0], so that equalization by w results in y equal to one of the rows of S.
In the noise-free model of (4.20), it doesn’t really matter which row of S is reconstructed: we
have L options, and they only differ by a delay. Since we only combine the P samples of x(t) in
one symbol period, the equalizer length is one symbol period.
Often, it is much better to filter over multiple sample periods. For a linear equalizer with a
length of m symbol periods, we have to augment X with m − 1 horizontally shifted copies of
itself:
.
x[1] . .
x[0] x[N − m]
. .
x[1] x[2] . . ..
X = .. . . : mP × N − m + 1 .
. .. .. x[N − 2]
.
x[m − 1] . . x[N − 2] x[N − 1]
Each column of X is a regression vector: the memory of the filter. Using X , a linear equalizer
over m symbol periods can be written as y = wH X , which combines mP snapshots: see Fig.
4.7.
S1
..
X = HS , H = [H1 , · · · , Hd ] , S= . .
Sd
If the d have the same channel lengths L, then we could also rearrange this to arrive at a model
as in (4.21), but now with block matrices H of size M P × dL, and d-dimensional vectors sk in
S. The m shifts of H to the left in H then are each over d positions.
h11
x1 (t)
x1k
1
P
h1d
s1 (t)
sd (t)
hM 1
xM (t)
xM k
1
P
hM d
Figure 4.8. Multiuser convolutive channel model. Input signals s1 (t), · · · , sd (t) are synchro-
nized dirac-pulse sequences.
In this general case, H has size mM P × d(L + m − 1). A necessary condition for space-time
equalization (the output y is equal to a row of S) is that H is tall, which gives minimal conditions
on m in terms of M, P, d, L:
mM P ≥ d(L + m − 1) ⇒ m(M P − d) ≥ d(L − 1)
which implies
d(L − 1)
MP > d , m≥ .
MP − d
where g(t) is the pulse shape function by which the signals are modulated. In this model, there
are r distinct propagation paths, each parameterized by (θi , τi , βi ), where θi is the direction-of-
arrival (DOA), τi is the path delay, and βi ∈ C| is the complex path attenuation (fading). The
vector-valued function a(θ) is the array response vector for an array of M antenna elements to
a signal from direction θ.
Suppose as before that h(t) has finite duration and is zero outside an interval [0, L). Conse-
quently, g(t − τi ) has the same support for all τi . At this point, we can define a parametric “time
manifold” vector function g(τ ), collecting LP samples of g(t − τ ):
g(0 − τ )
g( P1 − τ )
g(τ ) = .. , 0 ≤ τ ≤ max τi .
.
1
g(L − P − τ)
If we also construct a vector h with samples of h(t),
h(0)
h( P1 )
h= ..
.
1
h(L − P)
Thus, the multiray channel vector is a weighted sum of vectors on the space-time manifold
g(τ ) ⊗ a(θ). Because of the Kronecker product, this is a vector in an LP M -dimensional space,
with more distinctive characteristics than the M -dimensional a(θ)-vector in a scenario without
delay spread. The connection of h with H as in (4.20) is that h = vec(H), i.e., h is a stacking
of all columns of H in a single vector.
We can define, much as before, parametric matrix functions
Gτ ◦ Aθ := [g1 ⊗ a1 , · · · , gr ⊗ ar ]
(Gτ ◦ Aθ ) is a columnwise Kronecker product known as the Khatri-Rao product; its properties
are discussed in Sec. 5.1.6. This gives h = (Gτ ◦ Aθ )B1r .
Extending now to d sources, we get that the M P ×dL-sized matrix H in (4.20) can be rearranged
into an M P L × d matrix
H0 = [h1 , · · · , hd ] = (Gτ ◦ Aθ )BJ . (4.23)
where J is the selection matrix defined in (4.6) that sums the rays into channel vectors. (Gτ ◦Aθ )
now plays the same role as Aθ in Sec. 4.3.1. Each of its columns is a vector on the space-time
manifold.
4.3.7 Summary
I-MIMO: X = AS , A = Aθ BJ
(4.24)
FIR-MIMO: X = HS , H ↔ H0 = (Gτ ◦ Aθ )BJ
The first part of these model equations is generally valid for linear time-invariant channels,
whereas the second part is a consequence of the adopted multiray model in the form of a
parametric channel model.
Based on this model, the received data matrix X or X has several structural properties. In
several combinations, these are often strong enough to allow to find the factors A (or H) and S
(or S), even from knowledge of X or X alone. Very often, this will be in the form of a collection
of beamformers (or space-time equalizers) {wi }d1 such that each beamformed output wiH X = si
is equal to one of the source signals, so that it must have the properties of that signal.
One of the most powerful “structures”, on which most systems today rely to a large extent, is
knowledge of part of the transmitted message (a training sequence), so that several columns
of S are known. Along with the received signal X , this allows to estimate H. Very often, an
unparameterized FIR model is assumed here. The algorithms are using a temporal reference.
Algorithms that do not use this are called blind. Examples of this will be discussed in the coming
chapters.
TBD: for wideband data, consider STFT, split into narrow subbands; this translates a wideband
model into a set of narrowband models. These models can be processed independently, or
(better) jointly.
Diagonalization of Hankel matrix (if circulant).
4.4.1 OFDM
TBD
– Higher resolution implies longer baselines. We will see that this results in shorter integration
time (due to earth rotation), and more (narrower) subbands.
– Higher sensitivity implies a larger number of antennas (typically grouped in stations), longer
observing times, and better calibration (direction dependent).
– Higher survey speed requires a larger total bandwidth, a larger field-of-view, multiple beams,
and direction dependent calibration.
This results in larger data sets, higher computational demands, and the need for better calibra-
tion and imaging algorithms.
The performance of a radio telescope depends on many parameters: the spatial resolution de-
pends on the diameter of the instrument, the number of spatial samples on the number of STI’s,
the finite sample noise in a single STI depends on the number of samples N that we average in
that STI, etc. How should these parameters be designed?
As it turns out, a lot depends on the non-stationarity introduced by the rotation of the earth,
and constraints resulting from a requirement to satisfy the narrowband assumption. Essentially,
we can average samples that differ by phase factors φ = exp(−j2πf τ ) if they satisfy:
– Narrowband condition: f τ should be approximately constant for f ∈ (fmin , fmax ) and all
geometric delays τ . If W = fmax − fmin is the bandwidth, this translates to
Wτ 1.
For omni-directional antennas, the maximal geometric delay is τmax = D/c, for source signals
arriving in the same direction as the baseline. For directional antennas, D is scaled by sin(θ),
t t + ∆t
α
τ
α
D
Figure 4.9. Due to earth rotation, the stars appear to move relative to the array.
This condition determines maximum processing bandwidth, as a function of the array diam-
eter D. A signal with a larger total bandwidth should first be split into sufficiently narrow
subbands (“channels”).
– Stationarity condition: f τ should be approximately constant while the baselines move (due
to earth rotation).
This determines the maximum processing time (STI), also a function of the array diameter,
because for longer baselines, τ changes faster.
2π
The latter condition is worked out as follows. The earth rotation rate is ωe = 1day = 7.27 · 10−5
rad/s. The sky appears to move with this angular speed. “A day” is taken here to be a sidereal
day, i.e., taking into account that the earth makes an extra revolution over the course of a year;
it is about 4 minutes shorter than 24 hours.
As shown in Fig. 4.9, over a small time period ∆t, the earth rotates over a small angle α = ωe ∆t.
If initially we had τ = 0, we now have
D D
τ= sin(α) ≈ α
c c
Thus, the rate of change of τ due to earth rotation is
dτ D dα D
= = ωe
dt c dt c
2
This requires some elaboration.
station calibration
channelization beamforming
wk
x1 (t) xk [n]
A/D filter
bank
wkH xk [n]
xM (t)
A/D filter
bank
Integrating phases φ = e−j2πf τ coherently over a period T requires, at the largest frequency
fmax ,
dτ c λmin
fmax · T 1 ⇒ T = (4.26)
dt Dfmax ωe D ωe
This condition limits the STI. It depends on the observing frequency and the array diameter.
As example, we can look at the design of a first phase of the Square Kilometre Array (SKA), i.e.,
SKA1-Low: a low-frequency aperture array planned for around 2021-2023. Initial SKA1 design
objectives were specified in 2013 [12], but several of these numbers were scaled down later. The
architecture of the instrument is similar to that of LOFAR (commissioned in 2007), but at a
larger scale.
Generally, for the lower frequencies, the idea is that simple non-steerable antennas are grouped
into stations. The beamformed output of a station mimics that of a steerable dish. Next, the
station output signals are combined (correlated) at a central location.
The initial SKA1 design called for 131,072 antennas, divided over 512 stations each with 256
dual-polarized antennas. For SKA1-Low, the frequency range is 50–300 MHz, sampled using
log-periodic (somewhat directional) antennas. The maximal baseline was set at 100 km (later
scaled down to 65 km), and each station has a diameter of 35 m.
The objective of station data processing is to produce beamformed outputs. An important
consideration is that this will reduce the raw datarate by a factor equal to the number of
antennas in a station. If budget permits, a station can produce multiple beamformed outputs,
increasing the survey speed.
The number of coarse frequency channels that are needed is determined by the maximal band-
width that satisfies the narrowband condition. This depends on the station diameter. For this
station 1 xk [n]
τ filter
bank
P Rij,k
station P
τ filter
bank
For the selected SKA1 design parameters, we could choose 250,000 channels with Wchan = 1
kHz. However, this result is valid if the array aperture B is filled with omnidirectional antennas.
In reality, it is filled with stations that have beamformed outputs, and the beams limit the field
of view to
λ
sin(θ) =
D
Using (4.25) gives
cD D
Wchan fmin
Bλ B
Eq. (4.26) gives ?????????? TBD
D
T < 1200
B
With B = 100 km (the longest baseline) and D = 35 m (station diameter), the maximal
integration time is T < 0.4 sec.
Thus, the output of the central signal processing step are complex correlation matrices of size
512 × 512 times 250,000 channels, each 0.4 sec, times 4 polarizations.
Depending on P , it can happen that the the correlator produces several times more output
“data” than flows in: the input vectors of size P are transformed into matrices of size P × P ,
and then averaged. If the STI T is too short, it is more efficient to work with the original
data (X-matrices) rather than correlated data (R-matrices). The correlation matrices will be
rank-deficient, indicasting redundancy.
“Baseline dependent averaging” (analyzed in [13]) exploits the fact that up to 90% of the base-
lines are short and can be integrated over longer times and larger bandwidths. This may
significantly reduce the datarates.
The hardware bottleneck is perhaps not the required computational complexity (flops) but the
required bandwidth to transport all the data. In particular, a seemingly trivial operation is
this: data comes in from the various stations, each with a large number of subbands, and has
to be rearranged such that for each subband the data for all stations is together. This does not
involve computations but is a massive communication operation nonetheless.
Resolution So far, we have seen that B and D determine the number of subband channels and
the integration length. They also determine the resolution. To see this, consider first a single
station. If the antennas are placed sufficiently dense in a rectangular or circular aperture with
diameter D, then we have seen in (2.21) that the beamwidth is θs ∼ λ/D. This determines the
instantaneous field of view (FOV) of a station, which acts as a dish element in the entire array.
If the array has a diameter B, then its beamwidth is θ ∼ λ/B. This beam partitions the FOV.
Let
θs B
N := =
θ D
The field of view is covered by N beams in each dimension: we have a resolution of N “pixels” in
each dimension. Thus, without superresolution techniques, we can expect to create an image of
size N × N pixels. For the SKA1-Low example: D = 35 m, B = 100 km, resulting in N ∼ 3000.
Interestingly, N is independent of frequency, but the FOV is frequency dependent. Thus, al-
though the resulting image is size N ×N , the area on the sky which the image covers is frequency
dependent. Lower frequencies (larger λ) cover a larger area.
Number of subband channels For the complete instrument, the longest baseline B determines
the narrowband constraint. The maximum subband bandwidth is (taking into account the
reduced field of view)
D
Wchan fmin .
B
For example, we can set
D
Wchan = 0.1 · fmin .
B
Wtot B Wtot
Nchan = = 10 ·
Wchan D fmin
B
Thus, in first approximation, N ∼ D determines each dimension of the image, and the number
of frequency bins, i.e., the entire “image cube”. The main drivers for complexity are thus P 2
3
B Wtot
and the ratios (the instrument to station diameter) and (the fractional bandwidth).
D fmin
4.6 NOTES
Bibliography
[1] W.C. Jakes, ed., Microwave Mobile Communications. New York: John Wiley, 1974.
[2] F. Adachi and etal, “Cross correlation between the envelopes of 900 MHz signals received
at a mobile radio base station site,” IEE Proceedings, vol. 133, pp. 506–512, Oct. 1986.
[3] W.C.Y. Lee, “Effects on correlation between two mobile radio basestation antennas,” IEEE
Tr. Comm., vol. 21, pp. 1214–1224, Nov. 1973.
[4] W.C.Y. Lee, Mobile Communications Design Fundamentals. New York: John Wiley, 1993.
[5] B. Sklar, “Rayleigh fading channels in mobile digital communication systems, part I: Char-
acterization,” IEEE Communications Magazine, vol. 35, pp. 90–100, July 1997.
[6] A.J. Paulraj and C.B. Papadias, “Space-time processing for wireless communications,”
IEEE Signal Proc. Mag., vol. 14, pp. 49–83, November 1997.
[7] T. Trump and B. Ottersten, “Estimation of nominal direction of arrival and angular spread
using an array of sensors,” Signal Processing, vol. 50, pp. 57–69, April 1996.
[8] B. Ottersten, “Array processing for wireless communications,” in Proc. IEEE workshop on
Stat. Signal Array Proc., (Corfu), pp. 466–473, June 1996.
[9] T.S. Rappaport, Wireless Communications: Principles and Practice. Upper Saddle River,
NJ: Prentice Hall, 1996.
[10] K. Pahlavan and A.H. Levesque, “Wireless data communications,” Proc. IEEE, vol. 82,
pp. 1398–1430, September 1994.
[11] E.A. Lee and D.G. Messerschmitt, Digital Communication. Boston: Kluwer Publishers,
1988.
[12] P.E. Dewdney, W. Turner, R. Millenaar, R. McCool, J. Lazio, and T.J. Cornwell, “SKA1
system baseline design,” Tech. Rep. SKA-TEL-SKO-DD-001, SKA Office, 2013.
[13] S.J. Wijnholds, A.G. Willis, and S. Salvini, “Baseline-dependent averaging in radio inter-
ferometry,” Monthly Notices of the Royal Astronomical Society, vol. 476, pp. 2029–2039,
Feb. 2018.
[14] A.J. van der Veen, S.J. Wijnholds, and A.M. Sardarabadi, “Signal processing for radio
astronomy,” in Handbook of Signal Processing Systems, 3rd ed., Springer, November 2018.
ISBN 978-3-319-91734-4.
Contents
5.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 The QR factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 The singular value decomposition (SVD) . . . . . . . . . . . . . . . . 99
5.5 Pseudo-inverse and the Least Squares problem . . . . . . . . . . . . 105
5.6 The eigenvalue problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.7 The generalized eigenvalue decomposition . . . . . . . . . . . . . . . 109
5.8 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.1 BASICS
5.1.1 Notation
A bold-face letter, such as x, denotes a vector (usually a column vector, but occasionally a row
vector). Matrices are written with capital bold letters. A matrix A has entries aij , and columns
1
On a first reading, the more advanced topics should probably be skipped until needed in a future chapter, as
indicated there.
In general, X
yi = Aij xj , i = 1, · · · , M
j
Likewise, two matrices A and B can be multiplied if their “inner dimensions” match (i.e., the
number of columns of A equals the number of rows of B. In that case
X
C = AB ⇔ Cij = Aik Bkj
k
kak ≥ 0
kak = 0 ⇔ a = 0.
H
|b a| ≤ kak kbk
and this allows to define the angle θ between two vectors via
H
b a = kak kbk cos(θ) .
The induced matrix 2-norm of a matrix A (also called the spectral norm, or the operator norm)
is
kAxk
k A k := max
x kxk
It represents the largest magnification of a vector x that can be obtained by applying A to it.
Another expression for this is
xH AH Ax
k A k2 = max
x xH x
|Aij |2 )1/2
X
k A kF = (
5.1.5 Trace
For a square matrix A, the trace of A is the sum of its diagonal entries:
X
tr[A] = Aii
i
ij
A⊗B= .. ..
,
. .
aM 1 B · · · aM N B
and the Schur-Hadamard product as
a11 b11 · · · a1N b1N
A B= .. ..
,
. .
aM 1 bM 1 · · · aM N bM N
provided A and B have the same size.
A rank-one matrix has the form abT . It can be written using the Kronecker product as
T
vec(ab ) = b ⊗ a ,
and similarly, for complex vectors,
H
vec(ab ) = b ⊗ a . (5.1)
This can be readily shown by writing the products in full:
b1 a1
b1 a2
..
.
b1 aM
b2 a1
a1 b1 a1 b2 ··· a1 bN
b2 a2
H
a2 b1 a2 b2 ··· a2 bN
..
ab = , b⊗a= .
.. .. ..
. . .
b2 aM
aM b1 aM b2 · · · aM bN
..
.
bN a1
bN a2
..
.
bN aM
This can be proven using (5.1), by writing ABC as a sum of rank-one components of the form
(ai cTj )bij , where ai is the ith column of A, and cTj the jth row of C.
◦ denotes the Khatri-Rao product, i.e., a column-wise Kronecker product:
A ◦ B := [a1 ⊗ b1 a 2 ⊗ b2 ··· ].
vec(ab ) = b∗ ⊗ a
H
(5.2)
(A ⊗ B)(C ⊗ D) = AC ⊗ BD (5.3)
(A ⊗ B)(C ◦ D) = AC ◦ BD (5.4)
H H H
(A ◦ B) (C ◦ D) = A C B D (5.5)
H H
(a ⊗ B)C = a ⊗ BC (5.6)
T
vec(ABC) = (C ⊗ A)vec(B) (5.7)
T
vec(A diag(b) C) = (C ◦ A)b (5.8)
H H H
[a ⊗ b][c ⊗ d] = ac ⊗ bd (5.9)
H H
= a ⊗ bc ⊗ d (5.10)
H H
= c ⊗ ad ⊗ b
T T H H
tr(AB) = vec (A )vec(B) = vec (A )vec(B) (5.11)
(5.12)
T T T H H T
tr(ABCD) = vec (A )(D ⊗ B)vec(C) = vec (A )(D ⊗ B)vec(C) (5.13)
tr(A ⊗ B) = tr(A)tr(B) (5.14)
Let A be a P ×Q matrix. Clearly, vec(A) and vec(AT ) contain the same elements, but organized
differently: they are related by a permutation matrix. This matrix is called the commutation
matrix, and denoted by KP,Q :
vec(AT ) = KP,Q vec(A) .
For any P × Q matrix A and M × N matrix B we have
equal to A, possibly larger than 1. If A = aaH has rank 1, and B = I, then A B = diag(A)
can have full rank while A has rank 1. An exception is that abH cdH does have rank 1. This
is shown by considering
H H H
ab ⊗ cd = (a ⊗ c)(b ⊗ d) ,
(which is rank 1) and noting that abH cdH is a submatrix of abH ⊗ cdH .
5.2 SUBSPACES
H := {α1 x1 + · · · + αn xn | αi ∈ C| , ∀i}
α1 x1 + · · · + αn xn = 0 ⇔ α1 = · · · = αn = 0 .
5.2.2 Basis
An independent collection of vectors {xi } that together span a subspace is called a basis for that
subspace.
If the vectors are orthogonal (xHi xj = 0, i 6= j), it is an orthogonal basis.
If moreover, the vectors have norm 1: kxi k = 1, the basis is called orthonormal.
The basis for a subspace is not unique.
Often, we stack the basis vectors xi into a matrix X and, with abuse of terminology, call that
matrix a basis.
5.2.3 Rank
The rank of a matrix X is the number of independent columns (or rows) of X.
A prototype rank-1 matrix is X = abH , a prototype rank-2 matrix is X = a1 bH1 +a2 bH2 , etcetera:
a rank-d matrix is
H H H H
X = aa1 b1 + a2 b2 + · · · + ad bd = AB
where A = [a1 , a2 , · · · , ad ] and B = [b1 , b2 , · · · , bd ]. Thus, a matrix factorization X = ABH
where the “inner dimension” is d shows that the rank of X is (at most) d. This decomposition
is called a dyadic decomposition; it is not unique.
The rank cannot be larger than the smallest size of the matrix (when it is equal, the matrix is
full rank, otherwise it is rank deficient). A tall matrix is said to have full column rank if the
rank is equal to the number of columns: the columns are independent. Similary, a wide matrix
has full row rank if its rank equals the number of rows.
5.2.5 Projection
A square matrix P is a projection if PP = P. It is an orthogonal projection if also PH = P.
The norm of an orthogonal projection is k P k = 1. For an isometry Û, the matrix P = ÛÛH is
an orthogonal projection onto the space spanned by the columns of Û. This is the general form
of an orthogonal projection.
Suppose U = [|{z}
Û Û⊥ ] is unitary. Then,
|{z}
d M −d
1. from UH U = Im :
2. from UUH = IM :
This shows that any vector x ∈ C| M can be decomposed into x = x̂ + x̂⊥ , where x̂ ⊥ x̂⊥ ,
x̂ = Pc x ∈ ran(Û) , x̂⊥ = P⊥ ⊥
c x ∈ ran(Û )
The matrices ÛÛH = Pc and Û⊥ (Û⊥ )H = P⊥c are the orthogonal projectors onto the column
span of X and its orthogonal complement in C M respectively.
|
Similarly, we can find a matrix V̂H whose rows span the row span of X, and augment it with a
matrix V̂⊥ to a unitary matrix V:
d N −d
h↔ ↔ i
V =N l V̂ V̂⊥ .
The matrices V̂V̂H = Pr and V̂⊥ (V̂⊥ )H = P⊥ r are orthogonal projectors onto the original
subspaces in C| N spanned by the columns of V̂ and V̂⊥ , respectively. The columns of V̂⊥ span
the kernel (or nullspace) of X, i.e., the space of vectors a for which Xa = 0.
The interpretation is that q1 is a normalized vector with the same direction as x1 , similarly
[q1 q2 ] is an isometry spanning the same space as [x1 x2 ], etcetera.
If X : M × N is a tall matrix (M ≥ N ), then there is a decomposition
" #
R̂
X = QR = [Q̂ Q̂⊥ ] = Q̂R̂
0
Here, Q is a unitary matrix, R̂ is upper triangular and square. R is upper triangular with
M − N zero rows added. X = Q̂R̂ is called an “economy-size” QR.
If R̂ is nonsingular (all entries on the main diagonal are invertible), then d = N , the columns
of Q̂ form a basis of the column span of X, and Pc = Q̂Q̂H . If R̂ is rank-deficient, then this
is not true: the column span of Q̂ is too large. However, the QR factorization can be used as
a start in the estimation of an orthogonal basis for the column span of X. Although this has
sometimes been attempted, it is numerically not very robust to use the QR directly to estimate
the rank of a matrix. (Modifications such as a “rank-revealing QR” do exist.)
(for different Q and R). Now, X and R̂ have the same singular values and left singular vectors.
For a given (complex) matrix X of size m × n, where we assume m > n, the SVD is defined by
σ1
σ2
X = UΣV ,
H
U = [u1 · · · um ] ,
Σ= ..
, V = [v1 · · · vn ] (5.17)
.
σn
0 ··· ··· 0
where U : m × n is a tall matrix of the same size as X, and V and Σ are n × n. Note that
UH U = I but UUH 6= I because it is an m × m matrix of rank n.
By using these properties, we can readily show:
H
Σ = U XV
UΣ = XV
H H
ΣV = U X.
The singular values give important information on the dominant directions in the column span
and row span of X. This is seen by writing out the matrix equations, which gives the dyadic
decomposition
H H H H
X = UΣV = u1 σ1 v1 + u2 σ2 v2 + · · · + un σn vn . (5.19)
Each term of the form uk σk vkH is a rank-1 matrix. If σ1 is large, then the corresponding
component u1 v1H is dominantly present in X, and u1 is the dominant direction in the column
span of X. In fact, u1 σ1 v1H is the best rank-1 approximation of X (in the Least Squares sense).
If σn is zero, then one dimension is missing in the matrix: it is rank deficient by order 1. In
general, if X is of rank d, then only d singular values σ1 , · · · , σd are nonzero. This is also seen
from (5.19) because it will then consist of the sum of d rank-1 components. The best rank-d
approximation of a matrix X is obtained by setting σd+1 = · · · = σn = 0.
Suppose X has rank d, with d < n. Similar to the economy-size SVD, we can write
σ1
H
σ2
X = ÛΣ̂V̂ , Û = [u1 · · · ud ] , Σ̂ = .. , V̂ = [v1 · · · vd ] (5.20)
.
σd
where Σ̂ now has size d × d, and only the nonzero singular values are kept.
Since Û and V̂ are “tall” isometric matrices, we can complement them with orthonormal columns
ud+1 , · · · , um and vd+1 , · · · , vn , respectively, to square unitary matrices,
U = [Û , U⊥ ] , V = [V̂ , V⊥ ] .
We can augment Σ̂ accordingly with zero entries along the diagonal and elsewhere, to arrive at
the original decomposition X = UΣVH in (5.17). Thus, in Σ, the number of nonzero singular
values shows the rank of X.
The columns of Û provide an orthonormal basis for the column span of X. Likewise, the columns
of V̂ are an orthonormal basis for the column span of XH . The complementary matrices also
have a meaning: since V̂H V⊥ = 0, and hence XV⊥ = 0, it is seen that the columns of V⊥ span
the null space of X. Likewise, the columns of U⊥ span the left null space, U⊥H X = 0.
The SVD of X in (5.20) reveals the behavior of the map b = Xa: a is projected onto the column
span of V̂ and rotated in n-space (by VH a), then scaled (by the entries of Σ̂), and finally rotated
in m-space (by Û) to give b.
ij
The latter expression shows that multiplication of X by a unitary matrix does not change the
Frobenius norm. Thus, since Σ = UH XV, we find that
X
kXk2F = σi2 .
i
Recall the “induced 2 norm” or matrix 2-norm kXk2 , which measures how much a matrix can
increase the 2-norm of a vector v:
kXvk
kXk = max (5.21)
v kvk
Without loss of generality, we may normalize the vectors v such that kvk = 1. We can also
insert the SVD. We then obtain
From this we can deduce that the vector v that maximizes the norm is given by v = v1 , the
dominant right singular vector. The matrix 2-norm of X is then seen to be equal to σ1 . The
fact that the 2-norm is attained on the dominant singular vector v1 gives a recursive way to
prove the existence of the SVD: find v1 on which the norm is attained, and the corresponding
σ1 and u1 using
Xv1 = u1 σ1 ,
then subtract (or project out) this rank-1 component and consider the residual X0 , and repeat [2].
An important property that follows from the definition of the norm (5.21) is
kXvk ≤ kXkkvk ∀v
H
X̂ = ÛΣ̂V̂ ,
M
X
k X − X̂ k2F = σi2
i=d+1
k X − X̂ k2 2
= σd+1
√
u1 σ 1 2
√
u2 σ2 2
x2
x1
Figure 5.1. Construction of the left singular vectors and values of the matrix X = [x1 x2 ],
where x1 and x2 have equal length.
Example 5.1. Figure 5.1 shows the construction of the left singular vectors of a matrix
X = [x1 x2 ], whose columns x1 and x2 are of equal length. The largest singular
vector u1 is in the direction of the sum of x1 and x2 , i.e., the “common” direction
of the√two vectors, and the corresponding singular value σ1 is equal to σ1 = k x1 +
x2 k/ 2. On the other hand, the smallest singular vector u2 is dependent √ on the
difference x2 − x1 , as is its corresponding singular value: σ2 = k x2 − x1 k/ 2. If
x1 and x2 become more aligned, then σ2 will be smaller and X will be closer to a
singular matrix. Clearly, u2 is the most sensitive direction for perturbations on x1
and x2 .
An example of such a matrix could be A = [a(φ1 ) a(φ2 )], where a(φ) =
[1 φ φ2 · · · φM −1 ]T , where φ is for example related to the direction at which
a signal hits an antenna array, or to the time difference to a reference signal. If
M = 3, σ1 = 3.44 M = 3, σ1 = 4.86
N =3 σ2 = 0.44 N = 6 σ2 = 0.63
M = 6, σ1 = 4.73
N =3 σ2 = 1.29
two directions are close together, then φ1 ≈ φ2 and a(φ1 ) points in about the same
direction as a(φ2 ), which will be the direction of u1 . The smallest singular value, σ2 ,
is dependent on the difference of the directions of a(φ1 ) and a(φ2 ).
For further illustration, consider the following small numerical experiment. Let φ1 =
1, φ2 = exp(jπ · 0.1), √and construct M × N matrices X = [a(φ1 ) a(φ2 )]S, where
SSH = N I. Since (1/ N )S √ is co-isometric, the singular values of X are those of
A = [a(φ1 ) a(φ2 )] times N . The two non-zero singular values of X for some
values of M, N are given in Table 5.1. It is seen that doubling M almost triples the
smallest
√ singular value, whereas doubling N only increases the singular values by a
factor 2, which is because the matrices have larger size.
X = VΣ−1 U
H
(5.22)
which is a slightly more general expression. Essentially, we are inverting the singular values
here. We can easily verify that X† X = I and
XX† = UU
H
X X = VΣ2 V
H H
XX = UΣ2 U
H H
The point of the SVD is that it gives similar information as we obtain from an eigenvalue
decomposition, but (i) it is applicable to any matrix (e.g., non-square matrices), and (ii) it
always exists, whereas the eigenvalue decomposition only exists for “regular” matrices. Also,
there are numerically very robust algorithms to compute the decomposition.
X̂† = V̂Σ̂−1 Û .
H
This is called the Moore-Penrose pseudo-inverse of X̂. It satisfies the projection properties:
where Pc is a projection onto the dominant column span of X, and Pr a projection onto the
dominant row span.
The largest singular value of the Moore-Penrose (truncated) pseudo-inverse is σd−1 , whereas
without truncation it was σn−1 . This gives a way to control the norm of the inverse, by inverting
only dominant directions in X, and projecting away the other dimensions.
This pseudo-inverse (matlab: pinv) is commonly used if we are not sure if a matrix is full rank.
Typically, we compare the singular values of X to a threshold () and replace them by 0 if they
are below the threshold, leading to X̂. Next, we compute X̂† by inverting the non-zero singular
values.
b0 = b + e ⇒ x0 = x + A−1 e
k A−1 e k ≤ σn−1 k e k
kbk ≤ σ1 k x k
kx0 − xk kek kek
≤ σn−1 ≤ σn−1 σ1
kxk kxk kbk
This measures the relative change in the solution vector x, and shows that any error in b is
potentially magnified by a factor equal to the condition number.
If a matrix has a poor condition number, the usual strategy is not to invert it directly, but to
do a rank reduction to rank d, and compute the pseudo-inverse as shown before. This will avoid
noise enhancement due to the inversion of non-important components.
X† X = IN
XX† = Pc
Thus, X† is an inverse on the “short space”, and XX† is a projection onto the column span of
X. It is easy to verify that the solution to b = Xa is given by a = X† b.
If X is rank deficient, then XH X is not invertible, and there is no exact solution to b = Xa. In
this case, we can resort to the Moore-Penrose pseudo-inverse of X, also denoted by X† . It can
be defined in terms of the “economy size” SVD X = ÛΣ̂V̂H (equation (5.18)) as
X† = V̂Σ̂−1 Û .
H
1. XX† X = X 3. XX† = Pc
2. X† XX† = X† 4. X† X = Pr
b0 = P c b ,
min k b − Xa k2 ,
a
where a is chosen to have minimal norm if there is more than one solution (the latter requirement
translates to a = Pr a).
Some other properties of the pseudo-inverse are
and define the projection Pc = ÛÛH . We now take the TLS (column space) approximations
to be X̂ = Pc X = ÛΣ̂V̂1H and Ŷ = Pc Y = ÛΣ̂V̂2H , where V̂1 and V̂2 are the partitions of V̂
corresponding to X and Y respectively. X̂ and Ŷ have the same column span defined by Û,
and are in fact solutions to
min k [X Y] − [X̂ Ŷ] k2F
[X̂ Ŷ] rank N
and A satisfying X̂A = Ŷ is obtained as A = X̂† Ŷ. This A is the TLS solution of XA ≈ Y.
Instead of asking for rank N , we might even insist on a lower rank d.
This decomposition might not exist if eigenvalues are repeated. A classical example of a matrix
that does not have an eigenvalue decomposition is
" #
0 1
A= .
0 0
A = QRT ΛR−1
H H
T Q = QRQ
The factorization
H
A = QRQ ,
with Q unitary and R upper triangular, is called a Schur decomposition. One can show that this
decomposition always exists (although it is not unique); if A is hermitian, then R = RH is upper
triangular implies that R is diagonal, and in this case the Schur decomposition coincides with the
eigenvalue decomposition (and the SVD). For non-hermitian A, the Schur decomposition avoids
the inversion of the eigenvalue matrix T, which might be ill-conditioned (or even non-invertible
in some cases).
R has the eigenvalues of A on the diagonal. Q gives information about “eigen-subspaces”
(invariant subspaces), but doesn’t contain eigenvectors.
This shows that the eigenvalues of XXH are the singular values of X, squared (hence real). The
eigenvectors of XXH are equal to the left singular vectors of X (hence U is unitary). Since the
SVD always exists, the eigenvalue decomposition of XXH always exists (in fact it exists for any
Hermitian matrix C = CH ).
Historically, the SVD was derived out of frustration that the eigenvalue decomposition does
not always exist. By generalizing the eigenvector matrix T to two unitary matrices U, V, a
decomposition was found that does always exist. Despite this connection, the decompositions
are generally different and have different applications.
Ax = λBx ⇔ (A − λB)x = 0
This is a generalization of (5.23) where we had B = I. The solutions λi are called the generalized
eigenvalues. The set of matrices A − λB (for any λ) or the pair (A, B) is called a matrix pencil.
In the above formulation, A and B have the same size but could be rectangular. If A is square
and B is invertible, then we can immediately return to the usual eigenvalue decomposition, by
considering
B−1 Ax = λx
Thus, the generalized eigenvalues of a pair (A, B) are the eigenvalues of B−1 A. Generally, we
try to avoid inverting B, for numerical reasons, or because A and B might have structure that
is otherwise lost. E.g., if B is banded, its inverse is not a band.
As in (5.24), we can collect the eigenvectors in a matrix T, such that
AT = BTΛ
where Λ is diagonal. We can also write the solution as a joint matrix decomposition
(
A = FΛA T−1
(5.25)
B = FΛB T−1
where Λ−1
B ΛA are the generalized eigenvalues, and F is an (invertible?) matrix with unit-norm
columns. Indeed, from AT = BTΛ, after having found T we can set W = AT, and then
normalize the columns of W to find W = FΛA . This decomposition is called the Generalized
Eigenvalue Decomposition (GEV).
The form (5.25) shows an application of the GEV, namely a joint diagonalization of two matrices
(A, B). This application is studied in more detail in Chap. 9.
The existence of this decomposition is similar to that of the eigenvalue decomposition: in many
cases, it does not exist, and/or F and T are not invertible. However, if A and B are hermitian
and one of them (typically B) is positive definite, the decomposition exists. For more general
cases, numerical algorithms are more complicated and might run into problems.
Just as the SVD is connected to the eigenvalue decomposition, the generalized eigenvalue de-
composition leads to a Generalized SVD (GSVD):
(
A = FΣA UH
(5.26)
B = FΣB VH
where U, V are unitary, ΣA , ΣB are diagonal and nonnegative real, and F is an invertible matrix
with unit-norm columns. This decomposition always exists; the matrices A and B do not need
to be square, they can even have a different number of columns. Note that if B is square and
invertible, then B−1 A = VΣ−1 H
B ΣA U constitutes an SVD of B A.
−1
Actually, various definitions of the GSVD exist. In the usual formulation [2],
(
A = UΣA X−1
B = VΣB X−1
where U, V are unitary, X is invertible, and Σ2A + Σ2B = I, which gives a connection to the CS
decomposition (“cosine-sine”). However, this formulation has problems in case there is a vector
x in the nullspace of both A and B. In our applications, we often have that case. Therefore,
we will use (5.26).
The Generalized Schur Decomposition (GSD), also called the QZ decomposition, is
(
A = QRA ZH
(5.27)
B = QRB ZH
where Q, Z are unitary, and RA , RB are upper triangular. This decomposition always exists. It
follows from the GEV (5.25) by inserting a QR decomposition for F and another one for T−1 .
The generalized eigenvalues of (A, B) are found by the ratios of the diagonal entries of RA and
RB . The advantage of this decomposition is that it is more stable to compute as it involves
only unitary matrices. This facilitates its computation using 2 × 2 Givens rotations. Generally,
the QZ algorithm is used: the core of this consists of an iteration where the QR decomposition
of B−1 A is implicitly computed, without forming the product.
5.8 NOTES
A widely-used reference book on linear algebra is Golub & Van Loan [2]. More advanced
properties are found in Horn and Johnson [3]. An extensive tutorial to linear algebra in relation
to signal processing can be found in Moon and Stirling [1].
Bibliography
[1] T. K. Moon and W. C. Stirling, Mathematical methods and algorithms for signal processing.
Prentice Hall, 2000.
[2] G. Golub and C. Van Loan, Matrix Computations. The Johns Hopkins University Press,
1989.
[3] R. Horn and C. Johnson, Matrix Analysis. Cambridge, NY: Cambridge Univ. Press, 1985.
Contents
6.1 Deterministic approach to Matched and Wiener filters . . . . . . . . 114
In this chapter, we look at elementary receiver schemes: the matched filter and Wiener filter in
their non-adaptive forms. They are suitable if we have a good estimate of the channel, or if we
know a segment of the transmitted data, e.g., because of a training sequence. These receivers
are most simple in the context of narrowband antenna array processing, and hence we place the
discussion first in this scenario. The matched filter is shown to maximize the output signal-
to-noise ratio (in the case of a single signal in noise), whereas the Wiener receiver maximizes
the output signal-to-interference plus noise (in the case of several sources in noise). We also
look at the application of these receivers as non-parametric beamformers for direction-of-arrival
estimation. Improved accuracy is possible using parametric data models and subspace-based
techniques: a prime example is the MUSIC algorithm.
We assume that signals are received by M antennas, and that the antenna outputs (after de-
modulation, sampling, A/D conversion) are stacked into vectors xk . According to the model,
xk is a linear combination of d narrowband source signals si,k and noise nk . Initially, we will
consider an even simpler case where there is only one signal in noise. In all cases, we assume
that the noise covariance matrix
H
Rn := E[nn ]
is known, up to a scalar which represents the noise power. The most simple situation is spatially
white noise, for which
Rn = σ 2 I .
Starting from the data model (6.1), let us assume that we have collected N sample vectors. If
we store the samples in an M × N matrix X = [x1 , · · · , xN ], then we obtain that X has a
decomposition
X = AS + N (6.2)
where the rows of S ∈ C| d×N contain the samples of the source signals. Note that we can choose
to put the source powers in either A or S, or even in a separate factor B. Here we will assume
they are absorbed in A, thus the sources have unit powers. Sources may be considered either
stochastic (with probability distributions) or deterministic. If they are stochastic, we assume
they are zero mean, independent and hence uncorrelated,
H
E[sk sk ] = I .
lim 1 SSH = I.
N →∞ N
The objective of beamforming is to construct a receiver weight vector w such that the output is
H
yk = w xk = ŝk (6.3)
is an estimate of one of the original sources. Which beamformer is “the best” depends on the
optimality criterion, of which there are many. It also makes a difference if we wish to receive
only a single signal, as in (6.3), or all d signals jointly,
H
yk = W xk = ŝk (6.4)
where W = [w1 , · · · , wd ].
We will first look at purely deterministic techniques to estimate the beamformers: here no
explicit statistical assumptions are made on the data. The noise is viewed as a perturbation on
the noise-free data AS, and the perturbations are assumed to be small and equally important
on all entries. Then we will look at more statistically oriented techniques. The noise will be
modeled as a stochastic sequence with a joint Gaussian distribution. Still, we have a choice
whether we consider sk to be a deterministic sequence (known or unknown), or if we associate a
probabilistic distribution to it, for example Gaussian or belonging to a certain alphabet such as
{+1, −1}. In the latter case, we can often improve on the linear receiver (6.3) or (6.4) by taking
into account that the output of the beamformer should belong to this alphabet (or should have
a certain distribution). The resulting receivers will then contain some non-linear components.
In this chapter, we only consider the most simple cases, resulting in the classical linear beam-
formers.
1. A is known, for example we know the directions of the sources and have set A =
[a(θ1 ) · · · a(θd )],
2. S is known, for example we have selected a segment of the data which contains a training
sequence for all sources. Alternatively, for discrete alphabet sources (e.g., Sij ∈ {±1}) we
can be in this situation via decision feedback.
W = A† ,
H H
S = W X.
A† = (A A)−1 A .
H H
Note that, indeed, under these assumptions AH A is invertible and A† A = I. If M < d then we
cannot recover the sources exactly: AH A is not invertible (it is a d × d matrix with maximal
rank M ), so that A† A 6= I.
where X† is a right inverse of X. If N ≥ d and the rows of X are linearly independent,1 then
X† = X (XX )−1 .
H H
This is verified by XX† = I. In both cases, we obtain a beamformer which exactly cancels all
interference, i.e., WH A = I.
Noisy case In the presence of additive noise, we have X = AS + N. Two types of linear least-
squares (LS) minimization problems can now be considered. The first is based on minimizing
the model fitting error,
with A or S known, respectively. The second type of minimization problem is based on mini-
mizing the output error,
also with A or S known, respectively. The minimization problems are straightforward to solve,
and in the same way as before.
The noise contribution at the output is A† N, and if A† is large, the output noise will be large.
To get a better insight for this, introduce the “economy-size” singular value decomposition of
A,
H
A = UA ΣA VA
where we take UA : m×d with orthonormal columns, ΣA : d×d diagonal containing the nonzero
singular values of A, and VA : d × d unitary. Since
A† = VA Σ−1
H H
A = U A ΣA V A ⇒ A UA ,
1
In the present noiseless case, note that there are only d linearly independent rows in S and X, so for linear
independence of the rows of X we need M = d. With noise, X will have full row rank M .
A† is large if Σ−1
A is large, i.e., if A is ill conditioned.
Similarly, for (6.5) with S known we obtain
This does not specify the beamformer, but staying in the same context of minimizing kX−ASk2F ,
it is natural to take again a Zero-Forcing beamformer so that WH = † . Asymptotically for zero
mean noise independent of the sources, this gives  → A: we converge to the true A-matrix.
so that
w1 ⊥ {a2 , · · · , ad } .
Thus, w1 projects out all other sources, except source 1,
d
w1H x(t) = H H
P
i=1 w1 ai si (t) + w1 n(t)
H
= s1 (t) + w1 n(t) .
Deterministic output error minimization The second optimization problem (6.6) minimizes
the difference of the output signals to S. For known S, we obtain
(we assumed that the source powers are incorporated in A), so that
1 H 1 H 1 H
R̂xs = N XS = N ASS + N NS → A.
Asymptotically,
W → R−1
x A,
where Rx = E[xxH ] is the true data covariance matrix.2 With finite samples, we would set
W = R̂−1
x A.
This is known as the Linear Minimum Mean Square Error (LMMSE) or Wiener receiver. This
beamformer maximizes the Signal-to-Interference-plus-Noise Ratio (SINR) at the output. Since
it does not cancel all interference, WH A 6= I, the output source estimates are not unbiased.
However, it produces estimates of S with minimal deviation, which is often more relevant.
Let us now define some performance criteria, based on elementary stochastic assumptions on
the data. For the case of a single signal in noise,
H H H
xk = ask + nk , yk = w xk = (w a)sk + (w nk ) .
E[|sk |2 ] = 1 ,
H H
E[sk nk ] = 0 , E[nk nk ] = Rn ,
so that
E[|y|2 ] = (w a)(a w) + w Rn w .
H H H
The Signal to Noise Ratio (SNR) at the output can then be defined as
With d signals (signal 1 of interest, the others considered interferers), we can write
2
We thus see that even if we adopt a deterministic framework, we cannot avoid to make certain stochastic
assumptions on the data.
where A0 contains the columns of A except for the first one, and similarly for s0k . Now we can
define two criteria: the Signal to Interference Ratio (SIR), and the Signal to Interference plus
Noise Ratio (SINR):
w A0 = [0, · · · , 0] ,
H H H H
W A=I ⇒ w1 A = [1, 0, · · · , 0] ⇒ w a1 = 1 ,
and it follows that sir1 (w1 ) = ∞. When W is estimated from a known S, the ZF receiver still
maximizes the SIR, but it is not infinity anymore.
Note that (6.10) defines only the performance with respect to the first signal. If we want to
receive all signals, we need to define a performance vector, with entries for each signal,
SIR(W) := [sir1 (w1 ) · · · sird (wd )]
SINR(W) := [sinr1 (w1 ) · · · sinrd (wd )] .
In graphs, we would usually plot only the worst performance of each vector, or the average of
each vector.
xk = Ask + nk (k = 1, · · · , N ) ⇔ X = AS + N .
Suppose that sk is deterministic, and that the noise samples are independent and identically dis-
tributed in time (temporally white), and spatially white (Rn = I) and jointly complex Gaussian
distributed, so that nk has a probability density
1 knk k2
nk ∼ CN (0, σ 2 I) ⇔ p(nk ) = √ e− σ2 .
πσ
Because of temporal independence, the probability distribution of N samples is the product of
the individual probability distributions,
N
Y 1 − knk2k2
p(N) = √ e σ .
k=1
πσ
Since nk = xk −Ask , the probability to receive a certain vector xk (with a known or deterministic
sk ) is thus
1 kxk −Ask k2
p(xk |sk ) = √ e− σ2
πσ
and hence
PN
N kxk −Ask k2
N kxk −Ask k2 N kX−ASk2
1 1 1
k=1 F
√ e− − −
Y
p(X|S) = σ2 = √ e σ2 = √ e σ2 .
k=1
πσ πσ πσ
p(X|S) is called the likelihood of receiving a certain data matrix X, for a certain transmitted data
matrix S. It is of course a probability density function, but in the likelihood interpretation we
regard it as a function of S, for an actual received data matrix X. The Deterministic Maximum
Likelihood technique estimates S as that matrix that maximizes the likelihood of having received
the actual received X, thus
N kX−ASk2
1
− F
Ŝ = arg max √ e σ2 . (6.11)
S πσ
If we take the negative logarithm of p(X|S), we obtain what is called the negative log-likelihood
function. Since it is a monotonously growing function, taking the logarithm does not change
the location of the maximum. The maximization problem then becomes a minimization over
const + kX − ASk2F /σ 2 , or
Ŝ = arg min kX − ASk2F . (6.12)
S
This is the same model fitting problem as we had before in (6.5). Thus, the deterministic ML
problem is equivalent to the LS model fitting problem in the case of white Gaussian noise.
Stochastic output error minimization In a statistical framework, the output error problem
(6.6) becomes
min E[|w xk − sk |2 ] .
H
w
The cost function is known as the Linear Minimum Mean Square Error. It can be worked out
as follows:
J(w) = E[|wH xk − sk |2 ]
= wH E[xxH ]w − wH E[xs̄k ] − E[sk xH ]w + E[|sk |2 ] .
At this point, note that there is a question whether we regard sk as a stochastic variable or
deterministic. If sk is stochastic with E[|sk |2 ] = 1, then
H H H
J(w) = w Rx w − w a − a w + 1 .
Now differentiate with respect to w. This is a bit tricky since w is complex and functions
of complex variables may not be differentiable (a simple example of a non-analytic function is
f (z) = z̄). There are various approaches (e.g. [1, 8]). A consistent approach is to regard w
and w∗ as independent variables. Let w = u + jv with u and v real-valued, then the complex
gradients to w and w∗ are defined as [8]
∂ ∂
∂u1 J ∂v1 J
1 1 .. 1 ..
∇w J = (∇u J + j∇v J) = . + j
.
2 2 ∂
2
∂
∂ud J ∂vd J
∂ ∂
∂u1 J ∂v1 J
1 1 .. 1 ..
∇w ∗ J = (∇u J − j∇v J) = . − j
.
2 2 ∂
2
∂
∂ud J ∂vd J
with properties
∇w a w = a∗ , ∇w w Rw = R w∗
H H H T
∇w w a = 0 ,
∇∗w w a = a , ∇∗w a w = 0 , ∇∗w w Rw = Rw
H H H
It can further be shown that for a stationary point, it is necessary and sufficient that either
∇w J = 0 or that ∇∗w J = 0: the two are equivalent. Since the latter expression is more simple,
and because it specifies the maximal rate of change, we keep from now on the definition for the
gradient
∇J(w) ≡ ∇∗w J(w) , (6.13)
and we obtain
∇J(w) = Rx w − a .
The minimum of J(w) is attained for
∇w J = 0 ⇒ w = R−1
x a.
= a a(a a + σ 2 )−1 sk
H H
aH a
= sk .
a a + σ2
H
Thus, the expected value of the output is not sk , but a scaled-down version of it.
We assume that we know the variance. In that case, we can prewhiten the data with a square-root
−1/2
factor Rn :
W = R−1 −1 −1H
⇒ n A(A Rn A) (6.14)
The Wiener receiver on the other hand will be the same, since Rn is not used at all in the
derivation. This can also be checked:
−1/2 −1/2 −1 −1/2 1/2
W = R−1
x A = (Rn Rx Rn ) Rn A = Rn R−1
x A
−1/2 −1
⇒ W = Rn W = Rx A .
E[nk nk ] = σ 2 I .
H
xk = ask + nk ,
w = a(a a)−1 = γ1 a
H
where γ1 is a scalar. Since a scalar multiplication does not change the output SNR, the optimal
beamformer for s in this case is given by
wM F = a
w = R−1 −1 −1
= γ2 R−1
H
n a(a Rn a) n a.
wM F = R−1
n a.
w = R−1
x a
= (aaH + σ 2 I)−1 a
= a(aH a + σ 2 )−1 ∼ a.
It is equal to a multiple of the matched filter. In colored noise, we whiten to apply the white
noise result:
w = R−1 x a
= (aaH + Rn )−1 a
−1/2 −1/2
= Rn (aaH + I)−1 a (a = Rn a)
−1/2
= Rn a(aH a + 1)−1
∼ R−1n a.
For the reception of a single source out of interfering sources plus noise, the Zero-
Forcing receiver, Matched Filter or MRC: w = R−1 n a, and the Wiener receiver:
w = R−1x a, are asymptotically all equal to a scalar multiple of each other, and hence
will asymptotically give the same performance.
It should be stressed that this equivalence is only an asymptotic result (large N ), because the
interfering sources are not regarded deterministic sources, but stochastic. In finite samples, the
corresponding receivers
wZF = R−1n a, wW iener = R̂−1
x a
will be different. Note that Rn is assumed to be known, whereas R̂x is estimated from the
received data.
The above are examples of non-joint receivers: the interference is lumped together with the
noise, and there might as well be many more interferers than antennas. Improved performance
may be possible by a joint estimation of the collection of receivers for all sources.
σs2
SNRin = .
σ2
This is the SNR at each element of the array. Suppose all entries of a(θ) have unit
norm, |ai (θ)| = 1. With M antennas, a(θ)H a(θ) = M .
If we choose the matched filter, or MRC, i.e., w = a(θ), then
H H H H
y(t) = w x(t) = a as(t) + a n(t) = M s(t) + a n(t)
then
M 2 σs2 M 2 σs2
SNRout = = = M · SNRin .
aH σ 2 Ia M σ2
The factor M is the array gain.
Rs = σs2 aa ,
H H
Rn = E[nn ] .
wH Rs w
SNRout (w) = .
wH Rn w
We now would like to find the beamformer that maximizes SIRout , i.e.,
wH Rs w
w = arg max .
w wH Rn w
The expression is known as a Rayleigh quotient, and the solution is known to be given by the
solution of the eigenvalue equation
R−1
n Rs w = λmax w . (6.15)
This can be seen as follows: suppose that Rn = I, then the equation is
H
max w Rs w .
w
Let λ1 be the largest eigenvalue (1, 1-entry of Λ), then it is clear that the maximum of the
expression is given by choosing wH U = [1 0 · · · 0]. Thus, the optimal w is the eigenvector
corresponding to the largest eigenvalue, and satisfies the eigenvalue equation Rs w = λ1 w. If
Rn 6= I, then we can first whiten the noise to obtain the result in (6.15).
The solution of (6.15) can be found in closed form, by inserting Rs = σs2 aaH . We obtain
R−1
n Rs w = λmax w
⇔ σs R−1
2
n aa w
H
= λmax w
−1/2 −1/2 1/2 1/2
⇔ σs2 (Rn a)(aH Rn )(Rn w) = λmax (Rn w)
⇔ σs2 aaH w = λmax w
⇔ w = a , λmax = σs2 aH a
and it follows that
w = R−1
n a
which is, as promised, the matched filter in colored noise.
w0
x1
y
xM −
Cn wn
The solution can be found in closed form using Lagrange multipliers and is given by
w = R−1 −1 −1
H
x a(a Rx a) .
is given by
w = R−1 −1 −1
H
x C(C Rx C) f .
Generalized Sidelobe Canceler The generalized sidelobe canceler (GSC) represents an alter-
native formulation of the LCMV problem, which provides insight, is useful for analysis, and can
simplify LCMV beamformer implementation. Essentially, it is a technique to convert a con-
strained minimization problem into an unconstrained form. Suppose we decompose the weight
vector w into two orthogonal components, w0 and −v (w = w0 −v), that lie in the range and null
space of C and CH , respectively. These subspaces span the entire space so this decomposition
can be used to represent any w. Since CH v = 0, we must have
w0 = C(C C)−1 f
H
(6.16)
if w is to satisfy the constraints. (6.16) is the minimum norm solution to the under-determined
system CH w0 = f . The vector v is a linear combination of the columns of an M × (M − L)
matrix Cn , and v = Cn wn ; provided the columns of Cn form a basis for the null space of C.
The matrix Cn can be obtained from C using any of several orthogonalization procedures, for
s1
x0
yk = ŝ1
−
s2 beamformer
sd x1
w
xM
Figure 6.2. The Multiple Sidelobe Canceller: interference is estimated from a reference an-
tenna array and subtracted from the primary antenna x0 .
example the QR factorization or the SVD. The structure of the beamformer using the weight
vector w = w0 −Cn wn is depicted in Fig. 6.1. The choice for w0 and Cn implies that w satisfies
the constraints independent of wn and reduces the LCMV problem to the unconstrained problem
H
min [w0 − Cn wn ] Rx [w0 − Cn wn ] .
wn
The solution is
wn = (Cn Rx Cn )−1 Cn Rx w0 .
H H
The primary advantage of this implementation stems from the fact that the weights wn are
unconstrained and a data independent beamformer w0 is implemented as an integral part of the
adaptive beamformer. The unconstrained nature of the adaptive weights permits much simpler
adaptive algorithms to be employed and the data independent beamformer is useful in situations
where adaptive signal cancellation occurs.
improved. (If the signal is present at the reference antennas, then of course it will
be canceled as well!)
Call the primary sensor signal x0 and the reference signal vector x. Then the objec-
tive is
min Ekx0 − w x k2 .
H
w
w = R−1
x a a := E[xx̄0 ] .
Subspace-based prefiltering In the noise-free case with less sources than sensors, X = AS
is rank deficient: its rank is d (the number of signals) rather than m (the number of sensors).
As a consequence, once we have found a beamformer w such that wH X = s, one of the source
signals, then we can add any vector w0 such that w0H X = 0 to w, and obtain the same output.
The beamforming solutions are not unique.
The desired beamforming solutions are all in the column span of A. Indeed, any component
orthogonal to this span will not contribute at the output. The most easy way to ensure that our
solutions will be in this span is by performing a dimension-reducing prefiltering. Let F be any
M × d matrix such that span(F) = span(A). Then all beamforming matrices W in the column
span of A are given by
W = FW
where W is a d × d matrix, nonsingular if the beamformers are linearly independent. We will
use the underscore to denote prefiltered variables. Thus, the prefiltered noisy data matrix is
H
X := F X
with structure
H H
X = AS + N , where A := F A , N := F N .
X has only d channels, and is such that WH X = WH X. Thus, the columns of W are d-
dimensional beamformers on the prefiltered data X, and for any choice of W the columns of the
effective beamformer W are all in the column span of A, as desired.
subspace
estim.
To describe the column span of A, introduce the “economy-size” singular value decomposition
of A,
H
A = UA ΣA VA
where we take UA : m×d with orthonormal columns, ΣA : d×d diagonal containing the nonzero
singular values of A, and VA : d × d unitary. Also let U⊥
A be the orthonormal complement of
UA . The columns of UA are an orthonormal basis of the column span of A. The point is that
even if A is unknown, UA can be estimated from the data, as described below (and in more
detail in section 6.5).
1
We assume that the noise is spatially white, with covariance matrix σ 2 I. Let R̂x = N XX
H
be
the noisy sample data covariance matrix, with eigenvalue decomposition
R̂x = ÛΛ̂Û = ÛΣ̂2 Û .
H H
(6.17)
Here, Û is M × M unitary, and Σ̂ is M × M diagonal. Equivalently, these factors follow from
an SVD of the data matrix X directly:
√1 X = ÛΣ̂V̂H
N
We collect the d largest singular values into a d × d diagonal matrix Σ̂s , and collect the corre-
sponding d eigenvectors into Ûs . Asymptotically, Rx satisfies Rx = AAH + σ 2 I, with eigenvalue
decomposition
Rx = UA Σ2A UA + σ 2 I = UA (Σ2A + σ 2 I)UA + σ 2 U⊥ ⊥H
H H
A UA . (6.18)
Since R̂x → Rx as the number of samples N grows, we have that Ûs Σ̂2s ÛHs → UA (Σ2A +σ 2 I)UHA ,
so that Ûs is an asymptotically unbiased estimate of UA . Thus UA and also Σ and Λ can
be estimated consistently from the data, by taking sufficiently many samples. In contrast, VA
H
cannot be estimated like this: this factor is on the “inside” of the factorization AS = UA ΣA VA S
and as long as S is unknown, any unitary factor can be exchanged between VA and S.
Even if we choose F to have the column span of Ûs , there is freedom left. As we will show, a
natural choice is to combine the dimension reduction with a whitening of the data covariance
matrix, i.e., such that Rx := N1 XXH becomes unity: Rx = I. This is achieved if we define F as
F = Ûs Σ̂−1
s . (6.19)
If the noise is colored with covariance matrix σ 2 Rn , where we know Rn but perhaps not the
−1/2
noise power σ 2 , then we first whiten the noise by computing Rn X, and continue as in the
−1/2
white noise case, by computing an SVD of Rn X. The resulting prewhitening/dimension
reducing filter is then
F = Rn−1/2 ÛΣ̂−1 .
W=A
and asymptotically FFH = R−1 x PA . Since PA A = A, the result follows. For finite samples, the
dimension reduction gives a slight difference.
Direct matched filtering Another choice for F that reduces dimensions and that is often taken
if (an estimate of) A is known is by simply setting
F=A
H H H
X = A X = (A A)S + A N
This gives
H H H H H
X = UA X = (UA A)S + UA N = (ΣA VA )S + UA N .
H
The noise is white again, and A = ΣA VA . If we subsequently want to apply a Wiener receiver
in this prefiltered domain, it is given by
Conclusion
So far, we have looked at the receiver problem from a rather restricted viewpoint: the beam-
formers were based on the situation where there is a single source in noise. In the next section
we will also consider beamforming algorithms that can handle more sources. These are based
on an eigenvalue analysis of the data covariance matrix, which is introduced in this section.
Let us first consider the covariance matrix due to d sources and no noise,
H
Rx = ARs A
where Rx has size M × M , A has size M × d and Rs has size d × d. If d < M , then the rank of
Rx is d since A has only d columns. Thus, we can estimate the number of narrow-band sources
from a rank analysis. This is also seen from an eigenvalue analysis: let
H
Rx = UΛU
The remaining M − d eigenvectors from U can be collected in a matrix Un , and they are
orthogonal to Us since U = [Us Un ] is unitary. The subspace spanned by the columns of Us is
called the signal subspace, the orthogonal complement spanned by the columns of Un is known
as the noise subspace (although this is a misnomer since here there is no noise yet and later the
noise will be everywhere and not confined to the subspace). Thus, in the noise-free case,
" #" #
H Λs 0 UHs
Rx = UΛU = [Us Un ]
0 0 UHn
Rx = As Rs As + σ 2 IM .
H
In this case, Rx is full rank: its rank is always M . However, we can still detect the number of
sources by looking at the eigenvalues of Rx . Indeed, the eigenvalue decomposition is derived as
(expressed in terms of the previous decomposition (6.20) and using the fact that U = [Us Un ]
is unitary: Us UHs + Un UHn = IM )
Rx = As Rs AHs + σ 2 IM
= Us Λs UHs + σ 2 (Us UHs + Un UHn )
= Us (Λs + σ"2 Iq )UHs + Un (σ 2 IM −d )U H
# "n # (6.21)
Λs + σ 2 Iq 0 UHs
= [Us Un ]
0 σ 2 IM −d UHn
=: UΛUH
hence Rx has M − d eigenvalues equal to σ 2 , and d that are larger than σ 2 . Thus, we can detect
the number of signals d by comparing the eigenvalues of Rx to a threshold defined by σ 2 .
A physical interpretation of the eigenvalue decomposition can be as follows. The eigenvectors
give an orthogonal set of “directions” (spatial signatures) present in the covariance matrix,
sorted in decreasing order of dominance. The eigenvalues give the power of the signal coming
from the corresponding directions, or the power of the output of a beamformer matched to that
direction. Indeed, let the i’th eigenvector be ui , then this output power will be
H
ui Rui = λi .
The first eigenvector, u1 , is always pointing in the direction from which most energy is coming.
The second one, u2 , points in a direction orthogonal to u1 from which most of the remaining
energy is coming, etcetera.
If only (spatially white) noise is present but no sources, then there is no dominant direction,
and all eigenvalues are equal to the noise power. If there is a single source with power σs2 and
spatial signature a, normalized to kak2 = p, then the covariance matrix is Rx = σs2 aaH + σ 2 I.
It follows from the previous that there is only one eigenvalue larger than σ 2 . The corresponding
1
eigenvector is u1 = a kak , and is in the direction of a. The power coming from that direction is
λ1 = u1 Ru1 = M σs2 + σ 2 .
H
Since there is only one source, the power coming from any other direction orthogonal to u1 is
1
σ 2 , the noise power. Since u1 = a kak ,
aH Rx a uH1 Rx u1
= = λ1 .
aH a uH1 u1
singular value
6 6 6
4 4 4
gap
2 2 2
0 0 0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
index index index
Thus, the result of using the largest eigenvector as a beamformer is the same as the output
power of a matched filter where the a-vector of the source is known.
With more than one source, this generalizes. Suppose there are two sources with powers σ1 and
σ2 , and spatial signatures a1 and a2 . If the spatial signatures are orthogonal, aH1 a2 = 0, then u1
will be in the direction of the strongest source, number 1 say, and λ1 will be the corresponding
power, λ1 = M σ12 + σ 2 . Similarly, λ2 = M σ22 + σ 2 .
In general, the spatial signatures are not orthogonal to each other. In that case, u1 will point
into the direction that is common to both a1 and a2 , and u2 will point in the remaining direction
orthogonal to u1 . The power λ1 coming from direction u1 will be larger than before because it
combines power from both sources, whereas λ2 will be smaller.
Example 6.4. Instead of the eigenvalue decomposition of R̂x , we may also compute the
singular value decomposition of X:
H
X = UΣV
it is seen that U contains the eigenvectors of R̂x , whereas N1 Σ2 = Λ are the eigen-
values. Thus, the two decompositions give the same information (numerically, it is
often better to compute the SVD).
Figure 6.4 shows singular values of A for d = 2 sources, a uniform linear array with
M = 5 antennas, and N = 10 samples, for
1. well separated angles: large gap between signal and noise singular values,
2. signals from close directions, resulting in a small signal singular value,
3. increased noise level, increasing noise singular values.
Eigenvalues of covariance matrix (900 MHz) vs. time with CW and GSM
1.8
1.6
1.4
1.2
eigenvalue
0.8
0.6
0.4
0.2
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
time [sec]
Example 6.5. The covariance matrix eigenvalue structure can be nicely illustrated on
data collected at the Westerbork telescope array. We selected a narrow band slice
(52 kHz) of a GSM uplink data file, around 900 MHz. In this subband we have two
sources: a continuous narrow band (sine wave) signal which leaked in from a local
oscillator, and a weak GSM signal. From this data we computed a sequence of short
term data covariance matrices R̂x0.5ms based on 0.5 ms averages. Figure 6.5 shows
the time evolution of the eigenvalues of these matrices. The largest eigenvalue is due
to the CW signal and is always present. The GSM source is intermittent: at time
intervals where it is present the number of large eigenvalues increases to two. The
remaining eigenvalues are at the noise floor, σ 2 . The small step in the noise floor
after 0.2 s is due to a periodically switched calibration noise source at the input of
the telescope front ends.
In the previous sections, we have assumed that the source matrix S or the array matrix A is
known. We can now generalize the situation and only assume that the array response is known
as a function of the direction parameter θ. Then the directions of arrival (DOA’s) of the signals
are estimated and used to generate the beamformer weights. The beamformers are in fact the
same as we derived in the previous section, except that we specify them in terms of a(θ) and
subsequently scan θ to find directions where there is “maximal response” (e.g., in the sense of
maximal output SNR).
This yields
R̂−1
x a(θ)
w=
a(θ) R̂−1
H
x a(θ)
and thus the direction estimate is
1
θ̂ = min .
θ a(θ) R̂−1
H
x a(θ)
To make a spectral graph as in the classical beamformer, the expression is inverted to obtain a
search for maxima,
θ̂ = max a(θ) R̂−1
H
x a(θ) .
θ
MUSIC
MVDR
20 Classical BF
15
10
Power [dB]
−5
−10
40 50 60 70 80 90 100 110 120 130
Angle [deg]
Figure 6.6. Spatial spectra corresponding to the classical beamformer, MVDR, and MUSIC.
The DOA’s are estimated as the maxima of the spectra.
For multiple signals choose again the d largest local maxima. The MVDR is also illustrated in
Fig. 6.6.
MVDR beamformer:
1
w(p) = µR−1 a(p) , µ= .
a(p)H R−2 a(p)
This beamformer is known as the “Adapted Angular Response” (AAR) [9]. The resulting image
is
a(p)H R−1 a(p)
IAAR (p) = w(p)H Rw(p) = .
[a(p)H R−2 a(p)]2
It has a high resolution and suppresses sidelobe interference under the white noise constraint. It
was proposed for use in radio astronomy image formation in [10], the resulting image was called
LS-MVI.
Rx = As Rs AHs + σ 2 IM
= Us (Λs + σ 2 Iq )UHs + Un (σ 2 IM −d )UHn
As discussed before, the eigenvalues give information on the number of sources (by counting how
many eigenvalues are larger than σ 2 ). However, the decomposition shows more than just the
number of sources. Indeed, the columns of Us span the same subspace as the columns of A. This
is clear in the noise-free case (6.20), but the decomposition (6.21) shows that the eigenvectors
contained in Us and Un respectively are the same as in the noise-free case. Thus,
H
span(Us ) = span(A) , Un A = 0 . (6.22)
Given a correlation matrix R̂x estimated from the data, we compute its eigenvalue decomposi-
tion. From this we can detect the rank d from the number of eigenvalues larger than σ 2 , and
we can estimate Us and hence the subspace spanned by the columns of A. Although we cannot
directly identify each individual column of A, its subspace estimate can nonetheless be used to
determine the directions, since we know that
A = [a(θ1 ) , · · · , a(θd )]
If a(θ) is known as a function of θ, then we can select the unknown parameters [θ1 , · · · , θd ] to
make the estimate of A fit the subspace Us . Several algorithms are based on this idea. Below we
discuss an effective algorithm that is widely used, the MUSIC (Multiple SIgnal Classification)
algorithm.
Note that it is crucial that the noise is spatially white. For colored noise, an extension (whitening)
is possible but we have to know the coloring.
Assume that d < M . Since col(Us ) = col{a(θ1 ), · · · , a(θd )}, we have
H
Un a(θi ) = 0 , (1 ≤ i ≤ d) (6.23)
The MUSIC algorithm estimates the directions of arrival by choosing the d lowest local minima
of the cost function
kÛHn a(θ)k2 a(θ)H Ûn ÛHn a(θ)
JM U SIC (θ) = = (6.24)
ka(θ)k2 a(θ)H a(θ)
where ÛHn is the sample estimate of the noise subspace, obtained from an eigenvalue decompo-
sition of R̂x . To obtain a ‘spectral-like’ graph as before (it is called a pseudo-spectrum), we
plot the inverse of JM U SIC (θ). See Fig. 6.6. Note that this eigenvalue technique gives a higher
resolution than the original classical spectrum, also because its sidelobes are much more flat.
Note, very importantly, that as long as the number of sources is smaller than the number of
sensors (d < M ), the eigenvalue decomposition of the true Rx allows to estimate exactly the
DOAs. This means that if the number of samples N is large enough, we can obtain estimates
with arbitrary precision. Thus, in contrast to the beamforming techniques, the MUSIC algorithm
provides statistically consistent estimates.
An important limitation is still the failure to resolve closely spaced signals in small samples
and at low SNR scenarios. This loss of resolution is more pronounced for highly correlated
signals. In the limiting case of coherent signals, the property (6.23) is violated because the rank
of Rx becomes smaller than the number of sources (the dimension of Un is too large), and the
method fails to yield consistent estimates. To remedy this problem, techniques such as “spatial
smoothing” as well as extensions of the MUSIC algorithm have been derived.
In the previous sections, we have looked at matched filtering in the context of array signal
processing. Let us now look at how this applies to temporal filtering.
No intersymbol interference We start with a fairly simple case, namely the reception of a
symbol sequence s(t) convolved with a pulse shape function g(t):
gs1 gs2
x1 x2
function has a duration of less than T , so that g(t) has support only on the interval [0, 1i. We
sample x(t) at a rate P , where P is the (integer) oversampling rate. The samples of x(t) are
stacked in vectors
x(k)
x(k + 1 )
P
xk = ..
.
P −1
x(k + P )
See also Fig. 6.7. If we are sufficiently synchronized, this means that
x(k) g(0)
x(k + 1 g( 1 )
P ) P
xk = gsk ⇔ .
.
= .
.
sk (6.25)
. .
P −1
x(k + P ) g( PP−1 )
or
X = gs , X = [x1 x2 ··· xN ] , s = [s1 s2 ··· sN ] .
The matched filter in this context is simply gH . It has a standard interpretation as a convolution
P −1
or integrate-and-dump filter. Indeed, yk = gH xk = Pi=0 g(i)x(k + Pi ). This can be viewed as a
convolution by the reverse filter gr (t) := g(T − t):
P
X
gr ( Pi )x(k + 1 − i
H
yk = g xk = P)
i=1
With intersymbol interference In practise, pulse shape functions are often a bit larger than
1 symbol period. Also, we might not be able to achieve perfect synchronization. Thus let us
define a shift of g over some delay τ , and assume for simplicity that the result has support on
g1 s1 g2 s1
x1 x2
[0, 2T i (although with pulse shapes longer than a symbol period, it would in fact be more correct
to have a support of [0, 3T i):
g(0 − τ )
1
g( − τ )
gτ := . P
..
g(2 − P1 − τ )
Now, gτ is spread over two symbol periods, and we can define
" #
g
gτ = 1
g2
After convolution of g(t − τ ) by the symbol sequence s(t), sampling at rate P , and stacking, we
obtain that the resuling sample vectors xk are the sum of two symbol sequences (see Fig. 6.8):
x(k) g(0 − τ ) g(1 − τ )
1 1 1
x(k + P) g( − τ ) g( − τ )
xk = g1 sk + g2 sk−1 ⇔ . = .P sk + . P sk−1
.. .. ..
P −1 1 1
x(k + P ) g(1 − P − τ ) g(2 − P − τ )
or in matrix form
" #
s1 s2 · · · sN
X = Gτ S ⇔ [x1 x2 ··· xN ] = [g1 g2 ]
s0 s1 · · · sN −1
In this case, there is intersymbol interference: a sample vector xk contains the contributions of
more than a single symbol.
A matched filter in this context would be GHτ , at least if Gτ is tall: P ≥ 2. In the current
situation (impulse response length including fractional delay shorter than 2 symbols) this is the
case as soon as we do any amount of oversampling. After matched filtering, the output yk has
two entries, each containing a mixture of the symbol sequence and one shift of this sequence.
The mixture is given by " #
H g1H g1 g1H g2
Gτ Gτ =
g2H g1 g2H g2
Thus, if g1 is not orthogonal to g2 , the two sequences will be mixed and further equalization
(‘beamformer’ on yk ) will be necessary. The matched filter in this case only serves to make the
output more compact (2 entries) in case P is large.
More in general, we can stack the sample vectors to obtain
" # " # s2 s3 · · · sN
x1 x2 · · · xN −1 0 g1 g2
X = Gτ S ⇔ = s1 s2 · · · sN −1
x2 x3 · · · xN g1 g2 0
s0 s1 · · · sN −2
Gτ is tall if 2P ≥ 3. It is clear that for any amount of oversampling (P > 1) this is satisfied.
We can imagine several forms of filtering based on this model.
The second term is regarded as part of the noise. As such, it has a covariance matrix
" #
g2 g2H 0
Rn =
0 g1 g1H
If g1 is not orthogonal to g2 , then the noise due to ISI is not zero. Since these vectors are
dependent on τ , this will generally be the case. With temporally white noise added to the
samples, there will also be a contribution σ 2 (g1H g1 + g2H g2 ) to the output noise variance.4
4
In actuality, the noise will not be white but shaped by the receiver filter.
3. Zero-forcing filtering and selection of one output. This solution can be regarded as the
matched filter of item 1, followed by a de-mixing step (multiplication by (GτH Gτ )−1 ), and
selection of one of the outputs. The resulting filter is
0
H −1 H †
w = Gτ (Gτ Gτ ) 1 = (Gτ Gτ ) gτ
0
and the output will be [s1 s2 · · ·. Note that in principle we could select also one of the
other outputs, this would give only a shift in the output sequence (starting with [s0 s1 · · ·]
or [s2 s3 · · ·]). With noise, however, reconstructing the center sequence is likely to give
the best performance since it carries the most energy.
4. Wiener filtering. This is
w = R̂−1
X gτ .
Under noise-free conditions, this is asympotically equal to w = (GG H )† gτ , i.e., the zero-
forcing filter. In the presence of noise, however, it is more simply implemented by direct
inversion of the data covariance matrix. Among the linear filtering schemes considered
here, the Wiener filter is probably the preferred filter since it maximizes the output SINR.
As we have seen before, the Wiener filter is asymptotically also equal to a scaling of R−1
n gτ ,
i.e., the result of item 2, taking the correlated ISI-noise into account. (This equivalence
can however only be shown if there is some amount of additive noise as well, or else Rn
and RX are not invertible.)
Delay estimation In general, the delay τ by which the data is received is unknown and has
to be estimated from the data as well. This is a question very related to that of the DOA
estimation considered in the previous section. Indeed, in an ISI-free model xk = gτ sk , the
problem is similar to xk = a(θ)sk , but for a different functional. The traditional technique in
communications is to use the “classical beamformer”: scan the matched filter over a range of τ ,
and take that τ that gives the peak response. As we have seen in the previous sections, this is
optimal if there is only a single component in noise, i.e., no ISI. With ISI, the technique relies
on a sufficient orthogonality of the columns of Gτ . This is however not guaranteed, and the
resolution may be poor.
We may however also use the MUSIC algorithm. This is implemented here as follows: compute
the SVD of X , or the eigenvalue decomposition of RX . In either case, we obtain a basis Us for
the column span of X . In noise-free conditions or asymptotically for a large number of samples,
we know that the rank of X is 3, so that Us has 3 columns, and that
" #
0 g 1 g2
span{Us } = span{Gτ } = span{ }
g1 g2 0
Thus, gτ is in the span of Us . Therefore,
gτ ⊥ Un = (Us )⊥
40
35
30
[dB]
25
20
15
10
0
0 0.2 0.4 0.6 0.8 1
delay [T]
Figure 6.9. Delay estimation: spectra corresponding to the matched filter and MUSIC. The
true delay is 0.2T .
Bibliography
[1] D.H. Johnson and D.E. Dudgeon, Array Signal Processing: Concepts and Techniques.
Prentice-Hall, 1993.
[2] S.M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice-Hall,
1993.
[3] S. Haykin, Adaptive Filter Theory. Englewood Cliffs (NJ): Prentice-Hall, 1992.
[4] R. A. Monzingo and T. W. Miller, Introduction to Adaptive Arrays. New-York: Wiley-
Interscience, 1980.
[5] H. Krim and M. Viberg, “Two decades of array signal processing research: The parametric
approach,” IEEE Signal Processing Magazine, vol. 13, pp. 67–94, July 1996.
[6] B.D. van Veen and K.M. Buckley, “Beamforming: A versatile approach to spatial filtering,”
IEEE ASSP Magazine, vol. 5, pp. 4–24, Apr. 1988.
[7] L.L. Scharf, Statistical Signal Processing. Reading, MA: Addison-Wesley, 1991.
[8] D.H. Brandwood, “A complex gradient operator and its application in adaptive array the-
ory,” IEE Proc., parts F and H, vol. 130, pp. 11–16, Feb. 1983.
[10] C. Ben-David and A. Leshem, “Parametric high resolution techniques for radio astronomical
imaging,” IEEE J. Sel. Topics in Signal Processing, vol. 2, pp. 670–684, Oct. 2008.
Contents
7.1 Maximum Likelihood formulation to direction finding . . . . . . . . 145
7.2 Covariance Matching; Weighted Subspace Fitting . . . . . . . . . . . 145
7.3 Gauss-Newton Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.4 Application to Radio Astronomy imaging . . . . . . . . . . . . . . . . 145
Contents
8.1 Prelude: Shift-invariance . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2 Direction estimation using the ESPRIT algorithm . . . . . . . . . . 148
8.3 Delay estimation using ESPRIT . . . . . . . . . . . . . . . . . . . . . 157
8.4 Frequency estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.5 System identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.6 Real processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
In Chapter 6, we have looked at the MVDR and MUSIC algorithms for direction finding. It
was seen that MUSIC provides high-resolution estimates for the directions-of-arrival (DOAs).
However, these algorithms need a search over the parameter α, and extensive calibration data
(i.e., the function a(α) for a finely sampled range of α). In this present chapter, we look at
the ESPRIT algorithm for direction estimation. This algorithm does not require a search or
calibration data, but assumes a special array configuration that allows to solve for the DOAs
algebraically, by solving an eigenvalue problem. The same algorithm applies to delay estimation
and to frequency estimation.
In this chapter, we will be involved in estimating the parameter θ of vectors with the following
polynomial or “Vandermonde” structure
1
θ
θ2
a(θ) = , |θ| = 1 .
..
.
θN
The phase of θ provides either the angle-of-arrival of signals, its frequency, or its relative delay,
depending on the interpretation of a phase shift in the application. The structure of a(θ) is
rich, and there are several ways to estimate θ from it after it has been perturbed by noise. A
simple (but statistically suboptimal) method is to look for the ratios of entries of a with their
neighbor: each ratio ai+1
ai is equal to θ. To obtain a good estimate in the presence of noise, we
would rather take the average over these ratios:
N
1 X ai+1
θ̂ = . (8.1)
N i=1 ai
This usually provides a very reasonable estimate of θ. The property which has been used here
is that of shift-invariance of the vector: if we shift a(θ) over one position, we obtain the same
vector, but multiplied by θ. Indeed, define the subvectors
1 θ
2
θ
θ
x= .. , y=
.. .
. .
θN −1 θN
θ = (x x)−1 x y = x† y .
H H
y = xθ ⇒ (8.2)
If the entries of x are on the unit circle, then (xH x)−1 = N1 and (xH )i = a1i , and the two
estimates of θ are the same.1 The “algorithm” to compute θ in (8.2) is readily extended to
superpositions of multiple vectors a(θi ) of the same form, and this is the principle underlying
many subspace-based algorithms for harmonic retrieval, direction finding, and rational system
identification. The prototype algorithm for this is the ESPRIT algorithm, which was originally
proposed for direction finding.
As in previous chapters, we assume that all signals are narrowband with respect to the propaga-
tion delay across the array, so that this delay translates to a phase shift. We consider a simple
propagation scenario, in which there is no multipath and sources have only one ray towards the
receiving antenna array. Since no delays are involved, all measurements are simply instanta-
neous linear combinations of the source signals. Each source has only one ray, so that the data
model is
X = AS .
1
Otherwise, the estimates are slightly different: the ratios in (8.1) should be weighted by |ai |2 to obtain the
same result. This deemphasizes ratios with a poor SNR.
s(t)
x3 ∆ y3
x1 ∆ y1
x2 ∆ y2
A = [a(α1 ), · · · , a(αd )] contains the array response vectors. The rows of S contain the signals,
multiplied by the fading parameters (amplitude scalings and phase rotations).
Computationally attractive ways to compute {αi } and hence A are possible for certain regu-
lar antenna array configurations for which a(α) becomes a shift-invariant or similar recursive
structure. This is the basis for the ESPRIT algorithm (Roy, Kailath and Paulraj 1987 [1]).
The constraint on the array geometry imposed by ESPRIT is that of sensor doublets: the array
consists of two subarrays, denoted by
x1 (t) y1 (t)
.. ..
x(t) = . , y(t) = .
xM (t) yM (t)
where each sensor yi has an identical response as xi , and is spaced at a constant displacement
vector ∆ (wavelengths) from xi . It is important that the displacement vector is the same for all
sensor pairs (both in length and in direction). The antenna response ai (α) for the pair (xi , yi )
is arbitrary and may be different for other pairs.
d
X
xi (t) = ai (αk )sk (t)
k=1
Xd d
X
yi (t) = ai (αk )ej2π∆ sin(αk ) sk (t) = ai (αk )θk sk (t)
k=1 k=1
where θk = ej2π∆ sin(αk ) is the phase rotation due to the propagation of the signal from the
x-antenna to the corresponding y-antenna.
In terms of the vectors x and y, we have
d
X
x(t) = a(αk )sk (t)
k=1 (8.3)
Xd
y(t) = a(αk )θk sk (t)
k=1
The ESPRIT algorithm does not assume any structure on a(α). It will instead use the phase
relation between x(t) and y(t).
If we collect N samples in matrices X and Y, we obtain the data model
X = AS
(8.4)
Y = AΘS
where
θ1
One special case in which the shift-invariant structure occurs is that of a uniform linear array
(ULA) with M + 1 antennas. For such an array, with interelement spacing ∆ wavelengths, we
have seen that
1
θ
j2π∆ sin(α)
a(θ) = .. , θ = e
. (8.5)
.
θM
If we now split the array into two overlapping subarrays, the first (x) containing antennas 1 to
which gives precisely the model (8.3), where a in (8.3) is one entry shorter than in (8.5).
8.2.3 Algorithm
Given the data X and Y, we first stack all data in a single matrix Z of size 2M × N with model
" # " #
X A
Z= = Az S , Az = .
Y AΘ
(In the case of a ULA with M + 1 antennas, we stack the available antenna outputs vertically
but do not duplicate the antennas; Z will then have size M + 1 × N ). Since Z has rank d, we
compute an (economy-size) SVD
H
Z = Ûz Σ̂z V̂z , (8.6)
where Ûz : 2M × d has d columns which together span the column space of Z. The same space
is spanned by the columns of Az , so that there must exist a d × d invertible matrix T that maps
one basis into the other, i.e., such that
" #
AT
Ûz = Az T = (8.7)
AΘT
For M ≥ d, Ûx is “tall”, and if we assume that A has full column rank, then Ûx has a left-inverse
so that
Û†x Ûy = T−1 ΘT .
The matrix on the left hand side is known from the data. Since Θ is a diagonal matrix, the
matrix product on the right hand side is recognized as an eigenvalue equation: T−1 contains
the eigenvectors of Û†x Ûy (scaled arbitrarily to unit norm), and the entries of Θ on the diagonal
are the eigenvalues. Hence we can simply compute the eigenvalue decomposition of Û†x Ûy , take
the eigenvalues {θi } (they should be on the unit circle), and compute the DOAs αi from each
of them. This comprises the ESPRIT algorithm.
Note that the SVD of Z in (8.6) along with the definition of T in (8.7) as Ûz = Az T implies
that
Z = Ûz Σ̂z V̂zH , Z = Az S = Az TT−1 S
⇒ T−1 S = Σ̂z V̂zH = ÛHz Z
⇒ S = TÛHz Z
Hence, after having obtained T from the eigenvectors, a zero-forcing beamformer W on Z such
that S = WH Z is given by
H
W = Ûz T .
Thus, source separation is straightforward in this case and essentially reduced to an SVD and
an eigenvalue problem.
If the two subarrays are spaced by at most half a wavelength, then the DOAs are directly
recovered from the diagonal entries of Θ, otherwise they are ambiguous (two different values of
α give the same θ). Such an ambiguity does not prevent the construction of the beamformer W
from T, and source separation is possible nonetheless. Because the rows of T are determined
only up to a scaling, the correct scaling of the rows of S cannot be recovered unless we know the
average power of each signal or the array manifold A. This is of course inherent in the problem
definition.
With noise, essentially the same algorithm is used. If we assume that the number of sources d
is known, then we compute the SVD of the noisy Z, and set ÛZ equal to the principal d left
singular vectors. This is the best estimate of the subspace spanned by the columns of A, and
asymptotically (infinite samples) identical to it. Thus, for infinitely many samples we obtain the
correct directions: the algorithm is asymptotically unbiased (consistent). For finite samples, an
estimated eigenvalue θ̂ will not be on the unit circle, but we can easily map it to the unit circle
by dividing by |θ̂|.
Compared to the beamforming algorithms in Chap. 6, which locate sources due to their peaks
in a spatial power spectrum, the ESPRIT algorithm finds the exact directions of arrival under
noise-free conditions, or asymptotically as the number of samples grows large. This is due to the
parametric assumptions: an exact number of point sources, less than the number of antennas.
Under these conditions, the sources may be arbitrarily close. On the other hand, the algorithm
will fail if some of the sources are diffuse: such sources must be modeled as part of the noise. If
the noise is not white, the noise covariance must be estimated and whitened.
θ̄ = θ−1
θ̄M
1 1 1
1
M −1
θ 1 θ̄ θ̄ θ
−M
= a(θ)θ−M .
a(θ) = ⇒ Πā(θ) =: ..
= . = θ
..
.
.. . ..
. . . .
θM 1 θ̄M 1 θM
Ze = [Z , ΠX̄]
then this will double the number of observations but will not increase the rank, since
Ze = Az [S , Θ−1 S] .
Using this structure, it is also possible to transform Ze to a real-valued matrix, by simple linear
operations on its rows and columns [2, 3]. As we saw in chapter 6, there are many other direction
finding algorithms that are applicable. For the case of a ULA in fact a better algorithm is known
to be MODE [4]. Although ESPRIT is statistically suboptimal, its performance is usually
quite adequate. Its interest lies also in its straightforward generalization to more complicated
estimation problems in which shift-invariance structure is present.
8.2.6 Performance
Figure 8.2 shows the results of a simulation with 2 sources with directions −10◦ , 10◦ , a ULA( λ2 )
with 6 antennas, and N = 40 samples. The first graph shows the mean value, the second the
3.5
5
std(θ) [degrees]
3
0 2.5
2
−5
1.5
1
−10
0.5
−15 0
−10 −5 0 5 10 15 20 −10 −5 0 5 10 15 20
SNR [dB] SNR [dB]
Figure 8.2. Mean and standard deviations of ESPRIT and MUSIC estimates as function of
SNR
Mean value of estimated directions Standard deviation of direction estimates
10 5
ESPRIT ESPRIT
8 MUSIC 4.5 SNR = 10 dB MUSIC
M=6
6 4 N=40
estimated directions [degrees]
4 3.5
std(θ) [degrees]
2 3
0 2.5
−2 2
−4 1.5
−6 1
−8 0.5
−10 0
0 5 10 15 20 0 5 10 15 20
DOA separation [degrees] DOA separation [degrees]
Figure 8.3. Mean and standard deviations of ESPRIT and MUSIC estimates as function of
DOA separation
standard deviation (averaged over the two sources), which indicates the accuracy of an individual
estimate. For sufficient SNR, the performance of both algorithms is approximately the same.
Figure 8.3 shows the same for varying separation of the two sources, with an SNR of 10 dB.
For small separation, the performance of ESPRIT drops because the matrix A drops in rank:
it appears to have only 1 independent column rather than 2. If we select two singular vectors,
then this subspace will not be shift-invariant, and the algorithm produces bad estimates: both
the mean value and the standard deviation explode. MUSIC, on the other hand, selects the null
space and scans for vectors orthogonal to it. If we ask for 2 vectors, it will in this case produce
two times the same vector since there is only a single maximum in the MUSIC spectrum. It
is seen that the estimates become biased towards a direction centered between the two sources
(= 0◦ ), but that the standard deviation gets smaller since the algorithm consistenly picks this
center.
The performance of both ESPRIT and MUSIC is noise limited: without noise, the correct DOAs
are obtained. With noise and asymptotically many samples, N → ∞, the correct DOAs are
obtained as well, since the subspace spanned by Ûz is asymptotically identical to that obtained
in the noise-free case, the span of the columns of A.
In the above, we assumed that there was no multipath: each source had only one path to the
antenna array. However, the X = AS model is also valid if sources have multiple rays towards
the array, as long as the delay differences are small compared to the signal bandwidth, so that
they can be represented by phase shifts. This is known as coherent multipath (see also Sec.
4.3.1).
Pd
Let d be the number of sources, ri the number of rays belonging to source i, and r = 1 ri the
total number of rays (assumed to be distinct). In that case, a more detailed model is
where Aθ : M × r is the Vandermonde matrix associated to the DOAs of the rays, and J : r × d
is a selection matrix which adds groups of rays to source signals, e.g.,
1 0
1 0
J=
0 1
0 1
in case of two sources, each with two rays. B is a diagonal scaling matrix representing the
different amplitudes (fadings) of each ray, including phase offsets. Because the rank of X is still
d, the SVD of X can retrieve only a d-dimensional subspace Û, so that
H
Û = (Aθ BJ)T , S = TÛ X .
It is clear that blind beamforming is more challenging now: we try to find T such that each
column of Û is represented by a sum of r Vandermonde vectors, rather than only d vectors, and
r is not known.
To solve this problem algebraically using ESPRIT-type techniques,2 we first try to restore the
rank to r. This is possible if the number of antennas M is sufficiently large, in fact M ≥
r + max(ri ). In that case, we can form a block-Hankel matrix out of Û by taking vertical shifts
of it:
Um := [Û(1) Û(2) · · · Û(m) ] : (M − m + 1) × md . (8.9)
Here, Û(i) is a submatrix of Û consisting of its i-th till M − m + i-th row, and m is known as the
spatial smoothing factor [5, 6]. With the above model, we have that Um satisfies the factorization
Ûu = A0θ BR ,
H
T = (RÛu )Um .
which gives both Θ and R, up to scaling of its rows. At this point, we have recovered T =
(RÛHu )Um , up to multiplication at the left by an arbitrary diagonal matrix. The next objective
is to estimate T from the structure of T in (8.10). This is now a much simpler task: we have
available m matrices of size r × d, after correction by suitable powers of Θ−1 all equal to JT.
The structure of J ensures that this matrix has only d distinct rows, which are the d rows of T.
Hence, it suffices to estimate these d unique rows, which is a simple clustering problem if the
rows of T are sufficiently different. This determines both T and J, i.e., the assignment of rays
to sources. With T in hand, we have our blind beamformer as before: WH = TÛH .
2
Other techniques such as MODE are directly applicable to the coherent case without modifications.
A channel matrix H can be estimated from training sequences, or sometimes “blindly” (without
training). Very often, we do not need to know the details of H if our only purpose is to recover
the signal matrix S. But there are several situations as well where it is interesting to pose a
multipath propagation model, and try to resolve the individual propagation paths. This would
give information on the available delay and angle spread, for the purpose of diversity. It is often
assumed that the directions and delays of the paths do not change quickly, only their powers
(fading parameters), so that it makes sense to estimate these parameters. If the channel is
well-characterized by this parametrized model, then fitting the channel estimate to this model
will lead to a more accurate receiver. Another application is wireless localization.
8.3.1 Principle
Let us consider first the simple case already introduced in Sec. 6.7. Assume we have a vector g0
corresponding to N samples of an FIR pulse shape function g(t), sampled with period T above
the Nyquist rate,
g(0)
g(T )
g(t) ↔ g0 = .. .
.
g((N − 1)T )
Similarly, we can consider a delayed version of g(t):
g(0 − τ )
g(T − τ )
g(t − τ ) ↔ gτ = .. .
.
g((N − 1)T − −τ )
The number of samples N is chosen such that at the maximal possible delay, g(t−τ ) has support
only on the interval [0, N T i symbols.
Given gτ and knowing g0 , how do we estimate τ ? Note here that τ does not have to be a multiple
of T , so that gτ is not exactly a shift of the samples in g0 . A simple “pattern matching” with
entry-wise shifts of g0 will not give an exact result.
We can however make use of the fact that a Fourier transformation maps a delay to a certain
phase progression. Let
N −1
2π
e−jωi k g(kT ) ,
X
g̃(ωi ) = ωi = i , i = 0, 1, · · · , N − 1 .
k=0
N
g̃0 = F g0 , g̃τ = F gτ
If τ is an integer multiple of T , then it is straightforward to see that the Fourier transform g̃τ
of the sampled version of g(t − τ ) is given by
1 1
φτ /T φτ /T
(φτ /T )2 = diag(g̃0 ) · (φτ /T )2
g̃τ = g̃0 (8.13)
..
..
. .
(φτ /T )N −1 (φτ /T )N −1
where represents entrywise multiplication of the two vectors. The same holds true for any τ
if g(t) is bandlimited and sampled at or above the Nyquist rate.
Thus, we will assume that g(t) is bandlimited and sampled at such a rate that (8.13) is valid even
if τ is not an integer multiple of T . The next step is to do a deconvolution of g(t) in frequency
domain, by entrywise dividing g̃τ by g̃0 . Obviously, this can be done only on intervals where g̃0
is nonzero. Pulse shapes are bandlimited, and if we sample above Nyquist, some entries of g̃0
will be close to zero. If necessary, a selection matrix has to be applied to select only the nonzero
interval.
Next, we factor diag(g̃0 ) out of g̃τ and obtain
Note that f (φ) has the same structure as a(θ) for a ULA. Hence, we can apply the ESPRIT
algorithm in the same way as before to estimate φ from z, and subsequently τ . In the present
case, we simply split z into two subvectors x and y, one a shift of the other, and from the model
y = xφ we can obtain φ = x† y, from which we can compute τ .
LP samples
LP samples LWmax
Lg P L
1 1
0.8 P=2
L=9 0.8
0.6 Lg=6
rolloff=0.3 0.6
0.4
0.2 0.4
0
0.2
−0.2
0
0 2 4 6 8 −P/2 −0.5 0 0.5 P/2
(a) time [T] (b) frequency
Figure 8.4. Definition of parameters: (a) time domain, (b) frequency domain.
Oversampled pulse shapes There are some details that we skipped in the preceding discussion.
First of all, we assumed g(t) has a representation as an FIR filter. Because of the truncation
to length N , the spectrum of g(t) widens and sampling at a rate 1/T introduces some aliasing
due to spectral folding. This will eventually lead to a small bias in the delay estimate. To avoid
this, we can oversample the channel.
To give a specific example, assume that g(t) is a raised cosine pulse, as in Fig. 8.4. For conve-
nience of notation we normalize the time axis and set T = 1/P , where P is the oversampling
factor (in the figure, P = 2), so that in the DFT frequency domain we have N samples within
the fundamental interval −P/2 < F < P/2. Clearly, g(t) is bandlimited, and only L = N/P
frequency domain samples are significant. In the deconvolution step, we cannot divide out g̃0 ,
because we will be dividing by small numbers.
Let Jg̃ : L × N be a selection matrix for g̃, such that Jg̃ g̃ has the desired entries. For later use,
we require that the selected frequencies appear in increasing order, which with the definition of
the DFT in (8.12) means that the final dL/2e samples of g̃0 should be moved up front: Jg̃ has
the form
" #
0 0 IdL/2e
Jg̃ = : L×N.
IbL/2c 0 0
Next, we can factor diag(Jg̃ g̃0 ) out of Jg̃ g̃τ and obtain
(τi , βi )
x(t)
sk g(t)
Since there are now multiple components in F and only a single vector z, we cannot simply
estimate the parameters from this single vector by splitting it in x and y: this would allow only
to estimate a model with a single component. However, we can use the shift-invariance of the
vectors f (·) to construct a matrix out of z as
Z = [z(0) , z(1) , · · · , z(m−1) ] , (N − m + 1 × m) (8.18)
where
zi+1
(i)
zi+2
z := ..
.
zN −m+i
is a subvector of z containing the i + 1-st till the N − m + i-th entry. If we define f (φ)(i) similarly,
then
φi
1
i+1
φ φ i
f (φ)(i) = i+2 = 2 φ =: f 0 (φ)φi .
φ φ
.. ..
. .
Thus, Z has the model
Z = F0 B , F0 = [f 0 (φ1 ) , · · · , f 0 (φr )]
φ1
Thus, there is a limit on the number of rays that can be estimated: not more than half the
number of samples in frequency domain. If this condition cannot be satisfied, we need to use
multiple antennas. This is discussed in Sec. 9.3.
The ESPRIT algorithm can also be used to estimate frequencies. Consider a signal x(t) which
is the sum of d harmonic components,
d
X
x(t) = βi ejωi t (8.19)
i=1
Suppose that we uniformly sample this signal with period T (satisfying the Nyquist criterion,
here −π ≤ ωi T < π), and have available x(T ), x(2T ), · · · , x(N T ). We can then collect the
samples in a data matrix Z with m rows,
x1 x2 x3 · · ·
x2 x3 x4 · · ·
Z= . . . , xk = x(kT ) .
.. .. ..
xm xm+1 · · · xN
From (8.19), we see that this matrix satisfies the model
1 ··· 1
φ1 ··· φd β φ β1 φ21 · · ·
1 1
. ..
φ21 ··· φ2d
Z = AS := .
. .
.. ..
. . βd φd βd φ2d · · ·
φm−1
1 ··· φm−1
d
where φi = ejωi T . Since the model is the same as before, we can estimate the phase factors {φi }
as before using ESPRIT, and from these the frequencies {ωi } follow uniquely, since the Nyquist
condition was assumed to hold.
The parameter m has to be chosen larger than d. A larger m will give more accurate estimates,
however if N is fixed then the number of columns of Z (= N − m + 1) will get smaller and there
is a tradeoff. For a single sinusoid in noise, one can show that the most accurate estimate is
obtained by making Z rectangular with 2 times more columns than rows, m = N3 .
Linear time-invariant (LTI) systems can be represented using state-space models. This is in
particular convenient in the case of systems with multiple inputs and multiple outputs (MIMO).
The time-invariance gives rise to a shift invariance property, which allows to identify the state-
space matrices.
x−1
u−1 y−1
z z
x0
u0 y0
z z
x1
u1 y1 xk
z z
x2 A C
uk yk
D
u2 y2 B
z z xk+1
(a) (b)
Figure 8.6. LTI state space model. (a) Mapping of an input sequence {ui } to an output
sequence {yi } using an intermediate state sequence {xi }. The state dimension is
d = 2. Due to causality, the signal flow is from top to bottom. The delay operator
z −1 denotes a time shift here. (b) The operation at a particular time instant k
is a linear map from input uk and current state xk to output yk and next state
xk+1 .
The familiar state space model used to describe causal LTI systems is (for a system with a scalar
input uk and a scalar output yk ),
Here, xk is the state vector (assumed to have d entries), A is a d × d state transition matrix,
B and C T are d × 1 vectors, and D is a scalar (see Fig. 8.6). The integer d is called the state
dimension or system order. All finite dimensional linear systems can be described in this way.
The representation (8.20) is not at all unique. An equivalent system representation (yielding the
same input-output relationship) is obtained by applying a state transformation R (an invertible
d × d matrix) to define a new state vector x0k = Rxk . The equivalent system is
x0k+1 = A0 x0k + B0 uk
yk = C0 x0k + Duk
The eigenvalues of A remain invariant under this transformation since R−1 AR is a similarity
transformation. The eigenvalues of A are directly related to the poles of the system; for stability,
they are required to be bounded by 1.
The impulse response of this system is
CA2 B
H
h = [··· 0 D CB CAB ···] . (8.21)
The realization problem is to find a state space representation that matches a given impulse
response. As pointed out above, this representation is not unique.
Then, using (8.21) and comparing to (8.22) shows that H has a factorization as
H = OC
For a minimal realization, C and O have by definition full rank d. Since H is an outer product
of rank d matrices, it must be of rank d itself. Even for minimal realizations, there is of course
1. Given the impulse reponse, construct the Hankel matrix H as in (8.22). Determine the
rank d, and any factorization H = OC, where O and C are of full rank d. The SVD is a
robust tool for doing this.
2. At this point, we know that C and O have the shift-invariant structure of equation (8.23).
Use this property to derive
OA = O↑ ⇒ A = O† O↑
Because O is of full row rank d, we have O† = (OH O)−1 OH . This determines A. The
matrices B, C and D follow simply as
B = C(:,1)
C = O(1,:)
D = h0
where the subscript (:, 1) denotes the first column of the associated matrix, and (1, :) the
first row.
In practice, H should have finite size. This issue can be dealt with relatively easily. Further,
in practice we may not have the impulse response, but only a single input signal u plus its
corresponding output signal y. With some effort, we can adapt the algorithm to this situation.
See Verhaegen [7, 8] for details.
8.7 NOTES
The ESPRIT algorithm was originally proposed by Roy and Kailath in [9, 10]. See [11, 12] for
overviews.
Delay estimation using ESPRIT was proposed in [13].
Bibliography
[1] R. Roy and T. Kailath, “ESPRIT – Estimation of Signal Parameters via Rotational Invari-
ance Techniques,” IEEE Trans. Acoust., Speech, Signal Proc., vol. 37, pp. 984–995, July
1989.
[2] M. Haardt and J. Nossek, “Unitary ESPRIT: how to obtain increased estimation accuracy
with a reduced computational burden,” IEEE Trans. Signal Proc., vol. 43, pp. 1232–1242,
May 1995.
[3] M. Zoltowski, M. Haardt, and C. Mathews, “Closed-form 2-D angle estimation with rect-
angular arrays in element space or beamspace via Unitary ESPRIT,” IEEE Trans. Signal
Proc., vol. 44, pp. 316–328, February 1996.
[4] P. Stoica and K. Sharman, “Maximum Likelihood methods for direction-of-arrival estima-
tion,” IEEE Trans. Acoust., Speech, Signal Proc., vol. 38, pp. 1132–1143, July 1990.
[5] T. Shan, M. Wax, and T. Kailath, “On spatial smoothing for direction-of-arrival estimation
of coherent signals,” IEEE Trans. Acoust. Speech Signal Proc., vol. 33, pp. 806–811, April
1985.
[6] U. Pillai and B. Kwon, “Forward/backward spatial smoothing techniques for coherent signal
identification,” IEEE Trans. Acoust., Speech, Signal Proc., vol. 37, pp. 8–15, January 1989.
[7] M. Verhaegen and P. Dewilde, “Subspace model identification. Part 1: The Output Error
state space model identification class of algorithms,” Int. J. Control, vol. 56, no. 5, pp. 1187–
1210, 1992.
[8] M. Verhaegen and P. Dewilde, “Subspace model identification. Part 2: Analysis of the ele-
mentary Output-Error state-space model identification algorithm,” Int. J. Control, vol. 56,
no. 5, pp. 1211–1241, 1992.
[9] R. Roy, A. Paulraj, and T. Kailath, “ESPRIT—a subspace rotation approach to estimation
of parameters of cisoids in noise,” IEEE Trans. Acoust., Speech, Signal Proc., vol. 34,
pp. 1340–1342, Oct. 1986.
[10] R. Roy, ESPRIT. PhD thesis, Stanford Univ., Stanford, CA, 1987.
[12] A. van der Veen, E. Deprettere, and A. Swindlehurst, “Subspace based signal analysis using
singular value decomposition,” Proceedings of the IEEE, vol. 81, pp. 1277–1308, Sept. 1993.
[13] A. van der Veen, M. Vanderveen, and A. Paulraj, “Joint angle and delay estimation using
shift-invariance properties,” subm. IEEE Signal Processing Letters, Aug. 1996.
Contents
9.1 Joint azimuth and elevation estimation . . . . . . . . . . . . . . . . . 169
9.2 Connection to the Khatri-Rao product structure . . . . . . . . . . . 173
9.3 Joint angle and delay estimation . . . . . . . . . . . . . . . . . . . . . 175
9.4 Joint angle and frequency estimation . . . . . . . . . . . . . . . . . . 180
9.5 Multiple invariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
In the previous chapter, we have seen how direction finding of narrowband sources using a ULA,
delay estimation, and frequency estimation, all lead to a similar data model that shows shift-
invariance, and can be solved using the ESPRIT algorithm. In this chapter, we extend this to
a number of joint estimation techniques: determining the two-dimensional directions of arrival,
joint angle-delay estimation, and joint angle-frequency estimation. The data models related to
these applications have the same Khatri-Rao product structure, as well as the shift-invariance
structure. Therefore, the estimation algorithms can be based on an extension of the ESPRIT
algorithm to two dimensions. In many cases, the shift invariance is not even needed to enable
source separation: the Khatri-Rao product structure suffices.
dxz
z3
dxy
x3 y3
dxz
z1
dxy dxz z
2
x1 y1 d
x xy y2
2
Figure 9.2. Possible array configurations: a uniform rectangular array (URA), an L-shaped
array, and a +-shaped array.
obtained. Impinging on every array are d narrowband non-coherent signals sk (t). In a direct
extension of the data models in Chap. 8, we obtain the data model (ignoring the noise for the
moment)
X
= Ax S = AS X A
Y = Ay S = AΦS ⇔ Y = AΦ S . (9.1)
Z = A S = AΘS
z Z AΘ
We write A = Ax for brevity of notation. We will not assume a more detailed structure of A:
all structure in the problem is obtained by the assumption of shift invariance in the relation of
Ax to Ay and Az .
Fig. 9.2 shows some other antenna configurations that lead to the required shift invariance: a
Uniform Rectangular Array (URA), an L-shaped array, and a +-shaped array. The latter two
have a larger aperture for the same number of antennas, but it is sparsely filled, and the number
of baselines that are used is less: A has fewer rows. Hence, it is hard to say a priori which
geometry is preferred.
Due to the shift-invariance of the array, Ay = Ax Φ and Az = Ax Θ, where Φ and Θ are diagonal
Φ = diag(φ1 , φ2 , · · · , φd )
Θ = diag(θ1 , θ2 , · · · , θd )
The DOA problem is to estimate Φ and Θ from (X, Y, Z). From these matrices, the 2D angles
of arrival can directly be computed.
At this point, of course, the ESPRIT algorithm can be applied separately to (X, Y) and (X, Z).
This will produce two sets of angles. However, the angles are listed in random order. How can
the correct pairs be found?
Since there are d sources, we would like to reduce the problem to matrices of size d × d. As
before, this is done by computing an SVD. First, construct the combined data matrix
X
K= Y .
Z
In view of the model (9.1), we know that without noise, K has rank d. Therefore, we can
compute the ‘economy-size’ SVD,
H
K = UΣV
where U has d columns. In the case of noise, we need to actually compute the truncated SVD
where we take the dominant d singular components. Also, if we assume array configurations as
shown in Fig. 9.2, where X, Y, Z share some of the same elements, then the SVD is computed
on a data matrix K that contains only the unique elements.
Partitioning U in the same way as K gives
X Ux
H
Y Uy ΣV .
=
Z Uz
The d columns of U span the signal subspace. Comparing to the model (9.1), we find that there
must be a d × d invertible matrix T that maps one basis of the subspace to the other:
Ux Ax A
Uy = Ay T = AΦ T (9.3)
Uz Az AΘ
This implies
U K = ΣV = T−1 S
H H
Assuming Ux has a left inverse (this requires at least M ≥ d), compute My = U†x Uy and
Mz = U†x Uz , both of size d × d Then these have model
(
My = T−1 ΦT
(9.5)
Mz = T−1 ΘT
This shows that both My and Mz are jointly diagonalized by the same T. (Here, we mean
diagonalization by a similarity transform; we will see another form of diagonalization later on.)
These two equations are redundant: already one of the two will allow us to compute T. If the
eigenvalues Φ are distinct, then we can compute T from an eigenvalue decomposition of My ;
its eigenvector matrix T is unique up to a permutation and a scaling of its columns; this will
translate to an unknown scaling and permutation of the rows of S. The scaling can be fixed by
prior knowledge on the powers of the sources, or on the norms of the columns of A. In this case,
we can compute T from My and apply it to Mz to find Θ = TMz T−1 .
Similarly, if the eigenvalues Θ are distinct, we can compute T from an eigenvalue decomposition
of Mz , and then use T to compute Φ.
In either case, we find the correct correspondence between the entries of Φ and those of Θ, i.e.,
one pair for each source. This correct pairing does not happen if we compute the two eigenvalue
decompositions separately, as generally the eigenvalues will appear in random order.
With noise, we use the SVD to compute the dominant subspace U and proceed as above to
find My and Mz . However, now there is not a single T that exactly diagonalizes both matrices.
We would aim to find a single T to diagonalize as much as possible both matrices. This Joint
Approximate Diagonalization problem has several formulations and several algorithms have been
proposed, and a decent treatment warrants a separate chapter.
In one formulation, we can propose a QR decomposition of T−1 = QR (where Q is unitary and
R is upper triangular), so that
(
My = QRy QH
(9.6)
Mz = QRz QH
where Ry and Rz are upper triangular. This translates the problem into a joint Schur decom-
position. Since Q is unitary, it can be composed of 2 × 2 rotations (called Jacobi rotations),
which leads to numerically stable algorithms. The main diagonals of Ry and Rz give us Φ and
Θ, respectively.
where ◦ denotes the Khatri-Rao product (column-wise Kronecker product); see Sec. 5.1.6 for its
definition and some properties.
Likewise, we can write (9.3) compactly as
Ux
Uy = (F ◦ A)T
Uz
Note that this Khatri-Rao product structure is the only property that was needed to derive
the joint diagonalization model (9.5), via (9.4) and subsequently removing one matrix (Ux ) by
inversion. Thus, whenever we have this structure, we can transform it into joint diagonalization.
Further, note that here we expanded F into its rows; each row leading to a matrix of the form
shown in (9.4). But the form U = (F ◦ A)T is multilinear: we can also expand along T or
A, and arrive at a joint diagonalization model. E.g., expanding T = [t1 , · · · , td ] and likewise
U = [u1 , · · · , ud ] gives
T
uk = (F ◦ A)tk ⇔ Uk = ADk F , k = 1, · · · , d,
where Uk is a M × 3 matrix such that vec(Uk ) = uk , and Dk is a diagonal matrix such that
diag(Dk ) = tk . We used (5.8) to derive this. If FT is square and invertible (but here it is not;
this requires some preprocessing) then if we premultiply the set of matrices with a left inverse of
U1 , where U†1 = F−T D−1 †
1 A , then
where B = TH AH A is a square invertible matrix (assuming A is tall and full rank). In other
words, we have three d × d matrices of the form
Mx
= BT
M
y = BΦT
M
z = BΘT
1
This still singles out one data matrix and transfers its noise over to the other matrices. It would be better
to compute the column span of A from an SVD of [Ux , Uy , Uz ], i.e., stacked in a block row, and use this joint
estimate to reduce the dimensions to size d × d.
This is also a joint diagonalization problem, but now “by congruence” and not by similarity.
Also this problem has been well studied and several algorithms have been proposed. One tech-
nique to proceed is to insert QR factorizations B = QR and T = R0 Z (where R, R0 are upper
triangular and Q, Z are unitary matrices). Then the problem has the form
Mx
= QRx Z
My = QRy Z (9.7)
M
z = QRz Z
where Rx , Ry , Rz are upper triangular. We thus need to find unitary matrices Q, Z to make
Mx , My , Mz upper triangular. From the main diagonals of Rx , Ry and Rz , we can recover Φ
and Θ.
This problem is a “joint” generalized Schur decomposition, see Sec. 5.7. The matrices Q, Z can
be found using a generalization of the QZ algorithm. Note that a good starting point for the
iteration is available by first computing the solution to a single generalized eigenvalue problem.
Comparing (9.7) to (9.6), we see that two unitary matrices Q, Z are used instead of only one
Q. At the same time, three matrices Mx , My , Mz are available, rather than two. The number
of degrees of freedom in two unitary matrices is about equal to that of a single general matrix.
Thus, in the present case, the number of equations and number of unknowns has about the same
balance.
We have seen that the Khatri-Rao structure of the form X = (F ◦ A)T is the root to the
joint diagonalization model. This structure is an instance of a more general canonical polyadic
decomposition (CPD) of the data matrix, in this case of a third order tensor. A CPD aims to
find a low multi-linear rank approximation of a given tensor. The model in the present context
is similar to parallel factor analysis (PARAFAC). A CPD is more general because it allows more
than 3 dimensions, gives exact conditions on the dimensions in relation to the rank such that
there is a ‘unique’ decomposition, and allows for sparse representations. The tensor framework
also gives access to other decompositions such as a block term decomposition (BTD).
A second application that leads to a joint diagonalization problem is the following. In Sec. 8.3,
we studied the multipath estimation problem. Starting from a channel estimate h(t), we want
to estimate the individual path delays, directions of arrival, and path gains of each ray, as shown
in Fig. 9.3. With multiple antennas, the channel model is
r
X
h(t) = a(αi )βi g(t − τi ) .
i=1
(αi , τi , βi )
P
x1
space
time ŝk
sk g(t) equalizer
xM
Here, the pulse shape g(t) is known, and the antenna response vector a(α) is known as function
of α. We assume a ULA with interelement spacing ∆ wavelengths, so that
1
θ
θ = ej2π∆ sin(α) .
a(θ) = .. ,
.
θM −1
Assume h(t) is sampled above the Nyquist rate and that we collect N samples. Also assume
that the entire support of each g(t − τi ) is contained in these samples. We stack the samples of
h(t) into a vector as before,
h0 h(0)
h1
h(T )
Xr
h= .. = .. = [gτi ⊗ a(αi )]βi = [G ◦ A]b . (9.8)
. .
i=1
hN −1 h((N − 1)T )
The N samples of g(t − τi ) are stacked in the vector gτi , and we will assume that the entire
support of each g(t − τi ) is contained in these samples.
The equation shows the Khatri-Rao structure, which in the previous section we established to
be the root of the joint diagonalization model. How does this work out here?
With just a single matrix H, we do not have sufficient information to uniquely determine
its factorization: there is no “joint” diagonalization.
hi = Adiag(gi )b , i = 0, · · · , N − 1
Here, joint diagonalization also doesn’t work because we just have single vectors hi , not
matrices.
Thus, before we can proceed, we need to find a way to expand a single vector into a matrix. We
have seen in Chap. 8 that if A corresponds to a ULA (or a doublet structure is sufficient), we
can apply spatial smoothing to do this. Alternatively, after a DFT and deconvolution, we can
use the similar structure resulting in G to do a similar smoothing.
Indeed, this is what we did in Sec. 8.3 on single-antenna data. There, we applied a DFT to the
time domain samples in h, resulting in a vector z, and then in (8.18) constructed a matrix Z
from m shifts of z so that
Extending the results in Sec. 8.3 to multiple antennas, we see that Z has a model
where
1
φ
φ2 φ := e−j N
2π τ
f (φ) = , T
..
.
φN −1
φ1
Thus consider Z in (9.10), and assume m is large enough such that Z has rank r (in the noise-free
case). As usual, we proceed by computing the (truncated) SVD of Z,
H
Z = UΣV ,
where we truncate to rank r, i.e., U has r columns. Comparing to the model (9.11) we see
H
U = (F ◦ A)T , B = (TU )Z .
To estimate T, we form two types of selection matrices: a pair to select submatrices of F, and
a pair to select from A,
To estimate Φ, we take submatrices consisting of the first and respectively last M (N − 1) rows
of U, i.e.,
Uxφ = Jxφ U , Uyφ = Jyφ U ,
whereas to estimate Θ we stack, for all N blocks, its first and respectively last M − 1 rows:
This is again a joint diagonalization problem, where a single matrix T can diagonalize two data
matrices. Having found the eigenvalue matrices Φ and Θ, we can retrieve the delays and angles
of each ray. The correct pairing of angles to delays follows simply from the fact that they share
the same eigenvectors.
More multipath With a straightforward extension of this approach, we can estimate the mul-
tipath parameters of d sources, where each source is received via a superposition of rays, each
with its own angle θi , delay τi , and fading βi . The corresponding data model is
X = (G ◦ A)BJS , (9.14)
where B is the diagonal matrix containing all fading parameters, and J is a r × d selection
matrix which assigns each ray to one of the sources. A similar model was derived in (4.23).
The presence of J and S requires some additional processing steps: we try to estimate r > d
components from a rank-d matrix. This is the same problem as we encountered in the “coherent
multipath” problem in Sec. 8.2.7, and we can proceed in the same way.
In summary, an important property of the joint processing is that it allows us to simultaneously
identify parameters of many more rays than we have antennas: by combining with the time
domain, we extend A to G ◦ A.
By combining with Sec. 9.1, the algorithm has an elegant extension to the estimation of delays
and both azimuth and elevation angles. This results in a joint diagonalization problem of three
matrices. Similar generalizations occur if we have a non-uniform array with multiple baselines.
Exploiting fading diversity At the start of the section, we mentioned the multipath model
(9.9)
h = [G ◦ A]b .
Since only a single channel vector is available, we needed to exploit shift invariance of A or
(after the DFT) G to expand this to a matrix that admits joint diagonalization.
However, in mobile communication we often experience fast fading. In this case, angles and
delays of h remain more or less constant over time (in the order of microseconds), but b fluctu-
ates. If we obtain multiple channel estimates hk with constant angles and delays, but each with
different fading amplitudes bk , then
Joint diagonalization algorithms will allow us to estimate the columns of A and the rows of
G, without making any further assumptions on the structure of A or G: we do not need shift
invariance.
Fading diversity can also be exploited in other applications. As a simple example, consider d
independent narrowband unit-power sources impinging on an antenna array. In the kth trans-
mission block, the received data is
(ignoring the noise), where bk are the complex amplitudes, including the source powers. Subject
to fading, we assume that these are different for each k. Then the correlation matrix of xk [n] is
Rk = Adiag(bk )2 A .
H
This again leads to a joint diagonalization problem. We do not need to make assumptions on
the structure of A to be able to separate the sources.
A somewhat different scenario than what we considered before, which however leads to the same
type of data models (and thus the same beamforming algorithms), is the following. Suppose
that we observe a frequency band of interest, and want to separate all sources that are present.
Assume that the sources are narrowband, typically with different carrier frequencies, but that
the spectra might be partly overlapping. The objective is to construct a beamformer to separate
the sources based on differences in angles or carrier frequencies. This is a problem of joint angle-
frequency estimation [1, 2]. We will assume that the sample rates in this application are much
higher than the data rates of each source, and that there is only coherent multipath, although
generalizations are possible.
Suppose that the narrowband signals have a bandwidth of less than T1 , so that they can be
sampled with a period T to satisfy the Nyquist rate. We normalize to T = 1. Also assume that
the bandwidth of the band to be scanned is P times larger: after demodulation to IF we have to
sample at rate P . Without multipath, the data model of the modulated sources at the receiver
is
d
X 2π
x(t) = a(θi )βi ej P fi t
si (t)
1
Aθ BΦP s(1)
Aθ Bs(0) ···
Aθ BΦs( P1 ) Aθ BΦP +1 s(1 + 1
P) ···
X= .. ..
. .
Aθ BΦm−1 s( m−1
P ) Aθ BΦ
P +m−1 s(1 + m−1
P ) ···
Let us assume at this point that P m. In that case, s(t) is relatively bandlimited with respect
to the observed band, which allows to make the crucial assumption that
1 m−1
s(t) ≈ s(t + P) ≈ · · · ≈ s(t + P )
Fφ is as in (9.11), and only has a different interpretation: φ is now related to the carrier
frequency. FP is similar to Fφ except for a transpose and different powers, and the pointwise
multiplication represents the modulation on the signals. Obviously, beamforming will not remove
this modulation but after estimating Φ, we can easily correct for it.
If we do consider coherent multipath, the data model becomes
The column span of this model has precisely the same structure as X in (9.14) before, and hence
we can use the same algorithm to find the beamformer.
If sources are assumed not to have equal carrier frequencies and m > d, we can separate them
based on the structure of Fφ only. In this case we do not need the array structure and an
arbitrary array can be used, but we do not recover the DOAs. If frequencies can be close,
however, we will have to separate the signals based on differences in angles as well. It is then
also necessary to restore the rank of X to r by spatial smoothing.
TBD
Direct extension: Using both short and long baselines to improve resolution
Swindlehurst: MI-ESPRIT [?]
Lemma (in context of JAFE)
9.6 NOTES
Section 9.1 discussed the use of antenna triplets to derive the 2D ESPRIT algorithm. Instead of
triplets, we can also consider two ULAs oriented in two different directions, e.g., in an L-shape
or a +-shape. Extensions to more general 2-D arrays on which the ESPRIT algorithm works are
straightforward to derive, see e.g., [3]. The main issues are the preservation of shift-invariance
properties, and the correct pairing of the estimated path parameters using a coupled eigenvalue
method.
Joint angle-delay estimation is covered in [4–8].
The IQML-2D method of [9] was originally developed for estimating the two-dimensional modes
of sinusoids in Gaussian noise. As it is based on ML, it is expected to show high performance
and convergence to the CRB for large number of samples. It can be used to determine angles
and delays if both manifolds have Vandermonde structure.
Joint diagonalization problems such as encountered in this chapter have received wide interest
in the 1990s. If eigenvalues are distinct, then already a single matrix allows to compute the
separating beamformer. To achieve this situation, one line of approaches was to form linear
combinations of the two matrices to ensure that the combination has distinct eigenvalues: see
e.g., [3]. Several Jacobi-type algorithms have been proposed as well, although some of these
assume that T is a unitary matrix [10–23].
Although these algorithms usually give good performance, the problem of joint diagonalization
with non-hermitian matrices has not yet been optimally solved. It is very relevant to study such
overdetermined eigenvalue problems. Indeed, a third matrix arises if we use a two-dimensional
uniform antenna array, by which we can measure both azimuth and elevation, or any other
array with multiple independent baselines. We will see several other examples of joint eigenvalue
problems later in this book.
Bibliography
[1] M.D. Zoltowski and C.P. Mathews, “Real-time frequency and 2-D angle estimation with
sub-Nyquist spatio-temporal sampling,” IEEE Trans. Signal Proc., vol. 42, pp. 2781–2794,
October 1994.
[2] K.-B. Yu, “Recursive super-resolution algorithm for low-elevation target angle tracking in
multipath,” IEE Proceedings - Radar, Sonar and Navigation, vol. 141, pp. 223–229, August
1994.
[3] M.D. Zoltowski, M. Haardt, and C.P. Mathews, “Closed-form 2-D angle estimation with
rectangular arrays in element space or beamspace via Unitary ESPRIT,” IEEE Trans.
Signal Proc., vol. 44, pp. 316–328, February 1996.
[5] J. Gunther and A.L. Swindlehurst, “Algorithms for blind equalization with multiple an-
tennas based on frequency domain subspaces,” in Proc. IEEE ICASSP, (Atlanta, GA),
pp. 2421–2424, 1996.
[6] M. Wax and A. Leshem, “Joint estimation of directions-of-arrival and time-delays of multi-
ple reflections of known signal,” IEEE Trans. Signal Proc., vol. 45, pp. 2477–2484, October
1997.
[7] M.C. Vanderveen, C.B. Papadias, and A. Paulraj, “Joint angle and delay estimation
(JADE) for multipath signals arriving at an antenna array,” IEEE Communications Letters,
vol. 1, pp. 12–14, January 1997.
[8] A.J. van der Veen, M.C. Vanderveen, and A. Paulraj, “Joint angle and delay estimation
using shift-invariance techniques,” IEEE Trans. Signal Proc., vol. 46, pp. 405–418, February
1998.
[9] M.P. Clark and L.L. Scharf, “Two-dimensional modal analysis based on maximum likeli-
hood,” IEEE Trans. Signal Processing, vol. 42, pp. 1443–52, June 1994.
[10] A.J. van der Veen, P.B. Ober, and E.F. Deprettere, “Azimuth and elevation computation in
high resolution DOA estimation,” IEEE Trans. Signal Proc., vol. 40, pp. 1828–1832, July
1992.
[11] M. Haardt, Efficient One-, Two-, and Multidimensional High-Resolution Array Signal Pro-
cessing. PhD thesis, TU München, Munich, Germany, 1997.
[12] Y. Hua, “Estimating two-dimensional frequencies by matrix enhancement and matrix pen-
cil,” IEEE Trans. Signal Proc., vol. 40, pp. 2267–2280, September 1992.
[13] J.F. Cardoso and A. Souloumiac, “Blind beamforming for non-Gaussian signals,” IEE Proc.
F (Radar and Signal Processing), vol. 140, pp. 362–370, December 1993.
[14] A. Belouchrani, K. Abed-Meraim, J.-F. Cardoso, and E. Moulines, “A blind source separa-
tion technique using second-order statistics,” IEEE Trans. Signal Proc., vol. 45, pp. 434–
444, February 1997.
[16] L. De Lathauwer, Signal Processing Based on Multilinear Algebra. PhD thesis, KU Leuven,
Leuven, Belgium, 1997.
[17] A.J. van der Veen and A. Paulraj, “An analytical constant modulus algorithm,” IEEE
Trans. Signal Processing, vol. 44, pp. 1136–1155, May 1996.
[19] M.T. Chu, “A continuous Jacobi-like approach to the simultaneous reduction of real ma-
trices,” Lin. Alg. Appl., vol. 147, pp. 75–96, 1991.
[21] B.D. Flury and B.E. Neuenschwander, “Simultaneous diagonalization algorithms with ap-
plications in multivariate statistics,” in Approximation and Computation (R.V.M. Zahar,
ed.), pp. 179–205, Basel: Birkhäuser, 1995.
[22] J.-F. Cardoso and A. Souloumiac, “Jacobi angles for simultaneous diagonalization,” SIAM
J. Matrix Anal. Appl., vol. 17, no. 1, pp. 161–164, 1996.
[23] M. Wax and J. Sheinvald, “A least-squares approach to joint diagonalization,” IEEE Signal
Proc. Letters, vol. 4, pp. 52–53, February 1997.
FACTOR ANALYSIS
Contents
10.1 The Factor Analysis problem . . . . . . . . . . . . . . . . . . . . . . . 186
10.2 Computing the Factor Analysis decomposition . . . . . . . . . . . . . 189
10.3 Rank detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
10.4 Extensions of the Classical Model . . . . . . . . . . . . . . . . . . . . 199
10.5 Application to interference cancellation . . . . . . . . . . . . . . . . . 201
10.6 Application to array calibration . . . . . . . . . . . . . . . . . . . . . . 207
10.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Many array signal processing algorithms are at some point based on the eigenvalue decompo-
sition, which is used e.g., to make a distinction between the “signal subspace” and the “noise
subspace”. By using orthogonal projections, part of the noise is projected out and only the signal
subspace remains. This can then be used for applications such as high-resolution direction-of-
arrival estimation, blind source separation, etc. In these applications, it is commonly assumed
that the noise is spatially white. However, this is valid only after suitable calibration.
Factor analysis considers covariance data models where the noise is uncorrelated but has un-
known powers at each sensor, i.e., the noise covariance matrix is an arbitrary diagonal with
positive real entries. In these cases the familiar eigenvalue decomposition (EVD) has to be
replaced by a more general “Factor Analysis” decomposition (FAD), which then reveals all rel-
evant information. It is a very relevant model for the early stages of data processing in radio
astronomy, because at that point the instrument is not yet calibrated and the noise powers on
the various antennas may be quite different.
As it turns out, this problem has been studied in the psychometrics, biometrics and statistics
literature since the 1930s (but usually for real-valued matrices) [1, 2]. The problem has received
much less attention in the signal processing literature. In this chapter, we describe the FAD,
some applications, and some algorithms for computing it.
where A = [a1 , · · · , aQ ] contains the array response vectors. In this model, A is unknown, and
the array response vectors are unstructured, i.e., we do not consider a directional model for
them. The source vector s[n] and noise vector n[n] are considered zero mean i.i.d. complex
Gaussian, i.e., the corresponding covariance matrices are diagonal.
The data model leads to a model for the data covariance matrix as
H
R = AΣs A + Σn ,
where Σs is the (diagonal) source covariance matrix, and Σn is the (diagonal) noise covariance
matrix.
For given R, can we estimate A, Σs , and Σn ? If A has no special structure (such as imposed by
a parametrically known array response vector), then we cannot distinguish A and A0 = AΣ1/2 :
without loss of generality, we can scale the source signals such that the source covariance matrix
Σs is identity.
Therefore, in this section we will consider a data covariance matrix of the form
H
R = AA + D (10.2)
where D is the (diagonal) noise covariance matrix, and A has full column rank Q. We assume
Q < P so that AAH is rank deficient. Many signal processing algorithms are based on computing
an eigenvalue decomposition of R as R = UΛUH , where U is unitary and Λ is a diagonal matrix
containing the eigenvalues in descending order.
• If D = 0 (no noise), then R has rank Q and the eigenvalue decomposition specializes to
" #" #
H Λs UHs
R = UΛ0 U = [Us Un ]
0 UHn
where Λs contains the Q nonzero eigenvalues and Us the corresponding eigenvectors. The
range of Us is called the signal subspace, its orthogonal complement Un the noise subspace.
Since without noise R = AAH , we see that the column span of Us equals the column span
of A, i.e., ran(Us ) = ran(A).
• For spatially white noise, D = σ 2 I, we can write D = σ 2 UUH , and the eigenvalue decom-
position becomes
" #" #
Λs + σ 2 I UHs
R = UΛU = U(Λ0 + σ 2 I)U = [Us
H H
Un ] . (10.3)
σ2I UHn
Hence, all eigenvalues are raised by σ 2 , but the eigenvectors are unchanged. Algorithms
based on Us can thus proceed as if there was no noise, thus leading to the use of the EVD
and related subspace estimation algorithms in many array signal processing applications.
See e.g., Chap. 8.
• If the noise is not uniform, then D is an unknown diagonal matrix, and the EVD of R
does not reveal the signal subspace Us .
In practice, we are given a finite number of samples x[n], n = 0, · · · , N − 1, and compute the
sample covariance matrix
−1
1 NX H
R̂ = x[n]x[n] .
N n=0
For large enough N , this estimate is close, but not quite equal, to R.
The objective for factor analysis is, for given R̂, to identify A and D, as well as the factor
dimension Q. This can be seen as an extension of the eigenvalue decomposition, to be used if
the noise covariance is not σ 2 I but an unknown diagonal.
It is clear that for an arbitrary Hermitian matrix R, the factorization R = AAH +D can exist in
its exact form only for Q ≥ P , in which case we can set D = 0, or any other value, which makes
the factorization useless. Hence, for a noise-perturbed matrix, we wish to detect the smallest
Q which gives a “reasonable fit”, and we will assume that Q < P is sufficiently small so that
unique decompositions exist. What we consider reasonable depends on N , as de accuracy of R̂
(or: its covariance) scales with 1/N .
The number of available observations is equal to the number of (real) parameters in R̂, which
is P (real) entries on the main diagonal and P (P − 1) (real) parameters for the off-diagonal
(complex) entries, taking into account Hermitian symmetry. In total these are P 2 observations.
The number of unknowns is 2P Q (real) parameters for A, and P parameters for D, minus
the number of constraints to make A unique. A constraint which is often used is to make the
columns of A0 := D−1/2 A orthogonal, or equivalently, AH D−1 A is diagonal (this is motivated
in Sec. 10.1.3 below). This gives Q2 − Q constraints on the parameters of A. Further restricting
the first row of A to be real gives another Q constraints. In total we have for the number of
equations minus the number of unknowns
This number is also called the degree of freedom, and plays a role
√ in the asymptotic modeling of
the likelihood. Requiring s > 0 leads to the condition Q < P − P . This is an upper bound on
the factor rank.
Even if we satisfy this constraint, D is not always unique, as seen from the following example.
Consider R = A1 AH1 + D1 , where
1 0
1 1
A1 = . .
.
.. ..
1 1
H
Then we also have R = A2 A2 + D2 , where
1/2
√ 1
1 T
A2 = 2
.. ,
D2 = D1 + e1 e1
. 2
1
and ei is the ith column of the identity matrix. The problem in this case is caused by a
submatrix of A1 being rank-deficient. This can be considered an uncommon technicality that
can be detected after the factors have been estimated. Throughout the rest of the chapter, we
assume that D can be identified uniquely.
10.1.3 Constraints on A
If D is identifiable, then A is unique up to a rotation Q. We can make A unique by adding
additional constraints. This essentially amounts to choosing a non-redundant parametrization.
Not all algorithms require this, but it may be needed to avoid singularities during the compu-
tation of the Cramer-Rao Bound (CRB) or when we use Newton gradient descent techniques.
For complex data, Q2 constraint equations are needed. Common constraints are to force the
columns of A to be orthogonal with respect to a certain weight matrix W > 0, i.e. to require
that AH WA is diagonal.
In more detail, suppose we have estimated D, then we can whiten the noise covariance matrix
in R:
R̃ := D−1/2 RD−1/2 = (D−1/2 A)(A D−1/2 ) + I .
H
and identify D−1/2 AV = Ũ, or A = D1/2 ŨVH , where V is an arbitrary unitary factor. If we
choose V = I, we obtain AH D−1 A = Λ̃s is diagonal. We can use this as a constraint to obtain a
more unique parametrization of A. Note that A is not yet quite unique, because in the complex
case each column of A can be scaled by an arbitrary complex phase, and the columns may be
reordered as well.
If we compute a matrix A without satisfying constraints, the required transformation Q such
that A0 = AQ satisfies the constraints is easily determined afterwards. Hence, in most algo-
rithms the constraints do not play a role. In the literature, constraints such as setting AH D−1 A
diagonal have been introduced in an attempt to interprete the resulting “latent factors”, but
without prior structural information on A, these attempts are often futile.
Factor analysis is a classical problem. It was introduced in 1904 [3] and over time, several
algorithms were proposed [4–6], all for real data matrices (although readily extended to the
complex case). In this section we briefly review some of these approaches.
Consider again the model
H
R = AA + D, (10.5)
where A has Q columns, and D is a diagonal matrix with positive diagonal elements. As data,
we are given a sample covariance matrix R̂ based on N samples,
−1
1 NX H
R̂ = x[n]x[n] .
N n=0
1. Detection: given R̂, estimate Q. The hypothesis that the factor rank is q is denoted by
Hq . We can formulate this as a likelihood ratio test.
|a1 |2 = αr12
R0 = R − D = aaH =
a = αb
Figure 10.1. Ad hoc method: Away from the main diagonal, all submatrices have at most
rank Q. This can be used to estimate the main diagonal. The figure shows how
this is done for Q = 1.
If we separate detection from identification, then for the latter, the objective is to estimate the
factors A and D from R̂, where the number of columns Q of A is known. We first present
an ad-hoc algorithm, which gives some insight in the problem. Then, we look at Maximum
Likelihood (ML)-type algorithms, and in particular consider a Weighted Least Squares (WLS)
formulation that is minimized using fast-converging Gauss-Newton iterations.
by an alternating least-squares (ALS) approach, where k · kF is the Frobenius norm. First, for
a given A, (10.6) is minimized with respect to D and in the next stage, D is held constant and
a new A is found. Both problems can be optimized in closed form.
Let the subscript (k) denote the iteration count. The iteration steps are
H
D(k+1) := diag(R̂ − A(k) A(k) ) (10.7)
H
U(k+1) Λ(k+1) U(k+1) := R̂ − D(k+1) [EVD] (10.8)
1/2
A(k+1) := Us,(k+1) Λs,(k+1) , (10.9)
where U(k+1) and Λ(k+1) follow from an eigenvalue decomposition, and Us,(k+1) and Λs,(k+1)
contain the Q dominant eigenvectors and corresponding eigenvalues. A Weighted Least Squares
formulation could be considered instead of (10.6), leading to similar iterations, but involving the
EVD of D−1/2 R̂D−1/2 , if we take D−1 as a weight.
The iteration is usually initialized by taking
As for most ALS approaches, the rate of convergence is slow (linear). An EVD is required at
each iteration, which makes it prohibitive for large problems.
Parametrization Let us write the model as R(θ) = AAH + D, where the vector θ represents
the unknown parameters in the model. Since A is complex, a direct representation of its entries
gives complex parameters. We could represent them as independent real and purely imaginary
components, but a popular alternative is to represent them using Wirtinger operators [7, App.2],
[8]: for an unknown complex parameter θi we consider its conjugate θi∗ as an independent
parameter while real parameters are represented only once. Using this method we define the
parameter vector as
θA
θ = θ A∗ (10.10)
θD
where
θ A = vec(A)
θ A∗ = vec(A∗ )
θ D = diag(D) = d .
This parametrization is redundant: it does not implement the Q2 constraints we need to place
on A to make it unique. However, it is more convenient to do this at a later stage.
Using this parameterization and properties of Kronecker products (5.7) and (5.8), we have
To show how r depends on θ A∗ , let K be the exchange matrix defined by vec(AT ) = Kvec(A)
(cf. (5.15)). Then we can also write r as
T
r = (IP ⊗ A)vec(A ) + (IP ◦ IP )d
= (IP ⊗ A)Kθ A∗ + (IP ◦ IP )θ D . (10.12)
∂f 1 ∂f ∂f
= −j
∂z 2 ∂x ∂y
∂f 1 ∂f ∂f
= +j .
∂z ∗ 2 ∂x ∂y
where
JA = A∗ ⊗ IP , JA∗ = (IP ⊗ A)K, JD = IP ◦ IP . (10.14)
ML cost and Fisher score If we assume that the samples x are generated by zero mean
complex proper Gaussian sources, i.e.,
1 h i
exp −x R−1 x ,
H
p(x; θ) =
πP det(R)
The maximum likelihood approach aims to find a θ that maximizes this function. To this end,
we find the gradient of the likelihood function (called the Fisher score) and set it equal to zero.
For complex parameters, the Fisher score for a proper Gaussian distributed signal is defined as
..
gA H .
∂
∂ log p(X; θ)
g(θ) = gA∗ = log p(X; θ) = ∂θj∗ .
∂θ
gD ..
.
Inserting (10.15), the jth entry of g(θ) can be evaluated as
∂ ∂
[g(θ)]j = −N ∗ log det(R) − N ∗ tr(R−1 R̂) .
∂θj ∂θj
We need some results for matrix differentials [8, p.53]:
∂ det(R) = det(R)tr(R−1 ∂R)
∂R−1 = −R−1 ∂R R−1 .
This gives
∂R ∂R
[g(θ)]j = −N tr[R−1 ∗ ] + N tr[R−1 ∗ R−1 R̂] .
∂θj ∂θj
Next, we use some properties of Kronecker products (see Sec. 5.1.6):
H H
tr(AB) = vec (A )vec(B)
H H T
tr(ABCD) = vec (A )(D ⊗ B)vec(C) .
This results in
∂R H ∂R
) vec(R−1 ) + N vec ( ∗ )(R−T ⊗ R−1 )vec(R̂)
H
[g(θ)]j = −N vec ( ∗
∂θj ∂θj
∂vec(R) H −T
= N( ) (R ⊗ R−1 )vec(R̂ − R) .
∂θj∗
Finally, stacking for all j and using (10.13), we find a compact expression for g(θ) as
R−T ⊗ R−1 vec(R̂ − R) .
H
g(θ) = N J (10.16)
This is a general expression. Let us now look at our specific parametrization for R: inserting
(10.13) into (10.16), the elements of the Fisher score g(θ) become
gA = N (A R−T ⊗ R−1 )vec(R̂ − R)
T
h i
= N vec R−1 (R̂ − R)R−1 A (10.17)
∗
gA∗ = gA (10.18)
h i
−1 −1
gD = N vecdiag R (R̂ − R)R . (10.19)
The ML technique requires us to set (10.17) and (10.19) equal to zero, but unfortunately this
does not produce a closed-form solution. One approach to numerically compute the ML estimate
is to consider Newton-Raphson-like algorithms, as these provide quadratic convergence. Besides
the gradient, we will also need an expression for the Hessian.
The Scoring Method The scoring algorithm is a variant of the Newton-Raphson algorithm
where the gradient is the Fisher score (10.16) and the Hessian is replaced by the Fisher infor-
mation matrix [9]. The Fisher information matrix (FIM) is defined as
∂g(θ)
F = −E
∂θ T
where the expectation is over the data (i.e., R̂). Inserting (10.16), and realizing that after the
expectation only the derivative of vec(R̂ − R) results in a nonzero contribution, gives
where J is given by (10.13). The resulting iterations in the scoring algorithm are
where θ (k) is the current estimate of the parameters, µ(k) is a step size, and
δA
δ = δ A∗
δD
where g(k) = g(θ (k) ) is the Fisher score and F(k) = F(θ (k) ) is the FIM. Since without constraints
the parametrization is redundant (see Sec. 10.1.2), the FIM is singular. However, this does not
need to cause complications because (10.16) shows that g(k) is in the column span of F(k) , so
that the system of equations has a solution, and (taking the minimum-norm solution) standard
convergence results for the scoring method follow.
A problem with the scoring method is that the matrix F quickly becomes large, as its dimension
is equal to the number of unknown parameters. Solving (10.21) then becomes unattractive.
Similarly, we also do not want to directly work with R−T ⊗ R−1 as it is a matrix of size P 2 × P 2 .
Another problem is that R changes each iteration cycle and its inverse has to be recomputed
each time.
Covariance matching techniques We can view Factor Analysis as a special case of covariance
matching, as studied in Chap. 7. In this approach, the ML problem is replaced by a Weighted
Least Squares (WLS) fitting of the sample covariance. The large sample properties of the
estimators are the same. Solving this nonlinear least squares problem using gradient descent
techniques is closely connected to the scoring algorithm.
The corresponding Nonlinear Weighted Least Squares (NLWLS) problem is
θ̂ = arg min kW1/2 [r̂ − r(θ)]k2 = arg min [r̂ − r(θ)] W[r̂ − r(θ)]
H
(10.22)
θ θ
where r = vec(R), r̂ = vec(R̂). The optimal weighting matrix W is the inverse of the covariance
matrix of r̂. We derived in Sec. 3.2.2 that this covariance is equal to C = (1/N )(R∗ ⊗R). Because
we only have access to the sample covariance matrices R̂, we use instead
and then θ̂ asymptotically (for large N ) converges to the optimal ML solution for a Gaussian
distributed data matrix.
This is precisely in context of [10], and we can use one of the algorithms proposed there: Gauss-
Newton iterations, the scoring algorithm, or sequential estimation algorithms. Here, we derive
the Gauss-Newton iterations.
Gauss-Newton algorithm for solving NLWLS For the Gauss-Newton iteration, the Hessian
is replaced by the Gramian of the Jacobians [11]. The updates are similar to the scoring method
updates (10.20):
θ (k+1) = θ (k) + µ(k) δ, (10.24)
where δ is the direction of descent. To find δ we need to solve
where
H
g(θ) = J (θ)W[r̂ − r(θ)] (10.26)
H
B(θ) = J (θ)WJ(θ) . (10.27)
Closed-form solution for direction of descent A complicated derivation [12] that we omit
here shows how we can solve for δ D inside δ in closed form. Define
W̃ = = R̂−1 − R̂−1 A(A
H
R̂−1 A)−1 AH R̂−1
B̃D = JHD W̃T ⊗ W̃ JD = W̃T W̃ [using JD = I ◦ I and (5.5)]
H T
g̃D = JD W̃ ⊗ W̃ vec[R̂ − R(θ)] .
Alternating Weighted Least Squares (AWLS) algorithm If we take step size µ = 1, then
the closed-form result simplifies to
(k+1) (k)
θD = θD + δD .
W̃ acting on R̂ can be interpreted as “projecting out” the contribution of the term AAH in
R̂ after which the remaining term D can be estimated. The final simplification used that
W̃R̂W̃ = W̃.
The result can be formulated as the Alternating Weighted Least Squares (AWLS) algorithm [12].
First, for given D(k) , compute
H −1/2 −1/2
UΛU := D(k) R̂D(k)
1/2
A(k+1) := D(k) Us (Λs − I)1/2
where the first line represents an eigenvalue decomposition, and in the second line, Λs contains
the largest Q eigenvalues of Λ, and Us the corresponding eigenvectors. This step is similar to
the (prewhitened) alternating LS algorithm in Sec. 10.2.2. Alternatively, (10.28) could have
been used.
Next, let W = R̂−1 , and update the estimate of D:
These two steps are alternated until convergence. In this algorithm, all computations are on
matrices of size P × P , which makes the computational complexity of the same order as that of
an EVD: O(P 3 ).
10.2.4 Convergence
The following simulation experiment gives an indication on the convergence speed. We use
P = 100 sensors, N = 1000 samples. The matrix A is chosen randomly with a standard
complex Gaussian distribution (i.e. each element is distributed as CN (0, 1)) and D is chosen
randomly with a uniform distribution between 1 and 5.
Convergence is gauged by looking at the norm of the gradient.
AWLS is tested against a range of other algorithms which are described in [12]. In the graph, the
”Ad Hoc” method is the ALS, “Joreskog” is an implementation of WLS using Fletcher-Powell
iterations [13] as used in many standard toolboxes, while “CM” is the Constrained Maximization
algorithm [14], which was derived from the EM algorithm, is straightforward to implement and
shows quadratic convergence. “KLD/EM” is another representative of an EM algorithm with a
straightforward implementation [15].
As seen in Fig. 10.2, the AWLS algorithm converges fastest (in 10-15 iterations), while the ALS
and EM algorithms have slow convergence (over 1000 iterations for large Q).
The detection problem is to estimate the factor rank Q (i.e., the number of columns of A). In
array processing, this relates to detecting the number of sources that the array is exposed to. An
10 -2
10 -4
10 -4
10 -6
10 -6
10 -8
10 -8
10 -10 10 -10
10 0 10 1 10 2 10 3 10 0 10 1 10 2 10 3
# iterations # iterations
extensive literature exists on this topic; here we limit the discussion to a general likelihood ratio
test (GLRT) [16], which is used to decide whether the FA model fits a given sample covariance
matrix. We can use the GLRT to design a constant false alarm ratio detector. In the special case
where Q = 0, this test indicates whether there are any sources active during the measurement
(we detect whether R is diagonal). The largest permissible value of Q is that for which the
number of equations
√ minus the number of unknown (real) parameters s = (P − Q)2 − P > 0, or
Qmax < P − P . For larger Q, there is no identifiability of A and D: any sample covariance
matrix R̂ can be fitted.
Let Rq denote the covariance matrix of the FA model with q sources,
H
Rq = AA + D , where A : P ×q, D diagonal ,
and let CN (0, Rq ) denote the zero-mean complex normal distribution with covariance Rq . To
find Q using the GLRT, we define a collection of hypotheses
H0 : x(k) ∼ CN (0, R0 ) .
Under Hq , respectively H0 , the maximum values of the log-likelihood are (dropping constants)
Here, λ = L0 /Lq is the test statistic (likelihood ratio), and we will reject Hq and accept H0 if
λ > γ, where γ is a predetermined threshold. Typically, γ is determined such that we obtain an
acceptable “false-alarm” rate (i.e., the probability that we accept H0 instead of Hq , while Hq is
actually true). To establish γ, we need to know the statistics of λ under Hq .
Generalizing the results from the real-valued case [1, 2], we obtain that for moderately large N
(say N > 50), the test statistic 2 log(λ) has approximately a χ2s distribution, where s is equal
to “the number of free parameters” under Hq (the number of equations minus the number of
unknowns). For the complex case, we saw that this number is s = (P − q)2 − P degrees of
freedom.
In view of results of Box and Bartlett, a better fit of the distribution of 2 log(λ) to a χ2s distri-
bution is obtained by replacing N in (10.30) by [1, 2]
1 2
N 0 = N − (2P + 11) − Q .
6 3
To detect Q, we start with q = 0, and apply the test for increasing values of q until it is
accepted, or until q > Qmax . In that case, the hypothesis H0 is accepted, i.e., the given R̂ is an
unstructured covariance matrix. A disadvantage of this process is that the model parameters
for each q have to be estimated, which can become quite cumbersome if P is large.1
However, note that if the GLRT passes for a given estimate Q0 it also passes for any Q > Q0 ,
and if it fails it also fails for any Q < Q0 . Therefore, instead of a linear search for Q we can√use
a binary search. The maximum number of possible sources for FA is given by Qmax < P − P .
In a binary search, we split the entire interval into two segments, and test on the boundary to
decide in which interval the solution must lie. Proceeding recursively in this way, the number
of needed FA estimates is on average log2 (Qmax ) + 1, which is reasonable even for large P .
We present two extensions of the classical model: joint and extended factor analysis.
1
Also, as for any sequential hypothesis test, the actual false alarm rate that is achieved is unknown, because
the tests are not independent.
0 0
Eigenanalysis
Factor analysis
−2 Whitened eigenanalysis
J = 8, Q = 1, N = 500
−4 Nominal noise power: 0 dB
−5
−6
−8
−10
−10
J = 8, Q = 1
−12 Nominal noise power: 0 dB
Maximal deviation: 3 dB
Eigenanalysis
−14 Factor analysis
Whitened eigenanalysis
−16 −15
1 2 3
0 2 4 6 8 10 12 14 10 10 10
Maximal deviation in noise power from nominal [dB] N
The results are shown in Fig. 10.3. The left graph shows the residual interference power for
varying maximal deviations, the right graph shows the residual for varying number of samples
N , and a maximal deviation of 3 dB of the noise powers. The figures indicate that already for
small deviations of the noise powers it is essential to take this into account, by using the FAD
instead of the EVD. Furthermore, the estimates from the factor analysis are nearly as good as
can be obtained via whitening with known noise powers.
In the previous paragraph, we projected out the interference dimension, and this effectively
reduces the number of antennas (dishes) by the number of detected interferers. An alternative
is to use a reference antenna array, with antennas that receive a good copy of the interfering
signals, but have little gain towards the desired sky sources. So suppose we have a primary array
with p0 antennas, and a reference array with p1 antennas. The received signal model is
where the subscripts 0 and 1 refer to the primary and reference array, respectively, v0 (t) contains
the desired sky source signals, s(t) the q interfering signals, and ni (t) the noise on each array.
Collecting all antenna signals into a single vector x(t), we can write
where A : p × q has q columns corresponding to the q interferers. The covariance matrix of x(t)
can be partitioned as " #
R00 R01
R= .
R10 R11
where we are interested in estimating the unknown square matrix Ψ00 and, for an uncalibrated
array, Σ1 is unknown. Thus, the appropriate masking matrix M such that Ψ = M Ψ is
" #
11T 0
M= .
0 I
This is an EFA model and we can apply the corresponding algorithms for estimating A and Ψ.
Each R̂ will give us an estimate Ψ̂, and Ψ̂00 is simply the upper left sub-block of this matrix.
A necessary condition for identification is that the degree of freedom s > 0. Compared to FA,
we see that the p parameters of Σ are now replaced by the p20 + p1 (real) parameters in Ψ. Thus,
we require
s = p2 + q 2 − 2pq − (p20 + p1 )
to be larger than 0. With p = p0 + p1 and solving for the number of reference antennas p1 we
find q
p1 > q − (p0 − 21 ) + q + (p0 − 12 )2 . (10.36)
√
Thus, if p0 is small, we need p1 > q + q, and if p0 is large, we need p1 > q.
If R̂ is based on a short-term interval (a snapshot estimate of the covariance) and we have
multiple snapshots, then we can apply JEFA to estimate the varying interfering subspaces while
exploiting that the sky covariance is constant and common among all snapshots.
We show two examples with experimental data, taken from [19].
Example 10.1. To test the algorithm on actual data, we have made a short observation
of the strong astronomical source 3C48 contaminated by Afristar satellite signals.
The primary array consists of p0 = 3 of the 14 telescope dishes of the Westerbork
Synthesis Radio Telescope (WSRT), located in The Netherlands. As reference signals
we use p1 = 27 of 52 elements of a focal–plane array that is mounted on another dish
of WSRT which is set off-target (see Fig. 10.4) such that it has no dish gain towards
the astronomical source nor to the interferer.
We recorded 13.4 seconds of data with 80 MS/s, and processed these offline. Using
short-term windowed Fourier transforms, the data was first split into 8192 frequency
bins (from which we used 1537), and subsequently correlated and averaged over
M = 4048 samples to obtain N = 64 short-term covariance matrices.
Fig. 10.5(a) shows the autocorrelations and crosscorrelations on the primary antennas
and Fig. 10.5(b) shows the autocorrelation of 6 reference antennas. The interference
is clearly seen in the spectrum. The interference consists of a lower and higher
frequency part. The low frequency part is stronger on the reference antenna and
the higher part is stronger on the primary antenna. However, because of a relatively
large number of reference antennas the total INR, as we will see, is high enough for
the algorithms to be effective.
Because no calibration step has been performed we use a generalized likelihood ratio
test (GLRT) [20] to detect if each frequency bin is contaminated with RFI and then
we use EFA to estimate the noise powers and the signal spatial signature. The result
of whitening the spectrum with the estimated result of EFA is shown in Fig. 10.6(a).
The resulting auto- and crosscorrelation spectra after filtering are shown in Fig.
10.6(b). The autocorrelation spectra are almost flat, and close to 1 (the whitened
noise power). The cross-correlation spectra show that the spatial filtering with the
reference antenna has removed the RFI within the sensitivity of the telescope. Also
it shows the power of using EFA at this stage in the processing chain, as it is not
required for the array to be calibrated.
Example 10.2. In a second experiment, we use raw data from LOFAR station RS409
(100-200 MHz). Data from 46 (out of 48) x-polarization receiving elements are
sampled with a frequency of 200 MHz and correlated. Samples are then divided into
1024 subbands with the help of tapering and an FFT. From these samples we form
N = 4 covariance matrices with an integration time of 19 ms (M = 1862) for each
subband. No calibration was done on the resulting covariance matrices.
The LOFAR HBA has a hierarchy of antennas, where a single receiving element
output is the result of analog beamforming on 16 antennas (4 × 4) in a tile. During
2 2
|Rij|
1.5
Rii
1.5
1 1
0.5 0.5
0 0
1480 1485 1490 1495 1480 1485 1490 1495
Frequency [MHz] Frequency [MHz]
Figure 10.5. Observed spectrum from (a) the primary telescopes, (b) 6 of the reference an-
tennas.
Correlation Coef after EFA
0
10
Spectrum of Primary Antenna after whitening R
11
1.8
R12
|R11|
1.6
|R12|
R13
−1
10 R23
1.4 |R |
13
|Rij|2/ ( |Rii | |Rjj|)
1.2
−2
1 10
|Rij|
0.8
0.6
p=30
−3 p0=3
10
0.4 Nf=1537
N=64
0.2 M=4048
−4
0 10
1480 1485 1490 1495 1480 1485 1490 1495
Frequency [MHz] Frequency [MHz]
Figure 10.6. After EFA, the covariance matrices can be whitened: (a) Spectrum of primary
antenna after whitening, (b) averaged normalized correlation coefficients after
filtering.
2
10
Autocorrelation R00
0
10
−2
10
−4
10
100 120 140 160 180 200
Freq [Mhz]
the measurements the analog beamformers were tracking the strong astronomical
source Cygnus A.
The received spectrum is shown in Fig. 10.7. Above 174 MHz, the spectrum is
heavily contaminated by wideband Digital Audio Broadcast (DAB) transmissions.
We have used 6 of the 46 receiving elements as reference array for our filtering
techniques and the rest as primary array. Because we do not have dedicated reference
antennas and because the data is already beamformed the assumption that the source
is too weak at each short integration time (19 ms) is not completely valid. Also the
assumption that the sky sources are much weaker on the reference antennas is not
valid in this case because the reference array elements are also following Cygnus A.
Finally, we have the same exposure to the RFI on the secondary array as we have
on the primary so there is no additional RFI gain for the secondary array.
To illustrate the performance of the filtering technique we produce snapshot images
of the sky (i.e., images based on a single covariance matrix). For an uncontaminated
image, we have chosen subband 250 at 175.59 MHz, see Fig. 10.8(a), while for RFI-
contaminated data we take subband 247 at 175.88 MHz, see Fig. 10.8(b). These two
subbands have been chosen because they are close to each other (in frequency) and
we expect that the astronomical images for these bands would be similar. Subband
247 is heavily contaminated and has a 10 dB flux increase on the auto-correlations
and a 20 dB increase on the cross-correlations.
The repeated source visible in Fig. 10.8(a) is Cygnus A; the repetition is due to the
spatial aliasing which occurs at these frequencies (the tiles are separated by more
than half a wavelength). The contaminated image in Fig. 10.8(b) shows no trace of
Cygnus A; note the different amplitude scale which has been increased by a factor
100.
Fig. 10.9 shows the image after using EFA. The image is very similar to the clean
image in Fig. 10.8(a)).
Before we can do any beamforming, we need to calibrate the array. Indeed, in the previous
chapters, we assumed we fully knew the array response function, and in many cases, we even
assumed omnidirectional antennas (i.e., the individual antennas have the same unit response in
all directions). Before we are in this situation, we need to estimate these responses. Generally,
this involves a single test source that we scan across the array, but this is not always practical
once the array is out of the factory and deployed in the field. A particular example is radio
astronomy, where the “antennas” are large dishes or beamformed stations, and the calibrator
sources are strong celestial objects. Obviously we have no control over them and cannot switch
them off, but on the other hand their positions and source powers are accurately known from
tables. In this section, it is shown how factor analysis can be used to solve the problem of
calibration.
The calibration problem does not only involve the antenna response functions, it also involves
the receiver noise present on each antenna. In previous chapters, we usually assumed the noise
was spatially white: independent and of equal power on each antenna. However, before calibra-
tion the receiver noise generally has different powers on each antenna. These also need to be
estimated.
So far we ignored the beam shape of the individual elements (antennas or dishes) of the array.
In fact, any antenna has its own directional response b(ζ), where ζ denotes a unit-length source
direction vector (see (2.6)). This function is called the primary beam. For simplicity, it is
generally assumed that the primary beam is equal for all elements in the array, although this is
also subject to calibration. With Q point sources, we will collect the resulting samples of the
primary beam into a vector b = [b(ζ 1 ), · · · , b(ζ Q )]T . These coefficients are seen as gains that
(squared) will multiply the source powers σq2 . The general shape of the primary beam b(ζ) is
known from electromagnetic modeling during the design of the antenna. If this is not sufficiently
accurate, then it has to be calibrated.
We also have direction-independent differences in gains and phases among the antennas, e.g.,
due to differences in the receiver chains of each element in the array. Initially these are also
unknown and have to be estimated. We thus have an unknown vector g (size P × 1) with
complex entries that each multiply the output signal of each antenna.
Also the noise powers of each element are unknown and generally unequal to each other. We
will still assume that the noise is independent from element to element. We can thus model the
−0.1 −0.1 1
0.012
m
m
0 0 0.8
Figure 10.8. (a) Clean subband 250, (b) Contaminated subband 247
Dirty Subband 247 after Filtering with EFA
−0.4 0.016
−0.3
0.014
−0.2
−0.1 0.012
m
0
0.01
0.1
0.2
0.008
0.3
0.006
0.4
−0.4 −0.2 0 0.2 0.4
l
Here, the array response matrix A(θ) is a known function of the source direction vectors
{ζ 1 , · · · , ζ Q }, suitably parametrized by the vector θ (with typically two direction cosines per
source).
The modified data model that captures the unknown gain/phase/noise effects and replaces
(10.37) is then
H H H
R = [ΓA(θ)B] Σs [B A(θ) Γ ] + Σn (10.38)
where Γ = diag(g) is a diagonal with unknown receiver complex gains, and B = diag(b) contains
the samples of the primary beam (the directional response of each antenna). Usually, Γ and B
are considered to vary only slowly with time and frequency, so that we can combine multiple
covariance matrices Rm,k with the same Γ and B.
In some cases, the source directions are disturbed as well, e.g. due to atmospheric effects (or due
to ionospheric delays in radio astronomy). In first order, we can replace A(θ) by A(θ 0 ), where
θ 0 differs from θ due to the shift in apparent direction of each source. The modified data model
that captures the above effects is thus
known with sufficient accuracy, i.e., we assume that A and Σs are known. We can then write
(10.38) as
H H
R = ΓAΣA Γ + Σn (10.41)
where Σ = BΣs B is a diagonal with apparent source powers. With B unknown, Σ is unknown,
but estimating Σ is precisely the problem of estimating source powers in given directions: a
problem we studied before. Thus, once we have estimated Σ and know Σs , we can easily
estimate the directional gains B. The problem thus reduces to estimate the diagonal matrices
Γ, Σ and Σn from a model of the form (10.41).
Single calibrator source For some cases, e.g., radio telescope arrays where the elements are
traditional telescope dishes, the field of view is quite narrow (degrees) and we may assume that
there is only a single calibrator source in the observation. Then Σ = σ 2 is a scalar and the
problem reduces to
R = gσ 2 g + Σn
H
and since g is unknown, we could even absorb the unknown σ in g (it is not separately identifi-
able). The structure of R is a rank-1 matrix gσ 2 gH plus a diagonal Σn . This is recognized as a
“rank-1 factor analysis” model.
18
16 N = 131072
14
12
di (i = 1..8)
|σ g |2, d (dB)
10
i
8
6
i
s
4 2
(σs gi) (i = 1..8)
2
0
−2
1419.8 1420 1420.2 1420.4 1420.6 1420.8
frequency (MHz)
Figure 10.10. Gain magnitude and noise power estimates, as function of frequency, for an
observation of the astronomical source 3C48.
In general, there are more calibrator sources (Q) in the field of view, and we have to solve
(10.41). Using Factor Analysis, we can first solve for A0 and Σn in
H
R = A0 A0 + Σn .
Then, since A0 is not unique, we can identify
A0 = ΓAΣ1/2 Q
where Q is an unknown unitary matrix. Equivalently, define R0 = A0 AH = R − Σn , then
R0 = ΓAΣA Γ
H
where both Γ and Σ are diagonal and unknown. We may resort to an Alternating Least Squares
approach. If Γ is considered known, then we can correct R0 for it, so that we have precisely
the same problem as we considered before, (??), and we can solve for Σ using the techniques
discussed in section 7.4. Alternatively, with Σ known, we can say we know a reference model
R0 = AΣAH , and the problem is to identify the element gains Γ = diag(g) from a model of the
form
R0 = ΓR0 Γ .
H
Estimating the general model In the more general case (10.40), viz.
H
R = (G A)Σs (G A) + Σn ,
we have an unknown full matrix G. We assume A and Σs known. Since A pointwise multiplies
G and G is unknown, we might as well omit A from the equations without loss of generality.
For the same reason also Σs can be omitted. This leads to a problem of the form
H
R = GG + Σn ,
where G : P × Q and Σn (diagonal) are unknown. This problem is recognized as a rank-Q factor
analysis problem. For reasonably small Q, as compared to the size P of R, the factor G can be
solved for, again using algorithms for covariance matching such as in [10].
It is important to note that G can be identified only up to a unitary factor V at the right:
G0 = GV would also be a solution. This factor makes the gains unidentifiable unless we
introduce more structure to the problem.
10.7 NOTES
Bibliography
[1] Derrick Norman Lawley and A.E. Maxwell, Factor analysis as a statistical method. 2nd.
ed., New York: Am. Elsevier Publ., 1971.
[2] K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis. Academic Press, 1979.
[3] C. Spearman, “The proof and measurement of association between two things,” The Amer-
ican Journal of Psychology, vol. 15, pp. 72–101, Jan 1904.
[4] Walter Ledermann, “On a problem concerning matrices with variable diagonal elements,”
Proceedings of the Royal Society of Edinburgh, vol. 60, pp. 1–17, 1 1940.
[6] Sik Yum Lee, “The Gauss-Newton algorithm for the Weighted Least Squares factor analy-
sis,” Journal of the Royal Statistical Society, vol. 27, June 1978.
[7] Peter J. Schreier, Statistical Signal Processing of Complex-Valued Data. Cambridge Uni-
versity Press, 2010.
[8] Are Hjørungnes, Complex-Valued Matrix Derivatives with Applications in Signal Processing
and Communications. Cambridge University Press, 2011.
[9] Steven M. Kay, Fundamentals of Statistical Signal Processing, Estimation theory, vol. Vol-
ume I. Prentice Hall, 1993.
[10] B. Ottersten, P. Stoica, and R. Roy, “Covariance matching estimation techniques for ar-
ray signal proce ssing applications,” Digital Signal Processing, A Review Journal, vol. 8,
pp. 185–210, July 1998.
[11] P. Gill, W. Murray, and M.H. Wright, Practical optimization. London: Academic Press,
1981.
[12] A.M. Sardarabadi and A.J. van der Veen, “Complex factor analysis and extensions,” IEEE
Tr. Signal Processing, vol. 66, February 2018.
[13] Karl G. Jöreskog and Arthur S. Goldberger, “Factor analysis by generalized least squares,”
Psychometrika, vol. 37, pp. 243–260, Sep 1972.
[14] J.-H. Zhao, Philip Yu, and Qibao Jiang, “ML estimation for factor analysis: EM or non-
EM?,” Statistics and Computing, vol. 18, pp. 109–123, 2008. 10.1007/s11222-007-9042-y.
[15] A.-K. Seghouane, “An iterative projections algorithm for ML factor analysis,” in IEEE
Workshop on Machine Learning for Signal Processing, pp. 333–338, Oct. 2008.
[16] S.M. Kay, Fundamentals of Statistical Signal Processing. Volume II: Detection Theory.
Upper Saddle River, NJ: Prentice Hall PTR, 1998.
[17] J. Raza, A-J Boonstra, and A-J. van der Veen, “Spatial filtering of RF interference in radio
astronomy,” IEEE Signal Processing Letters, vol. 9, Mar. 2002.
[18] S. van der Tol and A. J. van der Veen, “Performance analysis of spatial filtering of RF
interference in radio astronomy,” IEEE Transactions on Signal Processing, vol. 53, pp. 896–
910, Mar. 2005.
[19] A. Mouri Sardarabadi, A.-J. van der Veen, and A.-J. Boonstra, “Spatial Filtering of RF
Interference in Radio Astronomy Using a Reference Antenna Array,” IEEE Trans. Signal
Process., vol. 64, pp. 432–447, Jan 2016.
[20] A. Leshem and A.-J. van der Veen, “Multichannel detection of Gaussian signals with un-
calibrated receivers,” IEEE Signal Processing Letters, vol. 8, no. 4, pp. 120–122, 2001.
[21] A. J. Boonstra and A. J. van der Veen, “Gain calibration methods for radio telescope
arrays,” IEEE Trans. Signal Processing, vol. 51, pp. 25–38, Jan. 2003.
[22] D. R. Fuhrmann, “Estimation of sensor gain and phase,” IEEE Trans. Signal Processing,
vol. 42, pp. 77–87, Jan. 1994.
[23] S. J. Wijnholds and A. J. Boonstra, “A multisource calibration method for phased ar-
ray telescopes,” in Fourth IEEE Workshop on Sensor Array and Multi-channel Processing
(SAM), (Waltham (Mass.), USA), July 2006.
[24] S. J. Wijnholds and A. J. van der Veen, “Multisource self-calibration for sensor arrays,”
IEEE Tr. Signal Processing, vol. 57, pp. 3512–3522, Sept. 2009.
[25] A.J. van der Veen, S.J. Wijnholds, and A.M. Sardarabadi, “Signal processing for radio
astronomy,” in Handbook of Signal Processing Systems, 3rd ed., Springer, November 2018.
ISBN 978-3-319-91734-4.
[26] Derrick N Lawley, “The estimation of factor loadings by the method of maximum likeli-
hood.,” Proceedings of the Royal Society of Edinburgh, vol. 60, no. 01, pp. 64–82, 1940.
[28] David J. Bartholomew, Martin Knott, and Irini Moustaki, Latent Variable Models and
Factor Analysis: A Unified Approach. John Wiley and Sons, 2011.
[29] M. Viberg, P. Stoica, and B. Ottersten, “Array processing in correlated noise fields based
on instrumental variables and subspace fitting,” IEEE Trans. Signal Process., vol. 43,
p. 1187–1199, Jan. 1995.
[30] V. Nagesha and S. M. Kay, “Maximum likelihood estimation for array processing in colored
noise,” IEEE Trans. Signal Process., vol. 44, p. 169–180, Feb. 1996.
[32] M. Wax, J. Sheinvald, and A. J. Weiss, “Detection and localization in colored noise via
generalized least squares,” IEEE Tr. Signal Process., vol. 44, pp. 1734–1743, July 1996.
[34] Donald Rubin and Dorothy Thayer, “EM algorithms for ML factor analysis,” Psychome-
trika, vol. 47, pp. 69–76, 1982. 10.1007/BF02293851.
[35] Alexander Shapiro, “Weighted minimum trace factor analysis,” Psychometrika, vol. 47,
no. 3, pp. 243–264, 1982.
[36] Alexander Shapiro, “Rank-reducibility of a symmetric matrix and sampling theory of min-
imum trace factor analysis,” Psychometrika, vol. 47, no. 2, pp. 187–199, 1982.
[37] James Saunderson, Venkat Chandrasekaran, Pablo A Parrilo, and Alan S Willsky, “Diagonal
and low-rank matrix decompositions, correlation matrices, and ellipsoid fitting,” SIAM
Journal on Matrix Analysis and Applications, vol. 33, no. 4, pp. 1395–1416, 2012.
[38] E.J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” arXiv
preprint arXiv:0912.3599, 2009.
[39] Emmanuel J Candès and Benjamin Recht, “Exact matrix completion via convex optimiza-
tion,” Foundations of Computational mathematics, vol. 9, no. 6, pp. 717–772, 2009.
[40] A.-J. Boonstra and A.-J. van der Veen, “Gain calibration methods for radio telescope
arrays,” IEEE Tr. Signal Processing, vol. 51, pp. 25–38, Jan. 2003.
[41] A-J. van der Veen, A. Leshem, and A-J. Boonstra, “Array signal processing for radio
astronomy,” Experimental Astronomy (EXPA), vol. 17, no. 1-3, pp. 231–249, 2004. ISSN
0922-6435.
[42] A-J. van der Veen, A. Leshem, and A-J. Boonstra, “Array signal processing for radio
astronomy,” in The Square Kilometre Array: An Engineering Perspective (P.J. Hall, ed.),
pp. 231–249, Dordrecht: Springer, 2005. ISBN 1-4020-3797-x. Reprinted from Experimental
Astronomy, 17(1-3),2004.
[43] S.J. Wijnholds and A.-J. van der Veen, “Multisource self-calibration for sensor arrays,”
Signal Processing, IEEE Transactions on, vol. 57, pp. 3512–3522, Sept 2009.
[44] S.J. Wijnholds, S. van der Tol, R. Nijboer, and A.-J. van der Veen, “Calibration challenges
for future radio telescopes,” IEEE Signal Processing Magazine, vol. 27, pp. 30–42, Jan 2010.
[45] A. Mouri Sardarabadi and A.-J. van der Veen, “Application of Krylov based methods in
calibration for radio astronomy,” in 2014 IEEE 8th Sensor Array and Multichannel Signal
Processing Workshop (SAM), pp. 153–156, June 2014.
[47] A. Mouri Sardarabadi and A.-J. van der Veen, “Subspace estimation using factor analysis,”
in 2012 IEEE 7th Sensor Array and Multichannel Signal Processing Workshop (SAM),
pp. 477 –480, June 2012.
Contents
11.1 Fourth-order Cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.2 Data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.3 JADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.4 Application: ACMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.3 JADE