0% found this document useful (0 votes)

15 views

Numerical Methods For - Computational Science and Engineering

Uploaded by

Marcuse Manhatin

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Numerical Methods For - Computational Science and Engineering

Uploaded by

Marcuse Manhatin

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 827

NumCSE, AT’15, Prof.

Ralf Hiptmair c SAM, ETH Zurich, 2015

ETH Lecture 401-0663-00L Numerical Methods for CSE

Numerical Methods for

Computational Science and Engineering

Prof. R. Hiptmair, SAM, ETH Zurich

(with contributions from Prof. P. Arbenz and Dr. V. Gradinaru)

Autumn Term 2016

(C) Seminar für Angewandte Mathematik, ETH Zürich
URL: https://people.math.ethz.ch/~grsam/HS16/NumCSE/NumCSE16.pdf

Always under construction!

SVN revision 91232

The online version will always be a work in progress and subject

to change.

(Nevertheless, the structure and the main contents can be ex-

pected to be stable)

Do not print!

Main source of information: Lecture homepage

Important links:
• Lecture Git repository: https://gitlab.math.ethz.ch/NumCSE/NumCSE.git
(Clone this repository to get access to most of the C++ codes in the lecture document and
homework problems. ➙ Git guide)
• Lecture recording: http://www.video.ethz.ch/lectures/d-math/2016/autumn/401-0663-00L.html
• Tablet notes: http://www.sam.math.ethz.ch/~grsam/HS16/NumCSE/NCSE16_Notes/
• Homework problems: https://people.math.ethz.ch/~grsam/HS16/NumCSE/NCSEProblems.pdf

, 1
Contents

0 Introduction 3
0.0.1 Focus of this course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.0.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.0.3 To avoid misunderstandings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.0.4 Reporting errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.0.5 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.1 Specific information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.1.1 Assistants and exercise classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.1.2 Study center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
0.1.3 Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
0.1.4 Information on examinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.2 Programming in C++11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
0.2.1 Function Arguments and Overloading . . . . . . . . . . . . . . . . . . . . . . . . . 14
0.2.2 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
0.2.3 Function Objects and Lambda Functions . . . . . . . . . . . . . . . . . . . . . . . 16
0.2.4 Multiple Return Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
0.2.5 A Vector Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
0.3 Creating Plots with M ATH GL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
0.3.1 M ATH GL Documentation (by J. Gacon) . . . . . . . . . . . . . . . . . . . . . . . . 31
0.3.2 M ATH GL Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
0.3.3 Corresponding Plotting functions of M ATLAB and M ATH GL . . . . . . . . . . . . . . 32
0.3.4 The Figure Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
0.3.4.1 Introductory example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
0.3.4.2 Figure Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

1 Computing with Matrices and Vectors 47

1.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.1.2 Classes of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.2 Software and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.2.1 M ATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.2.2 P YTHON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
1.2.3 E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.2.4 (Dense) Matrix storage formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.3 Basic linear algebra operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
1.3.1 Elementary matrix-vector calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
1.3.2 BLAS – Basic Linear Algebra Subprograms . . . . . . . . . . . . . . . . . . . . . 76
1.4 Computational effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
1.4.1 (Asymptotic) complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
1.4.2 Cost of basic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
1.4.3 Reducing complexity in numerical linear algebra: Some tricks . . . . . . . . . . . . 89

2
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

1.5 Machine Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

1.5.1 Experiment: Loss of orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
1.5.2 Machine Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
1.5.3 Roundoff errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
1.5.4 Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
1.5.5 Numerical stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

2 Direct Methods for Linear Systems of Equations 126

2.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
2.2 Theory: Linear systems of equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
2.2.1 Existence and uniqueness of solutions . . . . . . . . . . . . . . . . . . . . . . . . 128
2.2.2 Sensitivity of linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
2.3 Gaussian Ellimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
2.3.1 Basic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
2.3.2 LU-Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.3.3 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
2.4 Stability of Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
2.5 Survey: Elimination solvers for linear systems of equations . . . . . . . . . . . . . . . . . 164
2.6 Exploiting Structure when Solving Linear Systems . . . . . . . . . . . . . . . . . . . . . . 168
2.7 Sparse Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
2.7.1 Sparse matrix storage formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
2.7.2 Sparse matrices in M ATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
2.7.3 Sparse matrices in E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
2.7.4 Direct Solution of Sparse Linear Systems of Equations . . . . . . . . . . . . . . . . 191
2.7.5 LU-factorization of sparse matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 195
2.7.6 Banded matrices [?, Sect. 3.7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
2.8 Stable Gaussian elimination without pivoting . . . . . . . . . . . . . . . . . . . . . . . . . 209

3 DIrect Methods for Linear Least Squares Problems 215

3.0.1 Overdetermined Linear Systems of Equations: Examples . . . . . . . . . . . . . . 215
3.1 Least Squares Solution Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
3.1.1 Least Squares Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
3.1.2 Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
3.1.3 Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
3.1.4 Sensitivity of Least Squares Problem . . . . . . . . . . . . . . . . . . . . . . . . . 227
3.2 Normal Equation Methods [?, Sect. 4.2], [?, Ch. 11] . . . . . . . . . . . . . . . . . . . . . 228
3.3 Orthogonal Transformation Methods [?, Sect. 4.4.2] . . . . . . . . . . . . . . . . . . . . . 232
3.3.1 Transformation Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
3.3.2 Orthogonal/Unitary Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
3.3.3 QR-Decomposition [?, Sect. 13], [?, Sect. 7.3] . . . . . . . . . . . . . . . . . . . . 233
3.3.3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
3.3.3.2 Computation of QR-Decomposition . . . . . . . . . . . . . . . . . . . . . 237
3.3.3.3 QR-Decomposition: Stability . . . . . . . . . . . . . . . . . . . . . . . . . 243
3.3.3.4 QR-Decomposition in E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . 245
3.3.4 QR-Based Solver for Linear Least Squares Problems . . . . . . . . . . . . . . . . 246
3.3.5 Modification Techniques for QR-Decomposition . . . . . . . . . . . . . . . . . . . . 251
3.3.5.1 Rank-1 Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
3.3.5.2 Adding a Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
3.3.5.3 Adding a Row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
3.4 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
3.4.1 SVD: Definition and Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
3.4.2 SVD in E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

CONTENTS, CONTENTS 3
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3.4.3 Generalized Solutions of LSE by SVD . . . . . . . . . . . . . . . . . . . . . . . . . 263

3.4.4 SVD-Based Optimization and Approximation . . . . . . . . . . . . . . . . . . . . . 265
3.4.4.1 Norm-Constrained Extrema of Quadratic Forms . . . . . . . . . . . . . . 265
3.4.4.2 Best Low-Rank Approximation . . . . . . . . . . . . . . . . . . . . . . . . 268
3.4.4.3 Principal Component Data Analysis (PCA) . . . . . . . . . . . . . . . . . 272
3.5 Total Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
3.6 Constrained Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
3.6.1 Solution via Lagrangian Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . 289
3.6.2 Solution via SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

4 Filtering Algorithms 293

4.1 Discrete Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
4.2 Discrete Fourier Transform (DFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
4.2.1 Discrete Convolution via DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
4.2.2 Frequency filtering via DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
4.2.3 Real DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
4.2.4 Two-dimensional DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
4.2.5 Semi-discrete Fourier Transform [?, Sect. 10.11] . . . . . . . . . . . . . . . . . . . 326
4.3 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
4.4 Trigonometric transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
4.4.1 Sine transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
4.4.2 Cosine transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
4.5 Toeplitz Matrix Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
4.5.1 Toeplitz Matrix Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
4.5.2 The Levinson Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355

5 Data Interpolation and Data Fitting in 1D 358

5.1 Abstract interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
5.2 Global Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
5.2.1 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
5.2.2 Polynomial Interpolation: Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
5.2.3 Polynomial Interpolation: Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 370
5.2.3.1 Multiple evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
5.2.3.2 Single evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
5.2.3.3 Extrapolation to zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
5.2.3.4 Newton basis and divided differences . . . . . . . . . . . . . . . . . . . . 381
5.2.4 Polynomial Interpolation: Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . 386
5.3 Shape preserving interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
5.3.1 Shape properties of functions and data . . . . . . . . . . . . . . . . . . . . . . . . 390
5.3.2 Piecewise linear interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
5.4 Cubic Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
5.4.1 Definition and algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
5.4.2 Local monotonicity preserving Hermite interpolation . . . . . . . . . . . . . . . . . 399
5.5 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
5.5.1 Cubic spline interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
5.5.2 Structural properties of cubic spline interpolants . . . . . . . . . . . . . . . . . . . 407
5.5.3 Shape Preserving Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . 410
5.6 Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
5.6.1 Trigonometric Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
5.6.2 Reduction to Lagrange Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 419
5.6.3 Equidistant Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . . . . 421
5.7 Least Squares Data Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

CONTENTS, CONTENTS 4
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6 Approximation of Functions in 1D 432

6.1 Approximation by Global Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
6.1.1 Polynomial approximation: Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
6.1.2 Error estimates for polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . 441
6.1.3 Chebychev Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
6.1.3.1 Motivation and definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
6.1.3.2 Chebychev interpolation error estimates . . . . . . . . . . . . . . . . . . 456
6.1.3.3 Chebychev interpolation: computational aspects . . . . . . . . . . . . . . 461
6.2 Mean Square Best Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
6.2.1 Abstract theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
6.2.1.1 Mean square norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
6.2.1.2 Normal equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
6.2.1.3 Orthonormal bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
6.2.2 Polynomial mean square best approximation . . . . . . . . . . . . . . . . . . . . . 470
6.3 Uniform Best Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
6.4 Approximation by Trigonometric Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . 481
6.4.1 Approximation by Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . 481
6.4.2 Trigonometric interpolation error estimates . . . . . . . . . . . . . . . . . . . . . . 482
6.4.3 Trigonometric Interpolation of Analytic Periodic Functions . . . . . . . . . . . . . . 487
6.5 Approximation by piecewise polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
6.5.1 Piecewise polynomial Lagrange interpolation . . . . . . . . . . . . . . . . . . . . . 491
6.5.2 Cubic Hermite interpolation: error estimates . . . . . . . . . . . . . . . . . . . . . 494
6.5.3 Cubic spline interpolation: error estimates [?, Ch. 47] . . . . . . . . . . . . . . . . 498

7 Numerical Quadrature 502

7.1 Quadrature Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
7.2 Polynomial Quadrature Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
7.3 Gauss Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
7.4 Composite Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
7.5 Adaptive Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532

8 Iterative Methods for Non-Linear Systems of Equations 540

8.1 Iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
8.1.1 Speed of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
8.1.2 Termination criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
8.2 Fixed Point Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
8.2.1 Consistent fixed point iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
8.2.2 Convergence of fixed point iterations . . . . . . . . . . . . . . . . . . . . . . . . . 554
8.3 Finding Zeros of Scalar Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
8.3.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
8.3.2 Model function methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
8.3.2.1 Newton method in scalar case . . . . . . . . . . . . . . . . . . . . . . . 563
8.3.2.2 Special one-point methods . . . . . . . . . . . . . . . . . . . . . . . . . . 565
8.3.2.3 Multi-point methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
8.3.3 Asymptotic efficiency of iterative methods for zero finding . . . . . . . . . . . . . . 574
8.4 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
8.4.1 The Newton iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
8.4.2 Convergence of Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
8.4.3 Termination of Newton iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
8.4.4 Damped Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
8.4.5 Quasi-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
8.5 Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601

CONTENTS, CONTENTS 5
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

8.5.1 Minima and minimizers: Some theory . . . . . . . . . . . . . . . . . . . . . . . . . 601

8.5.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
8.5.3 Descent methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
8.5.4 Quasi-Newton methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
8.6 Non-linear Least Squares [?, Ch. 6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
8.6.1 (Damped) Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
8.6.2 Gauss-Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
8.6.3 Trust region method (Levenberg-Marquardt method) . . . . . . . . . . . . . . . . . 607

9 Eigenvalues 609
9.1 Theory of eigenvalue problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
9.2 “Direct” Eigensolvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
9.3 Power Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
9.3.1 Direct power method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
9.3.2 Inverse Iteration [?, Sect. 7.6], [?, Sect. 5.3.2] . . . . . . . . . . . . . . . . . . . . 629
9.3.3 Preconditioned inverse iteration (PINVIT) . . . . . . . . . . . . . . . . . . . . . . . 640
9.3.4 Subspace iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
9.3.4.1 Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
9.3.4.2 Ritz projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
9.4 Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658

10 Krylov Methods for Linear Systems of Equations 670

10.1 Descent Methods [?, Sect. 4.3.3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
10.1.1 Quadratic minimization context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
10.1.2 Abstract steepest descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
10.1.3 Gradient method for s.p.d. linear system of equations . . . . . . . . . . . . . . . . 673
10.1.4 Convergence of the gradient method . . . . . . . . . . . . . . . . . . . . . . . . . 674
10.2 Conjugate gradient method (CG) [?, Ch. 9], [?, Sect. 13.4], [?, Sect. 4.3.4] . . . . . . . . . 678
10.2.1 Krylov spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
10.2.2 Implementation of CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
10.2.3 Convergence of CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
10.3 Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, Sect. 4.3.5] . . . . . . . . . . . . . . . . . . 688
10.4 Survey of Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694
10.4.1 Minimal residual methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694
10.4.2 Iterations with short recursions [?, Sect. 4.5] . . . . . . . . . . . . . . . . . . . . . 696

11 Numerical Integration – Single Step Methods 698

11.1 Initial value problems (IVP) for ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698
11.1.1 Modeling with ordinary differential equations: Examples . . . . . . . . . . . . . . . 699
11.1.2 Theory of initial value problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
11.1.3 Evolution operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
11.2 Introduction: Polygonal Approximation Methods . . . . . . . . . . . . . . . . . . . . . . . 709
11.2.1 Explicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
11.2.2 Implicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712
11.2.3 Implicit midpoint method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
11.3 General single step methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714
11.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714
11.3.2 Convergence of single step methods . . . . . . . . . . . . . . . . . . . . . . . . . 717
11.4 Explicit Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
11.5 Adaptive Stepsize Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732

12 Single Step Methods for Stiff Initial Value Problems 746

CONTENTS, CONTENTS 6
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

12.1 Model problem analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747

12.2 Stiff Initial Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
12.3 Implicit Runge-Kutta Single Step Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 766
12.3.1 The implicit Euler method for stiff IVPs . . . . . . . . . . . . . . . . . . . . . . . . 767
12.3.2 Collocation single step methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
12.3.3 General implicit RK-SSMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
12.3.4 Model problem analysis for implicit RK-SSMs . . . . . . . . . . . . . . . . . . . . . 774
12.4 Semi-implicit Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 780
12.5 Splitting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783

Index 788
Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788

CONTENTS, CONTENTS 7
Chapter 0

Introduction

0.0.1 Focus of this course

✄ on algorithms (principles, computational cost, scope, and limitations),

✄ on (efficient and stable) implementation in C++ based on the numerical linear algebra E IGEN (a Domain
Specific Language embedded into C++)

✄ on numerical experiments (design and interpretation).

(0.0.1) Aspects outside the scope of this course

No emphasis will be put on

• theory and proofs (unless essential for understanding of algorithms).

☞ 401-3651-00L Numerical Methods for Elliptic and Parabolic Partial Differential Equations
401-3652-00L Numerical Methods for Hyperbolic Partial Differential Equations
(both courses offered in BSc Mathematics)
• hardware aware implementation (cache hierarchies, CPU pipelining, etc.)
☞ 263-2300-00L How To Write Fast Numerical Code (Prof. M. Püschel, D-INFK)

• issues of high-performance computing (HPC, shard and distributed memory parallelisation, vector-
ization)

☞ 401-0686-10L High Performance Computing for Science and Engineering (HPCSE, Profs. M. Troyer
and P. Koumoutsakos)
263-2800-00L Design of Parallel and High-Performance Computing (Prof. T. Höfler)
However, note that these other courses partly rely on knowledge of elementary numerical methods, which
is covered in this course.

Contents

(0.0.2) Prequisites

8
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

This course will take for granted basic knowledge of linear algebra, calculus, and programming, that you
should have acquired during your first year at ETH.

Eigenvalue problems

integration of ODEs
Linear systems of

Least squares

Interpolation

Quadrature

Numerical
equations

problems
Analysis Linear algebra Programming (in C++)

(0.0.3) Numerical methods: A motley toolbox

This course discusses elementary numerical methods and techniques

They are vastly different in terms of ideas, design, analysis, and scope of application. They are the
items in a toolbox, some only loosely related by the common purpose of being building blocks for
codes for numerical simulation.

Do not expect much coherence between the chapters of this course!

A purpose-oriented notion of “Numerical methods for

CSE”:

A: “Stop putting a hammer, a level, and duct tape

in one box! They have nothing to do with each
other!”
B: “I might need any of these tools when fixing some-
thing about the house”

Fig. 1

(0.0.4) Dependencies of topics

Despite the diverse nature of the individual topics covered in this course, some depend on others for
providing essential building blocks. The following directed graph tries to capture these relationships. The
arrows have to be read as “uses results or algorithms of”.

0. Introduction, 0. Introduction 9
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Clustering techniques, Numerical integration

(G( xi , y j ))i,j x, Ch. ?? ẏ = f(t, y), Chapter 11

Quadrature
R Eigenvalues Krylov meth.,
f ( x ) dx,
Ax = λx, Chapter 9 Chapter 10
Chapter 7

Least squares,
Function approximation, Non-linear least squares,
kAx − b k → min,
Chapter 6 k F(x)k → min, Section 8.6
Chapter 3

Interpolation Linear systems Non-linear systems

!
∑i αi b( xi ) = f ( xi ), Chapter 5 Ax = b, Chapter 2 F(x) = 0, Chapter 8

Filtering, Chapter 4 Sparse matrices, Section 2.7

Computing with matrices and vectors, Ch. 1 !

Zero finding f ( x ) = 0

Any one-semester course “Numerical methods for CSE” will cover only selected chapters and sec-
tions of this document. Only topics addressed in class or in homework problems will be relevant
for the final exam!

(0.0.5) Relevance of this course

I am a student of computer science. After the exam, may I safely forget everything I have learned in this
mandatory “numerical methods” course? No, because it is highly likely that other courses or projects
will rely on the contents of this course:

singular value decomposition
Computational statistics, machine learning
least squares

function approximation 
numerical quadrature Numerical methods for PDEs

numerical integration

interpolation
Computer graphics
least squares

eigensolvers
Graph theoretic algorithms
sparse linear systems

numerical integration Computer animation

and many more applications of fundamental numerical methods . . ..

0. Introduction, 0. Introduction 10
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Hardly anyone will need everything covered in this course, but most of you will need something.

0.0.2 Goals

✦ Knowledge of the fundamental algorithms in numerical mathematics

✦ Knowledge of the essential terms in numerical mathematics and the techniques used for the analysis
of numerical algorithms

✦ Ability to choose the appropriate numerical method for concrete problems

✦ Ability to interpret numerical results
✦ Ability to implement numerical algorithms efficiently in C++ using numerical libraries

Indispensable: Learning by doing (➔ exercises)

0.0.3 To avoid misunderstandings . . .

(0.0.6) “Lecture notes”

These course materials are neither a textbook nor comprehensive lecture notes.
They are meant to be supplemented by explanations given in class.

Some pieces of advice:

✦ the lecture material is not designed to be self-contained, but is to be studied beside attending the
course or watching the course videos,

✦ this document is not meant for mere reading, but for working with,
✦ turn pages all the time and follow the numerous cross-references,
✦ study the relevant section of the course material when doing homework problems,
✦ study referenced literature to refresh prerequisite knowledge and for alternative presentation of the
material (from a different angle, maybe), but be careful about not getting confused or distracted by
information overload.

(0.0.7) Comprehension is a process . . .

✦ The course is difficult and demanding (ie. ETH level)

✦ Do not expect to understand everything in class. The average student will

• understand about one third of the material when attending the lectures,

0. Introduction, 0. Introduction 11
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

• understand another third when making a serious effort to solve the homework problems,
• hopefully understand the remaining third when studying for the examination after the end of
the course.

Perseverance will be rewarded!

0.0.4 Reporting errors

As the documents will always be in a state of flux, they will inevitably and invariably teem with small errors,
mainly typos and omissions.

Please report errors in the l,ecture material through the Course Wiki!

Join information will be sent to you by email.

Please point out errors by leaving a comment in the

Wiki (“Discuss” menu item).

When reporting an error, please specify the section and the number of the paragraph, remark, equation,
etc. where it hides. You need not give a page number.

0.0.5 Literature

Parts of the following textbooks may be used as supplementary reading for this course. References to
relevant sections will be provided in the course material.

Studying extra literature is not important for following this course!

0. Introduction, 0. Introduction 12
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✦ [?] U. A SCHER AND C. G REIF, A First Course in Numerical Methods, SIAM, Philadelphia, 2011.

Comprehensive introduction to numerical methods with an algorithmic focus based on MATLAB.

(Target audience: students of engineering subjects)

✦ [?] W. DAHMEN AND A. R EUSKEN, Numerik für Ingenieure und Naturwissenschaftler, Springer, Hei-
delberg, 2006.

Good reference for large parts of this course; provides a lot of simple examples and lucid explana-
tions, but also rigorous mathematical treatment.
(Target audience: undergraduate students in science and engineering)
Available for download at PDF

✦ [?] M. H ANKE -B OURGEOIS, Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens, Mathematische Leitfäden, B.G. Teubner, Stuttgart, 2002.

Gives detailed description and mathematical analysis of algorithms and relies on MATLAB. Profound
treatment of theory way beyond the scope of this course. (Target audience: undergraduates in
mathematics)

✦ [?] A. QUARTERONI , R. S ACCO, F. S ALERI, Numerical mathematics, vol. 37 of Texts in Applied

AND
Mathematics, Springer, New York, 2000.

Classical introductory numerical analysis text with many examples and detailed discussion of algo-
rithms. (Target audience: undergraduates in mathematics and engineering)
Can be obtained from website.

✦ [?] P. D EUFLHARD A. H OHMANN, Numerische Mathematik. Eine algorithmisch orientierte Ein-

AND
führung, DeGruyter, Berlin, 1 ed., 1991.

Modern discussion of numerical methods with profound treatment of theoretical aspects (Target
audience: undergraduate students in mathematics).

✦ [?]: W.. G ANDER , M.J. G ANDER , AND F. K WOK, Scientific Computing, Text in Computational Sci-
ence and Engineering, springer, 2014.

Comprehensive treatment of elementary numerical methods with an algorithmic focus.

D-INFK maintains a webpage with links to some of these books.

Essential prerequisite for this course is a solid knowledge in linear algebra and calculus. Familiarity with
the topics covered in the first semester courses is taken for granted, see

✦ [?] K. N IPP AND D. S TOFFER, Lineare Algebra, vdf Hochschulverlag, Zürich, 5 ed., 2002.

✦ [?] M. G UTKNECHT, Lineare algebra, lecture notes, SAM, ETH Zürich, 2009, available online.
✦ [?] M. S TRUWE, Analysis für Informatiker. Lecture notes, ETH Zürich, 2009, available online.

0. Introduction, 0. Introduction 13
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

0.1 Specific information

0.1.1 Assistants and exercise classes

Lecturer: Prof. Ralf Hiptmair HG G 58.2, ☎ 044 632 3404, [email protected]

Assistants: Daniele Casati, HG E 62.2, ☎ 044 632 ????, [email protected]
Filippo Leonardi, HG J 45, ☎ 044 633 9379, [email protected]
Daniel Hupp, [email protected]
Marija Kranjčević, [email protected]
Heinekamp Sebastian, [email protected]
Hillebrand Fabian, [email protected]
Schaffner Yannick, [email protected]
Thomas Graf, [email protected]
Schwarz Fabian, [email protected]
Accaputo Giuseppe, [email protected]
Baumann Christian, [email protected]
Romero Francisco, [email protected]
Dabrowski Alexander, [email protected]
Luca Mondada, [email protected]
Varghese Alexander , [email protected]
Xandeep,

Though the assistants email addresses are provided above, their use should be restricted to cases of
emergency:

In general refrain from sending email messages to the lecturer or the assistants. They will not
be answered!

Questions should be asked in class (in public or during the break in private), in the tutorials, or
in the study center hours.

Classes: Thu, 08.15-10.00 (HG F 1), Fri, 13.15-16.00 (HG F 1)

Tutorials: Mon, 10.15-12.00 (CLA E 4, LFW E 11, LFW E13, ML H 41.1, ML J 34.1)
Mon, 13.15-15.00 (CLA E 4, HG E 33.3, HG E 33.5, HG G 26.5, LEE D 105)
Study center: Mon, 18.00-20.00 (HG E 41)
Before the first tutorial you will receive a link where you can register to a tutorial class. Keep in mind that
one tutorial will be held in German and one will be reserved for CSE students.

0. Introduction, 0.1. Specific information 14

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

0.1.2 Study center

The tutorials classes for this course will be supple-

mented by the option to do supervised work in the
ETH “flexible lecture hall” (study center) HG E 41 ✄.

Several assistants will be present to explain and dis-

cuss homework problems both from the previous and
the current problem sheet. They are also supposed
to answer questions about the course.

The study center session is also a good opportunity

to do the homework in a group. In case you are
stalled you may call an assistant and ask for advice.
Fig. 2

0.1.3 Assignments

A steady and persistent effort spent on homework problems is essential for success in this course.

You should expect to spend 4-6 hours per week on trying to solve the homework problems. Since many
involve small coding projects, the time it will take an individual student to arrive at a solution is hard to
predict.

(0.1.1) Homeworks and tutors’ corrections

✦ The weekly assignments will be a few problems from the NCSE Problem Collection available online
as PDF. The particular problems to be solved will be communicated on Friday every week.

Please note that this problem collection is being compiled during this semester. Thus, make sure
that you obtain the most current version every week.

✦ Some or all of the problems of an assignment sheet will be discussed in the tutorial classes on
Monday 10 days after the problems have been assigned.

✦ A few problems on each sheet will be marked as core problems. Every participant of the course is
strongly advised to try and solve at least the core problems.

✦ If you want your tutor to examine your solution of the current problem sheet, please put it into the
plexiglass trays in front of HG G 53/54 by the Thursday after the publication. You should submit your
codes using the online submission interface. This is voluntary, but feedback on your performance
on homework problems can be important.

✦ Please clearly mark the homework problems that you want your tutor to inspect.
✦ You are encouraged to hand-in incomplete and wrong solutions, you can receive valuable feedback
even on incomplete attempts.

(0.1.2) Git code repository

0. Introduction, 0.1. Specific information 15

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++ codes for both the classroom and homework problems are made available through a git repository
also accessible through Gitlab (Link):

The Gitlab toplevel page gives a short introduction into the repository for the course and provides a link to
online sources of information about Git.

Ddownload is possible via Git or as a zip archive. Which method you choose is up to you, but it should be
noted that updating via git is more convenient.
➣ Shell command to download the git repository:
> git clone https://gitlab.math.ethz.ch/NumCSE/NumCSE.git
Updating the repository to fetch upstream changes is then possible by executing > git pull inside the
NumCSE folder.

Note that by default participants of the course will have read access only. However, if you want to contribute
corrections and enhancements of lecture or homework codes your are invited to submit a merge request.
Beforehand you have to inform your tutor so that a personal Gitlab account can be set up for you.

The Zip-archive download link is here.

For instructions on how to compile assignments or lecture codes see the README file.

0.1.4 Information on examinations

(0.1.3) Examinations during the teaching period

From the ETH course directory:

Computer based examination involving coding problems beside theoretical questions. Parts
of the lecture documents and other materials will be made available online during the exami-
nation. A 30-minute mid-term exam and a 30-minute end term exam will be held during the
teaching period on dates specified in the beginning of the semester. Points earned in these
exams will be taken into account through a bonus of up to 20% of the total points in the final
session exam.
Both will be closed book examinations on paper.

Dates:

0. Introduction, 0.1. Specific information 16

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

• Mid-term: Friday, Nov 4, 2016, 13:15

• End-term: Friday, Dec 23, 2016, 13:15
• Make-up term exam: Friday, Jan 13, 2017, 9:15
• Repetition term exam: Friday, May 12, 2017, 16:15

The term exams are regarded as central elements and as such are graded on a pass/fail basis.

Admission to the main exam is conditional on passing at least one term exam

Only students who could not take part in one of the term exams for cogent reasons like illness (doc-
tor’s certificate required!) may take part in the make-up term exam. Please contact Daniele Casati
([email protected]) by email, if you think that you are eligible for the make-up term exam,
and attach all required documentation. You will be informed, whether you are admitted.

Only students who have failed both term exams can take part in the repetition term exam in spring next
year. This is their only chance to be admitted to the main exam in Summer 2017.

(0.1.4) Main examination during exam session

✦ Three-hour written examination involving coding problems to be done at the computer on

☛ ✟

✡ ✠
Thursday January 26, 2017, 9:00 - 12:00, HG G 1
✦ Dry-run for computer based examination:
TBA, registration via course website

✦ Subjects of examination:
• All topics, which have been addressed in class or in a homework problem (including the home-
work problems not labelled as “core problems”)

✦ Lecture documents will be available as PDF during the examination. The corresponding final version
of the lecture documents will be made available on TBA

✦ You may bring a summary of up to 10 pages A4 in your own handwriting. No printouts and copies
are allowed.

✦ The exam questions will be asked in English.

(0.1.5) Repeating the main exam

• Everybody who passed at least one of the term exams, the make-up term exam, or the repetition
term exam for last year’s course and wants to repeat the main exam, will be allowed to do so.

• Bonus points earned in term exams in last year’s course can be taken into account for this course’s
main exam.

0. Introduction, 0.1. Specific information 17

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

• If you are going to repeat the main exam, but also want to earn a bonus through this year’s term
exams, please declare this intention before the mid-term exam.

0.2 Programming in C++11

C++11 is the current ANSI/ISO standard for the programming language C++. On the one hand, it offers
a wealth of features and possibilities. On the other hand, this can be confusing and even be prone to
inconsistencies. A major cause of inconsistent design is the requirement with backward compatibility with
the C programming language and the earlier standard C++98.

However, C++ has become the main language in computational science and engineering and high per-
formance computing. Therefore this course relies on C++ to discuss the implementation of numerical
methods.

In fact C++ is a blend of different programming paradigms:

• an object oriented core providing classes, inheritance, and runtime polymorphism,

• a powerful template mechanism for parametric types and partial specialization, enabling template
meta-programming and compile-time polymorphism,

• a collection of abstract data containers and basic algorithms provided by the Standard Template
Libary (STL).

Supplementary reading. A popular book for learning C++ that has been upgraded to include

the latest C++11 standard is [?].

The book [?] gives a comprehensive presentation of the new features of C++11 compared to earlier
versions of C++.

There are plenty of online reference pages for C++, for instance http://en.cppreference.com
and http://www.cplusplus.com/.

(0.2.1) Building, compiling, and debugging

• We use the command line build tool C MAKE, see web page.
• The compilers supporting all features of C++ needed for this course, are clang and GCC. Both are
open source projects and free. C MAKE will automatically select a suitable compiler on your system
(Linux or Mac OS X).

• A command line tool for debugging is lldb, see short introduction by Till Ehrengruber, student of
CSE@ETH.

The following sections highlight a few particular aspects of C++11 that may be important for code devel-
opment in this course.

0. Introduction, 0.2. Programming in C++11 18

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

0.2.1 Function Arguments and Overloading

(0.2.2) Function overloading

Argument types are an integral part of a function declaration in C++. Hence the following functions are
different
i n t * f( i n t ); // use this in the case of a single numeric argument
double f( i n t *); // use only, if pointer to a integer is given
v o i d f( const MyClass &); // use when called for a MyClass object

and the compiler selects the function to be used depending on the type of the arguments following rather
sophisticated rules, refer to overload resolution rules. Complications arise, because implicit type conver-
sions have to be taken into account. In case of ambiguity a compile-time error will be triggered. Functions
cannot be distinguished by return type!

For member functions (methods) of classes an additional distinction can be introduced by the const spec-
ifier:
s t r u c t MyClass {
double f( double ); // use for a mutable object of type MyClass
double f( double ) const ; // use this version for a constant object
...
};

The second version of the method f is invoked for constant objects of type MyClass.

(0.2.3) Operator overloading

In C++ unary and binary operators like =, ==, +, -, *, /, +=, -=, *=, /=, %, &&, ||, etc. are regarded
as functions with a fixed number of arguments (one or two). For built-in numeric and logic types they are
defined already. They can be extended to any other type, for instance
MyClass o p e r a t o r +( const MyClass &, const MyClass &);
MyClass o p e r a t o r +( const MyClass &, double );
MyClass o p e r a t o r +( const MyClass &); // unary + !

The same selection rules as for function overloading apply. Of course, operators can also be introduced
as class member functions.

C++ gives complete freedom to overload operators. However, the semantics of the new operators should
be close to the customary use of the operator.

(0.2.4) Passing arguments by value and by reference

Consider a generic function declared as follows:

v o i d f(MyClass x); // Argument x passed by value.

0. Introduction, 0.2. Programming in C++11 19

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

When f is invoked, a temporary copy of the argument is created through the copy constructor or the move
constructor of MyClass. The new temporary object is a local variable inside the function body.

When a function is declared as follows

v o i d f(MyClass &x); // Argument x passed by reference.

then the argument is passed to the scope of the function and can be changed inside the function. No
copies are created. If one wants to avoid the creation of temporary objects, which may be costly, but also
wants to indicate that the argument will not be modified inside f, then the declaration should read
v o i d f(const MyClass &x); // Argument x passed by constant referene.

New in C++11 is move semantics, enabled in the following definition

v o i d f( const MyClass &&x); // Optional shallow copy

In this case, if the scope of the object passed as the argument is merely the function or std::move()
tags it as disposable, the move constructor of MyClass is invoked, which will usually do a shallow copy
only. Refer to Code 0.2.22 for an example.

0.2.2 Templates

(0.2.5) Function templates

The template mechanism supports parameterization of definitions of classes and functions by type. An
example of a function templates is
t e m p l a t e < typename ScalarType, typename VectorType>
VectorType saxpy(ScalarType alpha, const VectorType &x, const
VectorType &y)
{ r e t u r n (alpha*x+y); }

Depending on the concrete type of the arguments the compiler will instantiate particular versions of this
function, for instance saxpy<float,double>, when alpha is of type float and both x and y are of
type double. In this case the return type will be float.

For the above example the compiler will be able to deduce the types ScalarType and VectorType
from the arguments. The programmer can also specify the types directly through the < >-syntax as in
saxpy< double , double >(a,x,y);

If an instantiation for all arguments of type double is desired. In case, the arguments do not supply
enough information about the type parameters, specifying (some of) them through < > is mandatory.

(0.2.6) Class templates

A class template defines a class depending on one or more type parameters, for instance

0. Introduction, 0.2. Programming in C++11 20

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

t e m p l a t e < typename T>

c l a s s MyClsTempl {
public:
using parm_t = typename T::value_t; // T-dependent type
MyClsTempl( v o i d ); // Default constructor
MyClsTempl( const T&); // Constructor with an argument
t e m p l a t e < typename U>
T memfn( const T&, const U&) const ; // Templated member function
private:
T *ptr; // Data member, T-pointer
};

Types MyClsTempl<T> for a concrete choice of T are instantiated when a corresponding object is de-
clared, for instance via
double x = 3.14;
MyClass myobj; // Default construction of an object
MyClsTempl< double > tinstd; // Instantiation for T = double
MyClsTempl<MyClass> mytinst(myobj); // Instantiation for T = MyClass
MyClass ret = mytinst.memfn(myobj,x); // Instantiation of member
function for U = double, automatic type deduction

The types spawned by a template for different parameter types have nothing to do with each other.

Requirements on parameter types

The parameter types for a template have to provide all type definitions, member functions, operators,
and data to make possible the instantiation (“compilation”) of the class of function template.

0.2.3 Function Objects and Lambda Functions

A function object is an object of a type that provides an overloaded “function call” operator (). Function
objects can be implemented in two different ways:

(I) through special classes like the following that realizes a function R 7→ R
c l a s s MyFun {
public:
...
double operator ( double x) const ; // Evaluation operator
...
};

The evaluation operator can take more than one argument and need not be declared const.

(II) through lambda functions, an “anonymous function” defined as

[<capture list>] (<arguments>) -> <return type> { body; }

0. Introduction, 0.2. Programming in C++11 21

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

where <capture list> is a list of variables from the local scope to be passed to the lambda func-
tion; an & indicates passing by reference,
<arguments> is a comma separated list of function arguments complete with types,
<return type> is an optional return type; often the compiler will be able to deduce the
return type from the definition of the function.
Function classes should be used, when the function is needed in different places, whereas lambda func-
tions for short functions intended for single use.

C++11 code 0.2.8: Demonstration of use of lambda function

1 i n t main ( ) {
2 // initialize a vector from an initializer list
3 std : : vector <double> v ( { 1 . 2 , 2 . 3 , 3 . 4 , 4 . 5 , 5 . 6 , 6 . 7 , 7 . 8 } ) ;
4 // A vector of the same length
5 std : : vector <double> w( v . siz e ( ) ) ;
6 // Do cumulative summation of v and store result in w
7 double sum = 0 ;
8 std : : transform ( v . begin ( ) , v . end ( ) ,w . begin ( ) ,
9 [&sum ] ( double x ) { sum += x ; r et ur n sum ; } ) ;
10 cout << " sum = " << sum << " , w = [ " ;
11 f o r ( auto x : w) cout << x << ’ ’ ; cout << ’ ] ’ << endl ;
12 r et ur n ( 0 ) ;
13 };

In this code the lambda function captures the local variable sum by reference, which enables the lambda
function to change its value in the surrounding scope.

(0.2.9) Function type wrappers

The special class std::function provides types for general polymorphic function wrappers.
s t d ::function<return type(arg types)>

C++11 code 0.2.10: Use of std::function

1 double binop ( double arg1 , double arg2 ) { r et ur n ( arg1 / arg2 ) ; }
2

3 void s t d f u n c t i o n t e s t ( void ) {
4 // Vector of objects of a particular signature
5 std : : vector <std : : f u n c t i o n <double ( double , double ) >> fnv ec ;
6 // Store reference to a regular function
7 fnv ec . push_back ( binop ) ;
8 // Store are lambda function
9 fnv ec . push_back ( [ ] ( double x , double y ) −> double { r et ur n y / x ; } ) ;
10 f o r ( auto f n : fnv ec ) { std : : cout << f n ( 3 , 2 ) << std : : endl ; }
11 }

In this example an object of type std::function<double(double,double)> can hold a regular func-

tion taking two double arguments and returning another double or a lambda function with the same
signature. Guess the output of stdfunctiontest!

0. Introduction, 0.2. Programming in C++11 22

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

0.2.4 Multiple Return Values

In M ATLAB it is customary to return several variables from a function call:

f u n c t i o n [x,y,z] = f(a,b)

In C++ this is possible by using the tuple utility. For instance, the following function computes the mimimal
and maximal element of a vector and also returns its cumulative sum. It returns all these values.

C++11 code 0.2.11: Function with multiple return values

1 template <typename T>
2 std : : tuple <T , T , std : : vector <T>> extcumsum ( const std : : vector <T> &v ) {
3 // Local summation variable captured by reference by lambda function
4 T sum { } ;
5 // temporary vector for returning cumulative sum
6 std : : vector <T> w { } ;
7 // cumulative summation
8 std : : t r a n s f o r m ( v . c begin ( ) , v . cend ( ) , b a c k _ i n s e r t e r (w) ,
9 [&sum ] ( T x ) { sum += x ; r et ur n (sum) ; } ) ;
10 r et ur n ( std : : tuple <T , T , std : : vector <T>>
11 ( ∗ std : : min_element ( v . c begin ( ) , v . cend ( ) ) ,
12 ∗ std : : max_element ( v . c begin ( ) , v . cend ( ) ) ,w) ) ;
13 }

This code snippet shows how to extract the individual components of the tuple returned by the previous
function.

C++11 code 0.2.12: Calling a function with multiple return values

1 i n t main ( ) {
2 // initialize a vector from an initializer list
3 std : : vector <double> v ( { 1 . 2 , 2 . 3 , 3 . 4 , 4 . 5 , 5 . 6 , 6 . 7 , 7 . 8 } ) ;
4 // Variables for return values
5 double minv , maxv ; // Extremal elements
6 std : : vector <double> cs ; // Cumulative sums
7 std : : t i e ( minv , maxv , cs ) = extcumsum( v ) ;
8 cout << " mi n = " << minv << " , max = " << maxv << endl ;
9 cout << " c s = [ " ; f o r ( double x : cs ) cout << x << ’ ’ ; cout << " ] "
<< endl ;
10 r et ur n ( 0 ) ;
11 }

Be careful: many temporary objects might be created! A demonstration of this hidden cost is given in
Exp. 0.2.39.

0. Introduction, 0.2. Programming in C++11 23

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

0.2.5 A Vector Class

Since C++ is an object oriented programming language, datatypes defined by classes play a pivotal role in
every C++ program. Here, we demonstrate the main ingredients of a class definition and other important
facilities of C++ for the class MyVector meant for objects representing vectors from R n . The codes can
be found in ➺ GITLAB.

C++11 class 0.2.13: Definition of a simple vector class MyVector

1 namespace myvec {
2 class MyVector {
3 public :
4 using v a l u e _ t = double ;
5 // Constructor creating constant vector, also default constructor
6 e x p l i c i t MyVector ( std : : s i z e _ t n = 0 , double v a l = 0 . 0 ) ;
7 // Constructor: initialization from an STL container
8 template <typename Container > MyVector ( const C o n t a i n e r &v ) ;
9 // Constructor: initialization from an STL iterator range
10 template <typename I t e r a t o r > MyVector ( I t e r a t o r f i r s t , I t e r a t o r l a s t ) ;
11 // Copy constructor, computational cost O(n)
12 MyVector ( const MyVector &mv) ;
13 // Move constructor, computational cost O(1)
14 MyVector ( MyVector &&mv) ;
15 // Copy assignment operator, computational cost O(n)
16 MyVector &operator = ( const MyVector &mv) ;
17 // Move assignment operator, computational cost O(1)
18 MyVector &operator = ( MyVector &&mv) ;
19 // Destructor
20 v i r t u a l ~MyVector ( void ) ;
21 // Type conversion to STL vector
22 operator std : : vector <double> ( ) const ;
23

24 // Returns length of vector

25 std : : s i z e _ t l e n g t h ( void ) const { r et ur n n ; }
26 // Access operators: rvalue & lvalue, with range check
27 double operator [ ] ( std : : s i z e _ t i ) const ;
28 double &operator [ ] ( std : : s i z e _ t i ) ;
29 // Comparison operators
30 bool operator == ( const MyVector &mv) const ;
31 bool operator ! = ( const MyVector &mv) const ;
32 // Transformation of a vector by a function R → R
33 template <typename Functor >
34 MyVector &t r a n s f o r m ( Func tor && f ) ;
35

36 // Overloaded arithmetic operators

37 // In place vector addition: x += y;
38 MyVector &operator +=( const MyVector &mv) ;
39 // In place vector subtraction: x-= y;
40 MyVector &operator −=(const MyVector &mv) ;
41 // In place scalar multiplication: x *= a;
42 MyVector &operator ∗ =( double alpha ) ;

0. Introduction, 0.2. Programming in C++11 24

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

43 // In place scalar division: x /= a;

44 MyVector &operator / = ( double alpha ) ;
45 // Vector addition
46 MyVector operator + ( MyVector mv) const ;
47 // Vector subtraction
48 MyVector operator − ( const MyVector &mv) const ;
49 // Scalar multiplication from right and left: x = a*y; x = y*a
50 MyVector operator ∗ ( double alpha ) const ;
51 f r i e n d MyVector operator ∗ ( double alpha , const MyVector &) ;
52 // Scalar divsion: x = y/a;
53 MyVector operator / ( double alpha ) const ;
54

55 // Euclidean norm
56 double norm ( void ) const ;
57 // Euclidean inner product
58 double operator ∗ ( const MyVector &) const ;
59 // Output operator
60 f r i e n d std : : ostream &
61 operator << ( std : : ostream & , const MyVector &mv) ;
62

63 s t a t i c bool dbg ; // Flag for verbose output

64 private :
65 std : : s i z e _ t n ; // Length of vector
66 double ∗ data ; // data array (standard C array)
67 };
68 }

Note the use of a public static data member dbg in Line 63 that can be used to control debugging output
by setting MyVector::dbg = true or MyVector::dbg = false.

Remark 0.2.14 (Contiguous arrays in C++)

The class MyVector uses a C-style array and dynamic memory management with new and delete to
store the vector components. This is for demonstration purposes only and not recommended.

Arrays in C++

In C++ use the STL container std::vector<T> for storing data in contiguous memory locations.

(0.2.16) Member and friend functions of MyVector

C++11 code 0.2.17: Constructor for constant vector, also default constructor, see Line 28
1 MyVector : : MyVector ( std : : s i z e _ t _n , double _a ) : n ( _n ) , data ( n u l l p t r ) {
2 i f ( dbg ) cout << " { C o n s t r u c t o r M y V e c t o r ( " << _n
3 << " ) c a l l e d " << ’ } ’ << endl ;

0. Introduction, 0.2. Programming in C++11 25

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

4 i f ( n > 0 ) data = new double [ _n ] ;

5 f o r ( std : : s i z e _ t l = 0; l <n ;++ l ) data [ l ] = _a ;
6 }

This constructor can also serve as default constructor (a constructor that can be invoked without any
argument), because defaults are supplied for all its arguments.

The following two constructors initialize a vector from sequential containers according to the conventions
of the STL.

C++11 code 0.2.18: Templated constructors copying vector entries from an STL container
1 template <typename Container >
2 MyVector : : MyVector ( const C o n t a i n e r &v ) : n ( v . siz e ( ) ) , data ( n u l l p t r ) {
3 i f ( dbg ) cout << " { M y V e c t o r ( l e n g t h " << n
4 << " ) c o n s t r u c t e d f r o m c o n t a i n e r " << ’ } ’ << endl ;
5 i f ( n > 0) {
6 double ∗ tmp = ( data = new double [ n ] ) ;
7 f o r ( auto i : v ) ∗ tmp++ = i ; // foreach loop
8 }
9 }

Note the use of the new C++11 facility of a “foreach loop” iterating through a container in Line 7.

C++11 code 0.2.19: Constructor initializing vector from STL iterator range
1 template <typename I t e r a t o r >
2 MyVector : : MyVector ( I t e r a t o r f i r s t , I t e r a t o r l a s t ) : n ( 0 ) , data ( n u l l p t r ) {
3 n = std : : d i s t a n c e ( f i r s t , l a s t ) ;
4 i f ( dbg ) cout << " { M y V e c t o r ( l e n g t h " << n
5 << " ) c o n s t r u c t e d f r o m r a n g e " << ’ } ’ << endl ;
6 i f ( n > 0) {
7 data = new double [ n ] ;
8 std : : copy ( f i r s t , l a s t , data ) ;
9 }
10 }

The use of these constructors is demonstrated in the following code

C++11 code 0.2.20: Initialization of a MyVector object from an STL vector

1 i n t main ( ) {
2 myvec : : MyVector : : dbg = t r ue ;
3 std : : vector i v e c = { 1 , 2 , 3 , 5 , 7 , 1 1 , 1 3 } ; // initializer list
4 myvec : : MyVector v1 ( i v e c . cbegin ( ) , i v e c . cend ( ) ) ;
5 myvec : : MyVector v2 ( i v e c ) ;
6 myvec : : MyVector v r ( i v e c . crbegin ( ) , i v e c . crend ( ) ) ;
7 cout << " v1 = " << v1 << endl ;
8 cout << " v2 = " << v2 << endl ;
9 cout << " v r = " << v r << endl ;

0. Introduction, 0.2. Programming in C++11 26

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10 r et ur n ( 0 ) ;
11 }

The following output is produced:

{ MyVector ( l e n g t h 7 ) c o n s t r u c t e d from range }
{ MyVector ( l e n g t h 7 ) c o n s t r u c t e d from c o n t a i n e r }
{ MyVector ( l e n g t h 7 ) c o n s t r u c t e d from range }
v1 = [ 1 , 2 , 3 , 5 , 7 , 1 1 , 1 3 ]
v2 = [ 1 , 2 , 3 , 5 , 7 , 1 1 , 1 3 ]
vr = [ 13 ,11 ,7 ,5 ,3 ,2 ,1 ]
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 7)}
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 7)}
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 7)}

The copy constructor listed next relies on the STL algorithm std::copy to copy the elements of an
existing object into a newly created object. This takes n operations.

C++11 code 0.2.21: Copy constructor

1 MyVector : : MyVector ( const MyVector &mv) : n (mv . n ) , data ( n u l l p t r ) {
2 i f ( dbg ) cout << " { Copy c o n s t r u c t i o n o f M y V e c t o r ( l e n g t h "
3 << n << " ) " << ’ } ’ << endl ;
4 i f ( n > 0) {
5 data = new double [ n ] ;
6 std : : copy_n (mv . data , n , data ) ;
7 }
8 }

An important new feature of C++11 is move semantics which helps avoid expensive copy operations. The
following implementation just performs a shallow copy of pointers and, thus, for large n is much cheaper
than a call to the copy constructor from Code 0.2.21. The source vector is left in an empty vector state.

C++11 code 0.2.22: Move constructor

1 MyVector : : MyVector ( MyVector &&mv) : n (mv . n ) , data (mv . data ) {
2 i f ( dbg ) cout << " { Move c o n s t r u c t i o n o f M y V e c t o r ( l e n g t h "
3 << n << " ) " << ’ } ’ << endl ;
4 mv . data = n u l l p t r ; mv . n = 0 ; // Reset victim of data theft
5 }

The following code demonstrates the use of std::move() to mark a vector object as disposable and
allow the compiler the use of the move constructor. The code also uses left multiplication with a scalar,
see Code 0.2.35.

C++11 code 0.2.23: Invocation of copy and move constructors

1 i n t main ( ) {
2 myvec : : MyVector : : dbg = t r ue ;
3 myvec : : MyVector v1 ( std : : vector <double > (

0. Introduction, 0.2. Programming in C++11 27

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

4 {1.2 ,2.3 ,3.4 ,4.5 ,5.6 ,6.7 ,7.8 ,8.9}) ) ;

5 myvec : : MyVector v2 ( 2 . 0 ∗ v1 ) ; // Scalar multiplication
6 myvec : : MyVector v3 ( std : : move( v1 ) ) ;
7 cout << " v1 = " << v1 << endl ;
8 cout << " v2 = " << v2 << endl ;
9 cout << " v3 = " << v3 << endl ;
10 r et ur n ( 0 ) ;
11 }

This code produces the following output. We observe that v1 is empty after its data have been “stolen” by
v2.
{ MyVector ( l e n g t h 8 ) c o n s t r u c t e d from c o n t a i n e r }
{ o p e r a t o r a ∗ , MyVector o f l e n g t h 8}
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r ∗ = , MyVector o f l e n g t h 8}
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
v1 = [ ]
v2 = [ 2 . 4 , 4 . 6 , 6 . 8 , 9 , 1 1 . 2 , 1 3 . 4 , 1 5 . 6 , 1 7 . 8 ]
v3 = [ 1 . 2 , 2 . 3 , 3 . 4 , 4 . 5 , 5 . 6 , 6 . 7 , 7 . 8 , 8 . 9 ]
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }

We observe that the object v1 is reset after having been moved to v3.
Use std::move only for special purposes like above and only if an object has a move con-
structor. Otherwise a ’move’ will trigger a plain copy operation. In particular, do not use
! std::move on objects at the end of their scope, e.g., within return statements.

The next operator effects copy assignment of an rvalue MyVector object to an lvalue MyVector. This
involves O(n) operations.

C++11 code 0.2.24: Copy assignment operator

1 MyVector &MyVector : : operator = ( const MyVector &mv) {
2 i f ( dbg ) cout << " { Copy a s s i g n m e n t o f M y V e c t o r ( l e n g t h "
3 << n << "<− " << mv . n << " ) " << ’ } ’ << endl ;
4 i f ( t h i s == &mv) r et ur n ( ∗ t h i s ) ;
5 i f ( n ! = mv . n ) {
6 n = mv . n ;
7 i f ( data ! = n u l l p t r ) delet e [ ] data ;
8 i f ( n > 0 ) data = new double [ n ] ; else data = n u l l p t r ;
9 }
10 i f ( n > 0 ) std : : copy_n (mv . data , n , data ) ;
11 r et ur n ( ∗ t h i s ) ;
12 }

The move semantics is realized by an assignment operator relying on shallow copying.

0. Introduction, 0.2. Programming in C++11 28

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11 code 0.2.25: Move assignment operator

1 MyVector &MyVector : : operator = ( MyVector &&mv) {
2 i f ( dbg ) cout << " { Move a s s i g n m e n t o f M y V e c t o r ( l e n g t h "
3 << n << "<− " << mv . n << " ) " << ’ } ’ << endl ;
4 i f ( data ! = n u l l p t r ) delet e [ ] data ;
5 n = mv . n ; data = mv . data ;
6 mv . n = 0 ; mv . data = n u l l p t r ;
7 r et ur n ( ∗ t h i s ) ;
8 }

The destructor releases memory allocated by new during construction or assignment.

C++11 code 0.2.26: Destructor: releases allocated memory

1 MyVector : : ~ MyVector ( void ) {
2 i f ( dbg ) cout << " { D e s t r u c t o r f o r M y V e c t o r ( l e n g t h = "
3 << n << " ) " << ’ } ’ << endl ;
4 i f ( data ! = n u l l p t r ) delet e [ ] data ;
5 }

The operator keyword is also use to define implicit type conversions.

C++11 code 0.2.27: Type conversion operator: copies contents of vector into STL vector
1 MyVector : : operator std : : vector <double> ( ) const {
2 i f ( dbg ) cout << " { C o n v e r s i o n t o s t d : : v e c t o r , l e n g t h = " << n <<
’ } ’ << endl ;
3 r et ur n std : : vector <double > ( data , data+n ) ;
4 }

The bracket operator [] can be used to fetch and set vector components. Note that index range checking
is performed; an exception is thrown for invalid indices. The following code also gives an example of
operator overloading as discussed in § 0.2.3.

C++11 code 0.2.28: rvalue and lvalue access operators

1 double MyVector : : operator [ ] ( std : : s i z e _ t i ) const {
2 i f ( i >= n ) throw ( std : : l o g i c _ e r r o r ( " [ ] o u t o f r a n g e " ) ) ;
3 r et ur n data [ i ] ;
4 }
5

6 double &MyVector : : operator [ ] ( std : : s i z e _ t i ) {

7 i f ( i >= n ) throw ( std : : l o g i c _ e r r o r ( " [ ] o u t o f r a n g e " ) ) ;
8 r et ur n data [ i ] ;
9 }

Componentwise direct comparison of vectors. Can be dangerous in numerical codes,cf. Rem. 1.5.36.

0. Introduction, 0.2. Programming in C++11 29

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11 code 0.2.29: Comparison operators

1 bool MyVector : : operator == ( const MyVector &mv) const
2 {
3 i f ( dbg ) cout << " { C o m p a r i s o n = = : " << n << " <−> " << mv . n << ’ } ’
<< endl ;
4 i f ( n ! = mv . n ) r et ur n ( f a l s e ) ;
5 else {
6 f o r ( std : : s i z e _ t l = 0; l <n ;++ l )
7 i f ( data [ l ] ! = mv . data [ l ] ) r et ur n ( f a l s e ) ;
8 }
9 r et ur n ( t r ue ) ;
10 }
11

12 bool MyVector : : operator ! = ( const MyVector &mv) const {

13 r et ur n ! ( ∗ t h i s == mv) ;
14 }

The transform method applies a function to every vector component and overwrites it with the value
returned by the function. The function is passed as an object of a type providing a ()-operator that accepts
a single argument convertible to double and returns a value convertible to double.

C++11 code 0.2.30: Transformation of a vector through a functor double → double

1 template <typename Functor >
2 MyVector &MyVector : : t r a n s f o r m ( Func tor && f ) {
3 f o r ( std : : s i z e _ t l = 0; l <n ;++ l ) data [ l ] = f ( data [ l ] ) ;
4 r et ur n ( ∗ t h i s ) ;
5 }

The following code demonstrates the use of the transform method in combination with
1. a function object of the following type

C++11 code 0.2.31:

1 s t r u c t SimpleFunc tion {
2 SimpleFunc tion ( double _a = 1 . 0 ) : c n t ( 0 ) , a ( _a ) { }
3 double operator ( ) ( double x ) { c n t ++; r et ur n ( x+a ) ; }
4 int cnt ; // internal counter
5 const double a ; // increment value
6 };

2. a lambda function defined directly inside the call to transform.

C++11 code 0.2.32:

1 i n t main ( ) {
2 myvec : : MyVector : : dbg = f a l s e ;
3 double a = 2 . 0 ; // increment

0. Introduction, 0.2. Programming in C++11 30

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

4 int cnt = 0; // external counter used by lambda function

5 myvec : : MyVector mv( std : : vector <double > (
6 {1.2 ,2.3 ,3.4 ,4.5 ,5.6 ,6.7 ,7.8 ,8.9}) ) ;
7 mv . t r a n s f o r m ( [ a ,& c n t ] ( double x ) { c n t ++; r et ur n ( x+a ) ; } ) ;
8 cout << c n t << " o p e r a t i o n s , mv t r a n s f o r m e d = " << mv << endl ;
9 SimpleFunc tion t r f ( a ) ; mv . t r a n s f o r m ( t r f ) ;
10 cout << t r f . c n t << " o p e r a t i o n s , mv t r a n s f o r m e d = " << mv << endl ;
11 mv . t r a n s f o r m ( SimpleFunc tion ( − 4.0) ) ;
12 cout << " F i n a l v e c t o r = " << mv << endl ;
13 r et ur n ( 0 ) ;
14 }

The output is
8 o p e r a t i o n s , mv t r a n s f o r m e d = [ 3 . 2 , 4 . 3 , 5 . 4 , 6 . 5 , 7 . 6 , 8 . 7 , 9 . 8 , 1 0 . 9 ]
8 o p e r a t i o n s , mv t r a n s f o r m e d = [ 5 . 2 , 6 . 3 , 7 . 4 , 8 . 5 , 9 . 6 , 1 0 . 7 , 1 1 . 8 , 1 2 . 9 ]
Final vector = [ 1.2 ,2.3 ,3.4 ,4.5 ,5.6 ,6.7 ,7.8 ,8.9 ]

Operator overloading provides the “natural” vector operations in R n both in place and with a new vector
created for the result.

C++11 code 0.2.33: In place arithmetic operations (one argumnt)

1 MyVector &MyVector : : operator +=( const MyVector &mv) {
2 i f ( dbg ) cout << " { o p e r a t o r + = , M y V e c t o r o f l e n g t h "
3 << n << ’ } ’ << endl ;
4 i f ( n ! = mv . n ) throw ( std : : l o g i c _ e r r o r ( " + = : v e c t o r s i z e m i s m a t c h " ) ) ;
5 f o r ( std : : s i z e _ t l = 0; l <n ;++ l ) data [ l ] += mv . data [ l ] ;
6 r et ur n ( ∗ t h i s ) ;
7 }
8

9 MyVector &MyVector : : operator −=(const MyVector &mv) {

10 i f ( dbg ) cout << " { o p e r a t o r − =, M y V e c t o r o f l e n g t h "
11 << n << ’ } ’ << endl ;
12 i f ( n ! = mv . n ) throw ( std : : l o g i c _ e r r o r ( " − =: v e c t o r s i z e m i s m a t c h " ) ) ;
13 f o r ( std : : s i z e _ t l = 0; l <n ;++ l ) data [ l ] −= mv . data [ l ] ;
14 r et ur n ( ∗ t h i s ) ;
15 }
16

17 MyVector &MyVector : : operator ∗ =( double alpha ) {

18 i f ( dbg ) cout << " { o p e r a t o r ∗ = , M y V e c t o r o f l e n g t h "
19 << n << ’ } ’ << endl ;
20 f o r ( std : : s i z e _ t l = 0; l <n ;++ l ) data [ l ] ∗= alpha ;
21 r et ur n ( ∗ t h i s ) ;
22 }
23

24 MyVector &MyVector : : operator / = ( double alpha ) {

25 i f ( dbg ) cout << " { o p e r a t o r ∗ = , M y V e c t o r o f l e n g t h "
26 << n << ’ } ’ << endl ;
27 f o r ( std : : s i z e _ t l = 0; l <n ;++ l ) data [ l ] / = alpha ;
28 r et ur n ( ∗ t h i s ) ;

0. Introduction, 0.2. Programming in C++11 31

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

29 }

C++11 code 0.2.34: Binary arithmetic operators (two arguments)

1 MyVector MyVector : : operator + ( MyVector mv) const {
2 i f ( dbg ) cout << " { o p e r a t o r + , M y V e c t o r o f l e n g t h "
3 << n << ’ } ’ << endl ;
4 i f ( n ! = mv . n ) throw ( std : : l o g i c _ e r r o r ( " + : v e c t o r s i z e m i s m a t c h " ) ) ;
5 mv += ∗ t h i s ;
6 r et ur n (mv) ;
7 }
8

9 MyVector MyVector : : operator − ( const MyVector &mv) const {

10 i f ( dbg ) cout << " { o p e r a t o r + , M y V e c t o r o f l e n g t h "
11 << n << ’ } ’ << endl ;
12 i f ( n ! = mv . n ) throw ( std : : l o g i c _ e r r o r ( " + : v e c t o r s i z e m i s m a t c h " ) ) ;
13 MyVector tmp ( ∗ t h i s ) ; tmp −= mv ;
14 r et ur n ( tmp ) ;
15 }
16

17 MyVector MyVector : : operator ∗ ( double alpha ) const {

18 i f ( dbg ) cout << " { o p e r a t o r ∗ a , M y V e c t o r o f l e n g t h "
19 << n << ’ } ’ << endl ;
20 MyVector tmp ( ∗ t h i s ) ; tmp ∗= alpha ;
21 r et ur n ( tmp ) ;
22 }
23

24 MyVector MyVector : : operator / ( double alpha ) const {

25 i f ( dbg ) cout << " { o p e r a t o r / , M y V e c t o r o f l e n g t h " << n << ’ } ’ <<
endl ;
26 MyVector tmp ( ∗ t h i s ) ; tmp / = alpha ;
27 r et ur n ( tmp ) ;
28 }

C++11 code 0.2.35: Non-member function for left multiplication with a scalar
1 MyVector operator ∗ ( double alpha , const MyVector &mv) {
2 i f ( MyVector : : dbg ) cout << " { o p e r a t o r a ∗ , M y V e c t o r o f l e n g t h "
3 << mv . n << ’ } ’ << endl ;
4 MyVector tmp (mv) ; tmp ∗= alpha ;
5 r et ur n ( tmp ) ;
6 }

C++11 code 0.2.36: Euclidean norm

1 double MyVector : : norm ( void ) const {
2 i f ( dbg ) cout << " { norm : M y V e c t o r o f l e n g t h " << n << ’ } ’ << endl ;
3 double s = 0 ;

0. Introduction, 0.2. Programming in C++11 32

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

4 f o r ( std : : s i z e _ t l = 0; l <n ;++ l ) s += ( data [ l ] ∗ data [ l ] ) ;

5 r et ur n ( std : : s q r t ( s ) ) ;
6 }

Adopting the notation in some linear algebra texts, the operator * has been chosen to designate the
Euclidean inner product:

C++11 code 0.2.37: Euclidean inner product

1 double MyVector : : operator ∗ ( const MyVector &mv) const {
2 i f ( dbg ) cout << " { d o t ∗ , M y V e c t o r o f l e n g t h " << n << ’ } ’ << endl ;
3 i f ( n ! = mv . n ) throw ( std : : l o g i c _ e r r o r ( " d o t : v e c t o r s i z e m i s m a t c h " ) ) ;
4 double s = 0 ;
5 f o r ( std : : s i z e _ t l = 0; l <n ;++ l ) s += ( data [ l ] ∗ mv . data [ l ] ) ;
6 r et ur n ( s ) ;
7 }

At least for debugging purposes every reasonably complex class should be equipped with output function-
ality.

C++11 code 0.2.38: Non-member function output operator

1 std : : ostream &operator << ( std : : ostream &o , const MyVector &mv) {
2 o << " [ " ;
3 f o r ( std : : s i z e _ t l = 0; l <mv . n ;++ l )
4 o << mv . data [ l ] << ( l ==mv . n−1? ’ ’ : ’ , ’ ) ;
5 r et ur n ( o << " ] " ) ;
6 }

Experiment 0.2.39 (“Behind the scenes” of MyVector arithmetic)

The following code highlights the use of operator overloading to obtain readable and compact expressions
for vector arithmetic.

C++11 code 0.2.40:

1 i n t main ( ) {
2 myvec : : MyVector : : dbg = t r ue ;
3 myvec : : MyVector
x ( std : : vector <double > ( { 1 . 2 , 2 . 3 , 3 . 4 , 4 . 5 , 5 . 6 , 6 . 7 , 7 . 8 , 8 . 9 } ) ) ;
4 myvec : : MyVector
y ( std : : vector <double > ( { 2 . 1 , 3 . 2 , 4 . 3 , 5 . 4 , 6 . 5 , 7 . 6 , 8 . 7 , 9 . 8 } ) ) ;
5 auto z = x + ( x ∗ y ) ∗ x +2.0 ∗ y / ( x−y ) . norm ( ) ;
6 }

We run the code and trace calls. This is printed to the console:

0. Introduction, 0.2. Programming in C++11 33

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

{ MyVector ( l e n g t h 8 ) c o n s t r u c t e d from c o n t a i n e r }
{ MyVector ( l e n g t h 8 ) c o n s t r u c t e d from c o n t a i n e r }
{ dot ∗ , MyVector o f l e n g t h 8}
{ o p e r a t o r a ∗ , MyVector o f l e n g t h 8}
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r ∗ = , MyVector o f l e n g t h 8}
{ o p e r a t o r + , MyVector o f l e n g t h 8}
{ o p e r a t o r += , MyVector o f l e n g t h 8}
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r a ∗ , MyVector o f l e n g t h 8}
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r ∗ = , MyVector o f l e n g t h 8}
{ o p e r a t o r + , MyVector o f l e n g t h 8}
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r −=, MyVector o f l e n g t h 8}
{ norm : MyVector o f l e n g t h 8}
{ o p e r a t o r / , MyVector o f l e n g t h 8}
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r ∗ = , MyVector o f l e n g t h 8}
{ o p e r a t o r + , MyVector o f l e n g t h 8}
{ o p e r a t o r += , MyVector o f l e n g t h 8}
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }

Several temporary objects are created and destroyed and quite a few copy operations take place. The
situation would be worse unless move semantics was available; if we had not supplied a move constructor,
a few more copy operations would have been triggered. Even worse, the frequent copying of data runs a
high risk of cache misses. This is certainly not an efficient way to do elementary vector operations though
it looks elegant at first glance.

Example 0.2.41 (Gram-Schmidt orthonormalization based on MyVector implementation)

Gram-Schmidt orthonormalization has been taught in linear algebra and its theory will be revisited in
§ 1.5.1. Here we use this simple algorithm from linear algebra to demonstrate the use of the vector class
MyVector defined in Code 0.2.13.
The templated function gramschmidt takes a sequence of vectors stored in a std::vector object. The
actual vector type is passed as a template parameter. It has to supply length and norm member
functions as well as in place arithmetic operations -=, / and =. Note the use of the highlighted methods
of the std::vector class.

C++11 code 0.2.42: templated function for Gram-Schmidt orthonormalization

1 template <typename Vec>

0. Introduction, 0.2. Programming in C++11 34

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

2 std : : vector <Vec> gramschmidt ( const std : : vector <Vec> &A, double
eps=1E−14) {
3 const i n t k = A . siz e ( ) ; // no. of vectors to orthogonalize
4 const i n t n = A [ 0 ] . l e n g t h ( ) ; // length of vectors
5 cout << " g r a m s c h m i d t o r t h o g o n a l i z a t i o n f o r " << k << ’ ’ << n <<
"− v e c t o r s " << endl ;
6 std : : vector <Vec> Q( { A [ 0 ] / A [ 0 ] . norm ( ) } ) ; // output vectors
7 f o r ( i n t j = 1 ; ( j <k ) && ( j <n ) ;++ j ) {
8 Q. push_back (A [ j ] ) ;
9 f o r ( i n t l = 0; l < j ;++ l ) Q. back ( ) −= (A [ j ] ∗ Q[ l ] ) ∗Q[ l ] ;
10 i f (Q. back ( ) . norm ( ) < eps ∗A [ j ] . norm ( ) ) { // premature termination ?
11 Q. pop_back ( ) ; break ;
12 }
13 Q. back ( ) / = Q. back ( ) . norm ( ) ; // normalization
14 }
15 r et ur n (Q) ; // return at end of local scope
16 }

This driver program calls a function that initializes a sequence of vectors and then orthonormalizes them
by means of the Gram-Schmidt algorithm. Eventually orthonormality of the computed vectors is tested.
Please pay attention to

• the use of auto to avoid cumbersome type declarations,

• the for loops following the “foreach” syntax.
• automatic indirect template type deduction for the templated function gramschmidt from its argu-
ment. In Line 6 the function gramschmidt<MyVector> is instantiated.

C++11 code 0.2.43: Driver code for Gram-Schmidt orthonormalization

1 i n t main ( ) {
2 myvec : : MyVector : : dbg = f a l s e ;
3 const i n t n = 7 ; const i n t k = 7 ;
4 auto A( i n i t v e c t o r s ( n , k , [ ] ( i n t i , i n t j )
5 { r et ur n std : : min ( i +1 , j +1) ; } ) ) ;
6 auto Q( gramschmidt ( A) ) ; // instantiate template for MyVector
7 cout << " S e t o f v e c t o r s t o be o r t h o n o r m a l i z e d : " << endl ;
8 f o r ( const auto &a : A) { cout << a << endl ; }
9 cout << " O u t p u t o f Gram−S c h m i d t o r t h o n o r m a l i z a t i o n : " << endl ;
10 f o r ( const auto &q : Q) { cout << q << endl ; }
11 cout << " T e s t i n g o r t h o g o n a l i t y : " << endl ;
12 f o r ( const auto & q i : Q) {
13 f o r ( const auto & q j : Q)
14 cout << std : : set pr ecision ( 3 ) << std : : setw ( 9 ) << q i ∗ q j << ’ ’ ;
15 cout << endl ; }
16 r et ur n ( 0 ) ;
17 }

This initialization function takes a functor argument as discussed in Section 0.2.3.

0. Introduction, 0.2. Programming in C++11 35

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11 code 0.2.44: Initialization of a set of vectors through a functor with two arguments
1 template <typename Functor >
2 std : : vector <myvec : : MyVector>
3 i n i t v e c t o r s ( std : : s i z e _ t n , std : : s i z e _ t k , Func tor &&f ) {
4 std : : vector <MyVector> A { } ;
5 f o r ( i n t j = 0; j <k ;++ j ) {
6 A . push_back ( MyVector ( n ) ) ;
7 f o r ( i n t i = 0; i <n ;++ i )
8 ( A . back ( ) ) [ i ] = f ( i , j ) ;
9 }
10 r et ur n ( A) ;
11 }

0.3 Creating Plots with M ATH GL

0.3.1 M ATH GL Documentation (by J. Gacon)

M ATH GL is a huge open-source plotting library for scientific graphics. It can be used for many programming
languages, in particular also for C++. Mainly we will use the Figure library (implemented by J. Gacon,
student of CSE@ETH) introduced below (Section 0.3.4).

However for some special plots using M ATH GL can be necessary. A full documentation can be found at
http://mathgl.sourceforge.net/doc_en/index.html.
First of all note that all M ATH GL plot commands do not take std::vectors or E IGEN vectors as argu-
ments but only mglData.
N OTE : The Figure environment takes care of all this conversion and formatting!

From std::vector<double> or Eigen::VectorXd a mglData can be initialized as follows:

C++ code 0.3.1: Vector to mglData

1 std : : vector <double> v ;
2 Eigen : : VectorXd w ;
3 // ... initialize data...
4

5 // the constructor takes a pointer to the first element

6 // and the size of the vector
7 mglData vd ( v . data ( ) , v . siz e ( ) ) ,
8 wd(w. data ( ) , w . siz e ( ) ) ;

For Eigen::RowVectorXd we first must do some rearrangements of the vector, as the data()
method returns a pointer to the column major data.

0. Introduction, 0.3. Creating Plots with M ATH GL 36

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++ code 0.3.2: Eigen::RowVectorXd to mglData

1 Eigen : : RowVectorXd v ;
2 // ... initialize data...
3

4 // Store data in a temporary array

5 std : : vector <double> tmp ( v . cols ( ) ) ;
6 f o r ( long i = 0 ; i < r . cols ( ) ; ++ i ) tmp [ i ] = v ( i ) ;
7 mglData vd ( tmp . data ( ) , tmp . siz e ( ) ) ;

For matrices the constructor is slightly different:

C++ code 0.3.3: Matrices to mglData

1 Eigen : : MatrixXd A ;
2 // ... initialize data ...
3

4 // The column major data layout fits M A T H GL conventions

5 mglData Ad ( A . rows ( ) , A . cols ( ) , A . data ( ) ) ;

0.3.2 M ATH GL Installation

If you’re using Linux the easiest way to install M ATH GL is via the command line: apt-get install
mathgl (Ubuntu/Debian) or dnf install mathgl (Fedora).

You can also use CMake to install it.

1. Install CMake (apt-get install cmake or from https://cmake.org/download/)

2. Download the M ATH GL source from

http://mathgl.sourceforge.net/doc_en/Download.html#Download
3. In the mathgl-2.3.* directory do:
mkdir build && cd build
cmake ..
make (this step might take a while)
(sudo) make install

0.3.3 Corresponding Plotting functions of M ATLAB and M ATH GL

M ATH GL offers a subset of M ATLAB’s large array of plotting functions. The following tables list correspond-
ing commands/methods:

Plotting in 1-D

0. Introduction, 0.3. Creating Plots with M ATH GL 37

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB M ATH GL
axis([0,5,-2,2]) gr.Ranges(0,5,-2,2)
Default: autofit (axis auto) Default: x=-1:1, y=-1:1
Workaround:
gr.Ranges(x.Minimal(), x.Maximal(),
y.Minimal(), y.Maximal())
axis([0,5,-inf,inf]) gr.Range(’x’,0,5)
axis([-inf,inf,-2,2]) gr.Range(’y’,-2,2)
xlabel(’x-axis’) gr.Label(’x’, "x-axis")
ylabel(’y-axis’) gr.Label(’y’, "y-axis")
legend(’sin(x)’, ’xˆ2’) gr.AddLegend("sin(x)","b")
gr.AddLegend("\\xˆ2", "g")
gr.Legend()
legend(’exp(x)’) gr.AddLegend("exp(x)","b")
legend(’boxoff’) gr.Legend(1,1,"")
legend(’x’,’Location’, gr.AddLegend("x","b")
’northwest’) gr.Legend(0,1)
legend(’cos(x)’, gr.AddLegend("cos(x)","b")
’Orientation’,’horizontal’) gr.Legend("#-")
(0, 1) (0.5, 1) (1, 1) Legend alignment in MathGL:
(0, 0.5) (0.5, 0.5) (1, 0.5) Values larger than 1 will give position outside of the graph.
(0, 0) (0.5, 0) (1, 0) Default is (1,1).
plot(y) gr.Plot(y)
plot(t,y) gr.Plot(t,y)
plot(t0,y0,t1,y1) gr.Plot(t0,y0)
gr.Plot(t1,y1)
plot(t,y,’b+’) gr.Plot(t,y,"b+")
print(’myfig’,’-depsc’) gr.WriteEPS("myfig.eps")
print(’myfig’,’-dpng’) gr.WritePNG("myfig.png")
(compile w/ flag -lpng)
title(’Plot title’) gr.Title("Plot title") (title high above plot)
gr.Subplot(1,1,0,"<_")
(title directly above plot)
gr.Title("Plot title")
Plotting in 2-D
M ATLAB M ATH GL
colorbar gr.Colorbar()
mesh(Z) gr.Mesh(Z)
mesh(X,Y,Z) gr.Mesh(X,Y,Z)
surface(Z) gr.Surf(Z)
surface(X,Y,Z) gr.Surf(X,Y,Z)
pcolor(Z) gr.Tile(Z)
pcolor(X,Y,Z) gr.Tile(X,Y,Z)
plot3(X,Y,Z) gr.Plot(X,Y,Z)
Additionaly, you have to add gr.Rotate(50,60) before the plot command for MathGL to create a 3-D
box, otherwise the result is 2-D.

0. Introduction, 0.3. Creating Plots with M ATH GL 38

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

0.3.4 The Figure Class

The Figure library is an interface to M ATH GL. By taking care of formatting and the layout it allows a very
simple, fast and easy use of the powerful plot library.

This library depends on M ATH GL (and optionally on E IGEN), so the installation requires a working version
of these dependencies.

0.3.4.1 Introductory example

This short example code will show, how the Figure class can be used.

C++11 code 0.3.4: [

A first code using M ATH GL ]
1 # include <vector >
2 # include <Eigen / Dense>
3 # include < f i g u r e / f i g u r e . hpp>
4

5 i n t main ( ) {
6 std : : vector <double> x ( 1 0 ) , y ( 1 0 ) ;
7 f o r ( i n t i = 0 ; i < 10; ++ i ) {
8 x [ i ] = i ; y [ i ] = std : : exp ( − 0.2 ∗ i ) ∗ std : : cos ( i ) ;
9 }
10 Eigen : : VectorXd u = Eigen : : VectorXd : : LinSpaced ( 500 , 0 , 9 ) ,
11 v = ( u . array ( ) . cos ( ) ∗ ( − 0.2 ∗ u ) . array ( ) . exp ( )
) . matrix ( ) ;
12 mgl : : Figure f i g ;
13 f i g . p l o t ( x , y , " + r " ) . l a b e l ( " Sample D a t a " ) ;
14 f i g . plot ( u , v , " b " ) . label ( " F u n c t i o n " ) ;
15 f i g . legend ( ) ;
16 f i g . save ( " p l o t . e p s " ) ;
17 r et ur n 0 ;
18 }

0. Introduction, 0.3. Creating Plots with M ATH GL 39

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The figure beside displays the graphical output

stored in EPS format in the file plot.eps. ✄
Note the use of the methods label and legend to
create the legend.

Fig. 3

0.3.4.2 Figure Methods

(0.3.5) The grid method

This method prints gridlines in the background of a plot.

Definition:
v o i d grid( const bool & on = t r u e ,
const s t d ::string& gridType = "-",
const s t d ::string& gridCol = "h" )

Restrictions: None.

Examples:
1 mgl : : F i g u r e f i g ;
2 fig . plot (x , y) ;
3 f i g . gr id ( ) ; // set grid
4 f i g . save ( " p l o t . e p s " ) ;
5

6 mgl : : F i g u r e f i g ;
7 fig . plot (x , y) ;
8 f i g . gr id ( true , " ! " , " h " ) ; // grey (-> h) fine (-> !) mesh
9 f i g . save ( " p l o t . e p s " ) ;

0. Introduction, 0.3. Creating Plots with M ATH GL 40

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Fig. 4 Fig. 5

(0.3.6) The xlabel method

This method adds a label text to the x-axis.

Definition:
v o i d xlabel( const s t d ::string& label,
const double & pos = 0 )

Restrictions: None.

Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . p l o t ( x , y , " g+ " ) ; // ’g+’ equals matlab ’+-g’
3 f i g . xlabel ( " L i n e a r x a x i s " ) ;
4 f i g . save ( " p l o t . e p s " ) ;
5

6 mgl : : F i g u r e f i g ;
7 f i g . x l a b e l ( " L o g a r i t h m i x x a x i s " ) ; // no restricitons on call order
8 f i g . s e t l o g ( true , t r ue ) ;
9 f i g . p l o t ( x , y , " g+ " ) ;
10 f i g . save ( " p l o t . e p s " ) ;

0. Introduction, 0.3. Creating Plots with M ATH GL 41

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Fig. 6 Fig. 7

(0.3.7) The ylabel method

This command adds text to the y-axis.

Definition:
v o i d ylabel( const s t d ::string& label,
const double & pos = 0 )

Restrictions: None.

Examples: See xlabel.

(0.3.8) The legend method

This method adds a legend to a plot. Legend entries have to be defined by the label method given after
the plot command.

Definition:
v o i d legend( const double & xPos = 1,
const double & yPos = 1 )

Restrictions: None.

Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . p l o t ( x0 , y0 ) . l a b e l ( " My F u n c t i o n " ) ;
3 f i g . legend ( ) ; // ’activate’ legend
4 f i g . save ( " p l o t " ) ;
5

0. Introduction, 0.3. Creating Plots with M ATH GL 42

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6 mgl : : F i g u r e f i g ;
7 f i g . p l o t ( x0 , y0 ) . l a b e l ( " My F u n c t i o n " ) ;
8 f i g . legend ( 0 . 5 , 0 . 2 5 ) ; // set position to (0.5, 0.25)
9 f i g . save ( " p l o t " ) ;
10

11 mgl : : F i g u r e f i g ;
12 f i g . p l o t ( x0 , y0 ) . l a b e l ( " My F u n c t i o n " ) ;
13 f i g . save ( " p l o t " ) ; // legend won’t appear as legend() hasn’t been called

Fig. 8 Fig. 9

(0.3.9) The setlog method

Selecting axis scaling for a plot, either linear or logarithmic.

Definition:
v o i d setlog( const bool & logx = f a l s e ,
const bool & logy = f a l s e ,
const bool & logz = f a l s e )

Restrictions: All plots will use the latest setlog options or default if none have been set.

Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . set log ( true , f a l s e ) ; // -> semilogx
3 f i g . p l o t ( x0 , y0 ) ;
4 f i g . set log ( f alse , t r ue ) ; // -> semilogy
5 f i g . p l o t ( x1 , y1 ) ;
6 f i g . set log ( true , t r ue ) ; // -> loglog
7 f i g . p l o t ( x2 , y2 ) ;
8 f i g . save ( " p l o t . e p s " ) ; // ATTENTION: all plots will use loglog-scale
9

10 mgl : : F i g u r e f i g ;

0. Introduction, 0.3. Creating Plots with M ATH GL 43

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

11 fig . plot (x , y) ;
12 f i g . save ( " p l o t . e p s " ) ; // -> default (= linear) scaling

Fig. 10 Fig. 11

(0.3.10) The plot method

The standard method for plotting function graphs.

Definition:
t e m p l a t e < typename yVector>
v o i d plot( const yVector& y,
const s t d ::string& style = "" )

t e m p l a t e < typename xVector, typename yVector>

v o i d plot( const xVector& x,
const yVector& y,
const s t d ::string& style = "" )

Restrictions: xVector and yVector must have a size() method, which returns the size of the vec-
tor and a data() method, which returns a pointer to the first element in the vector.
Furthermore x and y must have same length.

Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . p l o t ( x , y , " g ; " ) ; // green and dashed linestyle
3 f i g . save ( " d a t a . e p s " ) ;
4

5 mgl : : F i g u r e f i g ;
6 f i g . p l o t ( x , y ) ; // OK - style is optional
7 f i g . save ( " d a t a . e p s " ) ;
8

9 mgl : : F i g u r e f i g ;

0. Introduction, 0.3. Creating Plots with M ATH GL 44

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10 f i g . p l o t ( x , y , " ∗ r " , " D a t a w / r e d d o t s " ) ; // ’ r’ equals matlab ’r’

11 f i g . save ( " d a t a . e p s " ) ;

Fig. 12 Fig. 13

See below for the various colors and linestyles that

can be chosen for plots.

Fig. 14

(0.3.11) The plot3 method

This method can visualize curves in 3D space.

Definition:
t e m p l a t e < typename xVector, typename yVector, typename zVector>
v o i d plot3( const xVector& x,
const yVector& y,
const zVector& z,
const s t d ::string& style = "" )

0. Introduction, 0.3. Creating Plots with M ATH GL 45

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Restrictions: Same restrictions as in plot for two vectors, extended to zVector.

Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . plot 3 ( x , y , z ) ;
3 f i g . save ( " t r a j e c t o r i e s . e p s " ) ;

This is the default color and line style.

As in the case of the plot method different colors

and linestyles can be selected through the optional
string argument style, see § 0.3.18.

Fig. 15

(0.3.12) The fplot method

This command can be used to plot a function of x passed a expression in a string.

Definition:
v o i d fplot( const s t d ::string& function,
const s t d ::string& style = "" )

Restrictions: None.

Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . f p l o t ( " ( 3 ∗ x ^2 − 4 . 5 / x ) ∗ e xp( − x / 1 . 3 ) " ) ;
3 f i g . f p l o t ( " 5 ∗ s i n ( 5 ∗ x ) ∗ e xp( − x ) " , " r " ) . l a b e l ( " 5 s i n ( 5 x ) ∗ e ^{ − x } " ) ;
4 f i g . ranges ( 0 . 5 , 5 , −5, 5 ) ; // be sure to set ranges for fplot!
5 f i g . save ( " p l o t . e p s " ) ;
6

7 mgl : : F i g u r e f i g ;
8 f i g . p l o t ( x , y , " b " ) . l a b e l ( " Benchmark " ) ;
9 f i g . f p l o t ( " x ^2 " , " k ; " ) . l a b e l ( "O ( x ^ 2 ) " ) ;
10 // here we don’t set the ranges as it uses the range given by the
11 // x,y data and we use fplot to draw a reference line O(xˆ2)
12 f i g . save ( " r u n t i m e s . e p s " ) ;

0. Introduction, 0.3. Creating Plots with M ATH GL 46

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Fig. 16 Fig. 17

(0.3.13) The ranges method

The method sets the axis ranges for a 2D plot.

Definition:
v o i d ranges( const double & xMin,
const double & xMax,
const double & yMin,
const double & yMax )

Restrictions: xMin < xMax, yMin < yMax and ranges must be > 0 for axis in logarithmic scale.

Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . ranges ( − 1 ,1 , − 1 ,1) ;
3 fig . plot (x , y , "b" ) ;
4

5 mgl : : F i g u r e f i g ;
6 fig . plot (x , y , "b" ) ;
7 f i g . ranges ( 0 , 2 . 3 , 4 , 5 ) ; // ranges can be called before or after ’plot’
8

9 mgl : : F i g u r e f i g ;
10 f i g . ranges( − 1, 1 , 0 , 5 ) ;
11 f i g . s e t l o g ( true , t r ue ) ; // will run but MathGL will throw a warning
12 fig . plot (x , y , "b" ) ;

0. Introduction, 0.3. Creating Plots with M ATH GL 47

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Fig. 18 Fig. 19

Illegal ranges for logarithmic scale yield a mangled

output.

Fig. 20

(0.3.14) The save method

Saves the graphics currently stored in a figure object to file. The default format is EPS.

Definition:
v o i d save( const s t d ::string& file )

Restrictions: Supported file formats: .eps and .png.

Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . save ( " p l o t . e p s " ) ; // OK
3

0. Introduction, 0.3. Creating Plots with M ATH GL 48

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

4 mgl : : F i g u r e f i g ;
5 f i g . save ( " p l o t " ) ; // OK - will be saved as plot.eps
6

7 mgl : : F i g u r e f i g ;
8 f i g . save ( " p l o t . png " ) ; // OK - but needs -lpng flag!

(0.3.15) The spy method

This method renders the sparsity pattern of a matrix.

Definition:
t e m p l a t e < typename Matrix>
MglPlot& spy( const Matrix& A, const s t d ::string& style = "b");

t e m p l a t e < typename Scalar>

MglPlot& spy( const Eigen::SparseMatrix<Scalar>& A, const
s t d ::string& style = "b");

Restrictions: None.
Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . spy ( A) ; // e.g: A = Eigen::MatrixXd, or Eigen::SparseMatrix<double>

Fig. 21 Fig. 22

(0.3.16) The triplot method

This method allows to plot triangulations.

Definition:

0. Introduction, 0.3. Creating Plots with M ATH GL 49

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

t e m p l a t e < typename Scalar, typename xVector, typename yVector>

MglPlot& triplot( c ons t Eigen::Matrix<Scalar, -1, -1, Eigen::RowMajor>& T,
c ons t xVector& x, c ons t yVector& y, s t d ::string style = "");

t e m p l a t e < typename Scalar, typename xVector, typename yVector>

MglPlot& triplot( c ons t Eigen::Matrix<Scalar, -1, -1, Eigen::ColMajor>& T,
c ons t xVector& x, c ons t yVector& y, s t d ::string style = "");

Restrictions: The vectors x and y must have the same dimensions and the matrix T must be of dimension
N × 3, with N being the number of edges.
Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . t r i p l o t ( T , x , y , " b ? " ) ; // ’?’ enumerates all vertices of the mesh
3

4 mgl : : F i g u r e f i g ;
5 f i g . t r i p l o t ( T , x , y , " bF " ) ; // ’F’ will yield a solid background
6 f i g . t r i p l o t ( T , x , y , " k " ) ; // draw black mesh on top

Fig. 23 Fig. 24

(0.3.17) The title method

Add a title line to any plot.

Definition:
v o i d title( const s t d ::string& text )

Restrictions: None.

(0.3.18) The string style specifiers

0. Introduction, 0.3. Creating Plots with M ATH GL 50

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Linecolorsa :

blue b
green g Linestyles: Linemarkers:
red r
none + +
cyan c
solid - o o
magenta m
yellow y dashed ; ⋄ d
gray h small dashed = · .
green-blue l long dashed | △ ˆ
sky-blue n dotted : ∇ v
dash-dotted j ✁ <
orange q
small dash-dotted i ✄ >
green-yellow e
blue-violet u None is used as follows: ⊙ #.
purple p " r*" gives red stars w/o ⊞ #+
any lines ⊠ #x
a
Upper-case letters will give
a darker version of the lower-
case version.

0. Introduction, 0.3. Creating Plots with M ATH GL 51

Chapter 1

Computing with Matrices and Vectors

(1.0.1) Prerequisite knowledge for Chapter 1

This chapter heavily relies on concepts and techniques from linear algebra as taught in the 1st semester
introductory course. Knowledge of the following topics from linear algebra will be taken for granted and
they should be refreshed in case of gaps:
• Operations involving matrices and vectors [?, Ch. 2]
• Computations with block-structured matrices
• Linear systems of equations: existence and uniqueness of solutions [?, Sects. 1.2, 3.3]
• Gaussian elimination [?, Ch. 2]
• LU-decomposition and its connection with Gaussian elimination [?, Sect. 2.4]

(1.0.2) Levels of operations in simulation codes

The lowest level of real arithmetic available on computers are the elementary operations “+”, “−”, “∗”,
“\”, “^”, usually implemented in hardware. The next level comprises computations on finite arrays of real
numbers, the elementary linear algebra operations (BLAS). On top of them we build complex algorithms
involving iterations and approximations.

Complex iterative/recursive/approximative algorithms

Linear algebra operations on arrays (BLAS)

Elementary operations in R

Hardly ever anyone will contemplate implementing elementary operations on binary data formats; similarly,
well tested and optimised code libraries should be used for all elementary linear algebra operations in
simulation codes. This chapter will introduce you to such libraries and how to use them smartly.

Contents
1.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.1.2 Classes of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

52
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

1.2 Software and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

1.2.1 M ATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.2.2 P YTHON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
1.2.3 E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.2.4 (Dense) Matrix storage formats . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.3 Basic linear algebra operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
1.3.1 Elementary matrix-vector calculus . . . . . . . . . . . . . . . . . . . . . . . . 69
1.3.2 BLAS – Basic Linear Algebra Subprograms . . . . . . . . . . . . . . . . . . . 76
1.4 Computational effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
1.4.1 (Asymptotic) complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
1.4.2 Cost of basic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
1.4.3 Reducing complexity in numerical linear algebra: Some tricks . . . . . . . . 89
1.5 Machine Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
1.5.1 Experiment: Loss of orthogonality . . . . . . . . . . . . . . . . . . . . . . . . 93
1.5.2 Machine Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
1.5.3 Roundoff errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
1.5.4 Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
1.5.5 Numerical stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

1.1 Fundamentals

1.1.1 Notations

The notations in this course try to adhere to established conventions. Since these may not be universal,
idiosyncrasies cannot be avoided completely. Notations in textbooks may be different, beware!

✎ notation for generic field of numbers: K

In this course, K will designate either R (real numbers) or C (complex numbers); complex arithmetic [?,
Sect. 2.5] plays a crucial role in many applications, for instance in signal processing.

(1.1.1) Notations for vectors

✦ Vectors = are n-tuples (n ∈ N) with components ∈ K.

vector = one-dimensional array (of real/complex numbers)

✎ vectors will usually be denoted by small bold symbols: a, b, . . . , x, y, z

✦ Default in this lecture: vectors = column vectors
 
x1
 .. 
 .  ∈ Kn x1 · · · xn ∈ K1,n
xn
column vector row vector

Kn =
ˆ vector space of column vectors with n components in K.

Unless stated otherwise, in mathematical formulas vector components are indexed from 1 !

1. Computing with Matrices and Vectors, 1.1. Fundamentals 53

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✎ notation for column vectors: bold small roman letters, e.g. x, y, z

column vector 7→ row vector
✦ Transposing: .
row vector 7→ column vector
 ⊤  
x1 x1
 ..  ⊤  .. 
. = x1 · · · x n , x1 · · · x n = . 
xn xn

✎ Notation for row vectors: x⊤ , y⊤ , z⊤

✦ Addressing vector components:

✎ two notations: x = [ x1 . . . x n ] ⊤ → xi , i = 1, . . . , n
x ∈ Kn → (x)i , i = 1, . . . , n

✦ Selecting sub-vectors:

✎ notation: x = [ x1 . . . xn ] ⊤ ➣ (x)k:l = ( xk , . . . , xl )⊤ , 1 ≤ k ≤ l ≤ n
⊤
✦ j-th unit vector: e j = 0, . . . , 1, . . . , 0 , (e j )i = δij , i, j = 1, . . . , n.
✎ notation: Kronecker symbol δij := 1, if i = j, δij := 0, if i 6= j.

(1.1.2) Notations and notions for matrices

✦ Matrices = two-dimensional arrays of real/complex numbers

 
a11 . . . a1m
 .. 
A :=  ... . ∈K
n,m
, n, m ∈ N .
an1 . . . anm

vector space of n × m-matrices: (n =

ˆ number of rows, m =
ˆ number of columns)

✎ notation: bold CAPITAL roman letters, e.g., A, S, Y

K n,1 ↔ column vectors, K1,n ↔ row vectors

✦ Writing a matrix as a tuple of its columns or rows
ci ∈ K n , i = 1, . . . , m ✄ A = [c1 , c2 , . . . , cm ] ∈ K n,m ,
 
r1⊤
 
ri ∈ K m , i = 1, . . . , n ✄ A =  ...  ∈ K n,m .
r⊤n

✦ Addressing matrix entries & sub-matrices (✎ notations):

  → entry (A)i,j = aij , 1 ≤ i ≤ n, 1 ≤ j ≤ m ,

a11 . . . a1m → i-th row, 1 ≤ i ≤ n: ai,: = (A)i,: ,
 .. ..  → j-th column, 1 ≤ j ≤ m: a:,j = (A):,j ,
A :=  . . 
an1 . . . anm 1≤k≤l≤n,
→ matrix block (aij ) i=k,...,l = (A)k:l,r:s ,
j =r,...,s 1≤r≤s≤m.
(sub-matrix)

1. Computing with Matrices and Vectors, 1.1. Fundamentals 54

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The colon (:) range notation is inspired by M ATLAB’s matrix addressing conventions, see Sec-
tion 1.2.1. (A)k:l,r:s is a matrix of size (l − k + 1) × (s − r + 1).

✦ Transposed matrix:
 ⊤  
a11 . . . a1m a11 . . . an1
 ..   . .. 
A⊤ =  ... .  :=  .. . ∈K
m,n
.
an1 . . . anm a1m . . . amn

✦ Adjoint matrix (Hermitian transposed):

 H  
a11 . . . a1m ā11 . . . ān1
 ..   . .. 
AH :=  ... .  :=  .. . ∈K
m,n
.
an1 . . . anm ā1m . . . āmn

✎ notation: āij = Re(aij ) − iIm(aij ) complex conjugate of aij .

1.1.2 Classes of matrices

Most matrices occurring in mathematical modelling have a special structure. This section presents a few
of these. More will come up throughout the remainder of this chapter; see also [?, Sect. 4.3].

(1.1.3) Special matrices

Terminology and notations for a few very special matrices:

 
1 0
 ..  n,n
Identity matrix: I= . ∈K ,
0 1
 
0 ... 0
 .. .. ..  n,m
Zero matrix: O = . . . ∈ K ,
0 ... 0
 
d1 0
 ..  n,n
Diagonal matrix: D= .  ∈ K , d j ∈ K , j = 1, . . . , n .
0 dn

The creation of special matrices can usually be done by special commands or functions in the various
languages or libraries dedicated to numerical linear algebra, see § 1.2.5, § 1.2.13.

(1.1.4) Diagonal and triangular matrices

A little terminology to quickly refer to matrices whose non-zero entries occupy special locations:

1. Computing with Matrices and Vectors, 1.1. Fundamentals 55

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 1.1.5. Types of matrices

A matrix A = (aij ) ∈ R m,n is

• diagonal matrix, if aij = 0 for i 6= j,
• upper triangular matrix if aij = 0 for i > j,
• lower triangular matrix if aij = 0 for i < j.
A triangular matrix is normalized, if aii = 1, i = 1, . . . , min{m, n}.

     
 0     0 
     
     
     
     
     
     
 0   0   
     

diagonal matrix upper triangular lower triangular

(1.1.6) Symmetric matrices

Definition 1.1.7. Hermitian/symmetric matrices

A matrix M ∈ K n,n , n ∈ N, is Hermitian, if MH = M. If K = R , the matrix is called symmetric.

Definition 1.1.8. Symmetric positive definite (s.p.d.) matrices → [?, Def. 3.31], [?, Def. 1.22]
M ∈ K n,n , n ∈ N, is symmetric (Hermitian) positive definite (s.p.d.), if

M = MH and ∀x ∈ K n : xH Mx > 0 ⇔ x 6= 0 .

If xH Mx ≥ 0 for all x ∈ K n ✄ M positive semi-definite.

Lemma 1.1.9. Necessary conditions for s.p.d. → [?, Satz 3.33], [?, Prop. 1.18]

For a symmetric/Hermitian positive definite matrix M = MH ∈ K n,n holds true:

1. mii > 0, i = 1, . . . , n,
2. mii m jj − |mij |2 > 0 ∀1 ≤ i < j ≤ n,
3. all eigenvalues of M are positive. (← also sufficient for symmetric/Hermitian M)

Remark 1.1.10 (S.p.d. Hessians)

Recall from analysis: in an isolated local minimum x ∗ of a C2 -function f : R n 7→ R ➤ Hessian D2 f ( x ∗ )

s.p.d. (see Def. 8.4.11 for the definition of the Hessian)

To compute the minimum of a C2 -function iteratively by means of Newton’s method (→ Sect. 8.4) a linear

1. Computing with Matrices and Vectors, 1.1. Fundamentals 56

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

system of equations with the s.p.d. Hessian as system matrix has to be solved in each step.

The solutions of many equations in science and engineering boils down to finding the minimum of some
(energy, entropy, etc.) function, which accounts for the prominent role of s.p.d. linear systems in applica-
tions.

?! Review question(s) 1.1.11. (Special matrices)

• We consider two matrices A, B ∈ R n,m , both with at most N ∈ N non-zero entries. What is the
maximal number of non-zero entries of A + B?

• A matrix A ∈ R n,m enjoys the following property (banded matrix):

i ∈ {1, . . . , n}, j ∈ {1, . . . , m}, i − j 6∈ {− B− , . . . , B+ } ⇒ (A)ij = 0 ,

for given B− , B+ ∈ N0 . What is the maximal number of non-zero entries of A.

1.2 Software and Libraries

Whenever algorithms involve matrices and vectors (in the sense of linear algebra) it is advisable to rely on
suitable code libraries or numerical programming environments.

1.2.1 M ATLAB

M ATLAB (“matrix laboratory”) is a commercial (sold by MathWorks Corp.)

• full fledged high level programming language designed for numerical
algoritms (domain specific language: DSL)
• integrated development environment (IDE) offering editor, debugger,
profiler, tracing facilities,
• rather comprehensive collection of numerical libraries.
Fig. 25

Many textbooks, for instance [?] and [?] rely on M ATLAB to demonstrate the actual implementation of
numerical algorithms. So did earlier versions of this course. The current version has dumped M ATLAB
and, hence, this section can be skipped safely.

In its basic form M ATLAB is an interpreted scripting language without strict type-binding. This, together
with its uniform IDE across many platforms, makes it a very popular tool for rapid prototyping and testing
in CSE.

Plenty of resources are available for M ATLAB’s users, among them

• M ATLAB documentation accessible through the Help menu ot through this link,
• M ATLAB’s help facility through the commands help <function> or doc <function>,
• A concise M ATLAB primer, one of many available online, see also here

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 57

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

• M ATLAB-Einführung (in German) by P. Arbenz.

✓ ✏
“In M ATLAB everything is a matrix”

(Fundamental “data type” in M ATLAB = matrix of complex numbers)

✒ ✑

➣ In M ATLAB vectors are represented as n × 1-matrices (column vectors) or 1 × n-matrices (row vec-
tors).

Note: The treatment of vectors as special matrices is consistent with the basic operations from matrix
calculus.

(1.2.1) Fetching the dimensions of a matrix

☞ v = size(A) yields a row vector v of length 2 with v(1) containing the number of rows and v(1)
containing the number of columns of the matrix A.
☞ numel(A) returns the total number of entries of A; if A is a (row or column) vector, we get the length
of A.

(1.2.2) Access to matrix and vector components in M ATLAB

Access (rvalue & lvalue) to components of a vector and entries of a matrix in M ATLAB is possible through
the ()-operator:
☞ r = v(i): retrieve i-th entry of vector v. i must be an integer and smaller or equal numel(v).
☞ r = A(i,j): get matrix entry (A)i,j for two (valid) integer indices i and j.
☞ r = A(end-1,end-2): get matrix entry (A)n−1,m−2 of an n × m-matrix A.
! In case the matrix A is too small to contain an entry (A)i,j , write access to A(i,j) will automatically
trigger a dynamic adjustment of the matrix size to hold the accessed entry. The other new entries are
filled with zeros.
Output: M=
% Caution: matrices are dynamically
1 2 3 0 0 0
expanded when
4 5 6 0 0 0
% out of range entries are accessed
0 0 0 0 0 0
M = [1,2,3;4,5,6]; M(4,6) = 1.0; M,
0 0 0 0 0 1

➣ Danger of accidental exhaustion of memory!

(1.2.3) Access to submatrices in M ATLAB

For any two (row or column) vectors I, J of positive integers A(I,J) selects the submatrix

(A)i,j i ∈I ∈ K♯I,♯J .
j ∈J

A(I,J) can be used as both r-value and l-value; in the former case the maximal components of I and
J have to be smaller or equal the corresponding matrix dimensions, lest M ATLAB issue the error message

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 58

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Index exceeds matrix dimensions. In the latter case, the size of the matrix is grown, if needed,
see § 1.2.2.

(1.2.4) Initialization of matrices in M ATLAB by concatenation

Inside square brackets [ ] the following two matrix construction operators can be used:

• ,-operator =
ˆ adding another matrix to the right (horizontal concatenation)
• ;-operator =
ˆ adding another matrix at the bottom (vertical concatenation)
(The ,-operator binds more strongly than the ;-operator!)

! Matrices joined by the ,-operator must have the same number of rows.
Matrices concatenated vertically must have the same number of columns
 
1 2
A = [1,2;3,4;5,6];
✄ Filling a small matrix: → 3 × 2 matrix 3 4.
A = [[1;3;5],[2;4;6]];
5 6
✄ Initialization of vectors in M ATLAB:
column vectors x = [1;2;3];
row vectors y = [1,2,3];
✄ Building a matrix from blocks:
Output: C=
% MATLAB script demonstrating the construction
1 2 5 6
of a matrix from blocks
3 4 7 8
A = [1,2;3,4]; B = [5,6;7,8];
-5 -6 1 2
C = [A,B;-B,A], % use concatenation
-7 -8 3 4

(1.2.5) Special matrices in M ATLAB

[Special matrices in M ATLAB → § 1.1.3]

☞ n × n identity matrix: I = eye(n);
☞ m × n zero matrix: O = zeros(m,n);
☞ m × n random matrix with entries equidistributed in [0, 1]: R = rand(m,n);
☞ n × n diagonal matrix with components of n-vector d (both row or column vectors are possible) on
its diagonal: D = diag(d);

(1.2.6) Initialization of equispaced vectors (“loop index vectors”)

In M ATLAB v = (a:s:b), where a,b,s are real numbers, creates a row vector as initialised by the
following code
i f ((b >= a) && (s > 0))
v = [a]; w h i l e (v( end )+s <= b), v = [v,v( end )+s]; end

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 59

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

e l s e i f ((b <= a) && (s < 0))

v = [a]; w h i l e (v( end )+s >= b), v = [v,v( end )+s]; end
e l s e v = [];
end

Examples:
>> v = (3:-0.5:-0.3)
v = 3.0000 2.5000 2.0000 1.5000 1.0000 0.5000 0
>> v = (1:2.5:-13)
v = Empty matrix: 1-by-0

These vectors can be used to program loops in M ATLAB

f o r i = (a:s:b)
% Do something with the loop variable i
end

In general we could also pass a matrix as “loop index vector”. In this case the loop variable with run
through the columns of the matrix
Output:
% MATLAB loop over columns of a matrix
M = [1,2,3;4,5,6]; i = 1 i = 2 i = 3
f o r i = M; i, end 4 5 6

(1.2.7) Special structural operations on matrices in M ATLAB

✦ A’ =
ˆ Hermitian transpose of a matrix A, transposing without complex conjugation done by transpose(A).
✦ triu (A) and tril (A) return the upper and lower triangular parts of a matrix A as r-value (copy): If
A ∈ K m.n
( (
(A)i,j , if i ≤ j , (A)i,j , if i ≥ j ,
(triu(A))i,j = (tril(A))i,j =
0 else. 0 else.

✦ diag(A) for a matrix A ∈ K m,n , min{m, n} ≥ 2 returns the column vector [(A)i,i ]i =1,min{m,n} ∈
Kmin{m,n} .

1.2.2 P YTHON

P YTHON is a widely used general-purpose and open source programming language. Together with the
packages like N UM P Y and MATPLOTLIB it delivers similar functionality like M ATLAB for free. For interactive
computing IP YTHON can be used. All those packages belong to the S CI P Y ecosystem.

P YTHON features a good documentation and several scientific distributions are available (e.g. Anaconda,
Enthought) which contain the most important packages. On most Linux-distributions the S CI P Y ecosystem
is also available in the software repository, as well as many other packages including for example the
Spyder IDE delivered with Anaconda.

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 60

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

A good introduction tutorial to numerical P YTHON are the S CI P Y-lectures. The full documentation of
N UM P Y and S CI P Y can be found here. For former M ATLAB-users there’s also a guide. The scripts in this
lecture notes follow the official P YTHON style guide.

Note that in P YTHON we have to import the numerical packages explicitly before use. This is normally done
at the beginning of the file with lines like import numpy as np and from matplotlib import
pyplot as plt. Those import statements are often skipped in this lecture notes to focus on the actual
computations. But you can always assume the import statements as given here, e.g. np.ravel(A) is
a call to a N UM P Y function and plt.loglog(x, y) is a call to a MATPLOTLIB pyplot function.

P YTHON is not used in the current version of the lecture. Nevertheless a few P YTHON codes are supplied
in order to convey similarities and differences to implementations in M ATLAB and C++.

(1.2.8) Matrices and Vectors in P YTHON

The basic numeric data type in P YTHON are N UM P Y’s n-dimensional arrays. Vectors are normally imple-
mented as 1D arrays and no distinction is made between row and column vectors. Matrices are repre-
sented as 2D arrays.

☞ v = np.array([1, 2, 3]) creates a 1D array with the three elements 1, 2 and 3.

☞ A = np.array([[1, 2], [3, 4]] creates a 2D array.
☞ A.shape gives the n-dimensional size of an array.
☞ A.size gives the total number of entries in an array.
Note: There’s also a matrix class in N UM P Y with different semantics but its use is officially discouraged
and it might even be removed in future release.

(1.2.9) Manipulating arrays in P YTHON

There are many possibilities listed in the documentation how to create, index and manipulate arrays.

An important difference to M ATLAB is, that all arithmetic operations are normally performed element-wise,
e.g. A * B is not the matrix-matrix product but element-wise multiplication (in M ATLAB: A.*A). Also A
* v does a broadcasted element-wise product. For the matrix product one has to use np.dot(A, B)
or A.dot(B) explicitly.

1.2.3 E IGEN

Currently, the most widely used programming language for the development of new simulation software
in scientific and industrial high-performance computing is C++. In this course we are going to use and
discuss E IGEN as an example for a C++ library for numerical linear algebra (“embedded” domain specific
language: DSL).

E IGEN is a header-only C++ template library designed to enable easy, natural and efficient numerical
linear algebra: it provides data structures and a wide range of operations for matrices and vectors, see

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 61

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

below. E IGEN also implements many more fundamental algorithms (see the documentation page or the
discussion below).

E IGEN relies on expression templates to allow the efficient evaluation of complex expressions involving
matrices and vectors. Refer to the example given in the E IGEN documentation for details.

➥ Link to an “E IGEN Cheat Sheet” (quick reference relating to M ATLAB commands)

(1.2.10) Compilation of codes using E IGEN

Compiling and linking on Mac OS X 10.10:

clang -D_HAS_CPP0X -std=c++11 -Wall -g \
-Wno-deprecated-register -DEIGEN3_ACTIVATED \
-I/opt/local/include -I/usr/local/include/eigen3 \
-o main.cpp.o -c main.cpp
/usr/bin/c++ -std=c++11 -Wall -g -Wno-deprecated-register \
-DEIGEN3_ACTIVATED -Wl,-search_paths_first \
-Wl,-headerpad_max_install_names main.cpp.o \
-o executable /opt/local/lib/libboost_program_options-mt.dylib
Of course, different compilers may be used on different platforms. In all cases, basic E IGEN functionality
can be used without linking with a special library. Usually the generation of such elaborate calls of the
compiler is left to a build system like C MAKE.

(1.2.11) Matrix and vector data types in E IGEN

A generic matrix data type is given by the templated class

Matrix< typename Scalar,
i n t RowsAtCompileTime, i n t ColsAtCompileTime>

Here Scalar is the underlying scalar type of the matrix entries, which must support the usual operations
’+’,’-’,’*’,’/’, and ’+=’, ’*=’, ’¯’, etc. Usually the scalar type will be either double, float, or complex<>.
The cardinal template arguments RowsAtCompileTime and ColsAtCompileTime can pass a fixed
size of the matrix, if it is known at compile time. There is a specialization selected by the template argument
Eigen::Dynamic supporting variable size “dynamic” matrices.

C++11-code 1.2.12: Vector type and their use in E IGEN

1 # include <Eigen / Dense >
2

3 template <typename Sc alar >

4 void eigenTypeDemo ( unsigned i n t dim )
5 {
6 // General dynamic (variable size) matrices
7 using dynMat_t =
Eigen : : Matrix < Sc alar , Eigen : : Dynamic , Eigen : : Dynamic > ;
8 // Dynamic (variable size) column vectors
9 using dynColVec_t = Eigen : : Matrix < Sc alar , Eigen : : Dynamic , 1 > ;
10 // Dynamic (variable size) row vectors

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 62

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

11 using dynRowVec_t = Eigen : : Matrix < Sc alar , 1 , Eigen : : Dynamic > ;

12 using i n d e x _ t = typename dynMat_t : : Index ;
13 using e n t r y _ t = typename dynMat_t : : S c a l a r ;
14

15 // Declare vectors of size ’dim’; not yet initialized

16 dynColVec_t c o l v e c ( dim ) ;
17 dynRowVec_t rowvec ( dim ) ;
18 // Initialisation through component access
19 f o r ( i n d e x _ t i = 0; i < c o l v e c . siz e ( ) ; ++ i ) c o l v e c ( i ) = ( S c a l a r ) i ;
20 f o r ( i n d e x _ t i = 0; i < rowvec . siz e ( ) ; ++ i ) rowvec ( i ) = ( S c a l a r ) 1 / ( i +1) ;
21 c o l v e c [ 0 ] = ( S c a l a r ) 3 . 1 4 ; rowvec [ dim −1] = ( S c a l a r ) 2 . 7 1 8 ;
22 // Form tensor product, a matrix, see Section 1.3.1
23 dynMat_t vecprod = c o l v e c ∗ rowvec ;
24 const i n t nrows = vecprod . rows ( ) ;
25 const i n t n c o l s = vecprod . cols ( ) ;
26 }

Note that in Line 23 we could have relied on automatic type deduction via auto vectprod = ....
However, often it is safer to forgo this option and specify the type directly.
The following convenience data types are provided by E IGEN, see documentation:
• MatrixXd =
ˆ generic variable size matrix with double precision entries
• VectorXd, RowVectorXd = ˆ dynamic column and row vectors
(= dynamic matrices with one dimension equal to 1)
• MatrixNd with N = 2, 3, 4 for small fixed size square N × N -matrices (type double)
• VectorNd with N = 2, 3, 4 for small column vectors with fixed length N .
The d in the type name may be replaced with i (for int), f (for float), and cd (for complex<double>)
to select another basic scalar type.

All matrix type feature the methods cols(), rows(), and size() telling the number of columns, rows,
and total number of entries.
Access to individual matrix entries and vector components, both as Rvalue and Lvalue, is possible through
the ()-operator taking two arguments of type index_t. If only one argument is supplied, the matrix is
accessed as a linear array according to its memory layout. For vectors, that is, matrices where one
dimension is fixed to 1, the []-operator can replace () with one argument, see Line 21 of Code 1.2.12.

(1.2.13) Initialization of dense matrices in E IGEN

The entry access operator (int i,int j) allows the most direct setting of matrix entries; there is
hardly any runtime penalty.

Of course, in E IGEN dedicated functions take care of the initialization of the special matrices introduced in
??:
Eigen::MatrixXd I = Eigen::MatrixXd::Identity(n,n);
Eigen::MatrixXd O = Eigen::MatrixXd::Zero(n,m);

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 63

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Eigen::MatrixXd D = d_vector.asDiagonal();

C++11-code 1.2.14: Initializing special matrices in E IGEN

1 # include <Eigen / Dense >
2 // Just allocate space for matrix, no initialisation
3 Eigen : : MatrixXd A( rows , cols ) ;
4 // Zero matrix. Similar to matlab command zeros(rows, cols);
5 Eigen : : MatrixXd B = MatrixXd : : Zero ( rows , cols ) ;
6 // Ones matrix. Similar to matlab command ones(rows, cols);
7 Eigen : : MatrixXd C = MatrixXd : : Ones ( rows , cols ) ;
8 // Matrix with all entries same as value.
9 Eigen : : MatrixXd D = MatrixXd : : Constant ( rows , cols , v alue ) ;
10 // Random matrix, entries uniformly distributed in [0, 1]
11 Eigen : : MatrixXd E = MatrixXd : : Random( rows , cols ) ;
12 // (Generalized) identity matrix, 1 on main diagonal
13 Eigen : : MatrixXd I = MatrixXd : : I d e n t i t y ( rows , cols ) ;
14 std : : cout << " s i z e o f A = ( " << A . rows ( ) << ’ , ’ << A . cols ( ) << ’ ) ’ <<
std : : endl ;

A versatile way to initialize a matrix relies on a combination of the operators « and ,, which allows the
construction of a matrix from blocks:
MatrixXd mat3(6,6);
mat3 <<
MatrixXd::Constant(4,2,1.5), // top row, first block
MatrixXd::Constant(4,3,3.5), // top row, second block
MatrixXd::Constant(4,1,7.5), // top row, third block
MatrixXd::Constant(2,4,2.5), // bottom row, left block
MatrixXd::Constant(2,2,4.5); // bottom row, right block

The matrix is filled top to bottom left to right, block dimensions have to match (like in MATLAB).

(1.2.15) Access to submatrices in E IGEN (→ documentation)

The method block(int i,int j,int p,int q) returns a reference to the submatrix with upper left
corner at position (i, j) and size p × q.

The methods row(int i) and col(int j) provide a reference to the corresponding row and column of
the matrix. Even more specialised access methods are

topLeftCorner(p,q), bottomLeftCorner(p,q),
topRightCorner(p,q), bottomRightCorner(p,q),
topRows(q), bottomRows(q),
leftCols(p), and rightCols(q),
with obvious purposes.

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 64

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11 code 1.2.16: Demonstration code for access to matrix blocks in E IGEN ➺ GITLAB
2 template <typename MatType>
3 void blockAccess ( Eigen : : MatrixBase <MatType> &M)
4 {
5 using i n d e x _ t = typename Eigen : : MatrixBase <MatType > : : Index ;
6 using e n t r y _ t = typename Eigen : : MatrixBase <MatType > : : S c a l a r ;
7 const i n d e x _ t nrows (M. rows ( ) ) ; // No. of rows
8 const i n d e x _ t n c o l s (M. cols ( ) ) ; // No. of columns
9

10 cout << " M a t r i x M = " << endl << M << endl ; // Print matrix
11 // Block size half the size of the matrix
12 i n d e x _ t p = nrows / 2 , q = n c o l s / 2 ;
13 // Output submatrix with left upper entry at position (i,i)
14 f o r ( i n d e x _ t i = 0; i < min ( p , q ) ; i ++)
15 cout << " B l o c k ( " << i << ’ , ’ << i << ’ , ’ < : : Constant ( p , q , 1 . 0 ) ;
19 cout << "M = " << endl << M << endl ;
20 // r-value access: extract sub-matrix
21 MatrixXd B = M. block ( 1 , 1 , p , q ) ;
22 cout << " I s o l a t e d m o d i f i e d b l o c k = " << endl << B << endl ;
23 // Special sub-matrices
24 cout << p << " t o p r o w s o f m = " << M. topRows ( p ) << endl ;
25 cout << p << " b o t t o m r o w s o f m = " << M. bottomRows( p ) << endl ;
26 cout << q << " l e f t c o l s o f m = " << M. l e f t C o l s ( q ) << endl ;
27 cout << q << " r i g h t c o l s o f m = " << M. r ight Cols ( p ) << endl ;
28 // r-value access to upper triangular part
29 const MatrixXd T = M. template t r iangular View <Upper > ( ) ; //
30 cout << " Up p e r t r i a n g u l a r p a r t = " << endl << T << endl ;
31 // l-value access to upper triangular part
32 M. template t r iangular View <Lower > ( ) ∗= − 1.5; //
33 cout << " M a t r i x M = " << endl << M << endl ;
34 }

E IGEN offers views for access to triangular parts of a matrix, see Line 29 and Line 32, according to

M.triangularView<XX>()
where XX can stand for one of the following: Upper, Lower, StrictlyUpper, StrictlyLower,
UnitUpper, UnitLower, see documentation.

For column and row vectors references to sub-vectors can be obtained by the methods head(int
length), tail(int length), and segment(int pos,int length).

Note: Unless the preprocessor switch NDEBUG is set, E IGEN performs range checks on all indices.

(1.2.17) Componentwise operations in E IGEN

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 65

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Since operators like M ATLAB’s .* are not available, E IGEN uses the Array concept to furnish entry-wise
operations on matrices. An E IGEN-Array contains the same data as a matrix, supports the same meth-
ods for initialisation and access, but replaces the operators of matrix arithmetic with entry-wise actions.
Matrices and arrays can be converted into each other by the array() and matrix() methods, see
documentation for details.

C++11 code 1.2.18: Using Array in E IGEN ➺ GITLAB

2 void matArray ( i n t nrows , i n t n c o l s ) {
3 Eigen : : MatrixXd m1( nrows , n c o l s ) ,m2( nrows , n c o l s ) ;
4 f o r ( i n t i = 0 ; i < m1 . rows ( ) ; i ++)
5 f o r ( i n t j = 0; j < m1 . cols ( ) ; j ++) {
6 m1( i , j ) = ( double ) ( i +1) / ( j +1) ;
7 m2( i , j ) = ( double ) ( j +1) / ( i +1) ;
8 }
9 // Entry-wise product, not a matrix product
10 Eigen : : MatrixXd m3 = (m1 . array ( ) ∗ m2 . array ( ) ) . matrix ( ) ;
11 // Explicit entry-wise operations on matrices are possible
12 Eigen : : MatrixXd m4(m1 . c wis ePr oduc t (m2) ) ;
13 // Entry-wise logarithm
14 cout << " Log ( m1 ) = " << endl << l o g (m1 . array ( ) ) << endl ;
15 // Entry-wise boolean expression, true cases counted
16 cout << (m1 . array ( ) > 3 ) . count ( ) << " e n t r i e s o f m1 > 3 " << endl ;
17 }

The application of a functor (→ Section 0.2.3) to all entries of a matrix can also been done via the
unaryExpr() method of a matrix:
// Apply a lambda function to all entries of a matrix
aut o fnct = []( double x) { r e t u r n (x+1.0/x); };
cout << "f(m1) = " << e n d l << m1.unaryExpr(fnct) << e n d l ;

Remark 1.2.19 (E IGEN in use)

☞ E IGEN is used as one of the base libraries for the Robot Operating System (ROS), an open source
project with strong ETH participation.

☞ The geometry processing library libigl uses E IGEN as its basic linear algebra engine. At ETH it is be-
ing used and developed at ETH Zurich, at the Interactive Geometry Lab and Advanced Technologies Lab.

1.2.4 (Dense) Matrix storage formats

All numerical libraries store the entries of a (generic = dense) matrix A ∈ K m,n in a linear array of length
mn (or longer). Accessing entries entails suitable index computations.

Two natural options for “vectorisation” of a matrix: row major, column major

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 66

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

  Row major (C-arrays, bitmaps, Python):

1 2 3
A_arr 1 2 3 4 5 6 7 8 9
A =  4 5 6
7 8 9 Column major (Fortran, M ATLAB, E IGEN):
A_arr 1 4 7 2 5 8 3 6 9
Access to entry (A)ij of A ∈ K n,m ,
i = 1, . . . , n, j = 1, . . . , m:
row major:

(A)ij ↔A_arr(m*(i-1)+(j-1))

column major:

(A)ij ↔A_arr(n*(j-1)+(i-1)) Fig. 26 Fig. 27

row major column major

Example 1.2.20 (Accessing matrix data as a vector)

Both in M ATLAB and E IGEN the single index access operator relies on the linear data layout: In M ATLAB
1 A = [1 2 3;4 5 6;7 8 9]; A(:)’,

produces the terminal output

1 1 4 7 2 5 8 3 6 9

which clearly reveals the column major storage format.

In P YTHON the default data layout is row major, but it can be explicitly set. Further, array transposition
does not change any data, but only the memory order and array shape.

P YTHON-code 1.2.21: Storage order in P YTHON

1 # array creation
2 A = np.array([[1, 2], [3, 4]]) # default (row major) storage
3 B = np.array([[1, 2], [3, 4]], order=’F’) # column major storage
4

5 # show internal storage

6 np.ravel(A, ’K’) # array elements as stored in memory: [1, 2, 3, 4]
7 np.ravel(B, ’K’) # array elements as stored in memory: [1, 3, 2, 4]
8

9 # nothing happens to the data on transpose, just the storage order

changes
10 np.ravel(A.T, ’K’) # array elements as stored in memory: [1, 2, 3,
4]
11 np.ravel(B.T, ’K’) # array elements as stored in memory: [1, 3, 2,
4]
12

13 # storage order can be accessed by checking the array’s flags

14 A.flags[’C_CONTIGUOUS’] # True
15 B.flags[’F_CONTIGUOUS’] # True
16 A.T.flags[’F_CONTIGUOUS’] # True
17 B.T.flags[’C_CONTIGUOUS’] # True

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 67

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

In E IGEN the data layout can be controlled by a template argument; default is column major.

C++11 code 1.2.22: Single index access of matrix entries in E IGEN ➺ GITLAB
2 void s tor ageOr der ( i n t nrows =6 , i n t n c o l s =7)
3 {
4 cout << " D i f f e r e n t m a t r i x s t o r a g e l a y o u t s i n E i g e n " << endl ;
5 // Template parameter ColMajor selects column major data layout
6 Matrix <double , Dynamic , Dynamic , ColMajor > mcm( nrows , n c o l s ) ;
7 // Template parameter RowMajor selects row major data layout
8 Matrix <double , Dynamic , Dynamic , RowMajor> mrm( nrows , n c o l s ) ;
9 // Direct initialization; lazy option: use int as index type
10 f o r ( i n t l =1 , i = 0 ; i < nrows ; i ++)
11 f o r ( i n t j = 0 ; j < n c o l s ; j ++ , l ++)
12 mcm( i , j ) = mrm( i , j ) = l ;
13

14 cout << " M a t r i x mrm = " << endl << mrm << endl ;
15 cout << "mcm l i n e a r = " ;
16 f o r ( i n t l = 0; l < mcm. siz e ( ) ; l ++) cout << mcm( l ) << ’ , ’ ;
17 cout << endl ;
18

19 cout << " mrm l i n e a r = " ;

20 f o r ( i n t l = 0; l < mrm . siz e ( ) ; l ++) cout << mrm( l ) << ’ , ’ ;
21 cout << endl ;
22 }

The function call storageOrder(3,3), cf. Code 1.2.22 yields the output
1 D i f f e r e n t m a t r i x s t o r a g e l a y o u t s i n Eigen
2 M a t r i x mrm =
3 1 2 3
4 4 5 6
5 7 8 9
6 mcm l i n e a r = 1 , 4 , 7 , 2 , 5 , 8 , 3 , 6 , 9 ,
7 mrm l i n e a r = 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ,

Remark 1.2.23 (Vectorisation of a matrix)

Mapping a column-major matrix to a column vector with the same number of entries is called vectorization
or linearization in numerical linear algebra, in symbols
 
(A):,1
 (A):,2 
 
vec : K n,m → K n·m , vec(A) =  ..  . (1.2.24)
 . 
(A):,m

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 68

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 1.2.25 (M ATLAB command reshape)

matlab offers the built-in command reshape for changing the dimensions of a matrix A ∈ K m,n :
B = reshape (k,l,A); % error, in case kl 6= mn

This command will create an k × l -matrix by just reinterpreting the linear array of entries of A as data for
a matrix with k rows and l columns. Regardless of the size and entries of the matrices the following test
will always produce a equal = true result
i f (() prod ( s i z e (A)) ~= (k*l)), e r r o r (’Size mismatch’); end
B = reshape(A,k,l);
equal = (B(:) == A(:));

Remark 1.2.26 (N UM P Y function reshape)

N UM P Y offers the function np.reshape for changing the dimensions of a matrix A ∈ K m,n :
# read elements of A in row major order (default)
B = np.reshape(A, (k, l)) # error, in case kl 6= mn
B = np.reshape(A, (k, l), order=’C’) # same as above
# read elements of A in column major order
B = np.reshape(A, (k, l), order=’F’)
# read elements of A as stored in memory
B = np.reshape(A, (k, l), order=’A’)

This command will create an k × l -array by reinterpreting the array of entries of A as data for an array
with k rows and l columns. The order in which the elements of A are be read can be set by the order
argument to row major (default, ’C’), column major (’F’) or A’s internal storage order, i.e. row major if
A is row major or column major if A is column major (’A’).

Remark 1.2.27 (Reshaping matrices in E IGEN)

If you need a reshaped view of a matrix’ data in E IGEN you can obtain it via the raw data vector belonging
to the matrix. Then use this information to create a matrix view by means of Map → documentation.

C++11 code 1.2.28: Demonstration on how reshape a matrix in E IGEN ➺ GITLAB

2 template <typename MatType>
3 void r e s h a p e t e s t ( MatType &M)
4 {
5 using i n d e x _ t = typename MatType : : Index ;
6 using e n t r y _ t = typename MatType : : S c a l a r ;
7 const i n d e x _ t n s i z e (M. siz e ( ) ) ;
8

9 // reshaping possible only for matrices with non-prime dimensions

10 i f ( ( n s i z e %2) == 0 ) {

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 69

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

11 e n t r y _ t ∗ Mdat = M. data ( ) ; // raw data array for M

12 // Reinterpretation of data of M
13 Map<Eigen : : Matrix < e n t r y _ t , Dynamic , Dynamic >> R( Mdat , 2 , n s i z e / 2 ) ;
14 // (Deep) copy data of M into matrix of different size
15 Eigen : : Matrix < e n t r y _ t , Dynamic , Dynamic > S =
16 Map<Eigen : : Matrix < e n t r y _ t , Dynamic , Dynamic > >(Mdat , 2 , n s i z e / 2 ) ;
17

18 cout << " M a t r i x M = " << endl << M << endl ;

19 cout << " r e s h a p e d t o " << R . rows ( ) << ’ x ’ << R . cols ( )
20 << " = " << endl << R << endl ;
21 // Modifying R affects M, because they share the data space !
22 R ∗= − 1.5;
23 cout << " S c a l e d ( ! ) m a t r i x M = " << endl << M << endl ;
24 // Matrix S is not affected, because of deep copy
25 cout << " M a t r i x S = " << endl << S << endl ;
26 }
27 }

This function has to be called with a mutable (l-value) matrix type object. A sample output is printed next:
1 Matrix M =
2 0 −1 −2 −3 −4 −5 −6
3 1 0 −1 −2 −3 −4 −5
4 2 1 0 −1 −2 −3 −4
5 3 2 1 0 −1 −2 −3
6 4 3 2 1 0 −1 −2
7 5 4 3 2 1 0 −1
8 reshaped t o 2x21 =
9 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1 −6 −4 −2
10 1 3 5 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1
11 Scaled ( ! ) m a t r i x M =
12 −0 1.5 3 4.5 6 7.5 9
13 − 1.5 −0 1.5 3 4.5 6 7.5
14 −3 − 1.5 −0 1.5 3 4.5 6
15 − 4.5 −3 − 1.5 −0 1.5 3 4.5
16 −6 − 4.5 −3 − 1.5 −0 1.5 3
17 − 7.5 −6 − 4.5 −3 − 1.5 −0 1.5
18 Matrix S =
19 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1 −6 −4 −2
20 1 3 5 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1

Experiment 1.2.29 (Impact of matrix data access patterns on runtime)

Modern CPU feature several levels of memories (registers, L1 cache, L2 cache, . . ., main memory) of
different latency, bandwidth, and size. Frequently accessing memory locations with widely different ad-
dresses results in many cache misses and will considerably slow down the CPU.

Two loops in M ATLAB:

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 70

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

A = randn (n,n); A = randn (n,n);

f o r j = 1:n-1, f o r i = 1:n-1,
A(:,j+1) = A(:,j+1) - A(:,j); A(i+1,:) = A(i+1,:) - A(i,:);
end end
column oriented access row oriented access

M ATLAB-code 1.2.30: Timing for row and column oriented matrix access in M ATLAB
1 % Timing for row/column operations on matrices
2 % We conduct K runs in order to reduce the risk of skewed maesurements
3 % due to OS activity during MATLAB run.
4 K = 3; res = [];
5 f o r n=2.^(4:13)
6 A = randn (n,n);
7

8 t1 = r ealmax ;
9 f o r k=1:K, tic;
10 f o r j = 1:n-1, A(:,j+1) = A(:,j+1) - A(:,j); end ;
11 t1 = min (toc,t1);
12 end
13 t2 = r ealmax ;
14 f o r k=1:K, tic;
15 f o r i = 1:n-1, A(i+1,:) = A(i+1,:) - A(i,:); end ;
16 t2 = min (toc,t2);
17 end
18 res = [res; n, t1 , t2];
19 end
20

21 % Plot runtimes versus matrix sizes

22 f i g u r e ; p l o t (res(:,1),res(:,2),’r+’, res(:,1),res(:,3),’m*’);
23 x l a b e l (’{\bf n}’,’fontsize’,14);
24 y l a b e l (’{\bf runtime [s]}’,’fontsize’,14);
25 legend (’A(:,j+1) = A(:,j+1) - A(:,j)’,’A(i+1,:) = A(i+1,:) -
A(i,:)’,...
26 ’location’,’northwest’);
27 p r i n t -depsc2 ’../PICTURES/accessrtlin.eps’;
28

29 f i g u r e ; l o g l o g (res(:,1),res(:,2),’r+’, res(:,1),res(:,3),’m*’);
30 x l a b e l (’{\bf n}’,’fontsize’,14);
31 y l a b e l (’{\bf runtime [s]}’,’fontsize’,14);
32 legend (’A(:,j+1) = A(:,j+1) - A(:,j)’,’A(i+1,:) = A(i+1,:) -
A(i,:)’,...
33 ’location’,’northwest’);
34 p r i n t -depsc2 ’../PICTURES/accessrtlog.eps’;

C++11 code 1.2.31: Timing for row and column oriented matrix access for E IGEN ➺ GITLAB
2 void r o w c o l a c c e s s t i m i n g ( void )
3 {
4 const i n t K = 3 ; // Number of repetitions

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 71

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5 const i n t N_min = 5 ; // Smalles matrix size 32

6 const i n t N_max = 13; // Scan until matrix size of 8192
7 unsigned long n = ( 1 L << N_min ) ;
8 Eigen : : MatrixXd times ( N_max−N_min + 1 ,3) ;
9

10 f o r ( i n t l =N_min ; l <= N_max ; l ++ , n ∗ =2) {

11 Eigen : : MatrixXd A = Eigen : : MatrixXd : : Random( n , n ) ;
12 double t 1 = 1 0 0 0 . 0 ;
13 f o r ( i n t k = 0; k<K ; k ++) {
14 auto t i c = h i g h _ r e s o l u t i o n _ c l o c k : : now ( ) ;
15 f o r ( i n t j = 0; j < n −1; j ++) A . row ( j +1) −= A . row ( j ) ; // row access
16 auto t o c = h i g h _ r e s o l u t i o n _ c l o c k : : now ( ) ;
17 double t =
( double ) d u r a t i o n _ c a s t <microseconds > ( toc − t i c ) . count ( ) / 1 E6 ;
18 t 1 = std : : min ( t1 , t ) ;
19 }
20 double t 2 = 1 0 0 0 . 0 ;
21 f o r ( i n t k = 0; k<K ; k ++) {
22 auto t i c = h i g h _ r e s o l u t i o n _ c l o c k : : now ( ) ;
23 f o r ( i n t j = 0; j < n −1; j ++) A . col ( j +1) −= A . col ( j ) ; //column
access
24 auto t o c = h i g h _ r e s o l u t i o n _ c l o c k : : now ( ) ;
25 double t =
( double ) d u r a t i o n _ c a s t <microseconds > ( toc − t i c ) . count ( ) / 1 E6 ;
26 t 2 = std : : min ( t2 , t ) ;
27 }
28 times ( l −N_min , 0 ) = n ; times ( l −N_min , 1 ) = t 1 ;
times ( l −N_min , 2 ) = t 2 ;
29 }
30 std : : cout << times << std : : endl ;
31 }

P YTHON-code 1.2.32: Timing for row and column oriented matrix access in P YTHON
1 i m p o r t numpy as np
2 i m p o r t timeit
3 from matplotlib i m p o r t pyplot as plt
4

5 d e f col_wise(A):
6 f o r j i n range(A.shape[1] - 1):
7 A[:, j + 1] -= A[:, j]
8

9 d e f row_wise(A):
10 f o r i i n range(A.shape[0] - 1):
11 A[i + 1, :] -= A[i, :]
12

13 # Timing for row/column-wise operations on matrix, we conduct k runs in

order
14 # to reduce risk of skewed measurements due to OS activity during run.
15

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 72

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

16 k = 3
17 res = []
18 f o r n i n 2**np.mgrid[4:14]:
19 A = np.random.normal(size=(n, n))
20

21 t1 = min(timeit.repeat( lambda : col_wise(A), repeat=k,

number=1))
22 t2 = min(timeit.repeat( lambda : row_wise(A), repeat=k,
number=1))
23

24 res.append((n, t1, t2))

26 # plot runtime versus matrix sizes

27 ns, t1s, t2s = np.transpose(res)
28

29 plt.figure()
30 plt.plot(ns, t1s, ’+’, label=’A[:, j + 1] -= A[:, j]’)
31 plt.plot(ns, t2s, ’o’, label=’A[i + 1, :] -= A[i, :]’)
32 plt.xlabel(r’n’)
33 plt.ylabel(r’runtime [s]’)
34 plt.legend(loc=’upper left’)
35 plt.savefig(’../PYTHON_PICTURES/accessrtlin.eps’)
36

37 plt.figure()
38 plt.loglog(ns, t1s, ’+’, label=’A[:, j + 1] -= A[:, j]’)
39 plt.loglog(ns, t2s, ’o’, label=’A[i + 1, :] -= A[i, :]’)
40 plt.xlabel(r’n’)
41 plt.ylabel(r’runtime [s]’)
42 plt.legend(loc=’upper left’)
43 plt.savefig(’../PYTHON_PICTURES/accessrtlog.eps’)
44

45 plt.show()

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 73

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10 1
A(:,j+1) = A(:,j+1) - A(:,j)
A(i+1,:) = A(i+1,:) - A(i,:)
eigen row access
10 0 eigen column access

Platform:
10 -1 ✦ ubuntu 14.04 LTS
✦ i7-3517U CPU @ 1.90GHz × 4
10 -2 ✦ L1 32 KB, L2 256 KB, L3 4096 KB,
runtime [s]

Mem 8 GB
10 -3
✦ gcc 4.8.4, -O3, -DNDEBUG
The compiler flags -O3 and -DNDEBUG
10 -4
are essential. The C++ code would be
significantly slower if the default compiler
10 -5
options were used!

10 -6
10 1 10 2 10 3 10 4
Fig. 28 n

For both M ATLAB and E IGEN codes we observe a glaring discrepancy of CPU time required for accessing
entries of a matrix in rowwise or columnwise fashion. This reflects the impact of features of the unterlying
hardware architecture, like cache size and memory bandwidth:

Interpretation of timings: Since matrices in MATLAB are stored column major all the matrix elements in a
column occupy contiguous memory locations, which will all reside in the cache together. Hence, column
oriented access will mainly operate on data in the cache even for large matrices. Conversely, row oriented
access addresses matrix entries that are stored in distant memory locations, which incurs frequent cash
misses (cache thrashing).

The impact of hardware architecture on the performance of algorithms will not be taken into account in
this course, because hardware features tend to be both intricate and ephemeral. However, for modern
high performance computing it is essential to adapt implementations to the hardware on which the code is
supposed to run.

1.3 Basic linear algebra operations

First we refresh the basic rules of vector and matrix calculus. Then we will learn about a very old program-
ming interface for simple dense linear algebra operations.

1.3.1 Elementary matrix-vector calculus

What you should know from linear algebra [?, Sect. 2.2]:

✦ vector space operations in matrix space K m,n (addition, multiplication with scalars)
✦ n
dot product: x, y ∈ K n , n ∈ N: x· y := xH y = ∑ x̄i yi ∈ K
i =1
(in E IGEN: x.dot(y) or x.adjoint()*y, x,y =
ˆ column vectors)

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 74
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✦
tensor product: x ∈ K m , y ∈ K n , n ∈ N: xyH = xi ȳ j i =1,...,m ∈ Km,n
j =1,...,n
(in E IGEN: x*y.adjoint(), x,y =
ˆ column vectors)
✦ All are special cases of the matrix product:
" #
n
A ∈ K m,n , B ∈ K n,k : AB = ∑ aij bjl ∈ R m,k . (1.3.1)
j =1 i =1,...,m
l =1,...,k

Recall from linear algebra basic properties of the matrix product: for all K-matrices A, B, C (of suitable
sizes), α, β ∈ K

associative:(AB)C = A(BC) ,
bi-linear: (αA + βB)C = α(AC) + β(BC) , C(αA + βB) = α(CA) + β(CB) ,
non-commutative: AB 6= BA in general .

(1.3.2) Visualisation of (special) matrix products

Dependency of an entry of a product matrix:

m n = m

n k

Fig. 29
k

= =

dot product tensor product

Remark 1.3.3 (Row-wise & column-wise view of matrix product)

To understand what is going on when forming a matrix product, it is often useful to decompose it into
matrix×vector operations in one of the following two ways:

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 75
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

A ∈ K m,n , B ∈ K n,k :
 
" # (A)1,: B
 .. 
AB = A(B):,1 ... A(B):,k , AB =  . .
(A)m,: B (1.3.4)
↓ ↓
matrix assembled from columns matrix assembled from rows

For notations refer to Sect. 1.1.1.

Remark 1.3.5 (Understanding the structure of product matrices)

A “mental image” of matrix multiplication is useful for telling special properties of product matrices.

For instance, zero blocks of the product matrix can be predicted easily in the following situations using the
idea explained in Rem. 1.3.3 (try to understand how):

m 0 n = 0 m

n k

Fig. 30
k

m
0 n =
0 m

n k

Fig. 31
k
A clear understanding of matrix multiplication enables you to “see”, which parts of a matrix factor matter
in a product:

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 76
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

irrelevant matrix entries

m n = m

n 0 k

Fig. 32
k
“Seeing” the structure/pattern of a matrix product:
    
    
    
    
    
    
    
  =  ,
    
    
    
    
    
    

    
    
    
    
    
    
    
  =  .
    
    
    
    
    
    

These nice renderings of the so-called patterns of matrices, that is, the distribution of their non-zero
entries have been created by a special E IGEN/Figure-command for visualizing the structure of a matrix:
fig.spy(M)

C++11 code 1.3.6: Visualizing the structure of matrices in E IGEN ➺ GITLAB

2 # include <Eigen / Dense>
3

4 # include < f i g u r e / f i g u r e . hpp>

6 using namespace Eigen ;

8 i n t main ( ) {
9 i n t n = 100;
10 MatrixXd A( n , n ) , B( n , n ) ; A . setZero ( ) ; B . setZero ( ) ;
11 A . diagonal ( ) = VectorXd : : LinSpaced ( n , 1 , n ) ;
12 A . col ( n −1) = VectorXd : : LinSpaced ( n , 1 , n ) ;
13 A . row ( n −1) = RowVectorXd : : LinSpaced ( n , 1 , n ) ;

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 77
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

14 B = A . colwise ( ) . reverse ( ) ;
15 MatrixXd C = A∗A , D = A∗B ;
16 mgl : : F i g u r e f i g 1 , f i g 2 , f i g 3 , f i g 4 ;
17 f i g 1 . spy ( A) ; f i g 1 . save ( " Asp y_ cp p " ) ;
18 f i g 2 . spy ( B) ; f i g 2 . save ( " Bsp y_ cp p " ) ;
19 f i g 3 . spy (C) ; f i g 3 . save ( " Csp y_ cp p " ) ;
20 f i g 4 . spy (D) ; f i g 4 . save ( " Dsp y_ cp p " ) ;
21 r et ur n 0 ;
22 }

This code also demonstrates the use of diagonal(), col(), row() for L-value access to parts of a
matrix.

P YTHON/MATPLOTLIB -command for visualizing the structure of a matrix: plt.spy(M)

P YTHON-code 1.3.7: Visualizing the structure of matrices in P YTHON

1 n = 100
2 A = np.diag(np.mgrid[:n])
3 A[:, -1] = A[-1, :] = np.mgrid[:n]
4 plt.spy(A)
5 plt.spy(A[::-1, :])
6 plt.spy(np.dot(A, A))
7 plt.spy(np.dot(A, B))

Remark 1.3.8 (Multiplying triangular matrices)

The following result is useful when dealing with matrix decompositions that often involve triangular matri-
ces.
Lemma 1.3.9. Group of regular diagonal/triangular matrices
( diagonal
( diagonal
A, B upper triangular ⇒ AB and A −1 upper triangular .
lower triangular lower triangular

(assumes that A is regular)

“Proof by visualization” → Rem. 1.3.5

     
0 0 0
     
     
     
     
 · =  .
     
     
     
     

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 78
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Experiment 1.3.10 (Scaling a matrix)

Scaling = multiplication with diagonal matrices (with non-zero diagonal entries):

It is important to know the different effect of multiplying with a diagonal matrix from left or right:

✦ multiplication with diagonal matrix from left ➤ row scaling

    
d1 0 0 a11 a12 . . . a1m d1 a11 d1 a12 . . . d1 a1m  
0 d d ( A )
 2 0 
 a21 a22 a2m   
  d2 a21 d2 a22 . . . d2 a2m  
1
..
1,:

 ..   . ..  =  .. ..  =  . .
 .  .. .   . . 
dn (A)n,:
0 0 dn an1 an2 . . . anm dn an1 dn an2 . . . dn anm

✦ multiplication with diagonal matrix from right ➤ column scaling

    
a11 a12 . . . a1m d1 0 0 d1 a11 d2 a12 . . . dm a1m
a a2m   0  
 21 a22   0 d2   d1 a21 d2 a22 . . . dm a2m 
 .. ..  ..  =  .. .. 
 . .  .   . . 
an1 an2 . . . anm 0 0 dm d1 an1 d2 an2 . . . dm anm
" #
= d1 (A):,1 ... dm (A):,m .

Timings for different ways to do scaling

10 0

10 -1

Multiplication with a scaling matrix D =

10 -2
diag(d1 , . . . , dn ) ∈ R n,n
in E IGEN can be re-
alised in three ways, see Code 1.3.11, ?? 9-?? 11, 10 -3

?? 13, and ?? 15.

time [s]

10 -4

The code will be slowed down massively in case 10 -5

a temporary dense matrix is created inadvertently.

10 -6
Notice that E IGEN’s expression templates avoid this
D.diagonal().cwiseProduct(x)
pointless effort, see ?? 15. 10 -7
D*x

10 -8
10 0 10 1 10 2 10 3 10 4 10 5
Fig. 33 vector length n

C++11 code 1.3.11: Timing multiplication with scaling matrix in E IGEN ➺ GITLAB
2 i n t nruns = 3 , minExp = 2 , maxExp = 14;
3 MatrixXd tms ( maxExp−minExp + 1 ,4) ;
4 f o r ( i n t i = 0 ; i <= maxExp−minExp ; ++ i ) {
5 Timer tbad , tgood , t o p t ; // timer class
6 i n t n = std : : pow ( 2 , minExp + i ) ;
7 VectorXd d = VectorXd : : Random( n , 1 ) , x = VectorXd : : Random( n , 1 ) ,
y(n) ;
8 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
9 MatrixXd D = d . asDiagonal ( ) ; //

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 79
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10 // matrix vector multiplication

11 tbad . s t a r t ( ) ; y = D∗ x ; tbad . s top ( ) ; //
12 // componentwise multiplication
13 tgood . s t a r t ( ) ; y= d . cwiseProduct ( x ) ; tgood . s top ( ) ; //
14 // matrix multiplication optimized by Eigen
15 t o p t . s t a r t ( ) ; y = d . asDiagonal ( ) ∗ x ; t o p t . s top ( ) ; //
16 }
17 tms ( i , 0 ) =n ;
18 tms ( i , 1 ) =tgood . min ( ) ; tms ( i , 2 ) =tbad . min ( ) ; tms ( i , 3 ) = t o p t . min ( ) ;
19 }

P YTHON-code 1.3.12: Timing multiplication with scaling matrix in P YTHON

1 i m p o r t numpy as np
2 from matplotlib i m p o r t pyplot as plt
3 i m p o r t timeit
4

5 # script for timing a smart and foolish way to carrry out

6 # multiplication with a scaling matrix
7

8 nruns = 3
9 res = []
10 for n i n 2**np.mgrid[2:15]:
11 d = np.random.uniform(size=n)
12 x = np.random.uniform(size=n)
13

14 tbad = min(timeit.repeat( lambda : np.dot(np.diag(d), x),

15 repeat=nruns, number=1))
16 tgood = min(timeit.repeat( lambda : d * x, repeat=nruns,
17 number=1))
18

19 res.append((n, tbad, tgood))

21 ns, tbads, tgoods = np.transpose(res)

22 plt.figure()
23 plt.loglog(ns, tbads, ’+’, label=’using np.diag’)
24 plt.loglog(ns, tgoods, ’o’, label=’using *’)
25 plt.legend(loc=’best’)
26 plt.title(’Timing for different ways to do scaling’)
27 plt.savefig(’../PYTHON_PICTURES/scaletiming.eps’)
28 plt.show()

Hardly surprising, the component-wise multiplication of the two vectors is way faster than the intermit-
tent initialisation of a diagonal matrix (main populated by zeros) and the computation of a matrix×vector
product. Nevertheless, such blunders keep on haunting numerical codes. Do not rely solely on E IGEN
optimizations!

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 80
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 1.3.13 (Row and column transformations)

Simple operations on rows/columns of matrices, cf. what was done in Exp. 1.2.29, can often be expressed
as multiplication with special matrices: For instance, given A ∈ K n,m we obtain B by adding row (A) j,: to
row (A) j+1,: , 1 ≤ j < n.
 
1
 ..
. 
 
 
Realisation through matrix  1 
B= A .
product  1 1 
 .. 
 . 
1

The matrix multiplying A from the left is a specimen of a transformation matrix, a matrix that coincides
with the identity matrix I except for a single off-diagonal entry.

left-multiplication row transformations

with transformation matrices ➙
right-multiplication column transformations

row/column transformations will play a central role in Sect. 2.3

Remark 1.3.14 (Matrix algebra)

A vector space (V, K, +, ·), where V is additionally equipped with a bi-linear and associative “multiplica-
tion” is called an algebra. Hence, the vector space of square matrices K n,n with matrix multiplication is an
algebra with unit element I.

(1.3.15) Block matrix product

Given matrix dimensions M, N, K ∈ N block sizes 1 ≤ n < N (n′ := N − n), 1 ≤ m < M (m′ :=
M − m), 1 ≤ k < K (k′ := K − k) we start from the following matrices:
′ ′
A11 ∈ K m,n A12 ∈ K m,n B11 ∈ K n,k B12 ∈ K n,k
′ ′ ′ , ′ ′ ′ .
A21 ∈ K m ,n A22 ∈ K m ,n B21 ∈ K n ,k B22 ∈ K n ,k

This matrices serve as sub-matrices or matrix blocks and are assembled into larger matrices

A11 A12 M,N B11 B12
A= ∈K , B= ∈ K N,K .
A21 A22 B21 B22

It turns out that the matrix product AB can be computed by the same formula as the product of simple
2 × 2-matrices:

A11 A12 B11 B12 A11 B11 + A12 B21 A11 B12 + A12 B22
= . (1.3.16)
A21 A22 B21 B22 A21 B11 + A22 B21 A21 B12 + A22 B22

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 81
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

m n m

M N = M

m′ m′

n′
n n′ k k′
N
k k′ K
Fig. 34
K
Bottom line: one can compute with block-structured matrices in almost (∗) the same ways as with matrices
with real/complex entries, see [?, Sect. 1.3.3].

(∗): you must not use the commutativity of multiplication (because matrix multiplication is not
! commutative).

1.3.2 BLAS – Basic Linear Algebra Subprograms

BLAS (Basic Linear Algebra Subprograms) is a specification (API) that prescribes a set of low-level rou-
tines for performing common linear algebra operations such as vector addition, scalar multiplication, dot
products, linear combinations, and matrix multiplication. They are the de facto low-level routines for linear
algebra libraries (Wikipedia).

The BLAS API is standardised by the BLAS technical forum and, due to its history dating back to the 70s,
follows conventions of FORTRAN 77, see the Quick Reference Guide for examples. However, wrappers for
other programming languages are available. CPU manufacturers and/or developers of operating systems
usually supply highly optimised implementations:
• OpenBLAS: open source implementation with some general optimisations, available under BSD
license.
• ATLAS (Automatically Tuned Linear Algebra Software): open source BLAS implementation with
auto-tuning capabilities. Comes with C and FORTRAN interfaces and is included in Linux distribu-
tions.
• Intel MKL (Math Kernel Library): commercial highly optimised BLAS implemetation available for all
Intel CPUs. Used by most proprietory simulation software and also M ATLAB.

Experiment 1.3.17 (Multiplying matrices in M ATLAB)

M ATLAB-code 1.3.18: Timing different implementations of matrix multiplication in M ATLAB

1 % MATLAB script for timing different implementations of matrix
multiplications
2 nruns = 3; times = [];
3 f o r n=2.^(2:10) % n = size of matrices
4 f p r i n t f (’matrix size n = %d\n’,n);
5 A = rand (n,n); B = rand (n,n); C = z e r o s (n,n);

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 82
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6 t1 = r ealmax ;
7 % loop based implementation (no BLAS)
8 f o r l=1:nruns
9 tic;
10 f o r i=1:n, f o r j=1:n
11 f o r k=1:n, C(i,j) = C(i,j) + A(i,k)*B(k,j); end
12 end , end
13 t1 = min (t1, t o c );
14 end
15 t2 = r ealmax ;
16 % dot product based implementation (BLAS level 1)
17 f o r l=1:nruns
18 tic;
19 f o r i=1:n
20 f o r j=1:n, C(i,j) = dot (A(i,:),B(:,j)); end
21 end
22 t2 = min (t2, t o c );
23 end
24 t3 = r ealmax ;
25 % matrix-vector based implementation (BLAS level 2)
26 f o r l=1:nruns
27 tic;
28 f o r j=1:n, C(:,j) = A*B(:,j); end
29 t3 = min (t3, t o c );
30 end
31 t4 = r ealmax ;
32 % BLAS level 3 matrix multiplication
33 f o r l=1:nruns
34 t i c ; C = A*B; t4 = min (t4, t o c );
35 end
36 times = [ times; n t1 t2 t3 t4];
37 end
38

39 f i g u r e (’name’,’mmtiming’);
40 l o g l o g (times(:,1),times(:,2),’r+-’,...
41 times(:,1),times(:,3),’m*-’,...
42 times(:,1),times(:,4),’b^-’,...
43 times(:,1),times(:,5),’kp-’);
44 t i t l e (’Timings: Different implementations of matrix
multiplication’);
45 x l a b e l (’matrix size n’,’fontsize’,14);
46 y l a b e l (’time [s]’,’fontsize’,14);
47 legend (’loop implementation’,’dot product implementation’,...
48 ’matrix-vector implementation’,’BLAS gemm (MATLAB *)’,...
49 ’location’,’northwest’);
50

51 p r i n t -depsc2 ’../PICTURES/mvtiming.eps’;

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 83
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Timings: Different implementations of matrix multiplication

3
10
loop implementation
dot product implementation
10
2
matrix-vector implementation Platform:
BLAS gemm

10
1
✦ Mac OS X 10.6
✦ Intel Core 7, 2.66 GHz
0
10
✦ L2 256 kB, L3 4 MB, Mem 4 GB
✗ ✦ MATLAB 7.10.0 (R 2010a) ✔
-1
10
time [s]

10
-2 In M ATLAB we can achieve a tremendous gain in
-3
execution speed by relying on compact matrix/vec-

✖ ✕
10
tor operations that invoke efficient BLAS routine.
-4
10

10
-5
Advise: avoid loops in M ATLAB and replace them with
vectorised operations.
-6
10
0 1 2 3 4
10 10 10 10 10
Fig. 35 matrix size n

To some extent the same applies to E IGEN code, a corresponding timing script is given here:

C++11 code 1.3.19: Timing different implementations of matrix multiplication in E IGEN

➺ GITLAB
2 //! script for timing different implementations of matrix
multiplications
3 //! no BLAS is used in Eigen!
4 void mmtiming ( ) {
5 i n t nruns = 3 , minExp = 2 , maxExp = 10;
6 MatrixXd t i m i n g s ( maxExp−minExp + 1 ,5) ;
7 f o r ( i n t p = 0 ; p <= maxExp−minExp ; ++p ) {
8 Timer t1 , t2 , t3 , t 4 ; // timer class
9 i n t n = std : : pow ( 2 , minExp + p ) ;
10 MatrixXd A = MatrixXd : : Random( n , n ) ;
11 MatrixXd B = MatrixXd : : Random( n , n ) ;
12 MatrixXd C = MatrixXd : : Zero ( n , n ) ;
13 f o r ( i n t q = 0 ; q < nruns ; ++q ) {
14 // Loop based implementation no template magic
15 t1 . s t a r t ( ) ;
16 f o r ( i n t i = 0 ; i < n ; ++ i )
17 f o r ( i n t j = 0 ; j < n ; ++ j )
18 f o r ( i n t k = 0 ; k < n ; ++k )
19 C( i , j ) += A( i , k ) ∗B( k , j ) ;
20 t 1 . s top ( ) ;
21 // dot product based implementation little template magic
22 t2 . s t a r t ( ) ;
23 f o r ( i n t i = 0 ; i < n ; ++ i )
24 f o r ( i n t j = 0 ; j < n ; ++ j )
25 C( i , j ) = A . row ( i ) . dot ( B . col ( j ) ) ;
26 t 2 . s top ( ) ;
27 // matrix-vector based implementation middle template magic
28 t3 . s t a r t ( ) ;
29 f o r ( i n t j = 0 ; j < n ; ++ j )
30 C . col ( j ) = A ∗ B . col ( j ) ;
31 t 3 . s top ( ) ;
32 // Eigen matrix multiplication template magic optimized
33 t4 . s t a r t ( ) ;

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 84
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

34 C = A ∗ B;
35 t 4 . s top ( ) ;
36 }
37 t i m i n g s ( p , 0 ) =n ; t i m i n g s ( p , 1 ) = t 1 . min ( ) ; t i m i n g s ( p , 2 ) = t 2 . min ( ) ;
38 t i m i n g s ( p , 3 ) = t 3 . min ( ) ; t i m i n g s ( p , 4 ) = t 4 . min ( ) ;
39 }
40 std : : cout << std : : s c i e n t i f i c << std : : set pr ecision ( 3 ) << t i m i n g s <<
std : : endl ;
41 //Plotting
42 mgl : : F i g u r e f i g ;
43 f i g . setFontSize ( 4 ) ;
44 f i g . s e t l o g ( true , t r ue ) ;
45 f i g . p l o t ( t i m i n g s . col ( 0 ) , t i m i n g s . col ( 1 ) , " + r − " ) . l a b e l ( " l o o p
implementation " ) ;
46 f i g . p l o t ( t i m i n g s . col ( 0 ) , t i m i n g s . col ( 2 ) , " ∗m− " ) . l a b e l ( " d o t − p r o d u c t
implementation " ) ;
47 f i g . p l o t ( t i m i n g s . col ( 0 ) , t i m i n g s . col ( 3 ) , " ^ b− " ) . l a b e l ( " m a t r i x − v e c t o r
implementation " ) ;
48 f i g . p l o t ( t i m i n g s . col ( 0 ) , t i m i n g s . col ( 4 ) , " ok− " ) . l a b e l ( " E i g e n m a t r i x
product " ) ;
49 f i g . xlabel ( " matrix size n " ) ; f i g . ylabel ( " time [ s ] " ) ;
50 f i g . legend ( 0 . 0 5 , 0 . 9 5 ) ; f i g . save ( " m m t i m i n g " ) ;
51 }

Timings: Different implementations of matrix multiplication

10 1
loop implementation
dot-product implementation
10 0
matrix-vector implementation
Eigen matrix product
Platform:
✦ ubuntu 14.04 LTS
10 -1
✦ i7-3517U CPU @ 1.90GHz × 4
10 -2 ✦ L1 32 KB, L2 256 KB, L3 4096 KB, Mem 8 GB
✦ gcc 4.8.4, -O3
✤ ✜
time [s]

10 -3

10 -4 In E IGEN we can achieve some gain in execution

speed by relying on compact matrix/vector oper-
10 -5
ations that invoke efficient E IGEN routine. Notice

✣ ✢
10 -6 that loops are not as punished as in M ATLAB!
-7
10
10 0 10 1 10 2 10 3 10 4
Fig. 36 matrix size n

The same applies to P YTHON code, a corresponding timing script is given here:

P YTHON-code 1.3.20: Timing different implementations of matrix multiplication in P YTHON

1 # script for timing different implementations of matrix multiplications
2 i m p o r t numpy as np
3 from matplotlib i m p o r t pyplot as plt
4 i m p o r t timeit
5

6 d e f mm_loop_based(A, B, C):
7 m, n = A.shape
8 _, p = B.shape

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 85
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

9 f o r i i n range(m):
10 f o r j i n range(p):
11 f o r k i n range(n):
12 C[i, j] += A[i, k] * B[k, j]
13 return C
14

15 d e f mm_blas1(A, B, C):
16 m, n = A.shape
17 _, p = B.shape
18 f o r i i n range(m):
19 f o r j i n range(p):
20 C[i, j] = np.dot(A[i, :], B[:, j])
21 return C
22

23 d e f mm_blas2(A, B, C):
24 m, n = A.shape
25 _, p = B.shape
26 f o r i i n range(m):
27 C[i, :] = np.dot(A[i, :], B)
28 return C
29

30 d e f mm_blas3(A, B, C):
31 C = np.dot(A, B)
32 return C
33

34 d e f main():
35 nruns = 3
36 res = []
37 f o r n i n 2**np.mgrid[2:11]:
38 p r i n t (’matrix size n = {}’.format(n))
39 A = np.random.uniform(size=(n, n))
40 B = np.random.uniform(size=(n, n))
41 C = np.random.uniform(size=(n, n))
42

43 tloop = min(timeit.repeat( lambda : mm_loop_based(A, B, C),

44 repeat=nruns, number=1))
45 tblas1 = min(timeit.repeat( lambda : mm_blas1(A, B, C),
46 repeat=nruns, number=1))
47 tblas2 = min(timeit.repeat( lambda : mm_blas2(A, B, C),
48 repeat=nruns, number=1))
49 tblas3 = min(timeit.repeat( lambda : mm_blas3(A, B, C),
50 repeat=nruns, number=1))
51 res.append((n, tloop, tblas1, tblas2, tblas3))
52

53 ns, tloops, tblas1s, tblas2s, tblas3s = np.transpose(res)

54 plt.figure()
55 plt.loglog(ns, tloops, ’o’, label=’loop implementation’)
56 plt.loglog(ns, tblas1s, ’+’, label=’dot product
implementation’)
57 plt.loglog(ns, tblas2s, ’*’, label=’matrix-vector

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 86
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

implementation’)
58 plt.loglog(ns, tblas3s, ’^’, label=’BLAS gemm (np.dot)’)
59 plt.legend(loc=’upper left’)
60 plt.savefig(’../PYTHON_PICTURES/mvtiming.eps’)
61 plt.show()
62

63 i f __name__ == ’__main__’:
64 main()

BLAS routines are grouped into “levels” according to the amount of data and computation involved (asymp-
totic complexity, see Section 1.4.1 and [?, Sect. 1.1.12]):
• Level 1: vector operations such as scalar products and vector norms.
asymptotic complexity O(n), (with n =ˆ vector length),
⊤
e.g.: dot product: ρ = x y
• Level 2: vector-matrix operations such as matrix-vector multiplications.
asymptotic complexity O(mn),(with (m, n) = ˆ matrix size),
e.g.: matrix×vector multiplication: y = αAx + βy
• Level 3: matrix-matrix operations such as matrix additions or multiplications.
asymptotic complexity often O(nmk),(with (n, m, k) =ˆ matrix sizes),
e.g.: matrix product: C = AB
Syntax of BLAS calls:
The functions have been implemented for different types, and are distinguished by the first letter of the
function name. E.g. sdot is the dot product implementation for single precision and ddot for double
precision.

✦ BLAS LEVEL 1: vector operations, asymptotic complexity O(n), n =

ˆ vector length
• dot product ρ = x⊤ y

xDOT(N,X,INCX,Y,INCY)
– x ∈ {S, D}, scalar type: S =
ˆ type float, D =
ˆ type double
– N=
ˆ length of vector (modulo stride INCX)
– X=
ˆ vector x: array of type x
– INCX =
ˆ stride for traversing vector X
– Y=
ˆ vector y: array of type x
– INCY =
ˆ stride for traversing vector Y
• vector operations y = αx + y

xAXPY(N,ALPHA,X,INCX,Y,INCY)
– x ∈ {S, D, C, Z}, S =
ˆ type float, D =
ˆ type double, C =
ˆ type complex
– N=
ˆ length of vector (modulo stride INCX)

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 87
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

– ALPHA =
ˆ scalar α
– X=
ˆ vector x: array of type x
– INCX =
ˆ stride for traversing vector X
– Y=
ˆ vector y: array of type x
– INCY =
ˆ stride for traversing vector Y
✦ BLAS LEVEL 2: matrix-vector operations, asymptotic complexity O(mn), (m, n) =
ˆ matrix size

• matrix×vector multiplication y = αAx + βy

xGEMV(TRANS,M,N,ALPHA,A,LDA,X,
INCX,BETA,Y,INCY)
– x ∈ {S, D, C, Z}, scalar type: S =
ˆ type float, D =
ˆ type double, C =
ˆ type complex
– M, N =
ˆ size of matrix A
– ALPHA =
ˆ scalar parameter α
ˆ matrix A stored in linear array of length M · N (column major arrangement)
– A=

(A)i,j = A[ N ∗ ( j − 1) + i ] .

ˆ “leading dimension” of A ∈ K n,m , that is, the number n of rows.

– LDA =

– X=
ˆ vector x: array of type x
– INCX =
ˆ stride for traversing vector X
– BETA =
ˆ scalar paramter β
– Y=
ˆ vector y: array of type x
– INCY =
ˆ stride for traversing vector Y
• BLAS LEVEL 3: matrix-matrix operations, asymptotic complexity O(mnk), (m, n, k) =
ˆ matrix
sizes

– matrix×matrix multiplication C = αAB + βC

xGEMM(TRANSA,TRANSB,M,N,K,
ALPHA,A,LDA,X,B,LDB,
BETA,C,LDC)
(☞ meaning of arguments as above)

Remark 1.3.21 (BLAS calling conventions)

The BLAS calling syntax seems queer in light of modern object oriented programming paradigms, but it
is a legacy of FORTRAN77, which was (and partly still is) the programming language, in which the BLAS

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 88
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

routines were coded.

It is a very common situation in scientific computing that one has to rely on old codes and libraries imple-
mented in an old-fashioned style.

Example 1.3.22 (Calling BLAS routines from C/C++)

When calling BLAS library functions from C, all arguments have to be passed by reference (as pointers),
in order to comply with the argument passing mechanism of FORTRAN77, which is the model followed by
BLAS.

C++-code 1.3.23: BLAS-based SAXPY operation in C++

1 # def ine daxpy_ daxpy
2 # include < ios tr eam >
3

4 // Definition of the required BLAS function. This is usually done

5 // in a header file like blas.h that is included in the E I G E N 3
6 // distribution
7 extern " C " {
8 i n t daxpy_ ( const i n t ∗ n , const double ∗ da , const double ∗ dx ,
9 const i n t ∗ i n c x , double ∗ dy , const i n t ∗ i n c y ) ;
10 }
11

12 using namespace std ;

14 i n t main ( ) {
15 const i n t n = 5 ; // length of vector
16 const i n t i n c x = 1 ; // stride
17 const i n t i n c y = 1 ; // stride
18 double alpha = 2.5; // scaling factor
19

20 // Allocated raw arrays of doubles

21 double ∗ x = new double [ n ] ;
22 double ∗ y = new double [ n ] ;
23

24 f o r ( s i z e _ t i = 0; i <n ; i ++) {
25 x [ i ] = 3.1415 ∗ i ;
26 y [ i ] = 1.0 / ( double ) ( i +1) ;
27 }
28

29 cout << " x =[ " ; f o r ( s i z e _ t i = 0; i <n ; i ++) cout << x [ i ] << ’ ’ ;

30 cout << " ] " << endl ;
31 cout << " y =[ " ; f o r ( s i z e _ t i = 0; i <n ; i ++) cout << y [ i ] << ’ ’ ;
32 cout << " ] " << endl ;
33

34 // Call the BLAS library function passing pointers to all arguments

35 // (Necessary when calling FORTRAN routines from C
36 daxpy_(&n , &alpha , x , &i n c x , y , &i n c y ) ;
37

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 89
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

38 cout << " y = " << alpha << " ∗ x + y = ] " ;

39 f o r ( i n t i = 0; i <n ; i ++) cout << y [ i ] << ’ ’ ; cout << " ] " << endl ;
40 r et ur n ( 0 ) ;
41 }

When using E IGEN in a mode that includes an external BLAS library, all this calls are wrapped into E IGEN
methods.

Example 1.3.24 (Using Intel Math Kernel Library (Intel MKL) from E IGEN)

The Intel Math Kernel Library is a highly optimized math library for Intel processors and can be called
directly from E IGEN (Using Intel R Math Kernel Library from Eigen) using the correct compiler flags.

C++-code 1.3.25: Timing of matrix multiplication in E IGEN for MKL comparison ➺ GITLAB
2 //! script for timing different implementations of matrix
multiplications
3 void mmeigenmkl ( ) {
4 i n t nruns = 3 , minExp = 6 , maxExp = 13;
5 MatrixXd t i m i n g s ( maxExp−minExp + 1 ,2) ;
6 f o r ( i n t p = 0 ; p <= maxExp−minExp ; ++p ) {
7 Timer t 1 ; // timer class
8 i n t n = std : : pow ( 2 , minExp + p ) ;
9 MatrixXd A = MatrixXd : : Random( n , n ) ;
10 MatrixXd B = MatrixXd : : Random( n , n ) ;
11 MatrixXd C = MatrixXd : : Zero ( n , n ) ;
12 f o r ( i n t q = 0 ; q < nruns ; ++q ) {
13 t1 . s t a r t ( ) ;
14 C = A ∗ B;
15 t 1 . s top ( ) ;
16 }
17 t i m i n g s ( p , 0 ) =n ; t i m i n g s ( p , 1 ) = t 1 . min ( ) ;
18 }
19 std : : cout << std : : s c i e n t i f i c << std : : set pr ecision ( 3 ) << t i m i n g s <<
std : : endl ;
20 }

Timing results:
n E IGEN sequential [s] E IGEN parallel [s] MKL sequential [s] MKL parallel [s]
64 1.318e-04 1.304e-04 6.442e-05 2.401e-05
128 7.168e-04 2.490e-04 4.386e-04 1.336e-04
256 6.641e-03 1.987e-03 3.000e-03 1.041e-03
512 2.609e-02 1.410e-02 1.356e-02 8.243e-03
1024 1.952e-01 1.069e-01 1.020e-01 5.728e-02
2048 1.531e+00 8.477e-01 8.581e-01 4.729e-01
4096 1.212e+01 6.635e+00 7.075e+00 3.827e+00
8192 9.801e+01 6.426e+01 5.731e+01 3.598e+01

1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 90
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10 2 10 -9
Eigen sequential Eigen sequential
Eigen parallel Eigen parallel
MKL sequential MKL sequential
10 1 MLK parallel MLK parallel

[s]
3
10 0

execution time divided by n

execution time [s]

10 -1

10 -10

-2
10

10 -3

10 -4

10 -5 10 -11
10 1 10 2 10 3 10 4 10 1 10 2 10 3 10 4
Fig. 37 Fig. 38 matrix size n
matrix size n

Timing environment: ✦ ubuntu 14.04 LTS

✦ i7-3517U CPU @ 1.90GHz × 4
✦ L1 32 KB, L2 256 KB, L3 4096 KB, Mem 8 GB
✦ gcc 4.8.4, -O3

1.4 Computational effort

Large scale numerical computations require immense resources and execution time of numerical codes
often becomes a central concern. Therefore, much emphasis has to be put on

1. designing algorithms that produce a desired result with (nearly) minimal computational effort (de-
fined precisely below),

2. exploit possibilities for parallel and vectorised execution,

3. organising algorithms in order to make them fit memory hierarchies,

4. implementing codes that make optimal use of hardware resources and capabilities,

While Item 2–Item 4 are out of the scope of this course and will be treated in more advanced lectures,
Item 1 will be a recurring theme.

The following definition encapsulates what is regarded as a measure for the “cost” of an algorithm in
computational mathematics.

Definition 1.4.1. Computational effort

The computational effort required by a numerical code amounts to the number of elementary oper-
ations (additions,subtractions,multiplications,divisions,square roots) executed in a run.

(1.4.2) What computational effort does not tell us

Fifty years ago counting elementary operations provided good predictions of runtimes, but nowadays this
is no longer true.

1. Computing with Matrices and Vectors, 1.4. Computational effort 91

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

“Computational effort 6∼ runtime”

! The computational effort involved in a run of a numerical code is only loosely related
to overall execution time on modern computers.

This is conspicuous in Exp. 1.2.29, where algorithms incurring exactly the same computational effort took
different times to execute.

The reason is that on today’s computers a key bottleneck for fast execution is latency and bandwidth of
memory, cf. the discussion at the end of Exp. 1.2.29 and [?]. Thus, concepts like I/O-complexity [?, ?]
might be more appropriate for gauging the efficiency of a code, because they take into account the pattern
of memory access.

1.4.1 (Asymptotic) complexity

The concept of computational effort from Def. 1.4.1 is still useful in a particular context:

Definition 1.4.4. (Asymptotic) complexity

The asymptotic complexity of an algorithm characterises the worst-case dependence of its

computational effort on one or more problem size parameter(s) when these tend to ∞.

• Problem size parameters in numerical linear algebra usually are the lengths and dimensions of the
vectors and matrices that an algorithm takes as inputs.

• Worst case indicates that the maximum effort over a set of admissible data is taken into account.

When dealing with asymptotic complexities a mathematical formalism comes handy:

Definition 1.4.5. Landau symbol [?, p. 7]

We write F(n) = O(G (n)) for two functions F, G : N → R , if there exists a constant C > 0 and
n∗ ∈ N such that

F (n) ≤ C G (n) ∀ n ≥ n∗ .

More generally, F(n1 , . . . , nk ) = O(G (n1 , . . . , nk )) for two functions F, G : N k → R implies the
existence of a constant C > 0 and a threshold value n∗ ∈ N such that

F(n1 , . . . , nk ) ≤ CG (n1 , . . . , nk ) ∀n1 , . . . , nk ∈ N , nℓ ≥ n∗ , ℓ = 1, . . . , k .

Remark 1.4.6 (Meaningful “O-bounds” for complexity)

Of course, the definition of the Landau symbol leaves ample freedom for stating meaningless bounds;
an algorithm that runs with linear complexity O(n) can be correctly labelled as possessing O(exp(n))
complexity.

1. Computing with Matrices and Vectors, 1.4. Computational effort 92

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Yet, whenever the Landau notation is used to describe asymptotic complexities, the bounds have to be
sharp in the sense that no function with slower asymptotic growth will be possible inside the O. To make
this precise we stipulate the following.

Sharpness of a complexity bound

Whenever the asymptotic complexity of an algorithm is stated as O(nα log β n exp(γnδ )) with non-
negative parameters α, β, γ, δ ≥ 0 in terms of the problem size parameter n, we take for granted
that choosing a smaller value for any of the parameters will no longer yield a valid (or provable)
asymptotic bound.

In particular
✦ complexity O(n) means that the complexity is not O(nα ) for any α < 1,
✦ complexity O(exp(n)) excludes asymptotic complexity O(n p ) for any p ∈ R.
Terminology: If the asymptotic complexity of an algorithm is O(n p ) with p = 1, 2, 3 we say that it is of
“linear”, “quadratic”, and “cubic” complexity, respectively.

Remark 1.4.8 (Relevance of asymptotic complexity)

§ 1.4.2 warned us that computational effort and, thus, asymptotic complexity, of an algorithm for a concrete
problem on a particular platform may not have much to do with the actual runtime (the blame goes to
memory hierarchies, internal pipelining, vectorisation, etc.).

Then, why do we pay so much attention to asymptotic complexity in this course?

To a certain extent, the asymptotic complexity allows to predict the dependence of the runtime of a
particular implementation of an algorithm on the problem size (for large problems).

For instance, an algorithm with asymptotic complexity O(n2 ) is likely to take 4× as much time when the
problem size is doubled.

(1.4.9) Concluding polynomial complexity from runtime measurements

Available: “Measured runtimes” ti = ti (ni ) for different values n1 , n2 , . . . , n N , ni ∈ N,

of the problem size parameter

Conjectured: power law dependence ti ≈ Cniα (also “algebraic dependence”), α ∈ R

How can we glean evidence that supports or refutes our conjecture from the data? Look at the data in
doubly logarithmic scale!

ti = Cniα ⇒ log(ti ) ≈ log C + α log(ni ) , i = 1, . . . , N .

If the conjecture holds true, then the points (ni , ti ) will approximately lie on a straight line with slope
α in a doubly logarithmic plot (which can be created in M ATLAB by the loglog plotting command and
in E IGEN with the Figure-command fig.setlog(true, true);).

➣ quick “visual test” of conjectured asymptotic complexity

1. Computing with Matrices and Vectors, 1.4. Computational effort 93

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

More rigorous: Perform linear regression on (log ni , log ti ), i = 1, . . . , N (→ Chapter 3)

1.4.2 Cost of basic operations

Performing elementary BLAS-type operations through simple (nested) loops, we arrive at the following
obvious complexity bounds:
operation description #mul/div #add/sub asymp. complexity
dot product (x ∈ Rn, y ∈ Rn )
7→ xH y n n−1 O (n)
tensor product m n
(x ∈ R , y ∈ R ) 7→ xyH nm 0 O(mn)
matrix product(∗) (A ∈ R m,n , B ∈ R n,k ) 7→ AB mnk mk(n − 1) O(mnk)

Remark 1.4.10 (“Fast” matrix multiplication)

(∗): The O(mnk) complexity bound applies to “straightforward” matrix multiplication according to
(1.3.1).
For m = n = k there are (sophisticated) variants with better asymptotic complexity, e.g., the divide-and-
conquer Strassen algorithm [?] with asymptotic complexity O(nlog2 7 ):

Start from A, B ∈ K n,n with n = 2ℓ, ℓ ∈ N. The idea relies on the block matrix
product
(1.3.16) with
C11 C22
Aij , Bij ∈ K ℓ,ℓ , i, j ∈ {1, 2}. Let C := AB be partitioned accordingly: C = C21 C22 . Then tedious
elementary computations reveal

C11 = Q0 + Q3 − Q4 + Q6 ,
C21 = Q1 + Q3 ,
C12 = Q2 + Q4 ,
C22 = Q0 + Q2 − Q1 + Q5 ,

where the Qk ∈ K ℓ,ℓ , k = 1, . . . , 7 are obtained from

Q0 = (A11 + A22 ) ∗ (B11 + B22 ) ,

Q1 = (A21 + A22 ) ∗ B11 ,
Q2 = A11 ∗ (B12 − B22 ) ,
Q3 = A22 ∗ (−B11 + B21 ) ,
Q4 = (A11 + A12 ) ∗ B22 ,
Q5 = (−A11 + A21 ) ∗ (B11 + B12 ) ,
Q6 = (A12 − A22 ) ∗ (B21 + B22 ) .

Beside a considerable number of matrix additions ( computational effort O(n2 ) ) it takes only 7 multiplica-
tions of matrices of size n/2 to compute C! Strassen’s algorithm boils down to the recursive application
of these formulas for n = 2k , k ∈ N.

A refined algorithm of this type can achieve complexity O(n2.36 ), see [?].

1. Computing with Matrices and Vectors, 1.4. Computational effort 94

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

1.4.3 Reducing complexity in numerical linear algebra: Some tricks

In computations involving matrices and vectors complexity of algoritms can often be reduced by performing
the operations in a particular order:

Example 1.4.11 (Efficient associative matrix multiplication)

We consider the multiplication with a rank-1-matrix. Matrices with rank 1 can always be obtained as the
tensor product of two vectors, that is, the matrix product of a column vector and a row vector. Given
a ∈ K m , b ∈ K n , x ∈ K n we may compute the vector y = ab⊤ x in two ways:

y = ab⊤ x . (1.4.12) y = a b⊤ x . (1.4.13)

T = (a*b.transpose())*x; t = a*b.dot(x);
➤ complexity O(mn) ➤ complexity O(n + m) (“linear complexity”)

Visualization of evaluation according to (1.4.12):

     
     
     
     
 ·  =  
     
     

Visualization of evaluation according to (1.4.13):

    
    
    
    
  · = 
    
    

Timings for rank 1 matrix-vector multiplications

10 0
slow evaluation
efficient evaluation ✁ average runtimes for efficient/inefficient
O(n)
10 -1
O(n 2 ) matrix×vector multiplication with rank-1 ma-
10 -2
trices , see § 1.4.9 for the rationale behind
choosing a doubly logarithmic plot.
average runtime (s)

10 -3

Platform:
10 -4

✦ ubuntu 14.04 LTS

10 -5
✦ i7-3517U CPU @ 1.90GHz × 4
10 -6 ✦ L1 32 KB, L2 256 KB, L3 4096 KB,
10 -7
✦ 8 GB main memory
✦ gcc 4.8.4, -O3
10 -8
10 0 10 1 10 2 10 3 10 4
Fig. 39 problem size n

1. Computing with Matrices and Vectors, 1.4. Computational effort 95

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11 code 1.4.14: E IGEN code for Ex. 1.4.11 ➺ GITLAB

2 //! This function compares the runtimes for the multiplication
3 //! of a vector with a rank-1 matrix ab⊤ , a, b ∈ R n
4 //! using different associative evaluations.
5 //! Runtime measurements consider minimal time for
6 //! several (nruns) runs
7 MatrixXd d o t t e n s t i m i n g ( ) {
8 const i n t nruns = 3 , minExp = 2 , maxExp = 13;
9 // Matrix for storing recorded runtimes
10 MatrixXd t i m i n g s ( maxExp−minExp + 1 ,3) ;
11 f o r ( i n t i = 0 ; i <= maxExp−minExp ; ++ i ) {
12 Timer t f o o l , t s m a r t ; // Timer objects
13 const i n t n = std : : pow ( 2 , minExp + i ) ;
14 VectorXd a = VectorXd : : LinSpaced ( n , 1 , n ) ;
15 VectorXd b = VectorXd : : LinSpaced ( n , 1 , n ) . reverse ( ) ;
16 VectorXd x = VectorXd : : Random( n , 1 ) , y ( n ) ;
17 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
18 // Grossly wasteful evaluation
19 t f o o l . s t a r t ( ) ; y = ( a ∗ b . transpose ( ) ) ∗ x ; t f o o l . s top ( ) ;
20 // Efficient implementation
21 t s m a r t . s t a r t ( ) ; y = a ∗ b . dot ( x ) ; t s m a r t . s top ( ) ;
22 }
23 t i m i n g s ( i , 0 ) =n ;
24 t i m i n g s ( i , 1 ) = t s m a r t . min ( ) ; t i m i n g s ( i , 2 ) = t f o o l . min ( ) ;
25 }
26 r et ur n t i m i n g s ;
27 }

Complexity can sometimes be reduced by reusing intermediate results.

Example 1.4.15 (Hidden summation)

The asymptotic complexity of the E IGEN code

Eigen::MatrixXd AB = A*B.transpose();
y = AB.triangularView<Eigen::Upper>()*x;

when supplied with two low-rank matrices A, B ∈ K n,p , p ≪ n, in terms of n → ∞ obviously is O(n2 ),
because an intermediate n × n-matrix ABT is built.

First, consider the case of a tensor product (= rank-1) matrix, that is, p = 1, A ↔ a = [ a1 , . . . , an ]⊤ ∈ K n ,

1. Computing with Matrices and Vectors, 1.4. Computational effort 96

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

B ↔ b = [b1 , . . . , bn ] ∈ K n . Then
  
a1 b1 a1 b2 ... . . . a1 bn x1
 0 a2 b2 a2 b3 . . . . . . a2 bn 
  ... 

 .. .. .. .. ..  
 . . . . .  
T
y = triu(ab )x =    
.. .. .. ..  
 . . . .  
 . .  .. 
 .. .. ..
. . ..  .
0 ... . . . 0 a n bn xn
  1 1 . . . ... 1
  
a1 b1 x1
 ..  0 1 1 ... ... 1   . ..  .. 
 .  ..   . 
   ... . . . . . . . . . .   
     
=  .. .. .. ..    .
  . . . .   
  .   . 
 ..
.   .. . .. .. ..    ..
.  .. 
. .
an 0 ... ... 0 1 bn x n
| {z }
T

Thus, the core problem is the fast multiplication of a vector with an upper triangular matrix T described
in E IGEN syntax by Eigen::MatrixXd::Ones(n,n).triangularView<Eigen::Upper>().
Note that multiplication of a vector x with T yields a vector of partial sums of components of x starting
from last component. This can be achieved by invoking the special C++ command std::partial_sum
in the numeric header (documentation). We also observe that
p
ABT = ∑ (A):,ℓ ((B):,ℓ )⊤ ,
ℓ=1

so that that the computations for the special case p = 1 discussed above can simply be reused p times!

C++11 code 1.4.16: Efficient multiplication with the upper diagonal part of a rank- p-matrix in
E IGEN ➺ GITLAB
2 //! Computation of y = triu(AB T )x
3 //! Efficient implementation with backward cumulative sum
4 //! (partial_sum)
5 template <class Vec , class Mat>
6 void l r t r i m u l t e f f ( const Mat& A , const Mat& B , const Vec& x , Vec& y ) {
7 const i n t n = A . rows ( ) , p = A . cols ( ) ;
8 asser t ( n == B . rows ( ) && p == B . cols ( ) ) ; // size missmatch
9 f o r ( i n t l = 0 ; l < p ; ++ l ) {
10 Vec tmp = ( B . col ( l ) . array ( ) ∗ x . array ( ) ) . matrix ( ) . reverse ( ) ;
11 std : : partial_sum ( tmp . data ( ) , tmp . data ( ) +n , tmp . data ( ) ) ;
12 y += ( A . col ( l ) . array ( ) ∗ tmp . reverse ( ) . array ( ) ) . matrix ( ) ;
13 }
14 }

This code enjoys the obvious complexity of O( pn) for p, n → ∞, p < n. The code offers an example of
a function templated with its argument types, see § 0.2.5. The types Vec and Mat must fit the concept of
E IGEN vectors/matrices.

1. Computing with Matrices and Vectors, 1.4. Computational effort 97

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The next concept from linear algebra is important in the context of computing with multi-dimensional arrays.

Definition 1.4.17. Kronecker product

The Kronecker product A ⊗ B of two matrices A ∈ K m,n and B ∈ K l,k , m, n, l, k ∈ N, is the

(ml ) × (nk)-matrix
 
(A)11 B (A)1,2 B . . . . . . (A)1,n B
 .. 
 (A)2,1 B (A)2,2 B . 
 .. .. .. 
A ⊗ B := 
 . . .  ∈ K ml,nk .

 . .. .. 
 .. . . 
(A)m,1 B (A)m,2 B . . . . . . (A)m,n B

Example 1.4.18 (Multiplication of Kronecker product with vector)

The function (A ⊗ B)x when invoked with two matrices A ∈ K m,n and B ∈ K l,k and a vector x ∈ K nk ,
will suffer an asymptotic complexity of O(m · n · l · k), determined by the size of the intermediate dense
matrix A ⊗ B ∈ K ml,nk .

Using the partitioning of the vector x into n equally long sub-vectors

 
x1
 x2 
 
x =  .  , x j ∈ Kk ,
 .. 
xn
we find the representation
 
(A)1,1 Bx1 + (A)1,2 Bx2 + · · · + (A)1,n Bxn
 (A)2,1 Bx1 + (A)2,2 Bx2 + · · · + (A)2,n Bxn 
 
 . 
(A ⊗ B)x =  .. .
 . 
 .. 
(A)m,1 Bx1 + (A)m,2 Bx2 + · · · + (A)m,n Bxn

The idea is to form the products Bx j , j = 1, . . . , n, once, and then combine them linearly with coefficients
given by the entries in the rows of A:

C++11 code 1.4.19: Efficient multiplication of Kronecker product with vector in E IGEN
➺ GITLAB
2 //! @brief Multiplication of Kronecker product with vector y = ( A ⊗ B) x.
Elegant way using reshape
3 //! WARNING: using Matrix::Map we assume the matrix is in ColMajor
format, *beware* you may incur bugs if matrix is in RowMajor isntead
4 //! @param[in] A Matrix m × n
5 //! @param[in] B Matrix l × k
6 //! @param[in] x Vector of dim nk
7 //! @param[out] y Vector y = kron(A,B)*x of dim ml
8 template <class Matrix , class Vector >

1. Computing with Matrices and Vectors, 1.4. Computational effort 98

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

9 void k r o n m u l t v ( const Mat r ix &A , const Mat r ix &B , const Vector &x ,
Vector &y ) {
10 unsigned i n t m = A . rows ( ) ; unsigned i n t n = A . cols ( ) ;
11 unsigned i n t l = B . rows ( ) ; unsigned i n t k = B . cols ( ) ;
12 // 1st matrix mult. computes the products Bx j
13 // 2nd matrix mult. combines them linearly with the coefficients
of A
14 Mat r ix t = B ∗ Mat r ix : : Map( x . data ( ) , k , n ) ∗ A . transpose ( ) ; //
15 y = Mat r ix : : Map( t . data ( ) , m∗ l , 1 ) ;
16 }

Recall the reshaping of a matrix in E IGEN in order to understand this code: Rem. 1.2.27.

The asymptotic complexity of this code is determined by the two matrix multiplications in Line 14. This
yields the asymptotic complexity O(lkn + mnl ) for l, k, m, n → ∞.

P YTHON-code 1.4.20: Efficient multiplication of Kronecker product with vector in P YTHON

1 d e f kronmultv(A, B, x):
2 n, k = A.shape[1], B.shape[1]
3 assert x.size == n * k, ’size mismatch’
4 xx = np.reshape(x, (n, k))
5 Z = np.dot(xx, B.T)
6 yy = np.dot(A, Z)
7 r e t u r n np.ravel(yy)

Note that different reshaping is used in the P YTHON code due to the default row major storage order.

1.5 Machine Arithmetic

1.5.1 Experiment: Loss of orthogonality

(1.5.1) Gram-Schmidt orthogonalisation

From linear algebra [?, Sect. 4.4] or Ex. 0.2.41 we recall the fundamental algorithm of Gram-Schmidt
orthogonalisation of an ordered finite set {a1 , . . . , ak }, k ∈ N, of vectors aℓ ∈ K n :

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 99

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Input: { a1 , . . . , a k } ⊂ K n
In linear algebra we have learnt that, if it does
1: q1 := a1
% 1st output vector not STOP prematurely, this algorithm will com-
k a1 k 2 pute orthonormal vectors q1 , . . . , qk satisfying
2: for j = 2, . . . , k do
{ % Orthogonal projection
3: q j := a j Span{q1 , . . . , qℓ } = Span{a1 , . . . , aℓ } ,
4: for ℓ = 1, 2, . . . , j − 1 do (GS) (1.5.2)
5: { q j ← q j − a j · qℓ qℓ }
6: if ( q j = 0 ) then STOP for all ℓ ∈ {1, . . . , k}.
qj
7: else { qj ← } More precisely, if a1 , . . . , aℓ , ℓ ≤ k, are linearly
k j k2
q
8: } independent, then the Gram-Schmidt algorithm
will not terminate before the ℓ + 1-th step.
Output: { q1 , . . . , q j }
ˆ Euclidean norm of a vector ∈ K n
✎ Notation: k·k2 =
The following code implements the Gram-Schmidt orthonormalization of a set of vectors passed as the
columns of a matrix A ∈ R n,k .

C++11 code 1.5.3: Gram-Schmidt orthogonalisation in E IGEN ➺ GITLAB

2 template <class Matrix >
3 Mat r ix gramschmidt ( const Mat r ix & A ) {
4 Mat r ix Q = A ;
5 // First vector just gets normalized, Line 1 of (GS)
6 Q. col ( 0 ) . normalize ( ) ;
7 f o r ( unsigned i n t j = 1 ; j < A . cols ( ) ; ++ j ) {
8 // Replace inner loop over each previous vector in Q with fast
9 // matrix-vector multiplication (Lines 4, 5 of (GS))
10 Q. col ( j ) −= Q. l e f t C o l s ( j ) ∗ (Q. l e f t C o l s ( j ) . transpose ( ) ∗
A . col ( j ) ) ; //
11 // Normalize vector, if possible.
12 // Otherwise colums of A must have been linearly dependent
13 i f ( Q. col ( j ) . norm ( ) <= 10e−9 ∗ A . col ( j ) . norm ( ) ) { //
14 std : : c e r r << " Gram−S c h m i d t f a i l e d : A h a s l i n . dep
c o l u m n s . " << std : : endl ;
15 break ;
16 } else { Q. col ( j ) . normalize ( ) ; } // Line 7 of (GS)
17 }
18 r et ur n Q;
19 }

We will soon learn the rationale behind the odd test in Line 13..

P YTHON-code 1.5.4: Gram-Schmidt orthogonalisation in P YTHON

1 d e f gramschmidt(A):
2 _, k = A.shape
3 Q = A[:, [0]] / np.linalg.norm(A[:, 0])
4 f o r j i n range(1, k):
5 q = A[:, j] - np.dot(Q, np.dot(Q.T, A[:, j]))

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 100

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6 nq = np.linalg.norm(q)
7 i f nq < 1e-9 * np.linalg.norm(A[:, j]):
8 br eak
9 Q = np.column_stack([Q, q / nq])
10 return Q

Note the different loop range due to the zero-based indexing in P YTHON.

Experiment 1.5.5 (Unstable Gram-Schmidt orthonormalization)

If {a1 , . . . , ak } are linearly independent we expect the output vectors q1 , . . . , qk to be orthonormal:

(qℓ )⊤ qm = δℓ,m , ℓ, m ∈ {1, . . . , k} . (1.5.6)

This property
can be easily tested numerically, for instance by computing Q⊤ Q for a matrix Q =
a1 , . . . , qk ∈ R n,k .

C++11 code 1.5.7: Wrong result from Gram-Schmidt orthogonalisation E IGEN ➺ GITLAB
2 void g s r o u n d o f f ( MatrixXd& A) {
3 // Gram-Schmidt orthogonalization of columns of A, see Code 1.5.3
4 MatrixXd Q = gramschmidt ( A) ;
5 // Test orthonormality of columns of Q, which should be an
6 // orthogonal matrix according to theory
7 cout << set pr ecision ( 4 ) << f i x e d << " I = "
8 << endl << Q. transpose ( ) ∗Q << endl ;
9 // E I G E N ’s stable internal Gram-Schmidt orthogonalization by
10 // QR-decomposition, see Rem. 1.5.9 below
11 HouseholderQR<MatrixXd > qr ( A . rows ( ) ,A . cols ( ) ) ; //
12 qr . compute ( A) ; MatrixXd Q1 = qr . householderQ ( ) ; //
13 // Test orthonormality
14 cout << " I 1 = " << endl << Q1 . transpose ( ) ∗Q1 << endl ;
15 // Check orthonormality and span property (1.5.2)
16 MatrixXd R1 = qr . matrixQR ( ) . t r iangular View <Upper > ( ) ;
17 cout << s c i e n t i f i c << " A−Q1 ∗ R1 = " << endl << A−Q1∗R1 << endl ;
18 }

We test the orthonormality of the output vectors of Gram-Schmidt orthogonalization for a special matrix
A ∈ R10,10 , a so-called Hilbert matrix, defined by (A)i,j = (i + j − 1)−1 . Then Code 1.5.7 produces the
follwing output:
I =
1.0000 0.0000 -0.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
0.0000 1.0000 -0.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
-0.0000 -0.0000 1.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
0.0000 0.0000 0.0000 1.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
-0.0000 -0.0000 -0.0000 -0.0000 1.0000 0.0000 -0.0008 -0.0007 -0.0007 -0.0006
0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 -0.0540 -0.0430 -0.0360 -0.0289
-0.0000 -0.0000 -0.0000 -0.0000 -0.0008 -0.0540 1.0000 0.9999 0.9998 0.9996
-0.0000 -0.0000 -0.0000 -0.0000 -0.0007 -0.0430 0.9999 1.0000 1.0000 0.9999
-0.0000 -0.0000 -0.0000 -0.0000 -0.0007 -0.0360 0.9998 1.0000 1.0000 1.0000
-0.0000 -0.0000 -0.0000 -0.0000 -0.0006 -0.0289 0.9996 0.9999 1.0000 1.0000

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 101

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Obviously, the vectors produced by the function gramschmidt fail to be orthonormal, contrary to the
predictions of rigorous results from linear algebra!

However, Line 11, Line 12 of Code 1.5.7 demonstrate another way to orthonormalize the columns of a
matrix using E IGEN’s built-in class template HouseholderQR:
I1 =
1.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0000
-0.0000 1.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000 -0.0000 0.0000 -0.0000
0.0000 -0.0000 1.0000 -0.0000 -0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
-0.0000 0.0000 -0.0000 1.0000 0.0000 -0.0000 -0.0000 0.0000 0.0000 0.0000
-0.0000 0.0000 -0.0000 0.0000 1.0000 -0.0000 0.0000 -0.0000 -0.0000 0.0000
-0.0000 -0.0000 0.0000 -0.0000 -0.0000 1.0000 -0.0000 -0.0000 0.0000 -0.0000
-0.0000 -0.0000 0.0000 -0.0000 0.0000 -0.0000 1.0000 0.0000 0.0000 -0.0000
-0.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000 0.0000 1.0000 -0.0000 0.0000
-0.0000 0.0000 0.0000 0.0000 -0.0000 0.0000 0.0000 -0.0000 1.0000 -0.0000
0.0000 -0.0000 0.0000 0.0000 0.0000 -0.0000 -0.0000 0.0000 -0.0000 1.0000

Now we observe apparently perfect orthogonality (1.5.6) of the columns of the matrix Q1 in Code 1.5.7.
There is another algorithm that reliably yields the theoretical output of Gram-Schmidt orthogonalization.

“Computers cannot compute”

Computers cannot compute “properly” in R : numerical computations may not respect the laws of
analysis and linear algebra!

This introduces an important new aspect in the study of numerical algorithms.

Remark 1.5.9 (Stable orthonormalization by QR-decomposition)

In Code 1.5.7 we saw the use of the E IGEN class HousholderQR<MatrixType> for the purpose of
Gram-Schmidt orthogonalisation. The underlying theory and algorithms will be explained later in Sec-
tion 3.3.3. There we will have the following insight:

➣ Up to signs the columns of the matrix Q available from the QR-decomposition of A are the same
vectors as produced by the Gram-Schmidt orthogonalisation of the columns of A.

Code 1.5.7 demonstrates a case where a desired result can be obtained by two algebraically equiv-
alent computations, that is, they yield the same result in a mathematical sense. Yet, when im-
! plemented on a computer, the results can be vastly different. One algorithm may produce junk
(“unstable algorithm”), whereas the other lives up to the expectations (“stable algorithm”)

Supplement to Exp. 1.5.5: despite its ability to produce orthonormal vectors, we get as output for D=A-Q1*R1
in Code 1.5.7:
D =
2.2204e-16 3.3307e-16 3.3307e-16 1.9429e-16 1.9429e-16 5.5511e-17 1.3878e-16 6.9389e-17 8.3267e-17 9.7145e-17
0.0000e+00 1.1102e-16 8.3267e-17 5.5511e-17 0.0000e+00 5.5511e-17 -2.7756e-17 0.0000e+00 0.0000e+00 4.1633e-17
-5.5511e-17 5.5511e-17 2.7756e-17 5.5511e-17 0.0000e+00 0.0000e+00 0.0000e+00 -1.3878e-17 1.3878e-17 1.3878e-17
0.0000e+00 5.5511e-17 2.7756e-17 2.7756e-17 0.0000e+00 1.3878e-17 -1.3878e-17 0.0000e+00 1.3878e-17 2.7756e-17
0.0000e+00 2.7756e-17 0.0000e+00 1.3878e-17 1.3878e-17 1.3878e-17 0.0000e+00 1.3878e-17 1.3878e-17 4.1633e-17
-2.7756e-17 2.7756e-17 1.3878e-17 4.1633e-17 2.7756e-17 1.3878e-17 0.0000e+00 -1.3878e-17 2.7756e-17 2.7756e-17
0.0000e+00 2.7756e-17 0.0000e+00 2.7756e-17 2.7756e-17 1.3878e-17 0.0000e+00 1.3878e-17 2.7756e-17 2.0817e-17
0.0000e+00 2.7756e-17 2.7756e-17 1.3878e-17 1.3878e-17 1.3878e-17 0.0000e+00 1.3878e-17 2.0817e-17 2.7756e-17
1.3878e-17 1.3878e-17 1.3878e-17 2.7756e-17 1.3878e-17 0.0000e+00 -1.3878e-17 6.9389e-18 -6.9389e-18 1.3878e-17
0.0000e+00 2.7756e-17 1.3878e-17 1.3878e-17 1.3878e-17 0.0000e+00 0.0000e+00 0.0000e+00 1.3878e-17 1.3878e-17

➥ The computed QR-decomposition apparently fails to meet the exact algebraic requirements stipulated
by Thm. 3.3.9. However, note the tiny size of the “defect”.

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 102

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 1.5.10 (QR-decomposition in E IGEN)

The two different QR-decompositions (3.3.3.1) and (3.3.3.1) of a matrix A ∈ K n,k , n, k ∈ N, can be
computed in E IGEN as follows:
HouseholderQR<MatrixXd> qr(A);
// Full QR-decomposition (3.3.3.1)
Q = qr.householderQ();
// Economical QR-decomposition (3.3.3.1)
thinQ = qr.householderQ() *
MatrixXd::Identity(A.rows(), s t d ::min(A.rows(), A.cols()));

The returned matrices Q and R correspond to the QR-factors Q and R as defined above. See the
discusssion, whether the “economical” decomposition is truly economical: some expression template
magic.

Remark 1.5.11 (QR-decomposition in P YTHON)

The two different QR-decomposition (3.3.3.1) and (3.3.3.1) of a matrix A ∈ K n,k , n, k ∈ N, can be
computed in P YTHON as follows:
1 Q, R = np.linalg.qr(A, mode=’reduced’) # Economical (3.3.3.1)
2 Q, R = np.linalg.qr(A, mode=’complete’) # Full (3.3.3.1)

The returned matrices Q and R correspond to the QR-factors Q and R as defined above.

1.5.2 Machine Numbers

(1.5.12) The finite and discrete set of machine numbers

The reason, why computers must fail to execute exact computations with real numbers is clear:
✞ ☎ ✞ ☎
Computer = finite automaton ➢ can handle only finitely many numbers, not R
✝ ✆ ✝ ✆
machine numbers, set M

Essential property: M is a finite, discrete subset of R (its numbers separated by gaps)

The set of machine numbers M cannot be closed under elementary arithmetic operations
+, −, ·, /, that is, when adding, multiplying, etc., two machine numbers the result may not belong
to M.

The results of elementary operations with operands in M have to be mapped back to M, an oper-
ation called rounding.

roundoff errors (ger.: Rundungsfehler) are inevitable

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 103

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The impact of roundoff means that mathematical identities may not carry over to the computational realm.

✞ ☎
As we have seen above in Exp. 1.5.5

✝ ✆
Computers cannot compute “properly” !

✛ ✘

analysis
numerical computations 6=
linear algebra
✚ ✙
This introduces a new and important aspect in the study of numerical algorithms!

(1.5.13) Internal representation of machine numbers

Now we give a brief sketch of the internal structure of machine numbers ∈ M. The main insight will be
that

“Computers use floating point numbers (scientific notation)”

Example 1.5.14 (Decimal floating point numbers)

Some 3-digit normalized decimal floating point numbers:

valid: 0.723 · 102 , 0.100 · 10−20 , −0.801 · 105
invalid: 0.033 · 102 , 1.333 · 10−4 , −0.002 · 103
General form of an m-digit normalized decimal floating point number:

never = 0 !

x=± 0 . 1 1 1 1 1 ... 1 1 · 10E

| {z }
m digits of mantissa exponent ∈ Z

Of course, computers are restricted to a finite range of exponents:

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 104

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 1.5.15. Machine numbers/floating point numbers → [?, Sect. 2.1]

Given ☞ basis B ∈ N \ {1},

☞ exponent range {emin , . . . , emax }, emin , emax ∈ Z, emin < emax ,
☞ number m ∈ N of digits (for mantissa),
the corresponding set of machine numbers is

M := {d · B E : d = i · B−m , i = Bm−1 , . . . , Bm − 1, E ∈ {emin , . . . , emax }}

never = 0 ! 1 1 ... 1 1
| {z }
machine number ∈ M : x=± 0 . 1 1 1 1 1 ... 1 1 ·B digits for exponent

| {z }
m digits for mantissa

Remark 1.5.16 (Extremal numbers in M)

Clearly, there is a largest element of M and two that are closest to zero. These are mainly determined by
the range for the exponent E, cf. Def. 1.5.15.

Largest machine number (in modulus) : xmax = max |M | = (1 − B−m ) · Bemax

Smallest machine number (in modulus) : xmin = min |M | = B−1 · Bemin

In C++ these extremal machine numbers are accessible through the

std::numeric_limits<double>::max()
and std::numeric_limits<double>::min()

functions. Other properties of arithmetic types can be queried accordingly from the numeric_limits header.

Remark 1.5.17 (Distribution of machine numbers)

From Def. 1.5.15 it is clear that there are equi-spaced sections of M and that the gaps between machine
numbers are bigger for larger numbers:
Bemin −1

0
spacing Bemin −m spacing Bemin −m+1 spacing Bemin −m+2
Gap partly filled with non-normalized numbers

Non-normalized numbers violate the lower bound for the mantissa i in Def. 1.5.15.

(1.5.18) IEEE standard 754 for machine numbers → [?], [?, Sect. 2.4], → link

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 105

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

No surprise: for modern computers B = 2 (binary system), the other parameters of the universally
implemented machine number system are
single precision : m = 24∗ ,E ∈ {−125, . . . , 128} ➣ 4 bytes
double precision : m = 53∗ ,E ∈ {−1021, . . . , 1024} ➣ 8 bytes
∗: including bit indicating sign

The standardisation of machine numbers is important, because it ensures that the same numerical algo-
rithm, executed on different computers will nevertheless produce the same result.

Remark 1.5.19 (Special cases in IEEE standard)

double x = exp(1000), y = 3/x, z = xsin(M_PI), w = xlog(1);

cout << x << e n d l << y << e n d l << z << e n d l << w << e n d l ;

1 inf
2 0
Output: 3 inf
4 −nan !
E = emax , M 6= 0 =
ˆ NaN = Not a number → exception
E = emax , M = 0 =ˆ Inf = Infinity → overflow
E =0 ˆ Non-normalized numbers → underflow
=
E = 0, M = 0 =
ˆ number 0

(1.5.20) Characteristic parameters of IEEE floating point numbers (double precision)

☞ C++ does not always fulfill the requirements of the IEEE 754 standard and it needs to be checked
with std::numeric_limits<T>::is_iec559.

C++11-code 1.5.21: Querying characteristics of double numbers ➺ GITLAB

2 # include < l i m i t s >
3 # include < ios tr eam >
4 # include <iomanip >
5

6 using namespace std ;

8 i n t main ( ) {
9 cout << std : : n u m e r i c _ l i m i t s <double > : : i s _ i e c 5 5 9 << endl
10 << std : : d e f a u l t f l o a t << n u m e r i c _ l i m i t s <double > : : min ( ) << endl
11 << std : : h e x f l o a t << n u m e r i c _ l i m i t s <double > : : min ( ) << endl
12 << std : : d e f a u l t f l o a t << n u m e r i c _ l i m i t s <double > : : max ( ) << endl
13 << std : : h e x f l o a t << n u m e r i c _ l i m i t s <double > : : max ( ) << endl ;
14 }

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 106

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

1 true
2 2.22507e−308
Output: 3 0010000000000000
4 1.79769e+308
5 7fefffffffffffff

1.5.3 Roundoff errors

Experiment 1.5.22 (Input errors and roundoff errors)

The following computations would always result in 0, if done in exact arithmetic.

C++11-code 1.5.23: Demonstration of roundoff errors ➺ GITLAB

2 # include < ios tr eam >
3 i n t main ( ) {
4 std : : cout . p r e c i s i o n ( 1 5 ) ;
5 double a = 4 . 0 / 3 . 0 , b = a −1, c = 3 ∗ b , e = 1−c ;
6 std : : cout << e << std : : endl ;
7 a = 1 0 1 2 . 0 / 1 1 3 . 0 ; b = a −9; c = 113 ∗ b ; e = 5+c ;
8 std : : cout << e << std : : endl ;
9 a = 83810206.0/6789.0; b = a − 12345; c = 6789 ∗ b ; e = c −1;
10 std : : cout << e << std : : endl ;
11 }

1 2.22044604925031 e−16
Output: 2 6.75015598972095 e−14
3 − 1.60798663273454e−09

Can you devise a similar calculation, whose result is even farther off zero? Apparently the rounding that
inevitably accompanies arithmetic operations in M can lead to results that are far away from the true
result.

For the discussion of errors introduced by rounding we need important notions.

Definition 1.5.24. Absolute and relative error → [?, Sect. 1.2]

Let xe ∈ K be an approximation of x ∈ K. Then its absolute error is given by

ǫabs := | x − xe| ,

and its relative error is defined as

| x − xe|
ǫrel := .
| x|

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 107

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 1.5.25 (Relative error and number of correct digits)

The number of correct (significant, valid) digits of an approximation xe of x ∈ K is defined through the
relative error:
| x − xe|
If ǫrel := | x|
≤ 10−ℓ , then xe has ℓ correct digits, ℓ ∈ N0

(1.5.26) Floating point operations

We may think of the elementary binary operations +, −, ∗, / in M comprising two steps:

➊ Compute the exact result of the operation.
➋ Perform rounding of the result of ➊ to map it back to M.
Definition 1.5.27. Correct rounding

Correct rounding (“rounding up”) is given by the function

R → M
rd :
x 7→ max argmin xe∈M | x − xe| .

(Recall that argmin x F( x ) is the set of arguments of a real valued function F that makes it attain its (global)
minimum.)

Of course, ➊ above is not possible in a strict sense, but the effect of both steps can be realised and yields
a floating point realization of ⋆ ∈ {+, −, ·, /}.

✎ Notation: ⋆ for the floating point realization of ⋆ ∈ {+, −, ·, /}:

write e

Then ➊ and ➋ may be summed up into

For ⋆ ∈ {+, −, ·, /}: xe

⋆ y := rd( x ⋆ y) .

Remark 1.5.28 (Breakdown of associativity)

As a consequence of rounding addition + e and multiplication e ∗ as implemented on computers fail to be

associative. They will usually be commutative, though this is not guaranteed.

(1.5.29) Estimating roundoff errors → [?, p. 23]

Let us denote by EPS the largest relative error (→ Def. 1.5.24) incurred through rounding:

| rd( x ) − x |
EPS := max , (1.5.30)
x ∈ I \{0} | x|

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 108

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

where I = [min |M |, max |M | is the range of positive machine numbers.

For machine numbers according to Def. 1.5.15 EPS can be computed from the defining parameters B
(base) and m (length of mantissa) [?, p. 24]:

EPS = 12 B1−m . (1.5.31)

However, when studying roundoff errors, we do not want to delve into the intricacies of the internal repre-
sentation of machine numbers. This can be avoided by just using a single bound for the relative error due
to rounding, and, thus, also for the relative error potentially suffered in each elementary operation.

Assumption 1.5.32. “Axiom” of roundoff analysis

There is a small positive number EPS, the machine precision, such that for the elementary arithmetic
operations ⋆ ∈ {+, −, ·, /} and “hard-wired” functions∗ f ∈ {exp, sin, cos, log, . . .} holds

xe
⋆ y = ( x ⋆ y)(1 + δ) , fe( x ) = f ( x )(1 + δ) ∀ x, y ∈ M ,

with |δ| < EPS.

∗: this is an ideal, which may not be accomplished even by modern CPUs.

relative roundoff errors of elementary steps in a program bounded by machine precision !

Example 1.5.33 (Machine precision for IEEE standard)

C++ tells the machine precision as following:

C++11-code 1.5.34: Finding out EPS in C++ ➺ GITLAB

2 # include < ios tr eam >
3 # include < l i m i t s > // get various properties of arithmetic types
4 i n t main ( ) {
5 std : : cout . p r e c i s i o n ( 1 5 ) ;
6 std : : cout << std : : n u m e r i c _ l i m i t s <double > : : epsilon ( ) << std : : endl ;
7 }

Output:
1 2.22044604925031 e−16

Knowing the machine precision can be important for checking the validity of computations or coding ter-
mination conditions for iterative approximations.

Experiment 1.5.35 (Adding EPS to 1)

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 109

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

cout .precision(25);
double eps =
numeric_limits< double >::epsilon();
cout << fixed << 1.0 + 0.5*eps << e n d l In fact, the following “definition”
<< 1.0 - 0.5*eps << e n d l of EPS is sometimes used:
<< (1.0 + 2/eps) - 2/eps << e n d l ; EPS is the smallest posi-
Output: tive number ∈ M for which
1+e EPS 6= 1 (in M):
1 1.0000000000000000000000000
2 0.9999999999999998889776975
3 0.0000000000000000000000000

e EPS = 1 actually complies with the “axiom” of roundoff error analysis, Ass. 1.5.32:
We find that 1+
EPS
1 = (1 + EPS)(1 + δ) ⇒ |δ| = < EPS ,
1 + EPS
2 2 EPS
= (1 + )(1 + δ) ⇒ |δ| = < EPS .
EPS EPS 2 + EPS

!
Do we have to worry about these tiny roundoff errors ?

YES (→ Exp. 1.5.5): • accumulation of roundoff errors

• amplification of roundoff errors

Remark 1.5.36 (Testing equality with zero)

Since results of numerical computations are almost always polluted by roundoff errors:
Tests like if (x == 0) are pointless and even dangerous, if x contains the result
of a numerical computation.
! Remedy: Test if (abs(x) < eps*s) ...,
s=ˆ positive number, compared to which | x | should be small.
We saw a first example of this practise in Code 1.5.3, Line 13.

Remark 1.5.37 (Overflow and underflow)

ˆ |result of an elementary operation| > max{M }

overflow ) =
ˆ IEEE standard ⇒ Inf
=
underflow =ˆ 0 < |result of an elementary operation| < min{|M \ {0}|}
ˆ IEEE standard ⇒ use non-normalized numbers (!)
=
The Axiom of roundoff analysis Ass. 1.5.32 does not hold once non-normalized numbers are encountered:

C++11-code 1.5.38: Demonstration of over-/underflow ➺ GITLAB

2 # include < ios tr eam >
3 # def ine _USE_MATH_DEFINES
4 # include <cmath>

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 110

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5 # include < l i m i t s >

6 using namespace std ;
7 i n t main ( ) {
8 cout . p r e c i s i o n ( 1 5 ) ;
9 double min = n u m e r i c _ l i m i t s <double > : : min ( ) ;
10 double res1 = M_PI ∗ min /123456789101112;
11 double res2 = res1 ∗ 123456789101112/ min ;
12 cout << res1 << endl << res2 << endl ;
13 }

1 5.68175492717434 e−322
Output: 2 3.15248510554597

Try to avoid underflow and overflow

A simple example teaching how to avoid overflow during the computation of the norm of a 2D vector [?,
Ex. 2.9]:

q ( p
r= x 2 + y2 | x | 1 + (y/x )2 , if | x | ≥ |y| ,
r= p
|y| 1 + (x/y)2 , if |y| > | x | .
straightforward evaluation:
p p overflow, when | x | >
max |M | or |y| > max |M |. ➢ no overflow!

1.5.4 Cancellation

In general, predicting the impact of roundoff errors on the result of a multi-stage computation is very diffi-
cult, if possible at all. However, there is a constellation that is particularly prone to dangerous amplification
of roundoff and still can be detected easily.

Example 1.5.39 (Computing the zeros of a quadratic polynomial)

The following simple E IGEN code computes the real roots of a quadratic polynomial p(ξ ) = ξ 2 + αξ + β
by the discriminant formula
1 √
p(ξ 1 ) = p(ξ 2 ) = 0 , ξ 1/2 = α ± D , if D := α2 − 4β ≥ 0 . (1.5.40)
2

C++11-code 1.5.41: Discriminant formula for the real roots of p(ξ ) = ξ 2 + αξ + β ➺ GITLAB
2 //! C++ function computing the zeros of a quadratic polynomial
3 //! ξ → ξ 2 + αξ + β by means
p of the familiar discriminant
1
4 //! formula ξ 1,2 = 2 (−α ± α2 − 4β). However
5 //! this implementation is vulnerable to round-off! The zeros are
6 //! returned in a column vector
7 Vector2d zerosquadpol ( double alpha , double beta ) {
8 Vector2d z ;

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 111

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

9 double D = std : : pow ( alpha , 2 ) −4∗ beta ; // discriminant

10 i f (D < 0 ) throw " no r e a l z e r o s " ;
11 else {
12 // The famous discriminant formula
13 double wD = std : : s q r t (D) ;
14 z << (− alpha −wD) / 2 , (− alpha +wD) / 2 ; //
15 }
16 r et ur n z ;
17 }

This formula is applied to the quadratic polynomial p(ξ ) = (ξ − γ)(ξ − γ1 ) after its coefficients α, β have
been computed from γ, which will have introduced small relative roundoff errors (of size EPS).

C++11-code 1.5.42: Testing the accuracy of computed roots of a quadratic polynomial

➺ GITLAB
2 //! Eigen Function for testing the computation of the zeros of a
parabola
3 void compzeros ( ) {
4 i n t n = 100;
5 MatrixXd r es ( n , 4 ) ;
6 VectorXd gamma = VectorXd : : LinSpaced ( n , 2 ,992) ;
7 f o r ( i n t i = 0 ; i < n ; ++ i ) {
8 double alpha = −(gamma( i ) + 1 . / gamma( i ) ) ;
9 double beta = 1 . ;
10 Vector2d z1 = zerosquadpol ( alpha , beta ) ;
11 Vector2d z2 = zerosquadpolstab ( alpha , beta ) ;
12 double z t r u e = 1 . / gamma( i ) , z 2 t r u e = gamma( i ) ;
13 r es ( i , 0 ) = gamma( i ) ;
14 r es ( i , 1 ) = std : : abs ( ( z1 ( 0 )− z t r u e ) / z t r u e ) ;
15 r es ( i , 2 ) = std : : abs ( ( z2 ( 0 )− z t r u e ) / z t r u e ) ;
16 r es ( i , 3 ) = std : : abs ( ( z1 ( 1 )− z 2 t r u e ) / z 2 t r u e ) ;
17 }
18 // Graphical output of relative error of roots computed by unstable
implementation
19 mgl : : F i g u r e f i g 1 ;
20 f i g 1 . setFontSize ( 3 ) ;
21 f i g 1 . t i t l e ( " R o o t s o f a p a r a b o l a co mp u te d i n an u n s t a b l e manner " ) ;
22 f i g 1 . p l o t ( r es . col ( 0 ) , r es . col ( 1 ) , " + r " ) . l a b e l ( " s m a l l r o o t " ) ;
23 f i g 1 . p l o t ( r es . col ( 0 ) , r es . col ( 3 ) , " ∗ b " ) . l a b e l ( " l a r g e r o o t " ) ;
24 f i g 1 . x l a b e l ( " \ \ gamma " ) ;
25 fig1 . ylabel ( " r e l a t i v e e r r o r s i n \ \ xi_1 , \ \ xi_2 " ) ;
26 f i g 1 . legend ( 0 . 0 5 , 0 . 9 5 ) ;
27 f i g 1 . save ( " z q p e r r i n s t a b " ) ;
28 // Graphical output of relative errors (comparison), small roots
29 mgl : : F i g u r e f i g 2 ;
30 fig 2 . t i t l e ( " Roundoff i n the computation of zeros of a parabola " ) ;
31 f i g 2 . p l o t ( r es . col ( 0 ) , r es . col ( 1 ) , " + r " ) . l a b e l ( " u n s t a b l e " ) ;
32 f i g 2 . p l o t ( r es . col ( 0 ) , r es . col ( 3 ) , " ∗m" ) . l a b e l ( " s t a b l e " ) ;
33 f i g 2 . x l a b e l ( " \ \ gamma " ) ;

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 112

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

34 fig2 . ylabel ( " r e l a t i v e errors in \ \ xi_1 " ) ;

35 f i g 2 . legend ( 0 . 0 5 , 0 . 9 5 ) ;
36 f i g 2 . save ( " z q p e r r " ) ;
37 }

×10 -11 Roots of a parabola computed in an unstable manner

3.5
small root
large root

Observation: 3

Roundoff incurred during the computation of α and β 2.5

2
relative errors in ξ , ξ
leads to “wrong” roots.

1
For large γ the computed small root may be fairly 2

inaccurate as regards its relative error, which can be

1.5
several orders of magnitude larger than machine pre-
cision EPS. 1

The large root always enjoys a small relative error 0.5

about the size of EPS.

0
0 100 200 300 400 500 600 700 800 900 1000
Fig. 40 γ

In order to understand why the small root is much more severely affected by roundoff, note that its com-
putation involves the subtraction of two large numbers, if γ is large. This is the typical situation, in which
cancellation occurs.

(1.5.43) Visualisation of cancellation effect

We look at the exact subtraction of two almost equal positive numbers both of which have small relative
errors (red boxes) with respect to some desired exact value (indicated by blue boxes). The result of the
subtraction will be small, but the errors may add up during the subtraction, ultimately constituting a large
fraction of the result.
(absolute) errors

Cancellation

=
ˆ Subtraction of almost equal numbers
(➤ extreme amplification of relative errors)

Fig. 41
(✁ Roundoff error introduced by subtraction itself is negligi-
ble.)

Example 1.5.44 (Cancellation in decimal system)

We consider two positive numbers x, y of about the same size afflicted with relative errors ≈ 10−7 .
This means that their seventh decimal digits are perturbed, here indicated by ∗. When we subtract the
two numbers the perturbed digits are shifted to the left, resulting in a possible relative error of ≈ 10−3 :

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 113

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

x = 0.123467∗ ← 7th digit perturbed

y = 0.123456∗ ← 7th digit perturbed
x − y = 0.000011∗ = 0.11∗000 · 10 − 4 ← 3rd digit perturbed

padded zeroes

Again, this example demonstrates that cancellation wreaks havoc through error amplification, not through
the roundoff error due to the subtraction.

Example 1.5.45 (Cancellation when evaluating difference quotients → [?, Sect. 8.2.6], [?,
Ex. 1.3])

From analysis we know that the derivative of a differentiable function f : I ⊂ R → R at a point x ∈ I is

the limit of a difference quotient

f ( x + h) − f ( x )
f ′ ( x ) = lim .
h →0 h
This suggests the following approximation of the derivative by a difference quotient with small but finite
h>0
f ( x + h) − f ( x )
f ′ (x) ≈ for |h| ≪ 1 .
h
Results from analysis tell us that the approximation error should tend to zero for h → 0. More precise
quantitative information is provided by the Taylor formula for a twice continuously differentiable function [?,
p. 5]

f ( x + h) = f ( x ) + f ′ ( x )h + 12 f ′′ (ξ )h2 for some ξ = ξ ( x, h) ∈ [min{ x, x + h}, max{ x, x + h}] ,

(1.5.46)

from which we infer

f ( x + h) − f ( x )
− f ′ ( x ) = 12 h f ′′ (ξ ) for some ξ = ξ ( x, h) ∈ [min{ x, x + h}, max{ x, x + h}] .
h
(1.5.47)

We investigate the approximation of the derivative by difference quotients for f = exp, x = 0, and
different values of h > 0:

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 114

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11-code 1.5.48: Difference quotient approxima-

tion of the derivative of exp ➺ GITLAB
2 //! Difference quotient approximation log10 (h) relative error
3 //! of the derivative of exp -1 0.05170918075648
4 void d i f f q ( ) { -2 0.00501670841679
5 double h = 0 . 1 , x = 0 . 0 ; -3 0.00050016670838
6 f o r ( i n t i = 1 ; i <= 16; ++ i ) { -4 0.00005000166714
-5 0.00000500000696
7 double d f = ( exp ( x+h )−exp ( x ) ) / h ;
-6 0.00000049996218
8 cout << set pr ecision ( 1 4 ) << f i x e d ;
-7 0.00000004943368
9 cout << setw ( 5 ) << − i
-8 -0.00000000607747
10 << setw ( 2 0 ) << df −1 << endl ; -9 0.00000008274037
-10 0.00000008274037
11 h / = 10; -11 0.00000008274037
12 } -12 0.00008890058234
13 } -13 -0.00079927783736
-14 -0.00079927783736
-15 0.11022302462516
-16 -1.00000000000000
Measured relative errors ✄

We observe an initial decrease of the relative approximation er-

ror followed by a steep increase when h drops below 10−8.

That the observed errors are really

due to round-off errors is confirmed
by the following numerical results re-
ported besides, using a variable preci-
sion floating point module of E IGEN, the
MPFRC++ Support module.

Fig. 42

C++11-code 1.5.49: Evaluation of difference quotients with variable precision ➺ GITLAB

2 // This module provides support for multi precision floating point
numbers via the MPFR C++ library which itself is built upon
MPFR/GMP.
3 # include <unsupported / Eigen / MPRealSupport>
4 //! Numerical differentiation of exponential function with extended
precision arithmetic

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 115

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5 //! Uses the unsupported E I G E N MPFRC++ Support module

6 void n u m e r i c a l d i f f e r e n t i a t i o n ( ) {
7 // declare numeric type
8 typedef mpfr : : mpreal numer ic _t ;
9 // bit number that should be evaluated
10 i n t n = 7 , k = 13;
11 VectorXi b i t s ( n ) ; b i t s << 10 ,30 ,50 ,70 ,90 ,110 ,130;
12 MatrixXd e x p e r r ( 1 3 , b i t s . siz e ( ) +1) ;
13 f o r ( i n t i = 0 ; i < n ; ++ i ) {
14 // set precision to bits(i) bits (double has 53 bits)
15 numer ic _t : : set _def ault _pr ec ( b i t s ( i ) ) ;
16 numer ic _t x = " 1 . 1 " ; // Evaluation point in extended precision
17 f o r ( i n t j = 0; j <k ;++ j ) {
18 numer ic _t h = mpfr : : pow ( " 2 " , −1−5∗ j ) ; // width of
difference quotient in extended precision
19 // compute (absolute) error
20 e x p e r r ( j , i +1) = mpfr : : abs ( ( mpfr : : exp ( x+h ) −
mpfr : : exp ( x ) ) / h − mpfr : : exp ( x ) ) . toDouble ( ) ;
21 e x p e r r ( j , 0 ) = h . toDouble ( ) ;
22 }
23 }
24 // Plotting
25 // ...

Line 16: The literal "1.1" in instead of 1.1 prevents the conversion to double.
Obvious culprit: cancellation when computing the numerator of the difference quotient for small |h|
leads to a strong amplification of inevitable errors introduced by the evaluation of the transcendent
exponential function.

We witness the competition of two opposite effects: Smaller h results in a better approximation of the
derivative by the difference quotient, but the impact of cancellation is the stronger the smaller |h|.

f ( x + h) − f ( x ) 
Approximation error f ′ ( x ) − →0
h as h → 0 .
Impact of roundoff → ∞ 

In order to provide a rigorous underpinning for our conjecture, in this example we embark on our first
roundoff error analysis merely based on the “Axiom of roundoff analysis” Ass. 1.5.32: As in the computa-
tional example above we study the approximation of f ′ ( x ) = e x for f = exp, x ∈ R .

correction factors take into account roundoff:

e x+h (1 + δ1 ) − e x (1 + δ2 )
(→ "‘axiom of roundoff analysis”, Ass. 1.5.32)
df =
h ! |δ1 |, |δ2 | ≤ eps .
− 1 ehδ e h−δ
2
= ex + 1
h h 1 + O (h) O(h−1 ) for h → 0
h h

⇒ |df| ≤ e x e h−1 + eps 1+he

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 116

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(Note that the estimate for the term (eh − 1)/h is a particular case of (1.5.47).)

e x − df 2eps p
relative error: ≈ h + → min for h = 2 eps .
ex h
√
In double precision: 2eps = 2.107342425544702 · 10−8

Remark 1.5.50 (Cancellation during the computation of relative errors)

In the numerical experiment of Ex. 1.5.45 we computed the relative error of the result by subtraction, see
Code 1.5.48. Of course, massive cancellation will occur! Do we have to worry?

In this case cancellation can be tolerated, because we are interested only in the magnitude of the relative
error. Even if it was affected by a large relative error, this information is still not compromised.

For example, if the relative error has the exact value 10−8, but can be computed only with a huge relative
error of 10%, then the perturbed value would still be in the range [0.9 · 10−8, 1.1 · 10−8 ]. Therefore it will
still have the correct magnitude and still permit us to conclude the number of valid digits correctly.

Remark 1.5.51 (Cancellation in Gram-Schmidt orthogonalisation of Exp. 1.5.5)

The matrix created by the M ATLAB command A = hilb(10), the so-called Hilbert matrix, has columns
that are almost linearly dependent.

Cancellation when computing orthogonal projection

of vector a onto space spanned by vector b ✄
p
a·b
p = a− b. a
b·b
b
If a, b point in almost the same direction, kpk ≪
kak, kbk, so that a “tiny” vector p is obtained by
subtracting two “long” vectors, which implies cancel-
lation. Fig. 43

This can happen in Line 10 of Code 1.5.3.

Example 1.5.52 (Cancellation: roundoff error analysis)

We consider a simple arithmetic expression written in two ways:

a2 − b2 = (a + b)(a − b) , a, b ∈ R; .

We evaluate this term by means of two algebraically equivalent algorithms for the input data a = 1.3,
b = 1.2 in 2-digit decimal arithmetic with standard rounding. (“Algebraically equivalent” means that two
algorithms will produce the same results in the absence of roundoff errors.

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 117

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Algorithm A Algorithm B
x := ae· a = 1.7 (rounded) e b = 2.5 (exact)
x := a+
y := be· b = 1.4 (rounded) e b = 0.1 (exact)
y := a−
e y = 0.30 (exact)
x− x ∗ y = 0.25 (exact)
Algorithm B produces the exact result, whereas Algorithm A fails to do so. Is this pure coincidence or an
indication of the superiority of algorithm B? This question can be answered by roundoff error analysis. We
demonstrate the approach for the two algorithms A & B and general input a, b, ∈ R .

Roundoff error analysis heavily relies on Ass. 1.5.32 and dropping terms of “higher order” in the machine
precision, that is terms that behave like O(EPSq ), q > 1. It involves introducing the relative roundoff error
for every elementary operation through a factor (1 + δ), |δ| ≤ EPS.

Algorithm A:

x = a2 (1 + δ1 ) , y = b2 (1 + δ2 )
fe = (a2 (1 + δ1 ) − b2 (1 + δ2 ))(1 + δ3 ) = f + a2 δ1 − b2 δ2 + (a2 − b2 )δ3 + O(EPS2 )

2 + b2 |
| fe − f | a2 + b2 + | a2 − b2 | | a
≤ EPS + O(EPS2 ) = EPS 1 + 2 + O(EPS2 ) . (1.5.53)
|f| | a2 − b2 | | a − b2 |
will be neglected

For a ≈ b the relative error of the result of Algorithm A will be much larger than the machine
precision EPS. This reflects cancellation in the last subtraction step.

Algorithm B:

x = (a + b)(1 + δ1 ) , y = (a − b)(1 + δ2 )
fe = (a + b)(a − b)(1 + δ1 )(1 + δ2 )(1 + δ3 ) = f + (a2 − b2 )(δ1 + δ2 + δ3 ) + O(EPS2 )

| fe − f |
≤ |δ1 + δ2 + δ3 | + O(EPS2 ) ≤ 3EPS + O(EPS2 ) . (1.5.54)
|f|

Relative error of the result of Algorithm B is always ≈ EPS !

In this example we see a general guideline at work:

If inevitable, subtractions prone to cancellation should be done as early as possible.

The reason is that input data and and initial intermediate results are usually not as much tainted by roundoff
errors as numbers computed after many steps.

(1.5.55) Avoiding disastrous cancellation

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 118

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The following examples demonstrate a few fundamental techniques for steering clear of cancellation by
using alternative formulas that yield the same value (in exact arithmetic), but do not entail subtracting two
numbers of almost equal size.

Example 1.5.56 (Stable discriminant formula → Ex. 1.5.39, [?, Ex. 2.10])

If ξ 1 and ξ 2 are the two roots of the quadratic polynomial p(ξ ) = ξ 2 + αξ + β, then ξ 1 · ξ 2 = β (Vieta’s
formula). Thus once we have computed a root, we can obtain the other by simple division.

Idea:
➊ Depending on the sign of α compute “stable root” without cancellation.
➋ Compute other root from Vieta’s formula (avoiding subtraction)

C++11-code 1.5.57: Stable computation of real root of a quadratic polynomial ➺ GITLAB

2 //! C++ function computing the zeros of a quadratic polynomial
3 //! ξ → ξ 2 + αξ + β by means p of the familiar discriminant
4 //! formula ξ 1,2 = 12 (−α ± α2 − 4β).
5 //! This is a stable implementation based on Vieta’s theorem.
6 //! The zeros are returned in a column vector
7 VectorXd z er os quadpols tab ( double alpha , double beta ) {
8 Vector2d z ( 2 ) ;
9 double D = std : : pow ( alpha , 2 ) −4∗ beta ; // discriminant
10 i f (D < 0 ) throw " no r e a l z e r o s " ;
11 else {
12 double wD = std : : s q r t (D) ;
13 // Use discriminant formula only for zero far away from 0
14 // in order to avoid cancellation. For the other zero
15 // use Vieta’s formula.
16 i f ( alpha >= 0 ) {
17 double t = 0.5 ∗ ( − alpha −wD) ; //
18 z << t , beta / t ;
19 }
20 else {
21 double t = 0.5 ∗ ( − alpha +wD) ; //
22 z << beta / t , t ;
23 }
24 }
25 r et ur n z ;
26 }

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 119

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

➥ Invariably, we add numbers with the same sign in Line 17 and Line 21.
-11 Roundoff in the computation of zeros of a parabola
×10
3.5
unstable
stable

Numerical experiment based on the driver code 2.5

Code 1.5.42.

relative error in ξ 1
2

Observation:
1.5
The new code can also compute the small root of
the polynomial p(ξ ) = (ξ − γ)(ξ − γ1 ) (expanded 1

in monomials) with a relative error ≈ EPS.

0.5

0
0 100 200 300 400 500 600 700 800 900 1000
Fig. 44 γ

Example 1.5.58 (Exploiting trigonometric identities to avoid cancellation)

The task is to evaluate the integral

Z x
sin t dt = 1− cos x = 2 sin2 (x/2) for 0 < x ≪ 1 , (1.5.59)
0 | {z } | {z }
I II

and this can be done by the two different formulas I and I I .

Unstable computation of 1-cos(x)
10 -2

10 -4

I (1-cos(x))
relative error of 1-cos(x)

Relative error of expression 10

-6

with respect to equivalent expression I I -8

10
(2*sin(x/2)^{2}) ✄
-10
10
Expression I is affected by cancellation for | x | ≪ 1,
since then cos x ≈ 1, whereas expression I I can be 10 -12

evaluated with a relative error ≈ EPS for all x.

10 -14

10 -16
-7 -6 -5 -4 -3 -2 -1 0
10 10 10 10 10 10 10 10
Fig. 45 x

Analytic manipulations offer ample opportunity to rewrite expressions in equivalent form immune to
cancellation.

Example 1.5.60 (Switching to equivalent formulas to avoid cancellation)

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 120

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Now we see an example of a computation allegedly dating back

to Archimedes, who tried to approximate the area of a circle by
the areas of inscribed regular polygons.

Approximation of circle by regular n-gon, n ∈ N ✄

Fig. 46

sin α2n
Area of n-gon:
Fn

αn αn n n 2π cos α2n
An = n cos sin = sin αn = sin .
2 2 2 2 n
αn
Recursion formula for An derived from 2
Fig. 47

r s p
αn 1 − cos αn 1− 1 − sin2 αn
sin = = ,
2 2 2
√
Initial approximation: A6 = 23 3 .

C++11-code 1.5.61: Tentative computation of circumference of regular polygon ➺ GITLAB

2 //! Approximation of Pi by approximating the circumference of a
3 //! regular polygon
4 MatrixXd A p p r o x P I i n s t a b l e ( double t o l = 1e −8, i n t maxIt = 50) {
5 double s= s q r t ( 3 ) / 2 . ; double An=3. ∗ s ; // initialization (hexagon case)
6 unsigned i n t n = 6 , i t = 0 ;
7 MatrixXd r es ( maxIt , 4 ) ; // matrix for storing results
8 r es ( i t , 0 ) = n ; r es ( i t , 1 ) = An ;
9 r es ( i t , 2 ) = An − M_PI ; r es ( i t , 3 ) =s ;
10 while ( i t < maxIt && s > t o l ) { // terminate when s is ’small enough’
11 s = s q r t ((1. − s q r t (1. − s ∗ s ) ) / 2 . ) ; // recursion for area
12 n ∗= 2 ; An = n / 2 . ∗ s ; // new estimate for circumference
13 ++ i t ;
14 r es ( i t , 0 ) =n ; r es ( i t , 1 ) =An ; // store results and (absolute) error
15 r es ( i t , 2 ) = An − M_PI ; r es ( i t , 3 ) =s ;
16 }
17 r et ur n r es . topRows ( i t ) ;
18 }

The approximation deteriorates after applying the recursion formula many times:

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 121

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

n An An − π sin αn
6 2.598076211353316 -0.543516442236477 0.866025403784439
12 3.000000000000000 -0.141592653589794 0.500000000000000
24 3.105828541230250 -0.035764112359543 0.258819045102521
48 3.132628613281237 -0.008964040308556 0.130526192220052
96 3.139350203046872 -0.002242450542921 0.065403129230143
192 3.141031950890530 -0.000560702699263 0.032719082821776
384 3.141452472285344 -0.000140181304449 0.016361731626486
768 3.141557607911622 -0.000035045678171 0.008181139603937
1536 3.141583892148936 -0.000008761440857 0.004090604026236
3072 3.141590463236762 -0.000002190353031 0.002045306291170
6144 3.141592106043048 -0.000000547546745 0.001022653680353
12288 3.141592516588155 -0.000000137001638 0.000511326906997
24576 3.141592618640789 -0.000000034949004 0.000255663461803
49152 3.141592645321216 -0.000000008268577 0.000127831731987
98304 3.141592645321216 -0.000000008268577 0.000063915865994
196608 3.141592645321216 -0.000000008268577 0.000031957932997
393216 3.141592645321216 -0.000000008268577 0.000015978966498
786432 3.141593669849427 0.000001016259634 0.000007989485855
1572864 3.141592303811738 -0.000000349778055 0.000003994741190
3145728 3.141608696224804 0.000016042635011 0.000001997381017
6291456 3.141586839655041 -0.000005813934752 0.000000998683561
12582912 3.141674265021758 0.000081611431964 0.000000499355676
25165824 3.141674265021758 0.000081611431964 0.000000249677838
50331648 3.143072740170040 0.001480086580246 0.000000124894489
100663296 3.159806164941135 0.018213511351342 0.000000062779708
201326592 3.181980515339464 0.040387861749671 0.000000031610136
402653184 3.354101966249685 0.212509312659892 0.000000016660005
805306368 4.242640687119286 1.101048033529493 0.000000010536712
1610612736 6.000000000000000 2.858407346410207 0.000000007450581

of Code 1.5.61? Since√s ≪ 1, computing 1 − s will not trigger

Where does cancellation occur in Line 11 √
cancellation. However, the subtraction 1− 1 − s2 will, because 1 − s2 ≈ 1 for s ≪ 1:

v p
r
u p For αn ≪ 1: 1 − sin2 αn ≈ 1
u
αn 1 − cos αn t 1 − 1 − sin2 αn
sin = = Cancellation here!
2 2 2

We arrive at an equivalent formula not vulnerable to cancellation essentially using the identity (a + b)(a −
b) = a2 − b2 in order to eliminate the difference of square roots in the numerator.
s v
p u p p
αn 1− 2
1 − sin αn u 1 − 1 − sin2 αn 1 + 1 − sin2 αn
sin = =t · p
2 2 2 1 + 1 − sin2 αn
s
1 − (1 − sin2 αn ) sin αn
= p =r p .
2(1 + 1 − sin2 αn ) 2
2 1 + 1 − sin αn

C++11-code 1.5.62: Stable recursion for area of regular n-gon ➺ GITLAB

2 //! Approximation of Pi by approximating the circumference of a
3 //! regular polygon

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 122

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

4 MatrixXd a p p r p i s t a b l e ( double t o l = 1e −8, i n t maxIt = 50) {

5 double s= s q r t ( 3 ) / 2 . ; double An=3. ∗ s ; // initialization (hexagon case)
6 unsigned i n t n = 6 , i t = 0 ;
7 MatrixXd r es ( maxIt , 4 ) ; // matrix for storing results
8 r es ( i t , 0 ) = n ; r es ( i t , 1 ) = An ;
9 r es ( i t , 2 ) = An − M_PI ; r es ( i t , 3 ) =s ;
10 while ( i t < maxIt && s > t o l ) { // terminate when s is ’small enough’
11 s = s / s q r t ( 2 ∗ ( 1+ s q r t ( ( 1 + s ) ∗(1 − s ) ) ) ) ; // Stable recursion without
cancellation
12 n ∗= 2 ; An = n / 2 . ∗ s ; // new estimate for circumference
13 ++ i t ;
14 r es ( i t , 0 ) =n ; r es ( i t , 1 ) =An ; // store results and (absolute) error
15 r es ( i t , 2 ) = An − M_PI ; r es ( i t , 3 ) =s ;
16 }
17 r et ur n r es . topRows ( i t ) ;
18 }

Using the stable recursion, we observe better approximation for polygons with more corners:
n An An − π sin αn
6 2.598076211353316 -0.543516442236477 0.866025403784439
12 3.000000000000000 -0.141592653589793 0.500000000000000
24 3.105828541230249 -0.035764112359544 0.258819045102521
48 3.132628613281238 -0.008964040308555 0.130526192220052
96 3.139350203046867 -0.002242450542926 0.065403129230143
192 3.141031950890509 -0.000560702699284 0.032719082821776
384 3.141452472285462 -0.000140181304332 0.016361731626487
768 3.141557607911857 -0.000035045677936 0.008181139603937
1536 3.141583892148318 -0.000008761441475 0.004090604026235
3072 3.141590463228050 -0.000002190361744 0.002045306291164
6144 3.141592105999271 -0.000000547590522 0.001022653680338
12288 3.141592516692156 -0.000000136897637 0.000511326907014
24576 3.141592619365383 -0.000000034224410 0.000255663461862
49152 3.141592645033690 -0.000000008556103 0.000127831731976
98304 3.141592651450766 -0.000000002139027 0.000063915866118
196608 3.141592653055036 -0.000000000534757 0.000031957933076
393216 3.141592653456104 -0.000000000133690 0.000015978966540
786432 3.141592653556371 -0.000000000033422 0.000007989483270
1572864 3.141592653581438 -0.000000000008355 0.000003994741635
3145728 3.141592653587705 -0.000000000002089 0.000001997370818
6291456 3.141592653589271 -0.000000000000522 0.000000998685409
12582912 3.141592653589663 -0.000000000000130 0.000000499342704
25165824 3.141592653589761 -0.000000000000032 0.000000249671352
50331648 3.141592653589786 -0.000000000000008 0.000000124835676
100663296 3.141592653589791 -0.000000000000002 0.000000062417838
201326592 3.141592653589794 0.000000000000000 0.000000031208919
402653184 3.141592653589794 0.000000000000001 0.000000015604460
805306368 3.141592653589794 0.000000000000001 0.000000007802230
1610612736 3.141592653589794 0.000000000000001 0.000000003901115

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 123

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Recursion for the area of a regular n-gon

2
10

10 0

Plot of errors for approximations of π as computed by 10 -2

the two algebraically equivalent recursion formulas✄

approximation error
10 -4

Observation, cf. Ex. 1.5.45 10 -6

Amplified roundoff errors due to cancellation super- 10 -8

sedes approximation error for n ≥ 105 . 10

-10

Roundoff errors merely of magnitude EPS in the case 10

-12

of stable recursion 10
-14

unstable recursion
stable recursion
-16
10
0 2 4 6 8 10
10 10 10 10 10 10
Fig. 48 n

Example 1.5.63 (Summation of exponential series)

C++11-code 1.5.64: Summation of exponential se-

ries ➺ GITLAB
2 double ex pev al ( double x ,
In principle, the function value exp( x ) can
3 double t o l =1e −8) {
be approximated up to any accuracy by
4 // Initialization
summing sufficiently many terms of the
5 double y = 1 . 0 , term = 1 . 0 ;
globally convergent exponential series.
6 long i n t k = 1 ;
∞
xk 7 // Termination criterion
exp( x ) = ∑ k! 8 while ( abs ( term ) > t o l ∗ y ) {
k=0 9 term ∗= x / k ; // next summand
x2 x3 x4 10 y += term ; // Summation
= 1+x+ + + +... .
2 6 24 11 ++k ;
12 }
13 r et ur n y ;
14 }

Results for tol = 10−8 , eg

xp designates the approximate value for exp( x ) returned by the function
from Code 1.5.64. Rightmost column lists relative errors, which tells us the number of valid digits in the
approximate result.

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 124

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

| exp( x )−eg
xp( x )|
x xp( x )
Approximation eg exp( x ) exp( x )
-20 6.1475618242e-09 2.0611536224e-09 1.982583033727893
-18 1.5983720359e-08 1.5229979745e-08 0.049490585500089
-16 1.1247503300e-07 1.1253517472e-07 0.000534425951530
-14 8.3154417874e-07 8.3152871910e-07 0.000018591829627
-12 6.1442105142e-06 6.1442123533e-06 0.000000299321453
-10 4.5399929604e-05 4.5399929762e-05 0.000000003501044
-8 3.3546262812e-04 3.3546262790e-04 0.000000000662004
-6 2.4787521758e-03 2.4787521767e-03 0.000000000332519
-4 1.8315638879e-02 1.8315638889e-02 0.000000000530724
-2 1.3533528320e-01 1.3533528324e-01 0.000000000273603
0 1.0000000000e+00 1.0000000000e+00 0.000000000000000
2 7.3890560954e+00 7.3890560989e+00 0.000000000479969
4 5.4598149928e+01 5.4598150033e+01 0.000000001923058
6 4.0342879295e+02 4.0342879349e+02 0.000000001344248
8 2.9809579808e+03 2.9809579870e+03 0.000000002102584
10 2.2026465748e+04 2.2026465795e+04 0.000000002143799
12 1.6275479114e+05 1.6275479142e+05 0.000000001723845
14 1.2026042798e+06 1.2026042842e+06 0.000000003634135
16 8.8861105010e+06 8.8861105205e+06 0.000000002197990
18 6.5659968911e+07 6.5659969137e+07 0.000000003450972
20 4.8516519307e+08 4.8516519541e+08 0.000000004828737
7 Terms in exponential sum for x = -20
×10
5

3
Observation:
value of k-th summand

Large relative approximation errors for x ≪ 0. 1

For x ≪ 0 we have exp( x )| ≪ 1, but this value 0

is computed by summing large numbers of opposite -1

sign. -2

Terms summed up for x = −20 ✄ -3

-4

-5
0 5 10 15 20 25 30 35 40 45 50
Fig. 49 index k of summand

Remedy: Cancellation can be avoided by using identity

1
exp( x ) = , if x<0.
exp(− x )

Example 1.5.65 (Combat cancellation by approximation)

In a computer code we have to provide a routine for the evaluation of

Z 1
exp(a) − 1
e at dt = for any a>0, (1.5.66)
0 a
cf. the discussion of cancellation in the context of numerical differentiation in Ex. 1.5.45. There we

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 125

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

observed massive cancellation.

Recall the Taylor expansion formula in one dimension for a function that is m + 1 times continuously
differentiable in a neighborhood of x [?, Satz 5.5.1]
m
1 (k) 1
f ( x + h) = ∑ f ( x )hk + Rm ( x, h) , Rm ( x, h) = f ( m + 1) ( ξ ) h m + 1 ,
k=0
k! ( m + 1 ) !
for some ξ ∈ [min{ x, x + h}, max{ x, x + h}], and for all sufficiently small |h|. Here R( x, h) is called
the remainder term and f (k) denotes the k derivative of f .

Cancellation in (1.5.66) can be avoided by replacing exp(a), a > 0, with a suitable Taylor expansion
around a = 0 and then dividing by a:
m
exp(a) − 1 1 1
= ∑ ak + Rm ( a) , Rm ( a) = exp(ξ )am for some 0 ≤ ξ ≤ a .
a k=0
( k + 1 ) ! ( m + 1 ) !
For a similar discussion see [?, Ex. 2.12].
Issue: How to choose the number m of terms to be retained in the Taylor expansion? We have to pick
m large enough such that the relative approximation error remains below a prescribed threshold tol. To
estimate the relative approximation error, we use the expression for the remainder together with the simple
estimate (exp(a) − 1)/a > 1 for all a > 0:
m
1
(e a − 1)/a − ∑ ( k + 1) !
ak
k=0 1 1
rel. err. = ≤ exp(ξ )am ≤ exp(a)am .
(e a − 1)/a ( m + 1) ! ( m + 1) !
For a = 10−3 we get
m 1 2 3 4 5
1.0010e-03 5.0050e-07 1.6683e-10 4.1708e-14 8.3417e-18
Hence, keeping m = 3 terms is enough for achieving about 10 valid digits.
-6
10
(exp(a)-1.0)/a

Relative error of unstable formula 10

-7
Taylor stabilized

(exp(a)-1.0)/a and relative error, when using a

10 -8
Taylor expansion approximation for small a ✄
10 -9

i f ( abs (a) < 1E-3)

relative error

-10
10
v = 1.0 + (1.0/2 + 1.0/6*a)*a;
-11
10
else
10 -12
v = ( exp (a)-1.0)/a;
end 10 -13

-14
Error computed by comparison with M ATLAB’s built-in 10

function expm1 that provides a stable implementa- 10

-15

tion of exp( x ) − 1. 10
-16
-10 -8 -6 -4 -2 0
10 10 10 10 10 10
Fig. 50 argument a

1.5.5 Numerical stability

We have seen that a particular “problem” can be tackled by different “algorithms”, which produce different
results due to roundoff errors. This section will clarify what distinguishes a “good” algorithm from a rather
abstract point of view.

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 126

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(1.5.67) The “problem”

A mathematical notion of “problem”: F

x y
✦ data space X , usually X ⊂ R n
✦ result space Y, usually Y ⊂ R m Y
✦ mapping (problem function) F : X 7→ Y X
✎ ☞
results
data
A problem is a well defined function that assigns
✍ ✌
to each datum a result.
Fig. 51

Note: In this course, both the data space X and the result space Y will always be subsets of finite dimen-
sional vector spaces.

Example 1.5.68 (The “matrix×vector-multiplication problem”)

We consider the “problem” of computing the product Ax for a given matrix A ∈ K m,n and a given vector
x ∈ Kn .
➣ • Data space X = K m,n × K n (input is a matrix and a vector)
• Result space Y = R m (space of column vectors)
• Problem function F : X → Y, F(a, x) := Ax

(1.5.69) Norms on spaces of vectors and matrices

Norms provide tools for measuring errors. Recall from linear algebra and calculus [?, Sect. 4.3], [?,
Sect. 6.1]:

Definition 1.5.70. Norm

X = vector space over field K, K = C, R. A map k · k : X 7→ R0+ is a norm on X , if it satisfies

(i) ∀x ∈ X: x 6= 0 ⇔ kxk > 0 (definite),
(ii) kλxk = |λ|kxk ∀x ∈ X, λ ∈ K (homogeneous),
(iii) kx + yk ≤ kxk + kyk ∀x, y ∈ X (triangle inequality).

Examples: (for vector space K n , vector x = ( x1 , x2 , . . . , xn )T ∈ K n )

name : definition E IGEN function
q
Euclidean norm : k x k 2 : = | x1 |2 + · · · + | x n |2 x.norm()
1-norm : k x k 1 : = | x1 | + · · · + | x n | x.lpNorm<1>()
∞-norm, max norm : kxk ∞ := max{| x1 |, . . . , | xn |} x.lpNorm<Eigen::Infinity>()

Remark 1.5.71 (Inequalities between vector norms)

All norms on the vector space K n , n ∈ N, are equivalent in the sense that for arbitrary two norms k·k1
and k·k2 we can always find a constant C > 0 such that
k vk 1 ≤ C k vk 2 ∀ v ∈ K n . (1.5.72)

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 127

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Of course, the constant C will usually depend on n and the norms under consideration.

For the vector norms introduced above, explicit expressions for the constants “C” are available: for all
x∈ Kn
√
k x k2 ≤ k x k1 ≤ n k x k2 , (1.5.73)
√
k x k ∞ ≤ k x k2 ≤ n k x k ∞ , (1.5.74)
k x k ∞ ≤ k x k 1 ≤ nk x k ∞ . (1.5.75)

The matrix space K m,n is a vector space, of course, and can also be equipped with various norms. Of
particular importance are norms induced by vector norms on K n and K m .

Definition 1.5.76. Matrix norm

Given vector norms k·k1 and k·k2 on K n and K m , respectively, the associated matrix norm is
defined by

kMxk2
M ∈ R m,n : kMk := sup .
x∈R n \{0} kxk1

By virtue of definition the matrix norms enjoy an important property, they are sub-multiplicative:

∀A ∈ Kn,m , B ∈ Km,k : kABk ≤ kAkkBk . (1.5.77)

✎ notations for matrix norms for quadratic matrices associated with standard vector norms:

k x k2 → k M k 2 , k x k 1 → k M k 1 , k x k ∞ → k M k ∞

Example 1.5.78 (Matrix norm associated with ∞-norm and 1-norm)

Rather simple formulas are available for the matrix norms induced by the vector norms k·k∞ and k·k1

e.g. for M ∈ K2,2 : kMx k∞ = max{|m11 x1 + m12 x2 |, |m21 x1 + m22 x2 |}

≤ max{|m11 | + |m12 |, |m21 | + |m22 |} k x k∞ ,
kMx k1 = |m11 x1 + m12 x2 | + |m21 x1 + m22 x2 |
≤ max{|m11 | + |m21 |, |m12 | + |m22 |}(| x1 | + | x2 |) .

For general M ∈ K m,n

n
➢ matrix norm ↔ k·k ∞ = row sum norm kMk∞ := max ∑ |mij | ,
i =1,...,m j=1
(1.5.79)

m
➢ matrix norm ↔ k·k1 = column sum norm kMk1 := max
j=1,...,n
∑ |mij | . (1.5.80)
i =1

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 128

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Sometimes special formulas for the Euclidean matrix norm come handy [?, Sect. 2.3.3]:

Lemma 1.5.81. Formula for Euclidean norm of a Hermitian matrix

|x H Ax|
A ∈ K n,n , A = A H ⇒ k Ak2 = max .
x6 =0 kxk22

Proof. Recall from linear algebra: Hermitian matrices (a special class of normal matrices) enjoy unitary
similarity to diagonal matrices:

∃U ∈ Kn,n , diagonal D ∈ R n,n : U−1 = U H and A = U H DU .

Since multiplication with an unitary matrix preserves the 2-norm of a vector, we conclude

kAk2 = U H DU = kDk2 = max |di | , D = diag(d1 , . . . , dn ) .

2 i =1,...,i

On the other hand, for the same reason:

max x H Ax = max (Ux) H D(Ux) = max y H Dy = max |di | .

k x k2 = 1 k x k2 = 1 ky k2 = 1 i =1,...,i

Hence, both expressions in the statement of the lemma agree with the largest modulus of eigenvalues of
A.
✷

Corollary 1.5.82. Euclidean matrix norm and eigenvalues

For A ∈ K m,n the Euclidean matrix norm kAk2 is the square root of the largest (in modulus)
eigenvalue of A H A.

For a normal matrix A ∈ K n,n (that is, A satisfies AH A = AAH ) the Euclidean matrix norm agrees
with the modulus of the largest eigenvalue.

(1.5.83) (Numerical) algorithm

When we talk about an “algorithm” we have in mind a concrete code function in M ATLAB or C++; the
only way to describe an algorithm is through a piece of code. We assume that this function defines
another mapping F e : X → Y on the data space of the problem. Of course, we can only feed data to the
M ATLAB/C++-function, if they can be represented in the set M of machine numbers. Hence, implicit in the
e is the assumption that input data are subject to rounding before passing them to the code
definition of F
function proper.
Problem Algorithm

F : X ⊂ Rn → Y ⊂ Rm e⊂ M
Fe : X → Y

(1.5.84) Stable algorithm → [?, Sect. 1.3]

[Stable algorithm]

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 129

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✦ We study a problem (→ § 1.5.67) F : X → Y on data space X into result space Y.

✦ We assume that both X and Y are equipped with norms k·k X and k·kY , respectively (→ Def. 1.5.70).
✦ We consider a concrete algorithm Fe : X → Y according to § 1.5.83.

We write w(x), x ∈ X , for the computational effort (→ Def. 1.4.1) required by the algorithm for input x.

Definition 1.5.85. Stable algorithm

An algorithm Fe for solving a problem F : X 7→ Y is numerically stable if for all x ∈ X its result Fe(x)
(possibly affected by roundoff) is the exact result for “slightly perturbed” data:

∃C ≈ 1: ∀x ∈ X: ∃e xk X ≤ Cw(x) EPS kxk X ∧ Fe(x) = F(e

x ∈ X: kx − e x) .

Here EPS should be read as machine precision according to the “Axiom” of roundoff analysis Ass. 1.5.32.

F
Illustration of Def. 1.5.85 ✄ x Fe y
(y = ˆ exact result for exact data x) Fe(x)
e
x F
Terminology: (Y, k·kY )
Def. 1.5.85 introduces stability in the sense of (X, k·kX )
backward error analysis
Fig. 52

Sloppily speaking, the impact of roundoff (∗) on a stable algorithm is of the same order of magnitude
as the effect of the inevitable perturbations due to rounding the input data.

➣ For stable algorithms roundoff errors are “harmless”.

(∗) In some cases the definition of Fe will also involve some approximations as in Ex. 1.5.65. Then the
above statement also includes approximation errors.

Example 1.5.86 (Testing stability of matrix×vector multiplication)

Assume you are given a black box implementation of a function

VectorXd mvmult( const MatrixX &A, const VectorXd &x)

that purports to provide a stable implementation of Ax for A ∈ K m,n , x ∈ K n , cf. Ex. 1.5.68. How can
we verify this claim for particular data. Both, K m,n and K n are equipped with the Euclidean norm.

The task is, given y ∈ K n as returned by the function, to find conditions on y that ensure the existence of
aAe ∈ K m,n such that

e = y and
Ax e −A
A ≤ Cmn EPSkAk2 , (1.5.87)
2

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 130

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

for a small constant ≈ 1.

In fact we can choose (easy computation)

e = A + zx T , z := y − Ax ∈ K m ,
A
kxk22
and we find

e −A x · wk zk2 ky − Axk2
A = zx T = sup ≤ kxk2 kzk2 = .
2 2 k wk2 kxk2
✬ ✩
w∈K n \{0}

Hence, in principle stability of an algorithm for computing Ax is confirmed, if for every x ∈ X the
computed result y = y(x) satisfies

ky − Ax k2 ≤ C mn EPS kxk2 kAk2 ,

✫ ✪
with a small constant C > 0 independent of data and problem size.

Remark 1.5.88 (Numerical stability and sensitive dependence on data)

A problem shows sensitive dependence on the data, if small perturbations of input data lead to large
perturbations of the output. Such problems are also called ill-conditioned. For such problems stability of
an algorithm is easily accomplished.

Example: The problem is the prediction of the po-

sition of the billard ball after ten bounces given the
initial position, velcity, and spin.

It is well known, that tiny changes of the initial condi-

tions can shift the final location of the ball to virtually
any point on the table: the billard problem is chaotic.

Hence, a stable algorithm for its solution may just out-

put a fixed or random position without even using the
initial conditions!
Fig. 53

Learning Outcomes

Principal take-home knowledge and skills from this chapter:

• Knowledge about the syntax of fundamental operations on matrices and vectors in E IGEN.
• Understanding of the concepts of computational effort/cost and asymptotic complexity in numerics.
• Awareness of the asymptotic complexity of basic linear algebra operations
• Ability to determine the (asymptotic) computational effort for a concrete (numerical linear algebra)
algorithm.
• Ability to manipulate simple expressions involving matrices and vectors in order to reduce the com-
putational cost for their evaluation.

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic 131

Chapter 2

Direct Methods for Linear Systems of

Equations

Contents
2.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
2.2 Theory: Linear systems of equations . . . . . . . . . . . . . . . . . . . . . . . . . . 128
2.2.1 Existence and uniqueness of solutions . . . . . . . . . . . . . . . . . . . . . . 128
2.2.2 Sensitivity of linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
2.3 Gaussian Ellimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
2.3.1 Basic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
2.3.2 LU-Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.3.3 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
2.4 Stability of Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
2.5 Survey: Elimination solvers for linear systems of equations . . . . . . . . . . . . 164
2.6 Exploiting Structure when Solving Linear Systems . . . . . . . . . . . . . . . . . . 168
2.7 Sparse Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
2.7.1 Sparse matrix storage formats . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
2.7.2 Sparse matrices in M ATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
2.7.3 Sparse matrices in E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
2.7.4 Direct Solution of Sparse Linear Systems of Equations . . . . . . . . . . . . 191
2.7.5 LU-factorization of sparse matrices . . . . . . . . . . . . . . . . . . . . . . . . 195
2.7.6 Banded matrices [?, Sect. 3.7] . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
2.8 Stable Gaussian elimination without pivoting . . . . . . . . . . . . . . . . . . . . 209

2.1 Preface

(2.1.1) The problem: solving a linear system

Given : square matrix A ∈ K n,n , vector b ∈ K n , n ∈ N

Sought : solution vector x ∈ K n : Ax = b ← (square) linear system of equations (LSE)

(Formal problem mapping (A, b) 7→ A−1 b)

(Terminology: A =
ˆ system matrix/coefficient matrix, b =
ˆ right hand side vector )

Linear systems with rectangular system matrices A ∈ K m,n , called “overdetermined” for m > n, and

132
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

“underdetermined” for m < n will be treated in Chapter 3.

(2.1.2) LSE: key components of mathematical models in many fields

Linear systems of equations are ubiquitous in computational science: they are encountered

• with discrete linear models in network theory (see Ex. 2.1.3), control, statistics;
• in the case of discretized boundary value problems for ordinary and partial differential equations (→
course “Numerical methods for partial differential equations”, 4th semester);

• as a result of linearization (e.g, “Newton’s method” → Sect. 8.4).

Example 2.1.3 (Nodal analysis of (linear) electric circuit [?, Sect. 4.7.1])

Now we study a very important application of numerical simulation, where (large, sparse) linear systems
of equations play a central role: Numerical circuit analysis. We begin with linear circuits in the frequency
domain, which are directly modelled by complex linear systems of equations. Later we tackle circuits
with non-linear elements, see Ex. 8.0.1, and, finally, will learn about numerical methods for computing the
transient (time-dependent) behavior of circuits, see Ex. 11.1.13.

Modeling of simple linear circuits takes only elementary physical laws as covered in any introductory
course of physics (or even in secondary school physics). There is no sophisticated physics or mathematics
involved.
Node (ger.: Knoten) =
ˆ junction of wires
➀ C1 ➁ R1
☞ number nodes 1, . . . , n ➂

Ikj : current from node k → node j, Ikj = − Ijk

U ~~ L R2
Kirchhoff current law (KCL) : sum of node currents = R5
C2
0:
R4 R3
n
∀k ∈ {1, . . . , n}: ∑ I
j=1 kj
=0. (2.1.4)
Fig. 54 ➃ ➄ ➅

Unknowns:
nodal potentials Uk , k = 1, . . . , n.
(some may be known: grounded nodes: ➅ in Fig. 54, voltage sources: ➀ in Fig. 54)

Constitutive relations for circuit elements: (in frequency domain with angular frequency ω > 0):

U 
• Ohmic resistor: I = , [ R] = 1VA−1  −1
 R (Uk − U j ) ,
R
• capacitor: I = ıωCU , capacitance [C] = 1AsV−1 ➤ Ikj = ıωC(Uk − U j ) ,
U 

• coil/inductor : I = , inductance [ L] = 1VsA−1 −ıω −1 L−1 (Uk − U j ) .
ıωL
√
✎ notation: ı =
ˆ imaginary unit “ı := −1”, ı = exp(ı π/2)

2. Direct Methods for Linear Systems of Equations, 2.1. Preface 133

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Here we face the special case of a linear circuit: all relationships between branch currents and voltages
are of the form

Ikj = αkj (Uk − U j ) with αkj ∈ C . (2.1.5)

The concrete value of αkj is determined by the circuit element connecting node k and node j.

These constitutive relations are derived by assuming a harmonic time-dependence of all quantities, which
is termed circuit analysis in the frequency domain (AC-mode).

voltage: u(t) = Re{U exp(ıωt)} , current: i (t) = Re{ I exp(ıωt)} . (2.1.6)

Here U, I ∈ C are called complex amplitudes. This implies for temporal derivatives (denoted by a dot):

du di
(t) = Re{ıωU exp(ıωt)} , (t) = Re{ıωI exp(ıωt)} . (2.1.7)
dt dt
For a capacitor the total charge is proportional to the applied voltage:

i (t) = q̇(t)
q(t) = Cu(t) ⇒ i (t) = Cu̇(t) .

For a coil the voltage is proportional to the rate of change of current: u(t) = Li˙(t). Combined with
(2.1.6) and (2.1.7) this leads to the above constitutive relations.

Constitutive relations + (2.1.4) linear system of equations:

➁ : ıωC1 (U2 − U1 ) + R1−1 (U2 − U3 ) − ıω −1 L−1 (U2 − U4 ) + R2−1 (U2 − U5 ) = 0,

➂: R1−1 (U3 − U2 ) + ıωC2 (U3 − U5 ) = 0,
➃: R5−1 (U4 − U1 ) − ıω −1 L−1 (U4 − U2 ) + R4−1 (U4 − U5 ) = 0,
➄: R2−1 (U5 − U2 ) + ıωC2 (U5 − U3 ) + R4−1 (U5 − U4 ) + R3−1 (U5 − U6 ) = 0,
U1 = U , U6 = 0 .

No equations for nodes ➀ and ➅, because these nodes are connected to the “outside world” so that the
Kirchhoff current law (2.1.4) does not hold (from a local perspective). This is fitting, because the voltages
in these nodes are known anyway.

 i i    
ıωC1 + R11 − ωL + R12 − R11 ωL − R12 U2 ıωC1 U
 − R11 1
−ıωC2    0 
 R1 + ıωC2 0 U3 
 i 1 i 1 1   = 
 1U 

 ωL 0 R5 − ωL + R4 − R4  U4 R5
− R12 −ıωC2 − R14 1 1 −1 U5 0
R2 + ıωC2 + R4 + R3

This is a linear system of equations with complex coefficients: A ∈ C4,4 , b ∈ C4 . For the algorithms to
be discussed below this does not matter, because they work alike for real and complex numbers.

2. Direct Methods for Linear Systems of Equations, 2.1. Preface 134

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

2.2 Theory: Linear systems of equations

2.2.1 Existence and uniqueness of solutions

Known from linear algebra [?, Sect. 1.2], [?, Sect. 1.3]:

Definition 2.2.1. Invertible matrix → [?, Sect. 2.3]

invertible /
A ∈ K n,n :⇔ ∃1 B ∈ K n,n : AB = BA = I .
regular

B=
ˆ inverse of A, (✎ notation B = A −1 )

New, recall a few concepts from linear algebra needed to state criteria for the invertibility of a matrix.

Definition 2.2.2. Image space and kernel of a matrix

Given A ∈ K m,n , the range/image (space) of A is the subspace of K m spanned by the columns of
A

R(A) := {Ax, x ∈ Kn } ⊂ R m .

The kernel/nullspace of A is

N (A) := {z ∈ R n : Az = 0} .

Definition 2.2.3. Rank of a matrix → [?, Sect. 2.4], [?, Sect. 1.5]
The rank of a matrix M ∈ K m,n , denoted by rank(M), is the maximal number of linearly indepen-
dent rows/columns of M.
Equivalently, rank(A) = dim R(A).

Theorem 2.2.4. Criteria for invertibility of matrix → [?, Sect. 2.3 & Cor. 3.8]
A square matrix A ∈ K n,n is invertible/regular if one of the following equivalent conditions is satis-
fied:
1. ∃B ∈ K n,n : BA = AB = I,
2. x 7→ Ax defines an endomorphism of K n ,
3. the columns of A are linearly independent (full column rank),
4. the rows of A are linearly independent (full row rank),
5. det A 6= 0 (non-vanishing determinant),
6. rank(A) = n (full rank).

(2.2.5) Solution of a LSE as a “problem”

2. Direct Methods for Linear Systems of Equations, 2.2. Theory: Linear systems of equations 135
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Linear algebra give us a formal way to denote solution of LSE:

A ∈ K n,n regular & Ax = b ⇒ x = A−1 b .

inverse matrix

Now recall our notion of “problem” from § 1.5.67 as a function F mapping data in a data space X to a
result in a result space Y. Concretely, for n × n linear systems of equations:

X := K n,n
∗ ×K
n → Y := K n
F:
(A, b) 7 → A −1 b
✎ notation: (open) set of regular matrices ⊂ K n,n :
K n,n
∗ := {A ∈ K
n,n
: A regular/invertible → Def. 2.2.1} .

Remark 2.2.6 (The inverse matrix and solution of a LSE)

E IGEN: inverse of a matrix A available through A.inverse()

Always avoid computing the inverse of a matrix (which can almost always be avoided)!
In particular, never ever even contemplate using x = A.inverse()*b to solve the linear
! system of equations Ax = b, cf. Exp. 2.4.13. The next sections present a sound way to do this.

2.2.2 Sensitivity of linear systems

The Sensitivity of a problem (for given data) gauges

the impact of small perturbations of the data on the result.

Before we examine sensitivity for linear systems of equations, we look at the simpler problem of matrix×vector
multiplication.

Example 2.2.7 (Sensitivity of linear mappings)

For a fixed given regular A ∈ K n,n we study the problem map

F : K n → K n , x 7→ Ax ,
that is, now we consider only the vector x as data.
Goal: Estimate relative perturbations in F(x) due to relative perturbations in x.
We assume that K n is equipped with some vector norm (→ Def. 1.5.70) and we use the induced matrix
norm (→ Def. 1.5.76) on K n,n . Using linearity and the elementary estimate kMx k ≤ kMkkxk, which is
a direct consequence of the definition of an induced matrix norm, we obtain

Ax = y ⇒ kxk ≤ A−1 kyk

A(x + ∆x) = y + ∆y ⇒ A∆x = ∆y ⇒ k∆y k ≤ kAkk∆x k

2. Direct Methods for Linear Systems of Equations, 2.2. Theory: Linear systems of equations 136
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

k∆y k kAkk∆x k −1 k ∆x k
⇒ ≤ −1
= k A k A . (2.2.8)
k yk k A −1 k k x k k xk

relative perturbation in result relative perturbation in data

We have found that the quantity k Ak A−1 bounds amplification of relative errors in the argument vector
in a matrix×vector-multiplication with the matrix A.

Now we study the sensitivity of the problem of finding the solution of a linear system of equations Ax = b,
A ∈ R n,n regular, b ∈ R, see § 2.1.1. We write e x for the solution of the perturbed linear system.

Question: To what extent do perturbations in the data A, b cause a

kx − e
xk
(normwise) relative error: ǫr := .
kxk
(k·k =
ˆ suitable vector norm, e.g., maximum norm k·k∞ )

Perturbed linear system:

Ax = b ↔ (A + ∆A)e
x = b + ∆b x − x) = ∆b − ∆Ax .
(A + ∆A)(e (2.2.9)

Theorem 2.2.10. Conditioning of LSEs → [?, Thm. 3.1]

−1
If A regular, k∆A k < A−1 and (2.2.9), then
(i) A + ∆A is regular/invertible,
(ii)

kx − exk A −1 k A k k∆b k k∆Ak
≤ + .
k xk 1 − kA−1 kkAkk ∆A k/k Ak k bk kAk
relative error of data relative perturbations

The proof is based on the following fundamental result:

✗ ✔
Lemma 2.2.11 (Perturbation lemma). → [?, Thm. 1.5]
1
B ∈ R n,n , k Bk < 1 ⇒ I + B regular ∧ (I + B)−1 ≤ .
1 − k Bk
✖ ✕
Proof. △-inequality ➣ k(I + B)x k ≥ (1 − kBk)kxk, ∀x ∈ Rn ➣ I + B regular.

−1 (I + B)−1 x k yk 1
(I + B) = sup = sup ≤
x∈R n \{0} k xk y ∈R n \{0} k(I + B)y k 1 − k Bk

A −1
Proof (of Thm. 2.2.10) Lemma 2.2.11 ➣ (A + ∆A)−1 ≤ & (2.2.9)
1 − k A−1 ∆A k

A −1 A −1 k A k k∆b k k∆A k
⇒ k∆x k ≤ (k∆b k + k∆Ax k) ≤ + k xk .
1 − kA−1 ∆A k 1 − kA−1 kk∆A k kAkkxk kAk

2. Direct Methods for Linear Systems of Equations, 2.2. Theory: Linear systems of equations 137
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Note that the term kAk A−1 occurs frequently. Therefore it has been given a special name:

Definition 2.2.12. Condition (number) of a matrix

Condition (number) of a matrix A ∈ R n,n : cond(A) := A−1 kAk

Note: cond(A) depends on the matrix norm k·k !

Rewriting estimate of Thm. 2.2.10 with ∆b = 0,

kx − e
xk cond(A)δ A k∆A k
ǫr := ≤ , δ A := . (2.2.13)
kxk 1 − cond(A)δ A kAk

From (2.2.13) we conclude important messsages of cond(A):

✦ If cond(A) ≫ 1, small perturbations in A can lead to large relative errors in the solution of
✓LSE.
the ✏
✦ If cond(A) ≫ 1, a stable algorithm (→ Def. 1.5.85) can produce solutions with

✒ ✑
large relative error !

Recall Thm. 2.2.10: for regular A ∈ K n,n , small ∆A, generic vector/matrix norm k·k

Ax = b kx − exk cond(A) k∆b k k∆A k
⇒ ≤ + . (2.2.14)
(A + ∆A)e
x = b + ∆b k xk 1 − cond(A)k ∆A k/k Ak kbk kAk

cond(A) ≫ 1 ➣ small relative changes of data A, b may effect huge relative changes in so-
lution.

cond(A) indicates sensitivity of “LSE problem” (A, b) 7→ x = A−1 b

(as “amplification factor” of (worst-case) relative perturbations in the data A, b).

Terminology:

Small changes of data ⇒ small perturbations of result : well-conditioned problem

Small changes of data ⇒ large perturbations of result : ill-conditioned problem

Note: sensitivity gauge depends on the chosen norm !

Example 2.2.15 (Intersection of lines in 2D)

Solving a 2 × 2 linear system of equations amounts to finding the intersection of two lines in the coordinate
plane. This relationship allows a geometric view of “sensitivity of a linear system”:

In distance metric (Euclidean vector norm):

2. Direct Methods for Linear Systems of Equations, 2.2. Theory: Linear systems of equations 138
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

glancing intersection: ill-conditioned

nearly orthogonal intersection: well-conditioned

Hessian normal form of line #i, i = 1, 2:

Li = { x ∈ R 2 : x T n i = d i } , n i ∈ R 2 , d i ∈ R .
T
n1 d
LSE for finding intersection: T x= 1 ,
n d2
| {z2 } | {z}
= :A = :b

ni =
ˆ (unit) direction vectors, di =
ˆ distance to origin.

1 cos ϕ
The following code investigates condition numbers for the matrix that can arise when comput-
0 sin ϕ
ing the intersection of two lines enclosing the angle ϕ.

C++11-code 2.2.16: condition numbers of 2 × 2 matrices ➺ GITLAB

2 VectorXd p h i = VectorXd : : LinSpaced ( 5 0 , M_PI / 2 0 0 , M_PI / 2 ) ;
3 MatrixXd r es ( p h i . siz e ( ) , 3 ) ;
4 Matrix2d A ; A( 0 , 0 ) = 1 ; A( 1 , 0 ) = 0 ;
5 f o r ( i n t i = 0 ; i < p h i . siz e ( ) ; ++ i ) {
6 A( 0 , 1 ) = std : : cos ( p h i ( i ) ) ; A( 1 , 1 ) = std : : s i n ( p h i ( i ) ) ;
7 // L2 condition number is the quotient of the maximal
8 // and minimal singular value of A
9 JacobiSVD<MatrixXd > svd ( A) ;
10 double C2 = svd . s i n g u l a r V a l u e s ( ) ( 0 ) / //
11 svd . s i n g u l a r V a l u e s ( ) ( svd . s i n g u l a r V a l u e s ( ) . siz e ( ) −1) ;
12 // L-infinity condition number
13 double C i n f =A . inverse ( ) . cwiseAbs ( ) . rowwise ( ) .sum ( ) . maxCoeff ( ) ∗
A . cwiseAbs ( ) . rowwise ( ) .sum ( ) . maxCoeff ( ) ; //
14 r es ( i , 0 ) = p h i ( i ) ; r es ( i , 1 ) = C2 ; r es ( i , 2 ) = C i n f ;
15 }
16 // Plot
17 // ...

In Line 10 we compute the condition number of A with respect to the Euclidean vector norm using special
E IGEN built-in functions.

2. Direct Methods for Linear Systems of Equations, 2.2. Theory: Linear systems of equations 139
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Line 13 evaluated the condition number of a matrix for the maximum norm, recall Ex. 1.5.78.

140
2-norm
max-norm

120

We clearly observe a blow-up of cond(A) (with 100

condition numbers
respect to the Euclidean vector norms) as the angle 80
enclosed by the two lines shrinks.
60
This corresponds to a large sensitivity of the location
of the intersection point in the case of glancing 40

incidence.
20

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Fig. 55 angle of n , n
1 2

Heuristics for predicting large cond(A)

cond(A) ≫ 1 ↔ columns/rows of A “almost linearly dependent”

2.3 Gaussian Ellimination

2.3.1 Basic algorithm

Exceptional feature of linear systems of equations (LSE):

! ☞ “exact” solution computable with finitely many elementary operations

Algorithm: Gaussian elimination (→ secondary school, linear algebra,)

Familiarity with the algorithm of Gaussian elimination for a square linear system of equations will be taken
for granted.

Supplementary reading. In case you cannot remember the main facts about Gaussian elimi-

nation, very detailed accounts and examples can be found in

• M. Gutknecht’s lecture notes [?, Ch. 1],
• the textbook by Nipp & Stoffer [?, Ch. 1],
• the numerical analysis text by Quarteroni et al. [?, Sects. 3.2 & 3.3],
• the textbook by Ascher & Greif [?, Sect. 5.1],
and, to some extend, below, see Ex. 2.3.1.

Wikipedia: Although the method is named after mathematician Carl Friedrich Gauss, the earliest pre-
sentation of it can be found in the important Chinese mathematical text Jiuzhang suanshu or
The Nine Chapters on the Mathematical Art, dated approximately 150 B.C., and commented
on by Liu Hui in the 3rd century.

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 140
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Idea: transformation to “simpler”, but equivalent LSE by means of successive

(invertible) row transformations

Ex. 1.3.13: row transformations ↔ left-multiplication with transformation matrix

Obviously, left multiplication with a regular matrix does not affect the solution of an LSE: for any regular
T ∈ K n,n
Ax = b ⇒ A′ x = b′ , if A′ = TA, b′ = Tb .
So we may try to convert the linear system of equations to a form that can be solved more easily by
multiplying with regular matrices from left, which boils down to applying row transformations. A suitable
target format is a diagonal linear system of equations, for which all equations are completely decoupled.
This is the gist of Gaussian elimination.

Example 2.3.1 (Gaussian elimination)

➀ (Forward) elimination:
    
1 1 0 x1 4 x1 + x2 = 4
 2 1 −1  x2  =  1  ←→ 2x1 + x2 − x3 = 1 .
3 −1 −1 x3 −3 3x1 − x2 − x3 = −3
           
1 1 0 4 1 1 0 4 1 1 0 4
 2 1 −1   1  ➤  0 −1 −1   −7  ➤  0 −1 −1   −7 
3 −1 −1 −3 3 −1 −1 −3 0 −4 −1 −15
   
1 1 0 4
➤  0 − 1 −1   −7 
0 0 3 13
| {z }
=U
= pivot row, pivot element bold.
transformation of LSE to upper triangular form

➁ Solve by back substitution: back substitution = Rücksubstitution

x1 + x2 = 4 x3 = 13
3
− x2 − = −7 ⇒ 13 8
x3 x2 = 7 − 3 = 3
3x3 = 13 8 4
x1 = 4 − 3 = 3 .
More detailed examples: [?, Sect. 1.1], [?, Sect. 1.1].

More general:
a11 x1 + a12 x2 + · · · + a1n xn = b1
a21 x1 + a22 x2 + · · · + a2n xn = b2
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
an1 x1 + an2 x2 + · · · + ann xn = bn

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 141
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

• i-th row - li1 · 1st row (pivot row), li1 := ai1/a11 , i = 2, . . . , n

a11 x1 + a12 x2 + · · · + a1n xn = b1

(1) (1) (1)
a22 x2 + · · · + a2n xn = b2 with
.. .. .. .. .. .. .. (1)
. . . . . . . aij = aij − a1j li1 , i, j = 2, . . . , n ,
.. .. .. .. .. .. .. (1)
. . . . . . . bi = bi − b1 li1 , i = 2, . . . , n .
(1) (1) (1)
an2 x2 + ··· + ann xn = bn
( 1) ( 1)
• i-th row - li1 · 2nd row (pivot row), li2 := ai2 /a22 , i = 3, . . . , n.

a11 x1 + a12 x2 + a13 x3 + · · · + a1n xn = b1

(1) (1) (1) (1)
a22 x2 + a23 x3 + · · · + a2n xn = b2
(2) (2) (2)
a33 x3 + · · · + a3n xn = b3
.. .. .. .. .. .. ..
. . . . . . .
(2) (2) (2)
an3 x3 + · · · + ann xn = bn

After n − 1 steps: linear systems of equations in upper triangular form

a11 x1 + a12 x2 + a13 x3 + ··· + a1n xn = b1

(1) (1) (1) (1)
a22 x2 + a23 x3 + ··· + a2n xn = b2
(2) (2) (2)
a33 x3 + ··· + a3n xn = b3
.. .. .. .. .. ..
. . . . . .
.. .. .. .. ..
. . . . .
( n − 1) ( n − 1)
ann x n = bn
(1) (2) ( n − 2)
Terminology: a11 , a22 , a33 , . . . , an−1,n−1 = pivots/pivot elements
Graphical depiction:

∗
0∗ 0
0∗

−→ −→ −→

0 00

0 0 0
0
0∗
−→ −→ · · · −→ −→

0 ∗
0 00 0 0 0 0

∗=
ˆ pivot (necessarily 6= 0 ➙ here: assumption), = pivot row
In k-th step (starting from A ∈ K n,n , 1 ≤ k < n, pivot row a⊤
k· ):

transformation: Ax = b ➤ A′ x = b′ .

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 142
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

with
 aik
 
 aij − akk akj for k < i, j ≤ n ,
 bi − aik b for k < i ≤ n ,
aij′ := 0 for k < i ≤ n,j = k ,
′
bi := akk k (2.3.2)

 b else.
 i
aij else,

multipliers lik

(2.3.3) Gaussian elimination: algorithm

Here we give a direct E IGEN implementation of Gaussian elimination for LSE Ax = b (grossly inefficient!).

C++11 code 2.3.4: Solving LSE Ax = b with Gaussian elimination ➺ GITLAB

2 //! Gauss elimination without pivoting, x = A\b
3 //! A must be an n × n-matrix, b an n-vector
4 //! The result is returned in x
5 void g a u s s e l i m s o l v e ( const MatrixXd &A , const VectorXd& b ,
6 VectorXd& x ) {
7 i n t n = A . rows ( ) ;
8 MatrixXd Ab ( n , n +1) ; // Augmented matrix [A, b]
9 Ab << A , b ; //
10 // Forward elimination (cf. step ➀ in Ex. 2.3.1)
11 f o r ( i n t i = 0 ; i < n −1; ++ i ) {
12 double pivot = Ab ( i , i ) ;
13 f o r ( i n t k = i + 1; k < n ; ++k ) {
14 double f a c = Ab ( k , i ) / pivot ;
15 Ab . block ( k , i + 1 ,1 , n− i )−= f a c ∗ Ab . block ( i , i + 1 ,1 , n− i ) ; //
16 }
17 }
18 // Back substitution (cf. step ➁ in Ex. 2.3.1)
19 Ab ( n −1,n ) = Ab ( n −1,n ) / Ab ( n −1,n −1) ;
20 f o r ( i n t i = n −2; i >= 0 ; −− i ) {
21 f o r ( i n t l = i + 1; l < n ; ++ l ) Ab ( i , n ) −= Ab ( l , n ) ∗Ab ( i , l ) ;
22 Ab ( i , n ) / = Ab ( i , i ) ;
23 }
24 x = Ab . r ight Cols ( 1 ) ; //
25 }

Line 9: right hand side vector set as last column of matrix, facilitates simultaneous row transformations of
matrix and r.h.s.

Variable fac =
ˆ multiplier
Line 24: extract solution from last column of transformed matrix.

(2.3.5) Computational effort of Gaussian elimination

Forward elimination: three nested loops (note: compact vector operation in line 15 involves another loop

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 143
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

from i + 1 to m)

Back substitution: two nested loops

computational cost (↔ number of elementary operations) of Gaussian elimination [?, Sect. 1.3]:
n −1
elimination : ∑i=1 (n − i)(2(n − i) + 3) = n(n − 1)( 23 n + 67 ) Ops. , (2.3.6)
n
back substitution : ∑i=1 2(n − i) + 1 = n2 Ops. .
✎ ☞
asymptotic complexity (→ Sect. 1.4) of Gaussian elimination 2 3
= 3n + O ( n2 ) = O ( n3 )
✍ ✌
(without pivoting) for generic LSE Ax = b, A ∈ R n,n

Experiment 2.3.7 (Runtime of Gaussian elimination)

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 144
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11 code 2.3.8: Measuring runtimes of Code 2.3.4 vs. E IGEN lu()-operator vs. MKL
➺ GITLAB
2 //! Eigen code for timing numerical solution of linear systems
3 MatrixXd g a u s s t i m i n g ( ) {
4 std : : vector n = { 8 ,16 ,32 ,64 ,128 ,256 ,512 ,1024 ,2048 ,4096 ,8192} ;
5 i n t nruns = 3 ;
6 MatrixXd times ( n . siz e ( ) , 3 ) ;
7 f o r ( i n t i = 0 ; i < n . siz e ( ) ; ++ i ) {
8 Timer t1 , t 2 ; // timer class
9 MatrixXd A = MatrixXd : : Random( n [ i ] , n [ i ] ) +
n [ i ] ∗ MatrixXd : : I d e n t i t y ( n [ i ] , n [ i ] ) ;
10 VectorXd b = VectorXd : : Random( n [ i ] ) ;
11 VectorXd x ( n [ i ] ) ;
12 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
13 t 1 . s t a r t ( ) ; x = A . l u ( ) . solve ( b ) ; t 1 . s top ( ) ; // Eigen
implementation
14 # i f n d e f EIGEN_USE_MKL_ALL // only test own algorithm without
MKL
15 i f ( n [ i ] <= 4096) // Prevent long runs
16 t2 . s t a r t ( ) ; gausselimsolve (A, b , x ) ; t 2 . s top ( ) ; //
own gauss elimination
17 #endif
18 }
19 times ( i , 0 ) = n [ i ] ; times ( i , 1 ) = t 1 . min ( ) ; times ( i , 2 ) = t 2 . min ( ) ;
20 }
21 r et ur n times ;
22 }

10 4
Eigen lu() solver
gausselimsolve
MLK solver sequential
MLK solver parallel
10 2
O(n 3 )
Platform:
✦ ubuntu 14.04 LTS
✦ i7-3517U CPU @ 1.90GHz 10 0

×4
execution time [s]

✦ L1 32 KB, L2 256 KB, L3

10 -2
4096 KB, Mem 8 GB
✦ gcc 4.8.4, -O3
10 -4
E IGEN is about two or-
ders of magnitude faster
than a direct implementa- 10 -6
tion, MKL is even faster.

10 -8
10 0 10 1 10 2 10 3 10 4
Fig. 56 matrix size n

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 145
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

n Code 2.3.4 [s] E IGEN lu() [s] MKL sequential [s] MKL parallel [s]
8 6.340e-07 1.140e-06 3.615e-06 2.273e-06
16 2.662e-06 3.203e-06 9.603e-06 1.408e-05
32 1.617e-05 1.331e-05 1.603e-05 2.495e-05
64 1.214e-04 5.836e-05 5.142e-05 7.416e-05
128 2.126e-03 3.180e-04 2.041e-04 3.176e-04
256 3.464e-02 2.093e-03 1.178e-03 1.221e-03
512 3.954e-01 1.326e-02 7.724e-03 8.175e-03
1024 4.822e+00 9.073e-02 4.457e-02 4.864e-02
2048 5.741e+01 6.260e-01 3.347e-01 3.378e-01
4096 5.727e+02 4.531e+00 2.644e+00 1.619e+00
8192 - 3.510e+01 2.064e+01 1.360e+01

Never implement Gaussian elimination yourself !

use numerical libraries (LAPACK/MKL) or E IGEN !

A concise list of libraries for numerical linear algebra and related problems can be found here.

Remark 2.3.9 (Gaussian elimination for non-square matrices)

In Code 2.3.4: the right hand side vector b was first appended to matrix A as rightmost column, and then
forward elimination and back substitution were carried out on the resulting matrix.

➣ Gaussian elimination for A ∈ K n,n+1!

“fat matrix”: A ∈ K n,m , m>n:
     
1
1 0
     
     
  −→   −→  
   0   0 
1

elimination step back substitution

Recall Code 2.3.4 (m = n + 1): the solution vector x = A−1 b was recovered as the rightmost column of
the augmented matrix (A, b) after forward elimination and back substitution. In the above cartoon it would
be contained in the yellow part of the matrix on the right.

Simultaneous solving of LSE with multiple right hand sides

Given regular A ∈ K n,n , B ∈ K n,k , seek X ∈ K n,k such that

AX = B ⇔ X = A−1 B

In E IGEN the following code accomplishes this:

MatrixXd X = A.lu().solve(B);

asymptotic complexity: O ( n2 ( n + k )

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 146
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11 code 2.3.10: Gaussian elimination with multiple r.h.s. → Code 2.3.4 ➺ GITLAB
2 //! Gauss elimination without pivoting, X = A−1 B
3 //! A must be an n × n-matrix, B an n × m-matrix
4 //! Result is returned in matrix X
5 void g a u s s e l i m s o l v e m u l t ( const MatrixXd &A , const MatrixXd& B ,
6 MatrixXd& X) {
7 i n t n = A . rows ( ) , m = B . cols ( ) ;
8 MatrixXd AB( n , n+m) ; // Augmented matrix [A, B]
9 AB << A , B ;
10 // Forward elimination, do not forget the B part of the Matrix
11 f o r ( i n t i = 0 ; i < n −1; ++ i ) {
12 double p i v o t = AB( i , i ) ;
13 f o r ( i n t k = i + 1; k < n ; ++k ) {
14 double f a c = AB( k , i ) / p i v o t ;
15 AB . block ( k , i + 1 ,1 ,m+n−i −1)−= f a c ∗ AB . block ( i , i + 1 ,1 ,m+n−i −1) ;
16 }
17 }
18 // Back substitution
19 AB . block ( n −1, n , 1 , m) / = AB( n −1,n −1) ;
20 f o r ( i n t i = n −2; i >= 0 ; −− i ) {
21 f o r ( i n t l = i + 1; l < n ; ++ l ) {
22 AB . block ( i , n , 1 ,m) −= AB . block ( l , n , 1 ,m) ∗AB( i , l ) ;
23 }
24 AB . block ( i , n , 1 ,m) / = AB( i , i ) ;
25 }
26 X = AB . r ight Cols (m) ;
27 }

Next two remarks: For understanding or analyzing special variants of Gaussian elimination, it is useful to
be aware of

• the effects of elimination steps on the level of matrix blocks, cf. Rem. 1.3.15,
• and of the recursive nature of Gaussian elimination.

Remark 2.3.11 (Gaussian elimination via rank-1 modifications)

Block perspective (first step of Gaussian elimination with pivot α 6= 0), cf. (2.3.2):

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 147
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

   
α c⊤ α c⊤
   
   
   
   
   
   
   
   
   
   
A := 
 d
 → A′ :=


 0 ′ dc⊤ .

 C   C := C − 
   α 
   
   
   
   
   
   
   

(2.3.12)

rank-1 modification of C
Adding a tensor product of two vectors to a matrix is called a rank-1 modification of that matrix.

(2.3.12) suggests a recursive variant of Gaussian elimination:

C++11 code 2.3.13: GE by rank-1 modification ➺ GITLAB

2 //! in-situ Gaussian elimination, no pivoting
  3 //! right hand side in rightmost column of A
4 //! back substitution is not done in this code!
 
  5 void bloc k gs ( MatrixXd &A) {
 
  6 i n t n = A . rows ( ) ;
 
 . 7 f o r ( i n t i = 1 ; i < n ; ++ i ) {
 
  8 // rank-1 modification of C
 
  9 A . bottomRightCorner ( n−i , n− i +1) −=
 
A . col ( i −1) . t a i l ( n− i ) ∗
A . row ( i −1) . t a i l ( n− i +1) / A( i −1, i −1) ;
10 A . col ( i −1) . t a i l ( n− i ) . setZero ( ) ; // set d = 0
r.h.s. b ∼ A(:, end) ➤ 11 }
12 }

In this code the Gaussian elimination is carried out in situ: the matrix A is replaced with the transformed
matrices during elimination. If the matrix is not needed later this offers maximum efficiency. Notice that
the recursive call is omitted!

Remark 2.3.14 (Block Gaussian elimination)

Recall “principle” from Ex. 1.3.15: deal with block matrices (“matrices of matrices”) like regular matrices
(except for commutativity of multiplication!).

Given: regular matrix A ∈ K n,n with sub-matrices

A11 := (A)1:k,1:k , A22 = (A)k+1:n,k+1,n , A12 = (A)1:k,k+1,n , A21 := (A)k+1:n,1:k , k < n,
right hand side vector b ∈ K n , b1 = (b)1:k , b2 = (b)k+1:n

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 148
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

A11 A12 b1 ❶ A11 A12 b1
−→ −1 −1
A21 A22 b2 0 A22 − A21 A11 A12 b2 − A21 A11 b1
−1

❷ I 0 A11 (b1 − A12 S−1 bS )
−→ ;,
0 I S −1 b S
−1 −1
where S := A22 − A21 A11 A12 (Schur complement, see Rem. 2.3.34), bS := b2 − A21 A11 b1

❶: elimination step, ❷: backsubstitution step Assumption: (sub-)matrices regular, if required.

We can read off the solution of the block-partitioned linear system from the above Gaussian elimination:

A11 A12 x1 b x2 = S −1 b S ,
= 1 ⇒ −1
(2.3.15)
A21 A22 x2 b2 x1 = A11 (b1 − A12 S−1 bS ) .

2.3.2 LU-Decomposition

A matrix factorization (ger. Matrixzerlegung) expresses a general matrix A as product of two special
(factor) matrices. Requirements for these special matrices define the matrix factorization.

Mathematical issue: existence & uniqueness

Numerical issue: algorithm for computing factor matrices

Matrix factorizations

☞ often capture the essence of algorithms in compact form (here: Gaussian elimination),
☞ are important building blocks for complex algorithms,
☞ are key theoretical tools for algorithm analysis.

In this section: forward elimination step of Gaussian elimination will be related to a special matrix factor-
ization, the so-called LU-decomposition or LU-factorization.

Supplementary reading. The LU-factorization should be well known from the introductory

linear algebra course. In case you need to refresh your knowledge, please consult one of the
following:

• textbook by Nipp & Stoffer [?, Sect. 2.4],

• book by M. Hanke-Bourgeois [?, II.4],
• linear algebra lecture notes by M. Gutknecht [?, Sect. 3.1],
• textbook by Quarteroni et al. [?, Sect.3.3.1],
• Sect. 3.5 of the book by Dahmen & Reusken,

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 149
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

• Sect. 5.1 of the textbook by Ascher & Greif [?].

The gist of Gaussian elimination:

     
1
1 0
     
     
  −→   −→  
   0   0 
1

row transformations row transformations

Here: row transformation = adding a multiple of a matrix row to another row, or

multiplying a row with a non-zero scalar (number)
swapping two rows
(more special than row transformations discussed in Ex. 1.3.13)
Note: these row transformations preserve regularity of a matrix (why ?)
✄ suitable for transforming linear systems of equations
(solution will not be affected)

Ex. 1.3.13: row transformations can be realized by multiplication from left with suitable transformation
matrices. When multiplying these transformation matrices we can emulate the effect to successive row
transformations through left multiplication with a matrix T:

   
   A′ 
 A   
  −→   ⇔ TA = A′ .
   0 

row transformations
Now we want to determine the T for the forward elimination step of Gaussian elimination.

Example 2.3.16 (Gaussian elimination and LU-factorization → [?, Sect. 2.4], [?, II.4], [?,
Sect. 3.1])

LSE from Ex. 2.3.1: consider (forward) Gaussian elimination:

    
1 1 0 x1 4 x1 + x2 = 4
2 1 −1 x2  =  1  ←→ 2x1 + x2 − x3 = 1 .
3 −1 −1 x3 −3 3x1 − x2 − x3 = −3
           
1 1 1 0 4 1 1 1 0 4
 1   2 1 −1   1  ➤  2 1   0 −1 −1   −7  ➤
1 3 −1 −1 −3 0 1 3 −1 −1 −3

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 150
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

           
1 1 1 0 4 1 1 1 0 4
 2 1   0 −1 −1   −7  ➤  2 1   0 − 1 −1   −7 
3 0 1 0 −4 −1 −15 3 4 1 0 0 3 13
| {z } | {z }
=L =U
= pivot row, pivot element bold, negative multipliers red

Details: link between Gaussian elimination and matrix factorization → Ex. 2.3.16
(row transformation = multiplication with elimination matrix)
    
1 0 ··· ··· 0 a1 a1
 a2    
− a 1 0   a2   0 
 1    
 a3    
a1 6 = 0  − a1   a3  =  0  . (2.3.17)
    
 .  .   . 
 ..  ..   .. 
    
− aan 0
1
1 an 0

n − 1 steps of Gaussian elimination: ➤ matrix factorization (→ Ex. 2.3.1)

(non-zero pivot elements assumed)

elimination matrices Li , i = 1, . . . , n − 1 ,
A = L1 · · · · · L n − 1 U with
upper triangular matrix U ∈ R n,n .

    
1 0 ··· ··· 0 1 0 ··· ··· 0 1 0 ··· ··· 0
    
 l2 1 0  0  l2 1 0
  0 1   
    
 l3   0 h3 1 =  l3 h3 1 
    
.  . .  . . 
 ..  .. ..   .. .. 
    
ln 0 1 0 hn 0 1 ln hn 0 1

L1 · · · · · Ln−1 are normalized lower triangular matrices

a
(entries = multipliers − a ik from (2.3.2) → Ex. 2.3.1)
✗ ✔
kk

The (forward) Gaussian elimination (without pivoting), for Ax = b, A ∈ R n,n ,

if possible, is alge-
braically equivalent to an LU-factorization/LU-decomposition A = LU of A into a normalized lower

✖ ✕
triangular matrix L and an upper triangular matrix U, [?, Thm. 3.2.1], [?, Thm. 2.10], [?, Sect. 3.1].

Algebraically equivalent = ˆ when carrying out the forward elimination in situ as in Code 2.3.4 and storing
the multipliers in a lower triangular matrix as in Ex. 2.3.16, then the latter will contain the L-factor and the
original matrix will be replaced with the U-factor.

Definition 2.3.18.
LU -decomposition/ LU -factorization Given a square matrix A ∈ K n,n , an upper triangular matrix
U ∈ K n,n and a normalized lower triangular matrix (→ Def. 1.1.5) form an LU -decomposition/ LU -
factorization of A, if A = LU.

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 151
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

 1     
1 0
 1     
 1     
 1     
 1     
 1 · = 
 1     
     
 1   0   
 1     
1
1

Lemma 2.3.19. Existence of LU -decomposition

The LU -decomposition of A ∈ K n,n exists, if all submatrices (A)1:k,1:k , 1 ≤ k ≤ n, are regular.

Proof. by block matrix perspective (→ Rem. 1.3.15) and induction w.r.t. n:

n = 1: assertion trivial
n − 1→n: Induction hypothesis ensures existence of normalized lower triangular matrix L e and regular
e such that A
upper triangular matrix U e =LeU
e , where A
e is the upper left (n − 1) × (n − 1) block of A:

e b
A e 0
L e y
U
= =: LU .
a⊤ α x⊤ 1 0 ξ
Then solve
➊ e =b
Ly → provides y ∈ Kn ,
➋ x⊤ U
e = a⊤ → provides x ∈ Kn ,
➌ x⊤ y + ξ = α → provides ξ ∈ K .
Regularity of A involves ξ 6= 0 (why?) so that U will be regular, too.

(2.3.20) Uniqueness of LU -decomposition

Regular upper triangular matrices and normalized lower triangular matrices form matrix groups (→ Lemma 1.3.9).
Their only common element is the identity matrix.
L1 U1 = L2 U2 ⇒ L2−1 L1 = U2 U1−1 = I .

(2.3.21) Basic algorithm for computing LU-decomposition

A direct way to determine the factor matrices of the LU -decomposition [?, Sect. 3.1], [?, Sect. 3.3.3]:
We study the entries of the product of a normalized lower triangular and an upper triangular matrix, see
Def. 1.1.5:
     
 0     
     
     
     
 · = 
     
     
     
     

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 152
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Taking into account the zero entries known a priori, we arrive at


min{i,k} ∑i −1 lij u jk + 1 · uik , if i ≤ k ,
j =1
LU = A ⇒ aik = ∑ lij u jk = (2.3.22)
j =1 ∑k−1 lij u jk + lik ukk , if i > k .
j =1

This reveals how to compute the entries of L and U sequentially. We start with the top row of U, which
agrees with that of A, and then work our way towards to bottom right corner:

➤ • row by row computation of U 1

2
• column by column computation of L
3
Entries of A can be replaced with those of L, U ! 4
(so-called in situ/in place computation)
1
(Crout’s algorithm, [?, Alg. 3.1]) 2
3
=
ˆ rows of U

=
ˆ columns of L
Fig. 57

The following code follows this sequential computation scheme:

C++11 code 2.3.23: LU-factorization ➺ GITLAB

2 //! Algorithm of Crout: LU-factorization of A ∈ K n,n
3 void l u f a k ( const MatrixXd& A , MatrixXd& L , MatrixXd& U) {
4 i n t n = A . rows ( ) ; asser t ( n == A . cols ( ) ) ;
5 L = MatrixXd : : I d e n t i t y ( n , n ) ;
6 U = MatrixXd : : Zero ( n , n ) ;
7 f o r ( i n t k = 0 ; k < n ; ++k ) {
8 // Compute row of U
9 f o r ( i n t j = k ; j < n ; ++ j )
10 U( k , j ) = A( k , j ) − ( L . block ( k , 0 , 1 , k ) ∗ U . block ( 0 , j , k , 1 ) ) ( 0 , 0 ) ;
11 // Compute column of L
12 f o r ( i n t i = k + 1; i < n ; ++ i )
13 L ( i , k ) = ( A( i , k ) − ( L . block ( i , 0 , 1 , k ) ∗
U. block ( 0 , k , k , 1 ) ) ( 0 , 0 ) ) /U( k , k ) ;
14 }
15 }

It is instructive to compare this code with a simple implementation of the matrix product of a normalized
lower triangular and an upper triangular matrix. From this perspective the LU-factorization looks like the
“inversion” of matrix multiplication:

C++11 code 2.3.24: matrix multiplication L · U ➺ GITLAB

2 //! Multiplication of normalized lower/upper triangular matrices
3 void l u m u l t ( const MatrixXd& L , const MatrixXd& U, MatrixXd& A) {
4 i n t n = L . rows ( ) ;
5 asser t ( n == L . cols ( ) && n == U . cols ( ) && n == U . rows ( ) ) ;

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 153
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6 A = MatrixXd : : Zero ( n , n ) ;
7 f o r ( i n t k = 0 ; k < n ; ++k ) {
8 f o r ( i n t j = k ; j < n ; ++ j )
9 A( k , j ) = U( k , j ) + ( L . block ( k , 0 , 1 , k ) ∗ U . block ( 0 , j , k , 1 ) ) ( 0 , 0 ) ;
10 f o r ( i n t i = k + 1; i < n ; ++ i )
11 A( i , k ) = ( L . block ( i , 0 , 1 , k ) ∗ U . block ( 0 , k , k , 1 ) ) ( 0 , 0 ) +
L ( i , k ) ∗U( k , k ) ;
12 }
13 }

Observe: Solving for entries L(i,k) of L and U(k,j) of U in the multiplication of an upper tri-
angular and normalized lower triangular matrix (Code 2.3.24) yields the algorithm for LU-factorization
(Code 2.3.23).

The computational cost of LU-factorization is immediate from Code 2.3.23:

☛ ✟
1 3
R n,n + O ( n2 ) O ( n3 )
✡ ✠
asymptotic complexity of LU-factorization of A ∈ = 3n = (2.3.25)

Remark 2.3.26 (In-situ LU-decomposition)

A −→
L

Replace entries of A with entries of L (strict lower triangle) and U (upper triangle).

Remark 2.3.27 (Recursive LU-factorization)

Recall: recursive view of Gaussian elimination → Rem. 2.3.11

In light of the close relationship between Gaussian elimination and LU-factorization there will also be a
recursive version of LU-factorization.

Recursive in situ (in place) LU-decomposition of A ∈ R n,n (without pivoting):

L,U stored in place of A:

C++11 code 2.3.28: Recursive LU-factorization ➺ GITLAB

2 //! in situ recursive LU-factorization

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 154
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3 MatrixXd l u r e c ( const MatrixXd &A) {

4 i n t n = A . rows ( ) ;
5 MatrixXd r e s u l t ( n , n ) ;
6 i f ( n > 1) {
7 VectorXd f a c = A . col ( 0 ) . t a i l ( n −1) / A( 0 , 0 ) ; //
8 r e s u l t . bottomRightCorner ( n −1,n −1) = l u r e c (
A . bottomRightCorner ( n −1,n −1) − fac ∗
A . row ( 0 ) . t a i l ( n −1) ) ; //
9 r e s u l t . row ( 0 ) = A . row ( 0 ) ; r e s u l t . col ( 0 ) . t a i l ( n −1) = f a c ;
10 r et ur n r e s u l t ;
11 }
12 r et ur n A ;
13 }

Refer to (2.3.12) to understand lurec: the rank-1 modification of the lower (n − 1) × (n − 1)-block of
the matrix is done in lines Line 7-Line 8 of the code.

C++11 code 2.3.29: Driver for recursive LU-factorization ➺ GITLAB

2 //! post-processing: extract L and U
3 void l u r e c d r i v e r ( const MatrixXd &A , MatrixXd &L , MatrixXd &U) {
4 MatrixXd A_dec = l u r e c ( A) ;
5 // post-processing:
6 //extract L and U
7 U = A_dec . t r iangular View <Upper > ( ) ;
8 L . setIdentity ( ) ;
9 L += A_dec . t r iangular View < S t r i c t l y L o w e r > ( ) ;
10 }

(2.3.30) Using LU-factorization to solve a linear system of equations

Solving an n × n linear system of equations by LU-factorization:

① LU -decomposition A = LU, #elementary operations 13 n(n − 1)(n + 1)

Ax = b : ② forward substitution, solve Lz = b, #elementary operations 12 n(n − 1)
③ backward substitution, solve Ux = z, #elementary operations 12 n(n + 1)

➣
asymptotic complexity: (in leading order) the same as for Gaussian elimination

However, the perspective of LU-factorization reveals that the solution of linear systems of equations can be

✗ ✔ ✗ ✔
split into two separate phases with different asymptotic complexity in terms of the number n of unknowns:

setup phase elimination phase

(factorization) + (forward/backward substition)

✖ ✕ ✖ ✕
Cost: O(n3 ) Cost: O(n2 )

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 155
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 2.3.31 (Rationale for using LU-decomposition in algorithms)

Gauss elimination and LU-factorization for the solution of a linear system of equations (→ § 2.3.30) are
equivalent and only differ in the ordering of the steps.

Then, why is it important to know about LU-factorization?

Because in the case of LU-factorization the expensive forward elimination and the less expensive (for-
ward/backward) substitutions are separated, which sometimes can be exploited to reduce computational
cost, as highlighted in Rem. 2.5.10 below.

Remark 2.3.32 ("‘Partial LU -decompositions” of principal minors)

Principal minor =
ˆ left upper block of a matrix

The following “visual rule” help identify the structure of the LU-factors of a matrix.

    
    
    
    


 
 
0 



    
    
 =  
    
    
    
    
    0 
    
    

(2.3.33)

The left-upper blocks of both L and U in the LU-factorization of A depend only on the corresponding
left-upper block of A!

Remark 2.3.34 (Block LU-factorization)

In the spirit of Rem. 1.3.15: block perspective of LU-factorization.

Natural in light of the close connection between matrix multiplication and matrix factorization, cf. the
relationship between matrix factorization and matrix multiplication found in § 2.3.21:
Block matrix multiplication (1.3.16) ∼
= block LU -decomposition:
With A11 ∈ K n,n regular, A12 ∈ K n,m , A21 ∈ K m,n , A22 ∈ K m,m :

Schur complement
A11 A12 I 0 A11 A12
= −1 , (2.3.35)
A21 A22 A21 A11 I 0 S −1
S := A22 − A21 A11 A12 .
| {z }
block LU-factorization

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 156
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

→ block Gaussian elimination, see Rem. 2.3.14.

2.3.3 Pivoting

Known from linear algebra [?, Sect. 1.1]:

0 1 x1 b 1 0 x1 b
= 1 = 2
1 0 x2 b2 0 1 x2 b1

breakdown of Gaussian elimination Gaussian elimination feasible

pivot element = 0

Idea (in linear algebra): Avoid zero pivot elements by swapping rows

Example 2.3.36 (Pivoting and numerical stability → [?, Example 3.2.3])

2 MatrixXd A( 2 , 2 ) ;
3 A << 5.0 e − 17, 1 . 0 , Output:
4 1.0 , 1.0;
5 VectorXd b ( 2 ) , x2 ( 2 ) ; 1 x1 =
6 b << 1 . 0 , 2 . 0 ; 2 1
7 VectorXd x1 = A . f u l l P i v L u ( ) . solve ( b ) ; 3 1
8 g a u s s e l i m s o l v e ( A , b , x2 ) ; // see Code 2.3.10 4 x2 =
9 MatrixXd L ( 2 , 2 ) , U( 2 , 2 ) ; 5 0
10 l u f a k ( A , L , U) ; // see Code 2.3.23 6 1
11 VectorXd z = L . l u ( ) . solve ( b ) ; 7 x3 =
12 VectorXd x3 = U . l u ( ) . solve ( z ) ; 8 0
13 cout << " x1 = \ n " << x1 << " \ n x2 = \ n " << x2 << 9 1
" \ n x3 = \ n " << x3 << std : : endl ;

 1 

ǫ 1 1  1−ǫ  1
A= , b= ⇒ x= ≈ for |ǫ| ≪ 1 .
1 1 2 1 − 2ǫ 1
1−ǫ

What is wrong with E IGEN? Needed: insight into roundoff errors, which we already have
→ Section 1.5.3

Armed with knowledge about the behavior of machine numbers and roundoff errors we can now under-
stand what is going on in Ex. 2.3.36

Straightforward LU-factorization: if ǫ ≤ 12 EPS, EPS =

ˆ machine precision,

1 0 ǫ 1 (∗)
e ǫ 1
L= , U= = U := in M ! (2.3.37)
ǫ−1 1 0 1 − ǫ−1 0 − ǫ−1

e 2/EPS = 2/EPS, see Exp. 1.5.35.

(∗): because 1+

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 157
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

e = b: x = 2ǫ
Solution of LUx (meaningless result !)
1 − 2ǫ
LU-factorization after swapping rows:

1 1 1 0 1 1 e := 1 1
A= ⇒ L= , U= =U in M . (2.3.38)
ǫ 1 ǫ 1 0 1−ǫ 0 1

e = b: x = 1 + 2ǫ
Solution of LUx (sufficiently accurate result !)
1 − 2ǫ

e 0 0
no row swapping, → (2.3.37): LU = A + E with E = unstable !
0 1

e e 0 0
after row swapping, → (2.3.38): LU = A + E with E = stable !
0 ǫ

Introduction to the notion of stability → Section 1.5.5, Def. 1.5.85, see also [?, Sect. 2.3].

Suitable pivoting essential for controlling impact of roundoff errors

on Gaussian elimination (→ Section 1.5.5, [?, Sect. 2.5])

Example 2.3.39 (Gaussian elimination with pivoting for 3 × 3-matrix)

         
1 2 2 2 −3 2 2 −3 2 2 −7 2 2 −7 2
➊ ➋ ➌ ➍
A =  2 −3 2 → ] 1 2 2 → ] 0 3.5 1 → 0 25.5 −1 → 0 25.5 −1 
1 24 0 1 24 0 0 25.5 −1 0 3.5 1 0 0 1.373

➊: swap rows 1 & 2.

➋: elimination with top row as pivot row
➌: swap rows 2 & 3
➍: elimination with 2nd row as pivot row

(2.3.40) Algorithm: Gaussian elimination with partial pivoting

C++11 code 2.3.41: Gaussian elimination with pivoting: extension of Code 2.3.4 ➺ GITLAB
2 //! Solving an LSE Ax = b by Gaussian elimination with partial
pivoting
3 //! A must be an n × n-matrix, b an n-vector
4 void gepiv ( const MatrixXd &A , const VectorXd& b , VectorXd& x ) {
5 i n t n = A . rows ( ) ;
6 MatrixXd Ab ( n , n +1) ;
7 Ab << A , b ; //
8 // Forward elimination by rank-1 modification, see Rem. 2.3.11
9 f o r ( i n t k = 0 ; k < n −1; ++k ) {
10 i n t j ; double p ; // p = relativly largest pivot, j = pivot
row index

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 158
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

11 p = ( Ab . col ( k ) . t a i l ( n−k ) . cwiseAbs ( ) . cwiseQuotient (

Ab . block ( k , k , n−k , n−k ) . cwiseAbs ( ) . rowwise ( ) . maxCoeff ( ) )
) . maxCoeff (& j ) ; //
12 i f ( p < std : : n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) ∗
Ab . block ( k , k , n−k , n−k ) . norm ( ) )
13 throw std : : l o g i c _ e r r o r ( " n e a r l y s i n g u l a r " ) ; //
14 Ab . row ( k ) . t a i l ( n−k +1) . swap ( Ab . row ( k+ j ) . t a i l ( n−k +1) ) ; //
15 Ab . bottomRightCorner ( n−k −1,n−k ) −= Ab . col ( k ) . t a i l ( n−k −1) ∗
Ab . row ( k ) . t a i l ( n−k ) / Ab ( k , k ) ; //
16 }
17 // Back substitution (same as in Code 2.3.4)
18 Ab ( n −1,n ) = Ab ( n −1,n ) / Ab ( n −1,n −1) ;
19 f o r ( i n t i = n −2; i >= 0 ; −− i ) {
20 f o r ( i n t l = i + 1; l < n ; ++ l ) {
21 Ab ( i , n ) −= Ab ( l , n ) ∗Ab ( i , l ) ;
22 }
23 Ab ( i , n ) / = Ab ( i , i ) ;
24 }
25 x = Ab . r ight Cols ( 1 ) ; //
26 }

choice of pivot row index j (Line 11 of code): relatively largest pivot [?, Sect. 2.5],

| a ji |
j ∈ {k, . . . , n} such that → max (2.3.42)
max{| a jl |, l = k, . . . , n}

for k = j, k ∈ {i, . . . , n}: partial pivoting

Explanations to Code 2.3.41:
Line 7: Augment matrix A by right hand side vector b, see comments on Code 2.3.4 for explanations.
Line 11: Select index j for pivot row according to the recipe of partial pivoting, see (2.3.42).

Note: Inefficient implementation above (too many comparisons)! Try to do better!

Line 13: If the pivot element is still very small relative to the norm of the matrix, then we have encountered
an entire column that is close to zero. Gaussian elimination may not be possible in a stable
fashion for this matrix; warn user and terminate.
Line 14: A way to swap rows of a matrix in E IGEN.
Line 15: Forward elimination by means of rank-1-update, see (2.3.12).
Line 25: As in Code 2.3.4: after back substitution last column of augmented matrix supplies solution x =
A − 1 b.

(2.3.43) Algorithm: LU-factorization with pivoting

Recall: close relationship between Gaussian elimination and LU-factorization

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 159
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

➣ LU-factorization with pivoting? Of course, just by rearranging the operations of Gaussian forward elim-
ination with pivoting.

E IGEN-code for in place LU-factorization of A ∈ R n,n with partial pivoting:

C++11 code 2.3.44: LU-factorization with partial pivoting ➺ GITLAB

2 void l u p i v ( MatrixXd &A) { //insitu
3 i n t n = A . rows ( ) ;
4 f o r ( i n t k = 0 ; k < n −1; ++k ) {
5 i n t j ; double p ; // p = relativly largest pivot, j = pivot
row index
6 p = ( A . col ( k ) . t a i l ( n−k ) . cwiseAbs ( ) . cwiseQuotient (
A . block ( k , k , n−k , n−k ) . cwiseAbs ( ) . rowwise ( ) . maxCoeff ( ) )
) . maxCoeff (& j ) ; //
7 i f ( p < std : : n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) ∗
A . block ( k , k , n−k , n−k ) . norm ( ) ) //
8 throw std : : l o g i c _ e r r o r ( " n e a r l y s i n g u l a r " ) ;
9 A . row ( k ) . t a i l ( n−k −1) . swap ( A . row ( k+ j ) . t a i l ( n−k −1) ) ; //
10 VectorXd f a c = A . col ( k ) . t a i l ( n−k −1) / A( k , k ) ; //
11 A . bottomRightCorner ( n−k −1,n−k −1) −= f a c ∗
A . row ( k ) . t a i l ( n−k −1) ; //
12 A . col ( k ) . t a i l ( n−k −1) = f a c ; //
13 }
14 }

Notice that the recursive call is omitted as in 2.3.11

Explanations to Code 2.3.44:

Line 6: Find the relatively largest pivot element p and the index j of the corresponding row of the matrix,
see (2.3.42)
Line 7: If the pivot element is still very small relative to the norm of the matrix, then we have encountered
an entire column that is close to zero. The matrix is (close to) singular and LU-factorization does
not exist.
Line 9: Swap the first and the j-th row of the matrix.

Line 10: Initialize the vector of multiplier.

Line 11: Call the routine for the lower right (n − 1) × (n − 1)-block of the matrix after subtracting suitable
multiples of the first row from the other rows, cf. Rem. 2.3.11 and Rem. 2.3.27.
Line 12: Reassemble the parts of the LU-factors. The vector of multipliers yields a column of L, see
Ex. 2.3.16.

Remark 2.3.45 (Rationale for partial pivoting policy (2.3.42) → [?, Page 47])

Why relatively largest pivot element in (2.3.42)? scaling invariance desirable

Scale linear system of equations from Ex. 2.3.36:

2/ǫ 0 ǫ 1 x1 2 2/ǫx1 2/ǫ
e
= = := b
0 1 1 1 x2 1 1 x2 2

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 160
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

No row swapping, if absolutely largest pivot element is used:

2 2/ǫ 1 0 2 2/ǫ · 1 0 2 2/ǫ
= = in M .
1 1 1 1 0 1 − 2/ǫ 1 1 0 −2/ǫ
| {z } | {z }
e
L e
U

Using the rules of arithmetic in M (→ Exp. 1.5.35), we find

e −1 e −1 e · 1 −1 −1 2 1 · 0
U L b= − = ,
2 0 ǫ ǫ −1 1

which is not an acceptable result.

Pivoting: Theoretical perspective

Definition 2.3.46. Permutation matrix

An n-permutation, n ∈ N, is a bijective mapping π : {1, . . . , n} 7→ {1, . . . , n}. The corresponding

permutation matrix Pπ ∈ K n,n is defined by
(
1 , if j = π (i ) ,
(Pπ )ij =
0 else.

 
1 0 0 0
0 0 1 0
ˆ P=
permutation (1, 2, 3, 4) 7→ (1, 3, 2, 4) = 0
.
1 0 0
0 0 0 1

Note:
✦ P⊤ = P−1 for any permutation matrix P (→ permutation matrices orthogonal/unitary)
✦ Pπ A effects π -permutation of rows of A ∈ K n,m
✦ APπ effects π -permutation of columns of A ∈ K m,n
Lemma 2.3.47. Existence of LU-factorization with pivoting → [?, Thm. 3.25], [?, Thm. 4.4]
For any regular A ∈ K n,n there is a permutation matrix (→ Def. 2.3.46) P ∈ K n,n , a normalized
lower triangular matrix L ∈ K n,n , and a regular upper triangular matrix U ∈ K n,n (→ Def. 1.1.5),
such that PA = LU .

Proof. (by induction)

Every regular matrix A ∈ K n,n admits a row permutation encoded by the permutation matrix P ∈ K n,n ,
such that A′ := (A)1:n−1,1:n−1 is regular (why ?).
By induction assumption there is a permutation matrix P′ ∈ K n−1,n−1 such that P′ A′ possesses a
LU-factorization A′ = L′ U′ . There are x, y ∈ K n−1 , γ ∈ K such that
′ ′ ′ ′ ′
P′ 0 P 0 A x LU x L 0 U d
PA = = = ⊤ ,
0 1 0 1 y⊤ γ y⊤ γ c 1 0 α

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 161
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

if we choose

d = (L ′ )−1 x , c = (u ′ )− T y , α = γ − c⊤ d ,

which is always possible. ✷

Example 2.3.48 (Ex. 2.3.39 cnt’d)

         
1 2 2 2 −3 2 2 −3 2 2 −7 2 2 −7 2
➊ ➋ ➌ ➍
A =  2 −3 2 → 1 2 2 →  0 3.5 1 →  0 25.5 −1 → 0 25.5 −1 
1 24 0 1 24 0 0 25.5 −1 0 3.5 1 0 0 1.373

     
2 −3 2 1 0 0 0 1 0
U =  0 25.5 −1 , L =  0.5 1 0 , P= 0 0 1 .
0 0 1.1373 0.5 0.1373 1 1 0 0

Two permutations: in step ➊ swap rows #1 and #2, in step ➌ swap rows #2 and #3. Apply these swaps to
the identity matrix and you will recover P. See also [?, Ex. 3.30].

(2.3.49) LU-decomposition in E IGEN

E IGEN provides various functions for computing the LU-decomposition of a given matrix. They all perform
the factorization in-situ → Rem. 2.3.26:

A −→
L

The resulting matrix can be retrieved and used to recover the LU-factors, as demonstrated in the next code
snippet.

C++11 code 2.3.50: Performing explicit LU-factorization in E IGEN ➺ GITLAB

Eigen::PartialPivLU<MatrixXd> lu(A);
MatrixXd L = MatrixXd::Identity(3,3);
L.triangularView<StrictlyLower>() += lu.matrixLU();
MatrixXd U = lu.matrixLU().triangularView<Upper>();
MatrixXd P = lu.permutationP();

Note that for solving a linear system of equations by means of LU-decomposition (the standard algorithm)
we never have to extract the LU-factors.

2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 162
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 2.3.51 (Row swapping commutes with forward elimination)

Any kind of pivoting only involves comparisons and row/column permutations, but no arithmetic operations
on the matrix entries. This makes the following observation plausible:

The LU-factorization of A ∈ K n,n with partial pivoting by § 2.3.43 is numerically equivalent to the LU-
factorization of PA without pivoting (→ Code in § 2.3.21), when P is a permutation matrix gathering
the row swaps entailed by partial pivoting.

numerically equivalent =
ˆ same result when executed with the same machine arithmetic

The above statement means that whenever we study the impact of roundoff errors on LU-
factorization it is safe to consider only the basic version without pivoting, because we can always
assume that row swaps have been conducted beforehand.

2.4 Stability of Gaussian Elimination

It will turn out that when investigating the stability of algorithms meant to solve linear systems of equations,
a key quantity is the residual.

Definition 2.4.1. Residual

x ∈ K n of the LSE Ax = b (A ∈ K n,n , b ∈ K n ), its residual is the
Given an approximate solution e
vector

r = b − Ae
x.

(2.4.2) Probing stability of a direct solver for LSE

Assume that you have downloaded a direct solver for a general (dense) linear system of equations Ax =
b, A ∈ K n,n regular, b ∈ K n . When given the data A and b it returns the perturbed solution e x. How
can we tell that ex is the exact solution of a linear system with slightly perturbed data (in the sense of a
tiny relative error of size ≈ EPS, EPS the machine precision, see § 1.5.29). That is, how can we tell that e
x
is an acceptable solution in the sense of backward error analysis, cf. Def. 1.5.85. A similar question was
explored in Ex. 1.5.86 for matrix×vector multiplication.

➊ x−e
x accounted for by perturbation of right hand side:
Ax = b
⇒ ∆b = Ae
x − b =: −r (residual, Def. 2.4.1) .
x = b + ∆b
Ae
krk
Hence, e
x can be accepted as a solution, if ≤ Cn3 · EPS, for some small constant C ≈ 1, see
kbk
Def. 1.5.85. Here, k·k can be any vector norm on K n .

➋ x−e
x accounted for by perturbation of system matrix:
Ax = b , (A + ∆A)e
x=b

2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 163
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

xH , u ∈ K n ]
[ try perturbation ∆A = ue
r xH
re
u= ⇒ ∆A = .
kxk22 kxk22
As in Ex. 1.5.86 we find

k∆A k2 krk k r k2
= ≤ .
k A k2 kAk2 ke
xk2 kAe xk2

k rk
Thus, e
x is ok in the sense of backward error analysis, if ≤ Cn3 · EPS.
kAe xk

A stable algorithm for solving an LSE yields a residual r := b − Ae

x small (in norm) relative to b.

The roundoff error analysis of Gaussian elimination based Ass. 1.5.32 is rather involved. Here we merely
summarise the results:

Simplification: equivalence of Gaussian elimination and LU-factorization extends to machine arithmetic,

cf. Sect. 2.3.2
Lemma 2.4.3. Equivalence of Gaussian elimination and LU-factorization

The following algorithms for solving the LSE Ax = b (A ∈ K n,n , b ∈ K n ) are

numerically equivalent:
❶ Gaussian elimination (forward elimination and back substitution) without pivoting, see Algo-
rithm 2.3.3.
❷ LU-factorization of A (→ Code in § 2.3.21) followed by forward and backward substitution,
see Algorithm 2.3.30.

Rem. 2.3.51 ➣ sufficient to consider LU-factorization without pivoting

A profound roundoff analysis of Gaussian eliminatin/LU-factorization can be found in [?, Sect. 3.3 & 3.5]
and [?, Sect. 9.3]. A less rigorous, but more lucid discussion is given in [?, Lecture 22].

Here we only quote a result due to Wilkinson, [?, Thm. 9.5]:

2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 164
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Theorem 2.4.4. Stability of Gaussian elimination with partial pivoting

Let A ∈ R n,n be regular and A(k) ∈ R n,n , k = 1, . . . , n − 1, denote the intermediate matrix arising
in the k-th step of § 2.3.43 (Gaussian elimination with partial pivoting) when carried out with exact
arithmetic.
For the approximate solution e x ∈ R n of the LSE Ax = b, b ∈ R n , computed as in § 2.3.43 (based
on machine arithmetic with machine precision EPS, → Ass. 1.5.32) there is ∆A ∈ R n,n with

max |(A(k) )ij |

3EPS i,j,k
k∆A k∞ ≤ n3 ρ kAk ∞ , ρ := ,
1 − 3nEPS max |(A)ij |
i,j
such that (A + ∆A)e
x=b.

ρ “small” ➥ Gaussian elimination with partial pivoting is stable (→ Def. 1.5.85)

If ρ is “small”, the computed solution of a LSE can be regarded as the exact solution of a LSE with “slightly
perturbed” system matrix (perturbations of size O(n3 EPS)).

Bad news: exponential growth ρ ∼ 2n is possible !

Example 2.4.5 (Wilkinson’s counterexample)

 
1 0 0 0 0 0 0 0 0 1
 −1 1 0 0 0 0 0 0 0 1
n=10:  
 −1 −1 1 0 0 0 0 0 0 1
  
 −1 −1 −1 1 0 0 0 0 0 1

1 , if i = j ∨ j = n ,  
 −1 −1 −1 −1 1 0 0 0 0 1
aij = −1 , if i > j , , A=
 −1


  −1 −1 −1 −1 1 0 0 0 1
0 else.  −1 −1 −1 −1 −1 −1 1 0 0 1
 
 −1 −1 −1 −1 −1 −1 −1 1 0 1
 
 −1 −1 −1 −1 −1 −1 −1 −1 1 1
−1 −1 −1 −1 −1 −1 −1 −1 −1 1
Partial pivoting does not trigger row permutations !
 

1 , if i = j , 
1 , if i = j ,
A = LU , lij = −1 , if i > j , uij = 2i −1 , if j = n ,

 

0 else 0 else.

Exponential blow-up of entries of U !

Blow-up of entries of U !
l (∗) Evidence of Instability of Gaussian elimination!
However, cond2 (A) is small!
2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 165
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11 code 2.4.6: Gaussian elimination for “Wilkinson system” in E IGEN ➺ GITLAB
2 MatrixXd r es ( 1 0 0 , 2 ) ;
3 f o r ( i n t n = 10; n <= 100 ∗ 10; n += 10) {
4 MatrixXd A( n , n ) ; A . s e t I d e n t i t y ( ) ;
5 A . t r iangular View < S t r i c t l y L o w e r > ( ) . setConstant ( − 1) ;
6 A . rightCols <1 >() . setOnes ( ) ;
7 VectorXd x = VectorXd : : Constant ( n, − 1) . binaryExpr (
VectorXd : : LinSpaced ( n , 1 , n ) , [ ] ( double x , double y ) { r et ur n
pow ( x , y ) ; } ) ;
8 double r e l e r r = ( A . l u ( ) . solve ( A∗ x )−x ) . norm ( ) / x . norm ( ) ;
9 r es ( n/10 − 1 ,0) = n ; r es ( n/10 − 1 ,1) = r e l e r r ;
10 }
11 // ... different solver(e.g. colPivHouseholderQr()), plotting

(∗) If cond2 (A) was huge, then big errors in the solution of a linear system can be caused by small
perturbations of either the system matrix or the right hand side vector, see (2.4.12) and the message
of Thm. 2.2.10, (2.2.13). In this case, a stable algorithm can obviously produce a grossly “wrong”
solution, as was already explained after (2.2.13).
Hence, lack of stability of Gaussian elimination will only become apparent for linear systems with
well-conditioned system matrices.
450
0
10
400
-2
10
relative error (Euclidean norm)

350
-4
10
300
-6
10
cond2(A)

250
Gaussian elimination
-8
10 QR-decomposition
200 relative residual norm

-10
10
150
-12
10
100
-14
10
50
-16
10
0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Fig. 58 n Fig. 59 n

These observations match Thm. 2.4.4, because in this case we encounter an exponential growth of ρ =
ρ(n), see Ex. 2.4.5.

√
Observation: In practice ρ (almost) always grows only mildly (like O( n)) with n

√
Discussion in [?, Lecture 22]: growth factors larger than the orderO( n) are exponentially rare in certain
relevant classes of random matrices.

Example 2.4.7 (Stability by small random perturbations)

C++11 code 2.4.8: Stability by small random perturbations ➺ GITLAB

2 //! Curing Wilkinson’s counterexample by random perturbation

2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 166
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3 //! Theory: Spielman and Teng

4 MatrixXd r es ( 2 0 , 3 ) ;
5 mt19937 gen ( 4 2 ) ; // seed
6 std : : n o r m a l _ d i s t r i b u t i o n <> d ; // normal distribution, mean = 0.0,
stddev = 1.0
7 f o r ( i n t n = 10; n <= 10 ∗ 20; n += 10) {
8 // Build Wilkinson matrix
9 MatrixXd A( n , n ) ; A . s e t I d e n t i t y ( ) ;
10 A . t r iangular View < S t r i c t l y L o w e r > ( ) . setConstant ( − 1) ;
11 A . rightCols <1 >() . setOnes ( ) ;
12 // imposed solution
13 VectorXd x =
VectorXd : : Constant ( n, − 1) . binaryExpr ( VectorXd : : LinSpaced ( n , 1 , n ) ,
[ ] ( double x , double y ) { r et ur n pow ( x , y ) ; } ) ;
14 double r e l e r r = ( A . l u ( ) . solve ( A∗ x )−x ) . norm ( ) / x . norm ( ) ;
15 //Randomly perturbed Wilkinson matrix by matrix with iid
16 // N (0, eps) distributed entries
17 MatrixXd Ap = A . unaryExpr ( [ & ] ( double x ) { r et ur n x +
n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) ∗ d ( gen ) ; } ) ;
18 double r e l e r r p = ( Ap . l u ( ) . solve ( Ap ∗ x )−x ) . norm ( ) / x . norm ( ) ;
19 r es ( n/10 − 1 ,0) = n ; r es ( n/10 − 1 ,1) = r e l e r r ; r es ( n/10 − 1 ,2) =
relerrp ;
20 }
21 // Plotting

0
10

-2
10

-4
10
Recall the statement made above about “improbabil-
10
-6 ity” of matrices for which Gaussian elimination with
relative error

partial pivoting is unstable. This is now matched

-8 unperturbed matrix
10
randn perturbed matrix
by the observation that a tiny random perturba-
10
-10
tion of the matrix (almost certainly) cures the prob-
-12
lem. This is investigated by the brand-new field of
10
smoothed analysis of numerical algorithms, see [?].
-14
10

-16
10
0 20 40 60 80 100 120 140 160 180 200
Fig. 60 matrix size n

Gaussian elimination/LU-factorization with partial pivoting is stable (∗)

(for all practical purposes) !

(∗): stability refers to maximum norm k·k∞ .

Experiment 2.4.9 (Conditioning and relative error → Ex. 2.4.10 cnt’d)

In the discussion of numerical stability (→ Def. 1.5.85, Rem. 1.5.88) we have seen that a stable algorithm
may produce results with large errors for ill-conditioned problems. The conditioning of the problem of

2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 167
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

solving a linear system of equations is determined by the condition number (→ Def. 2.2.12) of the system
matrix, see Thm. 2.2.10.

Hence, for an ill-conditioned linear system, whose system matrix is beset with a huge condition number,
(stable) Gaussian elimination may return “solutions” with large errors. This will be demonstrated in this
experiment.
20 2
10 10
cond(A)

19
10 1
10

Numerical experiment with nearly singular matrix 18

10
from Ex. 2.4.10
0
10

17
10

relative error
-1
T 10

cond(A)
A = uv + ǫI , 10
16

u = 13 (1, 2, 3, . . . , 10)T ,
-2
10
15
10

1 T
v = (−1, 12 , − 31 , 14 , . . . , 10 ) 10
14
10
-3

-4
13 10
10

relative error
12 -5
10 10
-14 -13 -12 -11 -10 -9 -8 -7 -6 -5
10 10 10 10 10 10 10 10 10 10
Fig. 61 ε

The practical stability of Gaussian elimination is reflected by the size of a particular vector that can easily
be computed after the elimination solver has finished:

In practice Gaussian elimination/LU-factorization with partial pivoting

produces “relatively small residuals”

Simple consideration as in § 2.4.2:

x = b ⇒ r = b − Ae
(A + ∆A)e x ⇒
x = ∆Ae krk ≤ k∆A kke
xk ,

for any vector norm k·k. This means that, if a direct solver for an LSE is stable in the sense of backward
error analysis, that is, the perturbed solution could be obtained as the exact solution for a slightly relatively
perturbed system matrix, then the residual will be (relatively) small.

Experiment 2.4.10 (Small residuals by Gaussian elimination)

Numerical experiment with nearly singular matrix

u = 13 (1, 2, 3, . . . , 10)T ,
A = uvT + ǫI ,
with 1 T
v = (−1, 21 , − 13 , 41 , . . . , 10 )
singular rank-1 matrix

C++11 code 2.4.11: Small residuals for Gauss elimination ➺ GITLAB

2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 168
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5 VectorXd x = VectorXd : : Ones ( n ) ;

6 double nx = x . lpNorm ( ) ;
7 VectorXd expo = VectorXd : : LinSpaced(19, − 5, − 14 );
8 Eigen : : MatrixXd r es ( expo . siz e ( ) , 4 ) ;
9 f o r ( i n t i = 0 ; i <= expo . siz e ( ) ; ++ i ) {
10 double e p s i l o n = std : : pow ( 1 0 , expo ( i ) ) ;
11 MatrixXd A = u ∗ v . transpose ( ) + e p s i l o n ∗ MatrixXd : : I d e n t i t y ( n , n ) ;
12 VectorXd b = A ∗ x ;
13 double nb = b . lpNorm ( ) ;
14 VectorXd x t = A . l u ( ) . solve ( b ) ;
15 VectorXd r = b − A∗ x t ;
16 r es ( i , 0 ) = e p s i l o n ; r es ( i , 1 ) = ( x−x t ) . lpNorm ( ) / nx ;
17 r es ( i , 2 ) = r . lpNorm ( ) / nb ;
18 // L-infinity condition number
19 r es ( i , 3 ) = A . inverse ( ) . cwiseAbs ( ) . rowwise ( ) .sum ( ) . maxCoeff ( ) ∗
A . cwiseAbs ( ) . rowwise ( ) .sum ( ) . maxCoeff ( ) ;
20 }

2
10
relative error
relative residual
0
10

-2
10

-4
10

-6
10
Observations (w.r.t k·k∞ -norm)
✦ for ǫ ≪ 1 large relative error in computed so-
-8
10
lution e
x
-10
10 ✦ small residuals for any ǫ
-12
10

-14
10

-16
10
-14 -13 -12 -11 -10 -9 -8 -7 -6 -5
10 10 10 10 10 10 10 10 10 10
Fig. 62 ε

How can a large relative error be reconciled with a small relative residual ?

Ax = b ↔ Ae x≈b
− 1
x) = r ⇒ k x − e
A (x − e xk ≤ A krk kx − e
xk krk
⇒ ≤ k A k A −1 . (2.4.12)
Ax = b ⇒ kbk ≤ kAkkxk kxk kbk

➣ If cond(A) := k Ak A−1 ≫ 1, then a small relative residual may not imply a small relative error.
Also recall the discussion in Exp. 2.4.9.

Experiment 2.4.13 (Instability of multiplication with inverse)

An important justification for Rem. 2.2.6 is conveyed by this experiment. We again consider the nearly
singular matrix from Ex. 2.4.10.

2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 169
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

A = uvT + ǫI , u = 13 (1, 2, 3, . . . , 10)T ,

with 1 T
v = (−1, 21 , − 13 , 41 , . . . , 10 )
singular rank-1 matrix

C++11 code 2.4.14: Instability of multiplication with inverse ➺ GITLAB

2 i n t n = 10;
3 VectorXd u = VectorXd : : LinSpaced ( n , 1 , n ) / 3 . 0 ;
4 VectorXd v = u . cwiseInverse ( ) . array ( ) ∗
VectorXd : : LinSpaced ( n , 1 , n ) . unaryExpr ( [ ] ( double x ) { r et ur n
pow( − 1,x ) ; } ) . array ( ) ;
5 VectorXd x = VectorXd : : Ones ( n ) ;
6 VectorXd expo = VectorXd : : LinSpaced(19, − 5, − 14) ;
7 MatrixXd r es ( expo . siz e ( ) , 4 ) ;
8 f o r ( i n t i = 0 ; i < expo . siz e ( ) ; ++ i ) {
9 double e p s i l o n = std : : pow ( 1 0 , expo ( i ) ) ;
10 MatrixXd A = u ∗ v . transpose ( ) + e p s i l o n ∗ ( MatrixXd : : Random( n , n )
+ MatrixXd : : Ones ( n , n ) ) / 2 ;
11 VectorXd b = A ∗ x ;
12 double nb = b . lpNorm ( ) ;
13 VectorXd x t = A . l u ( ) . solve ( b ) ; // stable solving
14 VectorXd r = b − A∗ x t ; // residual
15 MatrixXd B = A . inverse ( ) ;
16 VectorXd x i = B∗ b ; // solving via inverse
17 VectorXd r i = b − A∗ x i ; // residual
18 MatrixXd R = MatrixXd : : I d e n t i t y ( n , n ) − A∗B ; // residual
19 r es ( i , 0 ) = e p s i l o n ; r es ( i , 1 ) = ( r ) . lpNorm ( ) / nb ;
20 r es ( i , 2 ) = r i . lpNorm ( ) / nb ;
21 // L-infinity condition number
22 r es ( i , 3 ) = R . lpNorm ( ) / B . lpNorm ( ) ;
23 }

2
10
Gaussian elimination
multiplication with inversel
0
10 inverse

-2
10

-4
10
relative residual

Computation of the inverse B := inv(A) is affected -6

by roundoff errors, but does not benefit from favor- -8

10
able compensation of roundoff errors as does Gaus-
sian elimination. -10
10

-12
10

-14
10

-16
10
-14 -13 -12 -11 -10 -9 -8 -7 -6 -5
10 10 10 10 10 10 10 10 10 10
Fig. 63 ε

2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 170
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

2.5 Survey: Elimination solvers for linear systems of equations

All direct (∗) solver algorithms for square linear systems of equations Ax = b with given matrix A ∈
K n,n , right hand side vector b ∈ K n and unknown x ∈ K n rely on variants of Gaussian elimination
with pivoting, see Section 2.3.3. Sophisticated, optimised and verified implementations are available in
numerical libraries like LAPACK/MKL.

(∗): a direct solver terminates after a predictable finite number of elementary operations for every admis-
sible input.

Never contemplate implementing a general solver for linear systems of equations!

If possible, use algorithms from numerical libraries! (→ Exp. 2.3.7)

Therefore, familiarity with details of Gaussian elimination is not required, but one must know when and
how to use the library functions and one must be able to assess the computational effort they involve.

(2.5.1) Computational effort for direct elimination

We repeat the reasoning of § 2.3.5: Gaussian elimination for a general (dense) matrix invariably involves
three nested loops of length n, see Code 2.3.4, Code 2.3.41.

Theorem 2.5.2. Cost of Gaussian elimination → § 2.3.5

Given a linear system of equations Ax = bm A ∈ K n,n regular, b ∈]K n , n ∈ N, the asymptotic
computational effort (→ Def. 1.4.1) for its direct solution by means of Gaussian elimination in terms
of the problem size parameter n is O(n3 ) for n → ∞.

The constant hidden in the Landau symbol can be expected to be rather small (≈ 1) as is clear from
(2.3.6).

The cost for solving are substantially lower, if certain properties of the matrix A are known. This is clear,
if A is diagonal or orthogonal/unitary. It is also true for triangular matrices (→ Def. 1.1.5), because they
can be solved by simple back substitution or forward elimination. We recall the observation made in see
§ 2.3.30.

Theorem 2.5.3. Cost for solving triangular systems → § 2.3.5

In the setting of Thm. 2.5.2, the asymptotic computational effort for solving a triangular linear system
of equations is O(n2 ) for n → ∞.

(2.5.4) Direct solution of linear systems of equations in E IGEN

Given: system matrix A ∈ K n,n regular ↔ A (n × n E IGEN matrix)

right hand side vectors B ∈ K n,ℓ ↔ B (n × ℓ E IGEN matrix)
(corresponds to multiple right hand sides, cf. Code 2.3.10)

2. Direct Methods for Linear Systems of Equations, 2.5. Survey: Elimination solvers for linear systems of equations
171
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

linear algebra E IGEN

h i
X = A−1 B = A−1 (B):,1 , . . . , A−1 (B):,ℓ X = A.lu().solve(B)

From detailed information given in § 2.3.30:

cost(X = A.lu().solve(B)) = = O(n3 + ln2 ) for n, l → ∞

Remark 2.5.5 (Communicating special properties of system matrices in E IGEN)

Sometimes, the coefficient matrix of a linear system of equations is known to have certain analytic proper-
ties that a direct solver can exploit to perform elimination more efficiently. These properties may even be
impossible to detect by an algorithm, because matrix entries that should vanish exactly might have been
perturbed due to roundoff.

Thus one needs to pass E IGEN these informations as follows:

2 // A is lower triangular
3 x = A . t r iangular View <Eigen : : Lower > ( ) . solve ( b ) ;
4 // A is upper triangular
5 x = A . t r iangular View <Eigen : : Upper > ( ) . solve ( b ) ;
6 // A is hermitian / self adjoint and positive definite
7 x = A . s e l f a d j o i n t V i e w <Eigen : : Upper > ( ) . l l t ( ) . solve ( b ) ;
8 // A is hermiatin / self adjoint (poitive or negative semidefinite)
9 x = A . s e l f a d j o i n t V i e w <Eigen : : Upper > ( ) . l d l t ( ) . solve ( b ) ;

Experiment 2.5.6 (Standard E IGEN lu() operator versus triangularView() )

In this numerical experiment we study the gain in efficiency achievable by make the direct solver aware of
important matrix properties.

C++11 code 2.5.7: Direct solver applied to a upper triangular matrix ➺ GITLAB
2 //! Eigen code: assessing the gain from using special properties
3 //! of system matrices in Eigen
4 MatrixXd t i m i n g ( ) {
5 std : : vector n = { 16 ,32 ,64 ,128 ,256 ,512 ,1024 ,2048 ,4096 ,8192} ;
6 i n t nruns = 3 ;
7 MatrixXd times ( n . siz e ( ) , 3 ) ;
8 f o r ( i n t i = 0 ; i < n . siz e ( ) ; ++ i ) {
9 Timer t1 , t 2 ; // timer class
10 MatrixXd A = VectorXd : : LinSpaced ( n [ i ] , 1 , n [ i ] ) . asDiagonal ( ) ;
11 A += MatrixXd : : Ones ( n [ i ] , n [ i ] ) . t r iangular View <Upper > ( ) ;
12 VectorXd b = VectorXd : : Random( n [ i ] ) ;
13 VectorXd x1 ( n [ i ] ) , x2 ( n [ i ] ) ;
14 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
15 t1 . s t a r t ( ) ; x1 = A . l u ( ) . solve ( b ) ; t 1 . s top ( ) ;
16 t2 . s t a r t ( ) ; x2 =
A . t r iangular View <Upper > ( ) . solve ( b ) ; t 2 . s top ( ) ;

2. Direct Methods for Linear Systems of Equations, 2.5. Survey: Elimination solvers for linear systems of equations
172
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

17 }
18 times ( i , 0 ) = n [ i ] ; times ( i , 1 ) = t 1 . min ( ) ; times ( i , 2 ) =
t 2 . min ( ) ;
19 }
20 r et ur n times ;
21 }

10 2
naive lu()
triangularView lu()
1
10

10 0

runtime for direct solver [s]

Observation: ✄ 10 -1

Being told that only the upper triangular part of the 10 -2

matrix needs to be taken into account, Gaussian

10 -3
elimination reduces to cheap backward elimination,
which is much faster than full elimination. 10 -4

10 -5

10 -6
10 1 10 2 10 3 10 4
Fig. 64
matrix size n

(2.5.8) Direct solvers for LSE in E IGEN

Invocation of direct solvers in E IGEN is a two stage process:

➊ Request a decomposition (LU,QR,LDLT) of the matrix and store it in a temporary “decomposition
object”.
➋ Perform backward & forward substitutions by calling the solve() method of the decomposition
object.
The general format for invoking linear solvers in E IGEN is as follows:
Eigen::SolverType<Eigen::MatrixXd> solver(A);
Eigen::VectorXd x = solver.solve(b);

This can be reduced to one line, as the solvers can also be used as methods acting on matrices:
Eigen::VectorXd x = A.solverType().solve(b);

A full list of solvers can be found here, in the E IGEN documentation. The next code demonstrates a few of
the available decompositions that can serve as the basis for a linear solver:

C++-code 2.5.9: E IGEN based function solving a LSE ➺ GITLAB

2 // Gaussian elimination with partial pivoting, Code 2.3.41
3 void l u _ s o l v e ( const MatrixXd& A , const VectorXd& b , VectorXd& x ) {
4 x = A . l u ( ) . solve ( b ) ; // ’lu()’ is short for ’partialPivLu()’
5 }
6

7 // Gaussian elimination with total pivoting

8 void f u l l p i v l u _ s o l v e ( const MatrixXd& A , const VectorXd& b , VectorXd&

2. Direct Methods for Linear Systems of Equations, 2.5. Survey: Elimination solvers for linear systems of equations
173
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

x) {
9 x = A . f u l l P i v L u ( ) . solve ( b ) ; // total pivoting
10 }
11

12 // An elimination solver based on Householder transformations

13 void q r _ s o l v e ( const MatrixXd& A , const VectorXd& b , VectorXd& x ) {
14 Eigen : : HouseholderQR<MatrixXd > s o l v e r ( A) ;
15 x = s o l v e r . solve ( b ) ;
16 }
17

18 // Use singular value decomposition (SVD)

19 void s v d_s olv e ( const MatrixXd& A , const VectorXd& b , VectorXd& x ) {
20 x = A . jacobiSvd ( Eigen : : ComputeThinU | Eigen : : ComputeThinV ) . solve ( b ) ;
21 }

The different decompositions trade speed for stability and accuracy: fully pivoted and QR-based decom-
positions also work for nearly singular matrices, for which the standard LU-factorization may non longer
be reliable.

Remark 2.5.10 (Many sequential solutions of LSE)

Both E IGEN and M ATLAB provide functions that return decompositions of matrices, here the LU-decomposition
(→ Section 2.3.2):
E IGEN : MatrixXd A(n,n); auto ludec = mat.lu ();
M ATLAB: [L,U] − lu(A)
Based on the precomputed decompositions, a linear system of equations with coefficient matrix A ∈ K n,n
can be solved with asymptotic computational effort O(n2 ), cf. § 2.3.30.

The following example illustrates a special situation, in which matrix decompositions can curb computa-
tional cost:

C++11 code 2.5.12: Smart approach!

C++11 code 2.5.11: Wasteful approach!
➺ GITLAB
➺ GITLAB
2 // Setting: N ≫ 1,
2 // Setting: N ≫ 1,
3 // large matrix A ∈ K n,n
3 // large matrix A ∈ K n,n
4 auto A_lu_dec = A . l u ( ) ;
4 f o r ( i n t j = 0 ; j < N ; ++ j ) {
5 f o r ( i n t j = 0 ; j < N ; ++ j ) {
5 x = A . l u ( ) . solve ( b ) ;
6 x = A_lu_dec . solve ( b ) ;
6 b = some_function ( x ) ;
7 b = some_function ( x ) ;
7 }
8 }

computational effort O( Nn3 )

computational effort O(n3 + Nn2 )

A concrete example is the so-called inverse power iteration, see Chapter 9, for which a skeleton code is

2. Direct Methods for Linear Systems of Equations, 2.5. Survey: Elimination solvers for linear systems of equations
174
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

given next. It computes the iterates

x∗
x ∗ : = A − 1 x ( k ) , x ( k + 1) : = , k = 0, 1, 2, . . . , (2.5.13)
kx∗ k2

C++11-code 2.5.14: Efficient implementation of inverse power method in E IGEN ➺ GITLAB

2 template <class VecType , class MatType>
3 VecType i n v p o w i t ( const Eigen : : MatrixBase <MatType> &A , double t o l )
4 {
5 using i n d e x _ t = typename MatType : : Index ;
6 using s c a l a r _ t = typename VecType : : S c a l a r ;
7 // Make sure that the function is called with a square matrix
8 const i n d e x _ t n = A . cols ( ) ;
9 const i n d e x _ t m = A . rows ( ) ;
10 e i g e n _ a s s e r t ( n == m) ;
11 // Request LU-decomposition
12 auto A_lu_dec = A . l u ( ) ;
13 // Initial guess for inverse power iteration
14 VecType xo = VecType : : Zero ( n ) ;
15 VecType xn = VecType : : Random( n ) ;
16 // Normalize vector
17 xn / = xn . norm ( ) ;
18 // Terminate if relative (normwise) change below threshold
19 while ( ( xo−xn ) . norm ( ) > xn . norm ( ) ∗ t o l ) {
20 xo = xn ;
21 xn = A_lu_dec . solve ( xo ) ;
22 xn / = xn . norm ( ) ;
23 }
24 r et ur n ( xn ) ;
25 }

The use of Eigen::MatrixBase<MatType> makes it possible to call invpowit with an expression

argument
2 MatrixXd A = MatrixXd : : Random( n , n ) ;
3 MatrixXd B = MatrixXd : : Random( n , n ) ;
4 VectorXd ev = i n v p o w i t <VectorXd > (A+B , t o l ) ;

This is necessary, because A+B will spawn an auxiliary object of a “strange” type determined by the
expression template mechanism.

2.6 Exploiting Structure when Solving Linear Systems

By “structure” of a linear system we mean prior knowledge that

✦ either certain entries of the system matrix vanish,

2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 175
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✦ or the system matrix is generated by a particular formula.

(2.6.1) Triangular linear systems

Triangular linear systems are linear systems of equations whose system matrix is a triangular matrix (→
Def. 1.1.5).

Thm. 2.5.3 tells us that (dense) triangular linear systems can be solved by backward/forward elimina-
tion with O(n2 ) asymptotic computational effort (n =ˆ number of unknowns) compared to an asymptotic
3
complexity of O(n ) for solving a generic (dense) linear system of equations (→ Thm. 2.5.2).

This is the simplest case where exploiting special structure of the system matrix leads to faster algorithms
for the solution of a special class of linear systems.

(2.6.2) Block elimination

Remember that thanks to the possibility to compute the matrix product in a block-wise fashion (→ § 1.3.15,
Gaussian elimination can be conducted on the level of matrix blocks. We recall Rem. 2.3.14 and Rem. 2.3.34.

For k, ℓ ∈ N consider the block partitioned square n × n, n := k + ℓ, linear system

A11 A12 x1 b A11 ∈ K k,k , A12 ∈ K k,ℓ ,A21 ∈ K ℓ,k ,A22 ∈ K ℓ,ℓ ,
= 1 , (2.6.3)
A21 A22 x2 b2 x1 ∈ K k , x2 ∈ K ℓ , b1 ∈ K k , b2 ∈ K ℓ .

Using block matrix multiplication (applied to the matrix×vector product in (2.6.3)) we find an equivalent
way to write the block partitioned linear system of equations:

A11 x1 + A12 x2 = b1 ,
(2.6.4)
A21 x1 + A22 x2 = b2 .

We assume that A11 is regular (invertible) so that we can solve for x1 from the first equation.

➣quad By elementary algebraic manipulations (“block Gaussian elimination”) we find

−1
x1 = A11 (b1 − A12 x2 ) ,
−1 −1
(A22 − A21 A11 A12 ) x2 = b2 − A21 A11 b1 ,
| {z }
Schur complement

The resulting ℓ × ℓ linear system of equations for the unknown vector x2 is called the Schur complement
system for (2.6.3).

Unless A has a special structure that allows the efficient solution of linear systems with system matrix
A11 , the Schur complement system is mainly of theoretical interest.

Example 2.6.5 (Linear systems with arrow matrices)

2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 176
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

From n ∈ N, a diagonal matrix D ∈ K n,n , c ∈ K n , b ∈ K n , and α ∈ K, we can build an (n + 1) × (n +

1) arrow matrix.
0

 
2

 
 
 
  4
 
 
 
 D c 
A=


 6
 
 
 
 
  8
 
 
b⊤ α 10
(2.6.6)

12
0 2 4 6 8 10 12
Fig. 65 nz = 31

We can apply the block partitioning (2.6.3) with k = n and ℓ = 1 to a linear system Ax = y with system
matrix A and obtain A11 = D, which can be inverted easily, provided that all diagonal entries of D are
different from zero. In this case

D c x1 y
Ax = ⊤ = y := 1 , (2.6.7)
b α ξ η
η − b T D − 1 y1
ξ= ,
α − b ⊤ D −1 c (2.6.8)
x1 = D−1 (y1 − ξc) .

These formulas make sense, if D is regular and α − b⊤ D−1 c 6= 0, which is another condition for the
invertibility of A.

Using the formula (2.6.8) we can solve the linear system (2.6.7) with an asymptotic complexity O(n)!
This superior speed compared to Gaussian elimination applied to the (dense) linear system is evident in
runtime measurements.

C++11 code 2.6.9: Dense Gaussian elimination applied to arrow system ➺ GITLAB
2 VectorXd arrowsys_slow ( const VectorXd &d , const VectorXd &c , const
VectorXd &b , const double alpha , const VectorXd &y ) {
3 i n t n = d . siz e ( ) ;
4 MatrixXd A( n + 1 , n + 1 ) ; A . setZero ( ) ;
5 A . diagonal ( ) . head ( n ) = d ;
6 A . col ( n ) . head ( n ) = c ;
7 A . row ( n ) . head ( n ) = b ;
8 A( n , n ) = alpha ;
9 r et ur n A . l u ( ) . solve ( y ) ;
10 }

2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 177
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Asymptotic complexity O(n3 )!

(Due to the serious blunder of accidentally creating a matrix full of zeros, cf. Exp. 1.3.10.)

mit samcode-Umgebung

C++11 code 2.6.10: Solving an arrow system according to (2.6.8) ➺ GITLAB

2 VectorXd a r r o w s y s _ f a s t ( const VectorXd &d , const VectorXd &c , const
VectorXd &b , const double alpha , const VectorXd &y ) {
3 i n t n = d . siz e ( ) ;
4 VectorXd z = c . array ( ) / d . array ( ) ; // z = D−1 c
5 VectorXd w = y . head ( n ) . array ( ) / d . array ( ) ; // w = D−1 y1
6 double x i = ( y ( n ) − b . dot (w) ) / ( alpha − b . dot ( z ) ) ;
7 VectorXd x ( n +1) ;
8 x << w − x i ∗ z , x i ;
9 r et ur n x ;
10 }

Asymptotic complexity O(n)

C++11 code 2.6.11: Runtime measurement of Code 2.6.9 vs. Code 2.6.10 vs. sparse tech-
niques ➺ GITLAB
2 MatrixXd a r r o w s y s t i m i n g ( ) {
3 std : : vector n = {8 ,1 6 ,3 2 ,6 4 ,1 2 8 ,2 5 6 ,5 1 2 ,1 0 2 4 ,2 0 4 8 ,4 0 9 6 };
4 i n t nruns = 3 ;
5 MatrixXd t i m e s ( n . s i z e ( ) , 6 ) ;
6 f o r ( i n t i = 0 ; i < n . s i z e ( ) ; ++ i ) {
7 Timer t1 , t2 , t3 , t 4 ; // timer class
8 double a l p h a = 2 ;
9 VectorXd b = VectorXd : : Ones ( n [ i ] , 1 ) ;
10 VectorXd c = VectorXd : : LinSpaced ( n [ i ] , 1 , n [ i ] ) ;
11 VectorXd d = −b ;
12 VectorXd y = VectorXd : : Constant ( n [ i ]+1 , − 1) . binaryExpr (
VectorXd : : LinSpaced ( n [ i ] + 1 , 1 , n [ i ] + 1 ) , [ ] ( double x , double
y ) { r e tur n pow ( x , y ) ; } ) . a r r a y ( ) ;
13 VectorXd x1 ( n [ i ] + 1 ) , x2 ( n [ i ] + 1 ) , x3 ( n [ i ] + 1 ) , x4 ( n [ i ] + 1 ) ;
14 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
15 t1 . s t a r t ( ) ; x1 = arrowsys_slow ( d , c , b , alpha , y ) ; t 2 . s t o p ( ) ;
16 t2 . s t a r t ( ) ; x2 = a r r o w s y s _ f a s t ( d , c , b , alpha , y ) ; t 2 . s t o p ( ) ;
17 t3 . s t a r t ( ) ;
18 x3 = arrowsys_sparse <SparseLU<SparseMatrix <double > >
>(d , c , b , alpha , y ) ;
19 t3 . stop ( ) ;
20 t4 . s t a r t ( ) ;
21 x4 = arrowsys_sparse <BiCGSTAB<SparseMatrix <double > >
>(d , c , b , alpha , y ) ;
22 t4 . stop ( ) ;
23 }
24 t i m e s ( i , 0 ) =n [ i ] ; t i m e s ( i , 1 ) = t 1 . min ( ) ; t i m e s ( i , 2 ) = t 2 . min ( ) ;
25 t i m e s ( i , 3 ) = t 3 . min ( ) ; t i m e s ( i , 4 ) = t 4 . min ( ) ; t i m e s ( i , 5 ) =( x4−x3 ) . norm ( ) ;
26 }

2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 178
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

27 r e tur n t i m e s ;
28 }

10 1
arrowsys slow
arrowsys fast

0
10

10 -1

(Intel i7-3517U CPU @ 1.90GHz×4, 64-bit, 10 -2

runtime [s]
ubuntu 14.04 LTS, gcc 4.8.4, -O3)
10 -3

No comment! ✄
10 -4

10 -5

10 -6
10 0 10 1 10 2 10 3 10 4
Fig. 66 matrix size n

Remark 2.6.12 (Sacrificing numerical stability for efficiency)

The vector based implementation of the solver of Code 2.6.10 can be vulnerable to roundoff errors, be-
cause, upon closer inspection, the algorithm turns out to be equivalent to Gaussian elimination without
pivoting, cf. Section 2.3.3, Ex. 2.3.36.

Caution: stability at risk in Code 2.6.10

!
Yet, there are classes of matrices for which Gaussian elimination without pivoting is guaranteed to be
stable. For such matrices algorithms like that of Code 2.6.10 are safe.

(2.6.13) Solving LSE with low-rank perturbation of system matrix

Given a regular matrix A ∈ K n,n , let us assume that at some point in a code we are in a position to solve
any linear system Ax = b “fast”, because

✦ either A has a favorable structure, eg. triangular, see § 2.6.1,

✦ or an LU-decomposition of A is already available, see § 2.3.30.
e is obtained by changing a single entry of A:
Now, a A
(
aij , if (i, j) 6= (i ∗ , j∗ ) ,
e ∈ K n,n : e
A, A aij = , i ∗ , j∗ ∈ {1, . . . , n} . (2.6.14)
z + aij , if (i, j) = (i ∗ , j∗ ) ,

e = A + z · ei ∗ e T∗
A . (2.6.15)
j

2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 179
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(Recall: ei = ˆ i-th unit vector.) The question is whether we can reuse some of the computations spent on
solving Ax = b in order to solve Ae e x = b with less effort than entailed by a direct Gaussian elimination
from scratch.

We may also consider a matrix modification affecting a single row: Changing a single row: given z ∈ K n

(
aij , if i 6= i ∗ ,
e ∈ K n,n : e
A, A aij = , i ∗ , j∗ ∈ {1, . . . , n} .
(z) j + aij , if i = i ∗ ,

e = A + ei ∗ z T
A . (2.6.16)

Both matrix modifications (2.6.14) and (2.6.16) represent rank-1-modifications of A. A generic rank-1-
modification reads

e := A + uv H , u, v ∈ K n .
A ∈ K n,n 7→ A (2.6.17)

general rank-1-matrix

Idea:
Block elimination of an extended linear system, see § 2.6.2

We consider the block partitioned linear system

A u e
x b
= . (2.6.18)
vH − 1 ξ 0

The Schur complement system after elimination of ξ reads

(A + uvH )e ex = b . !
x = b ⇔ Ae (2.6.19)

Hence, we have solved the modified LSE, once we have found the component e x of the solution of the
extended linear system (2.6.18). We do block elimination again, now getting rid of e
x first, which yields the
other Schur complement system

(1 + vH A − 1 u ) ξ = vH A − 1 b . (2.6.20)
uvH A−1
x = b−
Ae b. (2.6.21)
1 + vH A − 1 u
The generalization of this formula to rank-k-perturbations if given in the following lemma:

Lemma 2.6.22. Sherman-Morrison-Woodbury formula

For regular A ∈ K n,n , and U, V ∈ K n,k , n, k ∈ N, k ≤ n, holds

(A + UV H )−1 = A−1 − A−1 U(I + V H A−1 U)−1 V H A−1 ,

if I + V H A−1 U is regular.

Proof. Straightforward algbra:

A−1 − A−1 U(I + V H A−1 U)−1 V H A−1 (A + UV H ) =

2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 180
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

I − A−1 U (I + V H A−1 U)−1 (I + V H A−1 U) V H + A−1 UV H = I .

| {z }
=I

Uniqueness of the inverse settles the case.

✷

e x = b with A
We use this result to solve Ae e from (2.6.17) more efficiently than straightforward elimination
could deliver, provided that the LU-factorisation a = LU is already known.

Apply Lemma 2.6.22 for k = 1:

A−1 u(v H (A−1 b ))

x = A −1 b −
e . (2.6.23)
1 + v H (A −1 u )

LU-decomposition of A is available.

Efficient implementation ! Asymptotic complexity O(n2 ) (back substitutions on Line 5-6)

C++11 code 2.6.24: Solving a rank-1 modified LSE ➺ GITLAB

2 // Solving rank-1 updated LSE based on (2.6.23)
3 template <class LUDec>
4 VectorXd smw( const LUDec &lu , const MatrixXd &u , const VectorXd &v ,
const VectorXd &b ) {
5 VectorXd z = l u . solve ( b ) ; //
6 VectorXd w = l u . solve ( u ) ; //
7 double alpha = 1.0 + v . dot (w) ;
8 i f ( std : : abs ( alpha ) < std : : n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) )
9 throw std : : r u n t i m e _ e r r o r ( " A n e a r l y s i n g u l a r " ) ;
10 else r et ur n ( z − w ∗ v . dot ( z ) / alpha ) ;
11 }

Example 2.6.25 (Resistance to currents map)

Many lineare systems with system matrices that differ in a single entry only have to be solved when we
want to determine the dependence of the total impedance of a (linear) circuit from the parameters of a

2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 181
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

single component.

C1 R1 C1 R1

L L

R1
R1
Large (linear) electric circuit (modelling → R2 R2
C2 C2
Ex. 2.1.3) ✄ Rx
R4 R4 R1
Sought:

C2
R1

R1
R3

R3
Dependence of (certain) branch currents U ~
~
on “continuously varying” resistance Rx

R2
(➣ currents for many different values of
Rx )
C1

C1
R4

R4
L

L
R2 12 R2 R2
Fig. 67

Only a few entries of the nodal analysis matrix A (→ Ex. 2.1.3) are affected by variation of Rx !
(If Rx connects nodes i & j ⇒ only entries aii , a jj , aij , a ji of A depend on Rx )

2.7 Sparse Linear Systems

A (rather fuzzy) classification of matrices according to their numbers of zeros:

Dense(ly populated) matrices) sparse(ly populated) matrices

Notion 2.7.1. Sparse matrix

A ∈ K m,n , m, n ∈ N, is sparse, if

nnz(A) := #{(i, j) ∈ {1, . . . , m} × {1, . . . , n}: aij 6= 0} ≪ mn .

Sloppy parlance: matrix sparse :⇔ “almost all” entries = 0 /“only a few percent of” entries 6= 0

J.H. Wilkinson’s informal working definition for a developer of simulation codes:

Notion 2.7.2. Sparse matrix

A matrix with enough zeros that it pays to take advantage of them should be treated as sparse.

A more rigorous “mathematical” definition:

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 182
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 2.7.3. Sparse matrices

Given a strictly increasing sequences m : N 7→ N, n : N 7→ N, a family (A(l ) )l ∈N of matrices

with A(l ) ∈ K ml ,nl is sparse (opposite: dense), if

nnz(A(l ) )
lim =0.
l →∞ nl m l

Simple example: families of diagonal matrices (→ Def. 1.1.5)

Example 2.7.4 (Sparse LSE in circuit modelling)

See Ex. 2.1.3 for the description of a linear electric circuit by means of a linear system of equations for
nodal voltages. For large circuits the system matrices will invariably be huge and sparse.

Modern electric circuits (VLSI chips):

105 − 107 circuit elements
• Each element is connected to only a few nodes
• Each node is connected to only a few elements
[In the case of a linear circuit]

nodal analysis ➤ sparse circuit matrix

(Impossible to even store as dense matrices)
Fig. 68

Remark 2.7.5 (Sparse matrices from the discretization of linear partial differential equations)

Another important context in which sparse matrices usually arise:

☛ spatial discretization of linear boundary value problems for partial differential equations by means
of finite element (FE), finite volume (FV), or finite difference (FD) methods (→ 4th semester course
“Numerical methods for PDEs”).

2.7.1 Sparse matrix storage formats

Sparse matrix storage formats for storing a “sparse matrix” A ∈ K m,n are designed to achieve two objec-
tives:
➊ Amount of memory required is only slightly more than nnz(A) scalars.
➋ Computational effort for matrix×vector multiplication is proportional to nnz(A).
In this section we see a few schemes used by numerical libraries.

(2.7.6) Triplet/coordinate list (COO) format

In the case of a sparse matrix A ∈ K m,n , this format stores triplets (i, j, αi,j ), 1 ≤ i ≤ m, 1 ≤ j ≤ n:

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 183
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

s t r u c t TripletMatrix {
size_t m,n; // Number of rows and columns
v e c t o r <size_t> I; // row indices
v e c t o r <size_t> J; // column indices
v e c t o r <scalar_t> a; // values associated with index pairs
};

All vectors have the same size ≥ nnz(a).

We write “≥”, because repetitions of index pairs (i, j) are allowed. The matrix entry (A))i, j is defined to
be the sum of all values αi,j associated with the index pair (i, j). The next code clearly demonstrates this
summation.

C++-code 2.7.7: Matrix×vector product y = Ax in triplet format

1 void m u l t T r i p l M a t v e c ( const T r i p l e t M a t r i x &A ,
2 const vector < s c a l a r _ t > &x ,
3 vector < s c a l a r _ t > &y )
4 f o r ( s i z e _ t l = 0; l <A . a . siz e ( ) ; l ++) {
5 y [ A . I [ l ] ] += A . a [ l ] ∗ x [ A . J [ l ] ] ;
6 }

Note that this code assumes that the result vector y has the appropriate length; no index checks are
performed.

Code 2.7.7: computational effort is proportional to the number of triplets. (This might be much larger than
nnz(A) in case of many repetitions of triplets.)

(2.7.8) The zoo of sparse matrix formats

Special sparse matrix storage formats store only non-zero entries:

• Compressed Row Storage (CRS)
• Compressed Column Storage (CCS) → used by MATLAB
• Block Compressed Row Storage (BCRS)
• Compressed Diagonal Storage (CDS)
• Jagged Diagonal Storage (JDS)
• Skyline Storage (SKS)
All of these formats achieve the two objectives stated above. Some have been designed for sparse matri-
ces with additional structure or for seamless cooperation with direct elimination algorithms (JDS,SKS).

Example 2.7.9 (Compressed row-storage (CRS) format)

The CRS format for a sparse matrix A = (aij ) ∈ K n,n keeps the data in three contiguous arrays:

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 184
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

vector<scalar_t> val size nnz(A) := #{(i, j) ∈ {1, . . . , n}2 , aij 6= 0}

vector<size_t> col_ind size nnz(A)
vector<size_t> row_ptr size n + 1 & row_ptr[n + 1] = nnz(A) + 1
(sentinel value)
As above we write nnz(A) =
ˆ (number of nonzeros) of A
Access to matrix entry aij 6= 0, 1 ≤ i, j ≤ n (“mathematical indexing”)

col_ind[k] = j ,
val[k] = aij ⇔ 1 ≤ k ≤ nnz(A) .
row_ptr[i ] ≤ k < row_ptr[i + 1] ,

val aij

col_ind j

row_ptr beginning of data for i-th row

i
 
10 0 0 0 −2 0 val-vector:
3 9 0 0 0 3  10 -2
3 9 3 7 8 7 3 . . . 9 13 4 2 -1
 
0 7 8 7 0 0  col_ind-array:
A = 
3

 0 8 7 5 0 
 1 5 1 2 6 2 3 4 1 ...5 6 2 5 6
0 8 0 9 9 13  row_ptr-array:
0 4 0 0 2 −1 1 3 6 9 13 17 20

Variant: diagonal CRS format (matrix diagonal stored in separate array)

The CCS format is equivalent to CRS format for the transposed matrix.

2.7.2 Sparse matrices in M ATLAB

Supplementary reading. A detailed discussion of sparse matrix formats and how to work with

them efficiently is given in [?]. An interesting related article on M ATLAB-central can be found here.

(2.7.10) Creating sparse matrices in M ATLAB

Matrices have to be created explicitly in sparse format by one of the following commands in order to let
M ATLAB know that the CCS format is to be used for internal representation.

A = sparse(m,n); create empty m × n “sparse matrix”

A = spalloc(m,n,nnz); create m × n sparse matrix & reserve memory
A = sparse(I,J,a,m,n); initialize m × n sparse matrix from triplets → § 2.7.6
A = spdiags(B,d,m,n); create sparse banded matrix → Section 2.7.6
A = speye(n); sparse identity matrix

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 185
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

☞ Consult the M ATLAB documentation for details.

Experiment 2.7.11 (Accessing rows and columns of sparse matrices)

The CCS internal data format used by M ATLAB has an impact on the speed of access operations, cf.
Exp. 1.2.29 for a similar effect.

Pattern for matrix A for n = 16

0
Runtimes measured by Code 2.7.12
1
10
2 row access
column access
O(n)
0
10
4

-1
10
6

access time [s]

-2
10
8

-3
10
10

-4
10
12

-5
10
14

-6
10
0 1 2 3 4 5 6 7
16 10 10 10 10 10 10 10 10
Fig. 70 size n of sparse quadratic matrix
0 2 4 6 8 10 12 14 16
Fig. 69 nz = 32

M ATLAB code 2.7.12: timing access to rows/columns of a sparse matrix

1 f i g u r e ; spy(spdiags(repmat([-1 2 5],16,1),[-8,0,8],16,16)); %
2 t i t l e (’Pattern for matrix {\bf A} for n = 16’,’fontsize’,14);
3 p r i n t -depsc2 ’../PICTURES/spdiagsmatspy.eps’;
4

5 t = [];
6 f o r i=1:20
7 n = 2^i; m = n/2;
8 A = spdiags(repmat([-1 2 5],n,1),[-n/2,0,n/2],n,n); %
9

10 t1 = inf; f o r j=1:5, t i c ; v = A(m,:)+j; t1 = min (t1, t o c ); end

11 t2 = inf; f o r j=1:5, t i c ; v = A(:,m)+j; t2 = min (t2, t o c ); end
12 t = [t; s i z e (A,1), nnz (A), t1, t2 ];
13 end
14

15 f i g u r e;
16 l o g l o g (t(:,1),t(:,3),’r+-’, t(:,1),t(:,4),’b*-’,...
17 t(:,1),t(1,3)*t(:,1)/t(1,1),’k-’);
18 x l a b e l (’{\bf size n of sparse quadratic matrix}’,’fontsize’,14);
19 y l a b e l (’{\bf access time [s]}’,’fontsize’,14);
20 le ge nd (’row access’,’column access’,’O(n)’,’location’,’northwest’);
21

22 p r i n t -depsc2 ’../PICTURES/sparseaccess.eps’;

M ATLAB uses compressed column storage (CCS), which entails O(n) searches for index j in the index

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 186
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

array when accessing all elements of a matrix row. Conversely, access to a column does not involve any
search operations.

Note the use of the M ATLAB command repmat in Line 1 and Line 8 of the above code. It can be used to
build structured matrices. Consult the M ATLAB documentation for details.

Experiment 2.7.13 (Efficient Initialization of sparse matrices in M ATLAB)

We study different ways to set a few non-zero entries of a sparse matrix. The first code just uses the
()-operator to set matrix entries.

M ATLAB code 2.7.14: Initialization of sparse matrices: entry-wise (I)

1 A1 = sparse(n,n);
2 f o r i=1:n
3 f o r j=1:n
4 i f ( abs (i-j) == 1), A1(i,j) = A1(i,j) + 1; end ;
5 i f ( abs (i-j) == round (n/3)), A1(i,j) = A1(i,j) -1; end;
6 end; end

The second and third code rely on an intermediate triplet format (→ § 2.7.6) to build the sparse matrix
and finally pass this to M ATLAB’s sparse function.

M ATLAB code 2.7.15: Initialization of sparse matrices: triplet based (II)

1 dat = [];
2 f o r i=1:n
3 f o r j=1:n
4 i f ( abs (i-j) == 1), dat = [dat; i,j,1.0]; end ;
5 i f ( abs (i-j) == round (n/3)), dat = [dat; i,j,-1.0];
6 end ; end ; end;
7 A2 = sparse(dat(:,1),dat(:,2),dat(:,3),n,n);

M ATLAB code 2.7.16: Initialization of sparse matrices: triplet based (III)

1 dat = z e r o s (6*n,3); k = 0;
2 f o r i=1:n
3 f o r j=1:n
4 i f ( abs (i-j) == 1), k=k+1; dat(k,:) = [i,j,1.0];
5 end ;
6 i f ( abs (i-j) == round (n/3))
7 k=k+1; dat(k,:) = [i,j,-1.0];
8 end ;
9 end ; end ;
10 A3 = sparse(dat(1:k,1),dat(1:k,2),dat(1:k,3),n,n);

M ATLAB code 2.7.17: Initialization of sparse matrices: driver script

1 % Driver routine for initialization of sparse matrices
2 K = 3; r = [];

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 187
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3 f o r n=2.^(8:14)
4 t1= 1000; f o r k=1:K, f p r i n t f (’sparse1, %d, %d\n’,n,k); t i c ;
sparse1; t1 = min (t1, t o c ); end
5 t2= 1000; f o r k=1:K, f p r i n t f (’sparse2, %d, %d\n’,n,k); t i c ;
sparse2; t2 = min (t2, t o c ); end
6 t3= 1000; f o r k=1:K, f p r i n t f (’sparse3, %d, %d\n’,n,k); t i c ;
sparse3; t3 = min (t3, t o c ); end
7 r = [r; n, t1 , t2, t3];
8 end
9

10 l o g l o g (r(:,1),r(:,2),’r*’,r(:,1),r(:,3),’m+’,r(:,1),r(:,4),’b^’);
11 x l a b e l (’{\bf matrix size n}’,’fontsize’,14);
12 y l a b e l (’{\bf time [s]}’,’fontsize’,14);
13 le ge nd (’Initialization I’,’Initialization II’,’Initialization III’,...
14 ’location’,’northwest’);
15 p r i n t -depsc2 ’../PICTURES/sparseinit.eps’;

3
10
Initialization I
Initialization II
Initialization III

2
10

Output of Code 2.7.17: ✄ 1

10
time [s]

✦ OS: Linux 2.6.16.27-0.9-smp

✦ CPU: Genuine Intel(R) CPU T2500 2.00GHz 0
10

✦ M ATLAB 7.4.0.336 (R2007a)

-1
10

-2
10
2 3 4 5
10 10 10 10
Fig. 71 matrix size n

☛ It is grossly inefficient to initialize a matrix in CCS format (→ Ex. 2.7.9) by setting individual entries
one after another, because this usually entails moving large chunks of memory to create space for
new non-zero entries.
Instead calls like

sparse(dat(1:k,1),dat(1:k,2),dat(1:k,3),n,n);,
where the triplet format is defined by (→ § 2.7.6)

dat(1 : k, 1) = i and dat(1 : k, 2) = j ⇒ aij = dat(1 : k, 3) ,

allow M ATLAB to allocate memory and initialize the arrays in one sweep.

Experiment 2.7.18 (Multiplication of sparse matrices in M ATLAB)

We study a sparse matrix A ∈ R n,n initialized by setting some of its (off-)diagonals with M ATLAB’s spdiags
function:
A = spdiags([(1:n)’,ones(n,1),(n:-1:1)’],...
[-floor(n/3),0,floor(n/3)],n,n);

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 188
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

A A*A
0 0

20 20

40 40

60 60

80 80

100 100

120 120

0 20 40 60 80 100 120 0 20 40 60 80 100 120

Fig. 72 nz = 300 Fig. 73 nz = 388

➣ A2 is still a sparse matrix (→ Notion 2.7.1)

5
10
time(A*A)
O(n)
4
10 O(n2)

3
10

2
10

Runtimes for matrix multiplication A*A in M ATLAB

runtime [s]

1
10

(tic/toc timing) ✄ 0
10

(same platform as in Ex. 2.7.13) 10

-1

-2
10

-3
10

-4
10
1 2 3 4 5 6 7
10 10 10 10 10 10 10
Fig. 74 n

We observe O(n) asymptotic complexity: “optimal” implementation !

Remark 2.7.19 (Silly M ATLAB)

A strange behavior of M ATLAB from version R2010a:

M ATLAB code 2.7.20: Extracting an entry of a sparse matrix

1 % MATLAB script demonstrating the awkward effects of treating entries
of
2 % sparse matrices as sparse matrices themselves.
3 A = spdiags (ones(100,3),-1:1,100,100);
4 b = A(1,1),
5 c = f u l l (b),
6 whos(’b’,’c’);
7 sum=0; t i c ; f o r i=1:1e6, sum = sum + b; end , t o c

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 189
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

8 sum=0; t i c ; f o r i=1:1e6, sum = sum + c; end , t o c

Output (M ATLAB-version 7.12.0.635 (R2011a), MacOS X 10.6, Intel Core i7):

» sparseentry
b = (1,1) 1
c = 1

Name Size Bytes Class Attributes

b 1x1 32 double sparse

c 1x1 8 double

Elapsed time is 2.962332 seconds.

Elapsed time is 0.514712 seconds.

When extracting a single entry from a sparse matrix, this entry will be stored in sparse format though it is
a mere number! This will considerably slow down all operations on that entry.
Change in Indexing for Sparse Matrix Access. Now subscripted reference into a sparse matrix always
returns a sparse matrix. In previous versions of MATLAB, using a double scalar to index into a sparse
matrix resulted in full scalar output.

2.7.3 Sparse matrices in E IGEN

Eigen can handle sparse matrices in the standard Compressed Row Storage (CRS) and Compressed
Column Storage (CCS) format, see Ex. 2.7.9 and the documentation:
# i n c l u d e <Eigen/Sparse>
Eigen::SparseMatrix Asp(rows,cols); // CCS
format
Eigen::SparseMatrix< double , Eigen::RowMajor> Bsp(rows,cols); // CRS
format

As already discussed in Exp. 2.7.13, sparse matrices must not be filled by setting entries through index-
pair access. As in MATLAB, also for E IGEN the matrix should first be assembled in triplet format, from
which a sparse matrix is built. E IGEN offers special facilities for handling triplets.

s t d :: v e c t o r <Eigen::Triplet < double > > triplets;

// .. fill the std::vector triplets ..
Eigen::SparseMatrix< double , Eigen::RowMajor> spMat(rows, cols);
spMat.setFromTriplets(triplets.begin(), triplets.end());
spMat.makeCompressed();
The call to makeCompressed() removes zero entries that are still kept in the raw sparse matrix format
for the sake of efficient operations. For a concrete C++ code dedicated the initialization of a sparse E IGEN
matrix from triplets, see Code 2.7.38.

A triplet object can be initialized as demonstrated in the following example:

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 190
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

unsigned i n t row_idx = 2;
unsigned i n t col_idx = 4;
double value = 2.5;
Eigen::Triplet< double > triplet(row_idx,col_idx,value);
s t d :: cout << ’(’ << triplet.row() << ’,’ << triplet.col()
<< ’,’ << triplet.value() << ’)’ << s t d :: e n d l ;

As shown, a Triplet object offers the access member functions row(), col(), and value() to
fetch the row index, column index, and scalar value stored in a Triplet.

The statement that entry-wise initialization of sparse matrices is not efficient has to be qualified in Eigen.
Entries can be set, provided that enough space for each row (in RowMajor format) is reserved in ad-
vance. This done by the reserve() method that takes an integer vector of maximal expected numbers
of non-zero entries per row:

C++11-code 2.7.21: Accessing entries of a sparse matrix: potentially inefficient!

1 unsigned i n t rows , cols , max_no_nnz_per_row ;
2 .....
3 SparseMatrix <double , RowMajor> mat ( rows , cols ) ;
4 mat . reserve ( RowVectorXi : : Constant ( cols , max_no_nnz_per_row ) ) ;
5 // do many (incremental) initializations
6 for ( ) {
7 mat . i n s e r t ( i , j ) = v a l u e _ i j ;
8 mat . coeffRef ( i , j ) += i n c r e m e n t _ i j ;
9 }
10 mat . makeCompressed ( ) ;

insert(i.j) sets an entry of the sparse matrix, which is rather efficient, provided that enough space
has be reserved. coeffRef(i,j) gives l-value and r-value access to any matrix entry, creating a
non-zero entry, if needed: costly!

The usual matrix operations are supported for sparse matrices; addition and subtraction may involve only
sparse matrices stored in the same format. These operations may incur large hidden costs and have to
be used with care!

Example 2.7.22 (Initialization of sparse matrices in Eigen)

We study the runtime behavior of the initialization of a sparse matrix in Eigen parallel to the tests from
Exp. 2.7.13. We use the methods described above.

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 191
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6
10

Triplets
Runtimes (in ms) for the initialization of a banded ma- 5
coeffRef with space reserved
coeffRef without space reserved
10
trix (with 5 non-zero diagonals, that is, a maximum of
5 non-zero entries per row) using different techniques 10
4

Time in milliseconds
in Eigen.
3
10

Green line: timing for entry-wise initialization with

only 4 non-zero entries per row reserved in advance. 10
2

(OS: Ubuntu Linux 14.04, CPU: Intel [email protected] Ghz, 10

Compiler: g++-4.8.2, -O2)

0
10
2 3 4 5 6 7
10 10 10 10 10 10
Size of matrix
Fig. 75

Observation: insufficient advance allocation of memory massively slows down the set-up of a sparse
matrix in the case of direct entry-wise initialization.

Reason: Massive internal copying of data required to created space for “unexpected” entries.

Example 2.7.23 (Smoothing of a triangulation)

This example demonstrates that sparse linear systems of equations naturally arise in the handling of
triangulations.

Definition 2.7.24. Planar triangulation

A planar triangulation (mesh) M consists of a set N of N ∈ N distinct points ∈ R 2 and a set T

of triangles with vertices in N , such that the following two conditions are satisfied:
1. the interiors of the triangles are mutually disjoint (“no overlap”),
2. for every two closed distinct triangles ∈ T their intersection satisfies exactly one of the fol-
lowing conditions:
(a) it is empty
(b) it is exactly one vertex from N ,
(c) it is a common edge of both triangles

The points in N are also called the nodes of the mesh, the triangles the cells, and all line segments
connecting two nodes and occurring as a side of a triangle form the set of edges. We always assume a

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 192
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

consecutive numbering of the nodes and cells of the triangulation (starting from 1, M ATLAB’s convention).

Fig. 76 Fig. 77

Valid planar triangulation Mesh with “illegal” hanging nodes

Triangulations are of fundamental importance for computer graphics, landscape models, geodesy, and
numerical methods. They need not be planar, but the algorithmic issues remain the same.

M ATLAB data structure for describing a triangulation with N nodes and M cells:
• column vector x ∈ R N : x-coordinates of nodes
• column vector y ∈ R N : y-coordinates of nodes
• M × 3-matrix T whose rows contain the index numbers of the vertices of the cells.
(This matrix is a so-called triangle-node incidence matrix.)
The Figure library provides the function triplot for drawing planar triangulations:

C++11-code 2.7.25: Initializing and drawing a simple planar triangulations ➺ GITLAB

2 // Demonstration for visualizing a plane triangular mesh
3 // Initialize node coordinates
4 Eigen : : VectorXd x ( 1 0 ) , y ( 1 0 ) ;
5 // x and y coordinate of mesh
6 x << 1 . 0 , 0 . 6 0 , 0 . 1 2 , 0 . 8 1 , 0 . 6 3 , 0 . 0 9 , 0 . 2 7 , 0 . 5 4 , 0 . 9 5 , 0 . 9 6 ;
7 y << 0 . 1 5 , 0 . 9 7 , 0 . 9 5 , 0 . 4 8 , 0 . 8 0 , 0 . 1 4 , 0 . 4 2 , 0 . 9 1 , 0 . 7 9 , 0 . 9 5 ;
8 // Then specify triangles through the indices of their vertices.

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 193
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

These
9 // indices refer to the ordering of the coordinates as given in the
10 // vectors x and y.
11 Eigen : : M a t r i x X i T ( 1 1 , 3 ) ;
12 T << 7 , 1 , 2 , 5, 6, 2, 4, 1, 7, 6, 7, 2,
13 6, 4, 7, 6, 5, 0, 3, 6, 0, 8, 4, 3,
14 3, 4, 6, 8, 1, 4, 9 , 1 , 8;
15 // Call the Figure plotting routine, draw mesh with blue edges
16 // red vertices and a numbering/ordering
17 mgl : : F i g u r e f i g 1 ; f i g 1 . s e t F o n t S i z e ( 8 ) ;
18 f i g 1 . ranges ( 0 . 0 , 1 . 0 5 , 0 . 0 , 1 . 0 5 ) ;
19 f i g 1 . t r i p l o t ( T , x , y , " b ? " ) ; // drawing triangulation with numbers
20 f i g 1 . p l o t ( x , y , " ∗ r " ) ; // mark vertices
21 f i g 1 . save ( " m e s h p l o t _ c p p " ) ;

Output of Code 2.7.25 ✄

Fig. 78

The cells of a mesh may be rather distorted triangles (with very large and/or small angles), which is usually
not desirable. We study an algorithm for smoothing a mesh without changing the planar domain covered
by it.

Definition 2.7.26. Boundary edge

Every edge that is adjacent to only one cell is a boundary edge of the triangulation. Nodes that are
endpoints of boundary edges are boundary nodes.

✎ Notation: Γ ⊂ {1, . . . , N } =
ˆ set of indices of boundary nodes.
✎ Notation: pi = ( pi1 , pi2 ) ∈ R2 =
ˆ coordinate vector of node ♯i, i = 1, . . . , N
i i
( p1 ↔ x( i ), p2 ↔ y( i ) in M ATLAB)

We define
S(i ) := { j ∈ {1, . . . , N } : nodes i and j are connected by an edge} , (2.7.27)
as the set of node indices of the “neighbours” of the node with index number i.

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 194
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 2.7.28. Smoothed triangulation

A triangulation is called smoothed, if

1 j
pi = ∑ p j ⇔ ♯S(i ) pid = ∑ pd , d = 1, 2 , for all i ∈ {1, . . . , N } \ Γ , (2.7.29)
♯ S (i ) j ∈ S (i ) j ∈ S (i )

that is, every interior node is located in the center of gravity of its neighbours.

The relations (2.7.29) correspond to the lines of a sparse linear system of equations! In order to state it,
we insert the coordinates of all nodes into a column vector z ∈ K2N , according to
(
pi1 , if 1 ≤ i ≤ N ,
zi = i− N (2.7.30)
p2 , if N + 1 ≤ i ≤ 2N .

For the sake of ease of presentation, in the sequel we assume (which is not the case in usual triangulation
data) that interior nodes have index numbers smaller than that of boundary nodes.

From (2.7.27) we infer that the system matrix C ∈ R 2n,2N , n := N − ♯Γ, of that linear system has the
following structure:


♯S(i ) , if i = j ,
A O i ∈ {1, . . . , n} ,
C= , (A)i,j = −1 , if j ∈ S(i ) , (2.7.31)
O A 
 j ∈ {1, . . . , N } .
0 else.
(2.7.29) ⇔ Cz = 0 . (2.7.32)

➣ nnz(A) ≤ number of edges of M + number of interior nodes of M.

➣ The matrix C associated with M according to (2.7.31) is clearly sparse.
➣ The sum of the entries in every row of C vanishes.

We partition the vector z into coordinates of nodes in the interior and of nodes on the boundary


zint
1
 zbd  ⊤
zT =  1  := z , . . . , z , z
zint  1 n n + 1 , . . . , z N , z N + 1 , . . . , z N + n , z N + n + 1 , . . . , z 2N .
2
zbd
2

This induces the following block partitioning of the linear system (2.7.32):
int  
z
1bd
Aint Abd O O  
 z1  = 0 , Aint ∈ R n,n ,
O O Aint Abd zint
2
 Abd ∈ R n,N −n .
bd
z2
m

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 195
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

 
 
  
 
 
 Aint Abd  
  
  
  
  =0. (2.7.33)
  
  
 Aint Abd  
  
 
 
 
 

The linear system (2.7.33) holds the key to the algorithmic realization of mesh smoothing; when smoothing
the mesh

(i) the node coordinates belonging to interior nodes have to be adjusted to satisfy the equilibrium con-
dition (2.7.29), they are unknowns,

(ii) the coordinates of nodes located on the boundary are fixed, that is, their values are known.

unknown zint int

1 , z2 , known zbd bd
1 , z2
(yellow in (2.7.33)) (pink in (2.7.33))

(2.7.32)/(2.7.33) ⇔ Aint zint
1 z int = − A zbd A zbd .
2 bd 1 bd 2 (2.7.34)

This is a square linear system with an n × n system matrix, to be solved for two different right hand side
vectors. The matrix Aint is also known as the matrix of the combinatorial graph Laplacian.

We examine the sparsity pattern of the system matrices Aint for a sequence of triangulations created by
regular refinement.

Definition 2.7.35. Regular refinemnent of a

planar triangulation

The planar triangulation with cells obtained by

splitting all cells of a planar triangulation M
into four congruent triangles is called the reg-
ular refinement of M. Fig. 79

We start from the triangulation of Fig. 78 and in turns perform regular refinement and smoothing (left ↔

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 196
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

after refinement, right ↔ after smoothing)

Refined mesh level 1 Smoothed mesh level 1

Refined mesh level 2 Smoothed mesh level 2

Refined mesh level 3 Smoothed mesh level 3

Below we give spy plots of the system matrices Aint for the first three triangulations of the sequence:

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 197
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Fig. 80 Fig. 81 Fig. 82

2.7.4 Direct Solution of Sparse Linear Systems of Equations

Efficient Gaussian elimination for sparse matrices requires sophisticated algorithms that are encapsulated
in special types of solvers in E IGEN. Their calling syntax remains unchanged, however:
Eigen::SolverType<Eigen::SparseMatrix< double >> solver(A);
Eigen::VectorXd x = solver.solve(b);

The standard sparse solver is SparseLU.

C++-code 2.7.36: Function for solving a sparse LSE with E IGEN ➺ GITLAB
2 using SparseMatrix = Eigen : : SparseMatrix <double > ;
3 // Perform sparse elimination
4 void s par s e_s olv e ( const SparseMatrix& A , const VectorXd& b , VectorXd&
x) {
5 Eigen : : SparseLU<SparseMatrix > s o l v e r ( A) ;
6 x = s o l v e r . solve ( b ) ;
7 }

(2.7.37) Solving sparse linear systems in E IGEN

The following codes initialize a sparse matrix, then perform an LU-factorization, and, finally, solve a sparse
linear system with a random right hand side vector.

C++11-code 2.7.38: Initialisation of sample sparse matrix in E IGEN ➺ GITLAB

2 template <class SpMat>
3 SpMat i n i t S p a r s e M a t r i x ( s i z e _ t n )
4 {
5 using s c a l a r _ t = typename SpMat : : S c a l a r ;
6 vector <Eigen : : T r i p l e t < s c a l a r _ t >> t r i p l e t s (5 ∗ n ) ;
7

8 f o r ( s i z e _ t l = 0; l <n ; ++ l )
9 t r i p l e t s . push_back ( Eigen : : T r i p l e t < s c a l a r _ t > ( l , l , 5 . 0 ) ) ;

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 198
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10 f o r ( s i z e _ t l = 1; l <n ; ++ l ) {
11 t r i p l e t s . push_back ( Eigen : : T r i p l e t < s c a l a r _ t > ( l −1, l , 1 . 0 ) ) ;
12 t r i p l e t s . push_back ( Eigen : : T r i p l e t < s c a l a r _ t > ( l , l − 1 ,1.0) ) ;
13 }
14 const s i z e _ t m = n / 2 ;
15 f o r ( s i z e _ t l = 0; l <m; ++ l ) {
16 t r i p l e t s . push_back ( Eigen : : T r i p l e t < s c a l a r _ t > ( l , l +m, 1 . 0 ) ) ;
17 t r i p l e t s . push_back ( Eigen : : T r i p l e t < s c a l a r _ t > ( l +m, l , 1 . 0 ) ) ;
18 }
19 SpMat M( n , n ) ;
20 M. setFromTriplets ( t r i p l e t s . begin ( ) , t r i p l e t s . end ( ) ) ;
21 M. makeCompressed ( ) ;
22 r et ur n M;
23 }

C++11-code 2.7.39: Solving a sparse linear system of equations in E IGEN ➺ GITLAB

2 void s olv eSpar s eTes t ( s i z e _ t n )
3 {
4 using SpMat = Eigen : : SparseMatrix <double , Eigen : : RowMajor > ;
5

6 const SpMat M = i n i t S p a r s e M a t r i x <SpMat > ( n ) ;

7 cout << "M = " << M. rows ( ) << ’ x ’ << M. cols ( ) << "− m a t r i x w i t h " <<
M. nonZeros ( )
8 << " non−z e r o s " << endl ;
9 p r i n t T r i p l e t s (M) ;
10

11 const Eigen : : VectorXd b = Eigen : : VectorXd : : Random( n ) ;

12 Eigen : : VectorXd x ( n ) ;
13

14 Eigen : : SparseLU<SpMat> s o l v e r ; s o l v e r . compute (M) ;

15 i f ( s o l v e r . i n f o ( ) ! = Eigen : : Success ) {
16 c e r r << " D e c o m p o s i t i o n f a i l e d ! " << endl ;
17 r et ur n ;
18 }
19 x = s o l v e r . solve ( b ) ;
20 i f ( s o l v e r . i n f o ( ) ! = Eigen : : Success ) {
21 c e r r << " S o l v e r f a i l e d ! " << endl ;
22 r et ur n ;
23 }
24 cout << " R e s i d u a l norm = " << (M∗ x−b ) . norm ( ) << endl ;
25 }

The compute method of the solver object triggers the actual sparse LU-decomposition. The solve
method then does forward and backward elimination, cf. § 2.3.30. It can be called multiple times, see
Rem. 2.5.10.

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 199
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Experiment 2.7.40 (Sparse elimination for arrow matrix)

In Ex. 2.6.5 we saw that applying the standard lu() solver to a sparse arrow matrix results in an extreme
waste of computational resources.

Yet, E IGEN can do much better! The main mistake was the creation of a dense matrix instead of storing
the arrow matrix in sparse format. There are E IGEN solvers which rely on particular sparse elimination
techniques. They still rely of Gaussian elimination with (partial) pivoting (→ Code 2.3.41), but take pains
to operate on non-zero entries only. This can greatly boost the speed of the elimination.

C++11 code 2.7.41: Invoking sparse elimination solver for arrow matrix ➺ GITLAB
2 template <class s o l v e r _ t >
3 VectorXd arrowsys_sparse ( const VectorXd &d , const VectorXd &c , const
VectorXd &b , const double alpha , const VectorXd &y ) {
4 i n t n = d . siz e ( ) ;
5 SparseMatrix <double> A( n+1 , n+1) ; // default: column-major
6 VectorXi reserveVec = VectorXi : : Constant ( n+1 , 2 ) ; // nnz per col
7 reserveVec ( n ) = n + 1; // last full col
8 A . r e s e r v e ( reserveVec ) ;
9 f o r ( i n t j = 0 ; j < n ; ++ j ) { // initalize along cols for
efficency
10 A. inser t ( j , j ) = d( j ) ; // diagonal entries
11 A. inser t (n , j ) = b( j ) ; // bottom row entries
12 }
13 f o r ( i n t i = 0 ; i < n ; ++ i ) {
14 A. inser t ( i , n) = c ( i ) ; // last col
15 }
16 A . i n s e r t ( n , n ) = alpha ; // bottomRight entry
17 A . makeCompressed ( ) ;
18 r et ur n s o l v e r _ t ( A) . solve ( y ) ;
19 }

10 1
arrowsys slow
arrowsys fast
arrowsys SparseLU
Observation: 10 0 arrowsys iterative

The sparse elimination solver is several orders 10 -1

of magnitude faster than lu() operating on a

10 -2
runtime [s]

dense matrix.
10 -3
The sparse solver is still slower than
Code 2.6.10. The reason is that it is a
10 -4
general algorithm that has to keep track of
non-zero entries and has to be prepared to do 10 -5

pivoting.
10 -6
10 0 10 1 10 2 10 3 10 4
Fig. 83 matrix size n

Experiment 2.7.42 (Timing sparse elimination for the combinatorial graph Laplacian)

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 200
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

We consider a sequence of planar triangulations created by successive regular refinement (→ Def. 2.7.35)
of the planar triangulation of Fig. 78, see Ex. 2.7.23. We use different E IGEN and MKL sparse solver for
the linear system of equations (2.7.34) associated with each mesh.
10 2
Eigen SparseLU
Eigen SimplicialLDLT
Timing results ✄ 10 1 Eigen ConjugateGradient
MKL PardisoLU
MKL PardisoLDLT
O(n 1.5 )
Platform: 10 0

✦ ubuntu 14.04 LTS 10 -1

solution time [s]

✦ i7-3517U CPU @ 1.90GHz × 4
✦ L1 32 KB, L2 256 KB, L3 4096 KB, Mem 8 GB 10 -2

✦ gcc 4.8.4, -O3 10 -3

We observe an empirical asymptotic complexity (→ 10 -4

Def. 1.4.4) of O(n1.5 ), way better than the asymptotic

complexity of O(n3 ) expected for Gaussian elimina- 10 -5

tion in the case of dense matrices. 10 -6

10 1 10 2 10 3 10 4 10 5 10 6
Fig. 84 size of matrix A int

When solving linear systems of equations directly dedicated sparse elimination solvers from
numerical libraries have to be used!

System matrices are passed to these algorithms in sparse storage formats (→ 2.7.1) to convey
information about zero entries.

STOP Never ever even think about implementing a general sparse elimination solver by yourself!

E IGEN sparse solvers:

https://eigen.tuxfamily.org/dox/group__TopicSparseSystems.html

(2.7.43) Implementations of sparse solvers

Widely used implementations of sparse solvers are:

→ SuperLU (http://www.cs.berkeley.edu/~demmel/SuperLU.html),
→ UMFPACK (http://www.cise.ufl.edu/research/sparse/umfpack/), used by M ATLAB’s \,

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 201
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

→ PARDISO [?] (http://www.pardiso-project.org/), incorporated into MKL

✁ fill-in (→ Def. 2.7.47) during sparse elimination

with PARDISO

PARDISO has been developed by

Prof. O. Schenk and his group (for-
merly University of Basel, now USI
Lugano).

Fig. 85

C++11-code 2.7.44: Example code demonstrating the use of PARDISO with E IGEN ➺ GITLAB
2 void s olv eSpar s ePar dis o ( s i z e _ t n ) {
3 using SpMat = Eigen : : SparseMatrix <double > ;
4 // Initialize a sparse matrix
5 const SpMat M = i n i t S p a r s e M a t r i x <SpMat > ( n ) ;
6 const Eigen : : VectorXd b = Eigen : : VectorXd : : Random( n ) ;
7 Eigen : : VectorXd x ( n ) ;
8 // Initalization of the sparse direct solver based on the Pardiso
library with directly passing the matrix M to the solver
9 // Pardiso is part of the Intel MKL library, see also Ex. 1.3.24
10 Eigen : : PardisoLU <SpMat> s o l v e r (M) ;
11 // The checks of Code 2.7.39 are omitted
12 // solve the LSE
13 x = s o l v e r . solve ( b ) ;
14 }

2.7.5 LU-factorization of sparse matrices

In Sect. 2.7.1 we have seen, how sparse matrices can be stored requiring O(nnz(A)) memory.

In Ex. 2.7.18 we found that (sometimes) matrix multiplication of sparse matrices can also be carried out
with optimal complexity, that is, with computational effort proportional to the total number of non-zero
entries of all matrices involved.

Does this carry over to the solution of linear systems of equations with sparse system matrices? Sec-

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 202
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

tion 2.7.4 says “Yes”, when sophisticated library routines are used. In this section, we examine some
aspects of Gaussian elimination ↔ LU-factorisation when applied in a sparse context.

Example 2.7.45 ( LU -factorization of sparse matrices)

We examine the following “sparse” matrix with a typical structure and inspect the pattern of the LU-factors
returned by E IGEN, see Code 2.7.46.
 
3 −1 −1
.. .. ..
 −1 . . . 
 ..
.
..
. −1
..
.

 
A=
 −1 − 1 3 −1  n,n
 ∈ R ,n ∈ N
 3 −1 
.. . ..
 . −1 . . . 
.. .. ..
. . . −1
−1 −1 3

C++11 code 2.7.46: Visualizing LU-factors of a sparse matrix ➺ GITLAB

2 // Build matrix
3 i n t n = 100;
4 RowVectorXd d i a g _ e l ( 5 ) ; d i a g _ e l << −1,−1, 3 , −1,−1;
5 VectorXi diag_no ( 5 ) ; diag_no << −n , −1, 0 , 1 , n ;
6 MatrixXd B = d i a g _ e l . r e p l i c a t e (2 ∗ n , 1 ) ;
7 B( n − 1 ,1) = 0 ; B( n , 3 ) = 0 ; // delete elements
8 // A custom function from the Utils folder
9 SparseMatrix <double> A = s pdiags ( B , diag_no , 2 ∗ n , 2 ∗ n ) ;
10 // It is not possible to access the LU-factors in the case of
11 // E I G E N ’s LU-decomposition for sparse matrices.
12 // Therefore we have to resort to the dense version.
13 auto s o l v e r = MatrixXd ( A) . l u ( ) ;
14 MatrixXd L = MatrixXd : : I d e n t i t y (2 ∗ n , 2 ∗ n ) ;
15 L += s o l v e r . matrixLU ( ) . t r iangular View < S t r i c t l y L o w e r > ( ) ;
16 MatrixXd U = s o l v e r . matrixLU ( ) . t r iangular View <Upper > ( ) ;
17 // Plotting
18 mgl : : F i g u r e f i g 1 , f i g 2 , f i g 3 ;
19 f i g 1 . spy ( A) ; f i g 1 . setFontSize ( 4 ) ;
20 f i g 1 . t i t l e ( " Sparse m a t r i x " ) ; f i g 1 . save ( " s p a r s e A _ c p p " ) ;
21 f i g 2 . spy ( L ) ; f i g 2 . setFontSize ( 4 ) ;
22 f i g 2 . t i t l e ( " Sparse m a t r i x : L f a c t o r " ) ; f i g 2 . save ( " s p a r s e L _ c p p " ) ;
23 f i g 3 . spy (U) ; f i g 3 . setFontSize ( 4 ) ;
24 f i g 3 . t i t l e ( " Sparse m a t r i x : U f a c t o r " ) ; f i g 3 . save ( " s p a r s e U _ c p p " ) ;

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 203
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Sparse matrix Sparse matrix: L factor Sparse matrix: U factor

0 0 0

20 20 20

40 40 40

60 60 60

80 80 80

100 100 100

120 120 120

140 140 140

160 160 160

180 180 180

200 200 200

0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Fig. 86 nz = 796 Fig. 87 nz = 10299 Fig. 88 nz = 10299

Observation: A sparse 6⇒ LU -factors sparse

Of course, in case the LU-factors of a sparse matrix possess many more non-zero entries than the matrix
itself, the effort for solving a linear system with direct elimination will increase significantly. This can be
quantified by means of the following concept:

Definition 2.7.47. Fill-in

Let A = LU be an LU -factorization (→ Sect. 2.3.2) of A ∈ K n,n . If lij 6= 0 or uij 6= 0 though
aij = 0, then we encounter fill-in at position (i, j).

Example 2.7.48 (Sparse LU -factors)

Ex. 2.7.45 ➣ massive fill-in can occur for sparse matrices

This example demonstrates that fill-in can largely be avoided, if the matrix has favorable structure. In this
case a LSE with this particular system matrix A can be solved efficiently, that is, with a computational
effort O(nnz(A)) by Gaussian elimination.

C++11 code 2.7.49: LU-factorization of sparse matrix ➺ GITLAB

2 // Build matrix
3 MatrixXd A( 1 1 , 1 1 ) ; A . s e t I d e n t i t y ( ) ;
4 A . col ( 1 0 ) . setOnes ( ) ; A . row ( 1 0 ) . setOnes ( ) ;
5 // A.reverseInPlace(); // used inEx. 2.7.50
6 auto s o l v e r = A . l u ( ) ;
7 MatrixXd L = MatrixXd : : I d e n t i t y ( 1 1 , 1 1 ) ;
8 L += s o l v e r . matrixLU ( ) . t r iangular View < S t r i c t l y L o w e r > ( ) ;
9 MatrixXd U = s o l v e r . matrixLU ( ) . t r iangular View <Upper > ( ) ;
10 MatrixXd Ainv = A . inverse ( ) ;
11 // Plotting
12 mgl : : F i g u r e f i g 1 , f i g 2 , f i g 3 , f i g 4 ;
13 f i g 1 . spy ( A) ; f i g 1 . setFontSize ( 4 ) ;
14 fig1 . t i t l e ( " Pattern of A" ) ; f i g 1 . save ( " A p a t _ c p p " ) ;
15 f i g 2 . spy ( L ) ; f i g 2 . setFontSize ( 4 ) ;

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 204
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

16 fig2 . t i t l e ( " Pattern of L" ) ; f i g 2 . save ( " L p a t _ c p p " ) ;

17 fig3 . spy (U) ; fig3 . setFontSize ( 4 ) ;
18 fig3 . t i t l e ( " Pattern of U" ) ; f i g 3 . save ( " U p a t _ c p p " ) ;
19 fig4 . spy ( Ainv ) ; f i g 4 . setFontSize ( 4 ) ;
20 fig4 . t i t l e ( " Pattern o f A^{ − 1 } " ) ; f i g 4 . save ( " A i n v p a t _ c p p " ) ;

A is called an “arrow matrix”, see the pattern of non-zero entries below and Ex. 2.6.5.
Recalling Rem. 2.3.32 it is easy to see that the LU-factors of A will be sparse and that their sparsity
patterns will be as depicted below. Observe that despite sparse LU-factors, A−1 will be densely populated.
Pattern of A -1 Pattern of L Pattern of U
Pattern of A
0 0 0 0

2 2 2 2

4 4 4 4

6 6 6 6

8 8 8 8

10 10 10 10

12 12 12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 31 nz = 121 nz = 21 nz = 21

L, U sparse 6=⇒ A−1 sparse !

!
Besides stability and efficiency issues, see Exp. 2.4.10, this is another reason why using x = A.inverse()*y
instead of y = A.lu().solve(b) is usually a major blunder.

Example 2.7.50 (LU-decomposition of flipped “arrow matrix”)

Recall the discussion in Ex. 2.6.5. Here we look at an arrow matrix in a slightly different form:

 
α b⊤
 
 
 
 
 
 
  α∈R,
 
M=
 c
 ,
 b, c ∈ R n−1 ,
 D 
  D ∈ R n−1,n−1 regular diagonal matrix, → Def. 1.1.5
 
 
 
 
 

(2.7.51)

Run the algorithm from § 2.3.21 (LU decompisition without pivoting):

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 205
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✦ LU-decomposition dense factor matrices with O(n2 ) non-zero entries.

✦ asymptotic computational cost: O(n3 )

0 0

2 2

4 4

Output of modified Code 2.7.49:

Obvious fill-in (→ Def. 2.7.47)

6
L 6
U
8 8

10 10

12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 65 nz = 65

 
 
 
 
 
 
Now it comes as a surprise that the arrow matrix A
 
 
from Ex. 2.6.5, (2.6.6) has sparse LU-factors!
 D c 
A=



 
Arrow matrix (2.6.6) ✄  
 
 
 
 
 
b⊤ α

   
   
   
   
   
   
   
   
 I 0   D c 
A=

·
 
 , σ : = α − b ⊤ D −1 c .

   
   
   
   
   
   
   
b ⊤ D −1 1 0 σ
| {z } | {z }
= :L = :U

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 206
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

➣ In this case LU-factorisation is possible without fill-in, cost merely O(n)!

Idea: Transform A into A by row and column permutations before performing LU-
decomposition.

Details: Apply a cyclic permutation of rows/columns:

• 1st row/column → n-th row/column

• i-th row/column → i − 1-th row/column, i = 2, . . . , n

0 0

2 2

4 4

6 6

8 8

10 10

12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 89 nz = 31 Fig. 90 nz = 31

➣ Then LU-factorization (without pivoting) of the resulting matrix requires O(n) operations.

C++11 code 2.7.52: Permuting arrow matrix, see Figs. 89, 90 ➺ GITLAB
2 MatrixXd A( 1 1 , 1 1 ) ; A . s e t I d e n t i t y ( ) ;
3 A . col ( 0 ) . setOnes ( ) ; A . row ( 0 ) = RowVectorXd : : LinSpaced ( 1 1 , 1 1 , 1 ) ;
4 // Permutation matrix (→ Def. 2.3.46) encoding cyclic permutation
5 MatrixXd P( 1 1 , 1 1 ) ; P . setZero ( ) ;
6 P . topRightCorner ( 1 0 , 1 0 ) . s e t I d e n t i t y ( ) ; P( 1 0 , 0 ) = 1 ;
7 mgl : : F i g u r e f i g 1 , f i g 2 ;
8 f i g 1 . spy ( A) ; f i g 1 . setFontSize ( 4 ) ;
9 f i g 1 . save ( " I n v A r r o w S p y _ c p p " ) ;
10 f i g 2 . spy ( ( P∗A∗P . transpose ( ) ) . e v a l ( ) ) ; f i g 2 . s e t F o n t S i z e ( 4 ) ;
11 f i g 2 . save ( " A r r o w S p y _ c p p " ) ;

Example 2.7.53 (Pivoting destroys sparsity)

In Ex. 2.7.50 we found that permuting a matrix can make it amenable to Gaussian elimination/LU-decomposition
with much less fill-in (→ Def. 2.7.47). However, recall from Section 2.3.3 that pivoting, which may be es-

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 207
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

sential for achieving numerical stability, amounts to permuting the rows (or even columns) of the matrix.
Thus, we may face the awkward situation that pivoting tries to reverse the very permutation we applied to
minimize fill-in! The next example shows that this can happen for an arrow matrix.

C++11 code 2.7.54: fill-in due to pivoting ➺ GITLAB

2 // Study of fill-in with LU-factorization due to pivoting
3 MatrixXd A( 1 1 , 1 1 ) ; A . setZero ( ) ;
4 A . diagonal ( ) = VectorXd : : LinSpaced ( 1 1 , 1 , 1 1 ) . cwiseInverse ( ) ;
5 A . col ( 1 0 ) . setConstant ( 2 ) ; A . row ( 1 0 ) . setConstant ( 2 ) ;
6 auto s o l v e r = A . l u ( ) ;
7 MatrixXd L = MatrixXd : : I d e n t i t y ( 1 1 , 1 1 ) ;
8 L += s o l v e r . matrixLU ( ) . t r iangular View < S t r i c t l y L o w e r > ( ) ;
9 MatrixXd U = s o l v e r . matrixLU ( ) . t r iangular View <Upper > ( ) ;
10 // Plotting
11 mgl : : F i g u r e f i g 1 , f i g 2 , f i g 3 , f i g 4 ;
12 f i g 1 . spy ( A) ; f i g 1 . setFontSize ( 4 ) ;
13 f i g 1 . t i t l e ( " a r r o w m a t r i x A " ) ; f i g 1 . save ( " f i l l i n p i v o t A " ) ;
14 f i g 2 . spy ( L ) ; f i g 2 . setFontSize ( 4 ) ;
15 fig2 . t i t l e ( "L f actor " ) ; f i g 2 . save ( " f i l l i n p i v o t L " ) ;
16 f i g 3 . spy (U) ; f i g 3 . setFontSize ( 4 ) ;
17 f i g 3 . t i t l e ( "U f a c t o r " ) ; f i g 3 . save ( " f i l l i n p i v o t U " ) ;
18 std : : cout << A << std : : endl ;

 
1 2
 1
2
 2 
 .. .. 
A= . . → arrow matrix, Ex. 2.7.48
 
 1
2 
10
2 ... 2

The distributions of non-zero entries of the computed LU-factors (“spy-plots”) are as follows:
arrow matrix A L factor U factor
0 0 0

2 2 2

4 4 4

6 6 6

In
8 8 8

10 10 10

12 12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 31 nz = 21 nz = 66

A L U
this case the solution of a LSE with system matrix A ∈ R n,n of the above type by means of Gaussian
elimination with partial pivoting would incur costs of O(n3 ).

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 208
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

2.7.6 Banded matrices [?, Sect. 3.7]

Banded matrices are a special class of sparse matrices (→ Notion 2.7.1 with extra structure:

Definition 2.7.55. Bandwidth

For A = (aij )i,j ∈ K m,n we call

bw(A) := min{k ∈ N: j − i > k ⇒ aij = 0} upper bandwidth ,

bw(A) := min{k ∈ N: i − j > k ⇒ aij = 0} lower bandwidth .

bw(A) := bw(A) + m(A) + 1 = bandwidth von A.

• bw(A) = 1 ✄ A diagonal matrix, → Def. 1.1.5

• bw(A) = bw(A) = 1 ✄ A tridiagonal matrix
• More general: A ∈ R n,n with bw(A) ≪ n =
ˆ banded matrix

: diagonal

: super-diagonals
m

: sub-diagonals

✁ bw(A) = 3, bw(A) = 2
n

for banded matrix A ∈ K m,n : nnz(A) ≤ min{m, n} bw(A)

We now examine a generalization of the concept of a banded matrix that is particularly useful in the context
of Gaussian elimination:
Definition 2.7.56. Matrix envelope

For A ∈ K n,n define

row bandwidth bwiR (A) := max{0, i − j : aij 6= 0, 1 ≤ j ≤ n}, i ∈ {1, ..., n}
column bandwidth bwC j (A ) := max{0, j − i : aij 6= 0, 1 ≤ i ≤ n}, j ∈ {1, ..., n}
( )
i − bwiR (A) ≤ j ≤ i ,
envelope env(A) := (i, j) ∈ {1, . . . , n}2 :
j − bwCj (A) ≤ i ≤ j

Example 2.7.57 (Envelope of a matrix)

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 209
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

 
∗ 0 ∗ 0 0 0 0 bw1R ( A) =0
 0 ∗ 0 0 ∗ 0 0  bw2R ( A) =0
 

 ∗ 0 ∗ 0 0 0 ∗ 
 bw3R ( A) =2
env( A) = red entries
A=
 0 0 0 ∗ ∗ 0 ∗ 
 bw4R ( A) = 0 ∗ = non-zero matrix entry a 6= 0
  ˆ ij
 0 ∗ 0 ∗ ∗ ∗ 0  bw5R ( A) =3
 0 0 0 0 ∗ ∗ 0  bw6R ( A) =1
0 0 ∗ ∗ 0 0 ∗ bw7R ( A) =4

Starting from a “spy-plot”, it is easy to find the evelope:

0 0

2 2

4 4

6 6

8 8

10 10

12 12

14 14

16 16

18 18

20 20

0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 91 nz = 138 Fig. 92 nz = 121

Note: the envelope of the arrow matrix from Ex. 2.7.48 is just the set of index pairs of its non-zero entries.
Hence, the following theorem provides another reason for the sparsity of the LU-factors in that example.

Theorem 2.7.58. Envelope and fill-in → [?, Sect. 3.9]

If A ∈ K n,n is regular with LU-factorization A = LU, then fill-in (→ Def. 2.7.47) is confined to
env(A).

Gaussian elimination without pivoting

Proof. (by induction, version I) Examine first step of Gaussian elimination without pivoting, a11 6= 0
!
a11 b⊤ 1 0 a11 b⊤
A= = ⊤
c Ã − ac11 I 0 Ã − cb
a
| {z }| {z 11 }
L ( 1) U ( 1)

ci −1 = 0 , if i > j ,
If (i, j) 6∈ env(A) ⇒
b j−1 = 0 , if i < j .
⇒ env(L(1) ) ⊂ env(A), env(U(1) ) ⊂ env(A) .
⊤
Moreover, env(Ã − cb
a ) = env(A (2 : n, 2 : n)) ✷
11

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 210
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Proof. (by induction, version II) Use block-LU-factorization, cf. Rem. 2.3.34 and proof of Lemma 2.3.19:
e ⊤l = c ,
Ae b e 0
L e u
U U
= ⇒ (2.7.59)
c⊤ α l⊤ 1 0 ξ e =b.
Lu

From Def. 2.7.56:

0
If mnR (A) = m, then c1 , . . . , cn−m = 0 (entries of c
0
from (2.7.59))

If mC
n (A ) = m, then b1 , . . . , bn − m = 0 (entries of b
= 0
from (2.7.59))

✁ for lower triagular LSE:

If c1 , . . . , ck = 0 then l1 , . . . , lk = 0
If b1 , . . . , bk = 0, then u1 , . . . , uk = 0
Fig. 93 ⇓
assertion of the theorem ✷

Thm. 2.7.58 immediately suggests a policy for saving cmputational effort when solving linear system
whose system matrix A ∈ K n,n is sparse due to small envelope:

♯ env(A) ≪ n2 :
✞ ☎

✝ ✆
Policy Confine elimination to envelope!

Details will be given now:

Envelope-aware LU-factorization:

C++11 code 2.7.60: Computing row bandwidths, → Def. 2.7.56 ➺ GITLAB

2 //! computes rowbandwidth numbers miR (A) of A (sparse matrix) according
to Def. 2.7.56
3 template <class numeric_t >
4 VectorXi rowbandwidth ( const SparseMatrix <numeric_t > &A) {
5 VectorXi m = VectorXi : : Zero ( A . rows ( ) ) ;
6 f o r ( i n t k = 0 ; k < A . o u t e r S i z e ( ) ; ++k )
7 f o r ( typename SparseMatrix <numeric_t > : : I n n e r I t e r a t o r
i t ( A , k ) ; i t ; ++ i t )
8 m( i t . row ( ) ) = std : : max<VectorXi : : Sc alar > (m( i t . row ( )
) , i t . row ( )− i t . col ( ) ) ;
9 r et ur n m;
10 }
11 //! computes row bandwidth numbers miR (A) of A (dense matrix) according
to Def. 2.7.56
12 template <class Derived >
13 VectorXi rowbandwidth ( const MatrixBase <Derived > &A) {
14 VectorXi m = VectorXi : : Zero ( A . rows ( ) ) ;
15 f o r ( i n t i = 1 ; i < A . rows ( ) ; ++ i )
16 f o r ( i n t j = 0 ; j < i ; ++ j )
17 i f ( A( i , j ) ! = 0 ) {

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 211
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

18 m( i ) = i − j ;
19 break ;
20 }
21 r et ur n m;
22 }

C++11 code 2.7.61: Envelope aware forward substitution ➺ GITLAB

2 //! evelope aware forward substitution for Lx = y
3 //! (L = lower triangular matrix)
4 //! argument mr: row bandwidth vector
5 VectorXd substenv ( const MatrixXd &L , const VectorXd &y , const
VectorXi &mr ) {
6 i n t n = L . cols ( ) ; VectorXd x ( n ) ;
7 x (0) = y (0) / L(0 ,0) ;
8 f o r ( i n t i = 1 ; i < n ; ++ i ) {
9 i f ( mr ( i ) > 0 ) {
10 double z eta = L . row ( i ) . segment ( i −mr ( i ) , mr ( i ) )
11 ∗ x . segment ( i −mr ( i ) , mr ( i ) ) ;
12 x ( i ) = ( y ( i ) − z eta ) / L ( i , i ) ;
13 }
14 else x ( i ) = y ( i ) / L ( i , i ) ;
15 }
16 r et ur n x ;
17 }

Asymptotic complexity of envelope aware forward substitution, cf. § 2.3.30, for Lx = y, L ∈ K n,n regular
lower triangular matrix is

O(# env(L)) !

By block LU-factorization (→ Rem. 2.3.34) we find

(A)1:n−1,1:n−1 (A)1:n−1,n L1 0 U1 u
= , (2.7.62)
(A)n,1:n−1 (A)n,n l⊤ 1 0 γ
⇒ (A)1:n−1,1:n−1 = L1 U1 , L1 u = (A)1:n−1,n , U1⊤ l = (A)⊤ ⊤
n,1:n −1 , l u + γ = (A )n,n .
(2.7.63)

C++11 code 2.7.64: Envelope aware recursive LU-factorization ➺ GITLAB

2 //! envelope aware recursive LU-factorization
3 //! of structurally symmetric matrix
4 void luenv ( const MatrixXd &A , MatrixXd &L , MatrixXd &U) {
5 i n t n = A . cols ( ) ;
6 asser t ( n == A . rows ( ) && " A mu st be s q u a r e " ) ;
7 i f ( n == 1 ) { L . s e t I d e n t i t y ( ) ; U = A ; }
8 else {

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 212
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

9 VectorXi mr = rowbandwidth ( A) ; // = colbandwidth thanks to

symmetry
10 double gamma ;
11 MatrixXd L1 ( n −1,n −1) , U1( n −1,n −1) ;
12 luenv ( A . topLeftCorner ( n −1, n −1) , L1 , U1) ;
13 VectorXd u = substenv ( L1 , A . col ( n −1) . head ( n −1) , mr ) ;
14 VectorXd l = substenv ( U1 . transpose ( ) ,
A . row ( n −1) . head ( n −1) . transpose ( ) , mr ) ;
15 i f ( mr ( n −1) > 0 )
16 gamma = A( n −1,n −1) −
l . t a i l ( mr ( n −1) ) . dot ( u . t a i l ( mr ( n −1) ) ) ;
17 else
18 gamma = A( n −1,n −1) ;
19 L . topLeftCorner ( n −1,n −1) = L1 ; L . col ( n −1) . setZero ( ) ;
20 L . row ( n −1) . head ( n −1) = l . transpose ( ) ; L ( n −1,n −1) = 1 ;
21 U. topLeftCorner ( n −1,n −1) = U1 ; U . col ( n −1) . head ( n −1) = u ;
22 U. row ( n −1) . setZero ( ) ; U( n −1,n −1) = gamma ;
23 }
24 }

recursive implementation of envelope aware recursive LU-factorization (no pivoting !)

Assumption:

A ∈ K n,n is
structurally symmetric

Asymptotic complexity (A ∈ K n,n )

O(n · # env(A)) .

Definition 2.7.65. Structurally symmetric matrix

A ∈ K n,n is structurally symmetric, if

(A)i,j 6= 0 ⇔ (A) j,i 6= 0 ∀i, j ∈ {1, . . . , n} .

Since by Thm. 2.7.58 fill-in is confined to the envelope, we need store only the matrix entries aij , (i, j) ∈
env(A) when computing (in situ) LU-factorization of structurally symmetric A ∈ K n,n

➤ Storage required: n + 2 ∑in=1 mi (A) floating point numbers

➤ terminology: envelope oriented matrix storage

Example 2.7.66 (Envelope oriented matrix storage)

Linear envelope oriented matrix storage of symmetric A = A⊤ ∈ R n,n :

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 213
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Two arrays:
scalar_t * val size P, Indexing rule:
 
size_t * dptr size n ∗ 0 ∗ 0 0 0 0
0 ∗ 0 0 ∗ 0 0  dptr[ j] = k
 ∗
n A =  0 00 ∗
 0
0
∗
0
∗
0
0
∗ 
∗  m
P : = n + ∑ mi ( A ) . 0 ∗ 0 ∗ ∗ ∗ 0
0 0 0 0 ∗ ∗ 0 val[k] = a jj
i =1 0 0 ∗ ∗ 0 0 ∗
(2.7.67)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
val a11 a22 a31 a32 a33 a44 a52 a53 a54 a55 a65 a66 a73 a74 a75 a76 a77
dptr 0 1 2 5 6 10 12 17

Minimizing bandwidth/envelope:
Goal: Minimize mi (A),A = (aij ) ∈ R N,N , by permuting rows/columns of A

Example 2.7.68 (Reducing bandwidth by row/column permutations)

Recall: cyclic permutation of rows/columns of arrow matrix applied in Ex. 2.7.50. This can be viewed as
a drastic shrinking of the envelope:
envelope arrow matrix envelope arrow matrix
0 0

2 2

4 4

6 6

8 8

10 10

12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 31 nz = 31

Another example:Reflection at cross diagonal ➤ reduction of # env(A)

   
∗ 0 0 ∗ ∗ ∗ ∗ ∗ ∗ 0 0 ∗
 00 ∗ 0 0 0 0   ∗ ∗ ∗ 0 0 ∗ 
 0 ∗ 0 0 0  −→  ∗ ∗ ∗ 0 0 ∗ 
 ∗ 0 0 ∗ ∗ ∗   0 0 0 ∗ 0 0 
∗ 0 0 ∗ ∗ ∗ 0 0 0 0 ∗ 0
∗ 0 0 ∗ ∗ ∗ ∗ ∗ ∗ 0 0 ∗
i ← N+1−i
# env(A) = 30 # env(A) = 22

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 214
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Example 2.7.69 (Reducing fill-in by reordering)

Envelope reducing permutations are at the heart of all mod-

ern sparse solvers (→ § 2.7.43). They employ elaborate al-
gorithms for the analysis of matrix graphs, that is, the connec-
tions between components of the vector of unknowns defined
by non-zero entries of the matrix. For further discussion see [?,
Sect. 5.7].

E IGEN supplies a few ordering methods for sparse matrices.

These methods use permutations to aim for minimal band-
width/envelop of a given sparse matrix. We study an example
with a 347×347 matrix M originating in the numerical solution
of partial differential equations, cf. Rem. 2.7.5.
Pattern of M ➥
(Here: no row swaps from pivoting !)

C++11 code 2.7.70: preordering in E IGEN ➺ GITLAB

2 // L and U cannot be extracted from SparseLU -> LDLT
3 SimplicialLDLT <SpMat_t , Lower , AMDOrdering > s o l v e r 1 (M) ;
4 SimplicialLDLT <SpMat_t , Lower , N a t u r a l O r d e r i n g > s o l v e r 2 (M) ;
5 MatrixXd U1 = MatrixXd ( s o l v e r 1 . matrixU ( ) ) ;
6 MatrixXd U2 = MatrixXd ( s o l v e r 2 . matrixU ( ) ) ;
7 // Plotting
8 mgl : : F i g u r e f i g 1 , f i g 2 , f i g 3 ;
9 f i g 1 . spy (M) ; f i g 1 . s e t F o n t S i z e ( 4 ) ; f i g 1 . save ( " Mspy " ) ;
10 f i g 2 . spy ( U1) ; f i g 2 . s e t F o n t S i z e ( 4 ) ; f i g 2 . save ( " AMDUSpy " ) ;
11 f i g 3 . spy ( U2) ; f i g 3 . s e t F o n t S i z e ( 4 ) ; f i g 3 . save ( " N a t u r a l U S p y " ) ;

Examine patterns of LU-factors (→ Sect. 2.3.2) after reordering:

no reordering approximate minimum degree

2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 215
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

2.8 Stable Gaussian elimination without pivoting

Recall insights, examples and experimnents:

• Thm. 2.7.58 ➣ special structure of the matrix helps avoid fill-in in Gaussian elimination/LU-factorization
without pivoting.
• Ex. 2.7.53 ➣ pivoting can trigger huge fill-in that would not occur without it.
• Ex. 2.7.69 ➣ fill-in reducing effect of reordering can be thwarted by later row swapping in the course
of pivoting.
• BUT pivoting is essential for stability of Gaussian elimination/LU-factorization → Ex. 2.3.36.

Very desirable: a priori criteria, when Gaussian elimination/LU-factorization remains stable even
without pivoting. This can help avoid the extra work for partial pivoting and makes it possible to
exploit structure without worrying about stability.

This section will introduce classes of matrices that allow Gaussian elimination without pivoting. Fortu-
nately, linear systems of equations featuring system matrices from these classes are very common in
applications.

Example 2.8.1 (Diagonally dominant matrices from nodal analysis → Ex. 2.1.3)

➀ R12 ➁ R23 ➂
Consider:

electrical circuit entirely composed of U ~ R24

~ R14 R25
Ohmic resistors. R35
R45 R56
Circuit equations from nodal analysis, see Ex. 2.1.3:
➃ ➄ ➅

−1 −1 −1 −1
➁ : R12 (U2 − U1 ) + R23 (U2 − U3 ) + R24 (U2 − U4 ) + R25 (U2 − U5 ) = 0,
−1 −1
➂: R23 (U3 − U2 ) + R35 (U3 − U5 ) = 0,
−1 −1 −1
➃: R14 (U4 − U1 ) + R24 (U4 − U2 ) + R45 (U4 − U5 ) = 0,
−1 −1 −1
➄ : R25 (U5 − U2 ) + R35 (U5 − U3 ) + R45 (U5 − U4 ) + R56 (U5 − U6 ) = 0,
U1 = U , U6 = 0 .

 1 1    
R12 + R23+ R124 + 1
R25 − R123 − R124 − R125 U2
1
R12
 − R123 1 1
− R135    
 R23 + R35 0 U3   0 
   =  1 U
 − R124 0 1
R24 + R45
1
− R145  U4  R14 
− R125 − R135 − R145 1
+ R135 + R145 + 1 U5 0
R22 R56

➣ Matrix A ∈ R n,n arising from nodal analysis satisfies

• A = A⊤ , akk > 0 , akj ≤ 0 for k 6= j , (2.8.2)

2. Direct Methods for Linear Systems of Equations, 2.8. Stable Gaussian elimination without pivoting 216
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

n
• ∑ akj ≥ 0 , k = 1, . . . , n , (2.8.3)
j =1
• A is regular. (2.8.4)

All these properties are obvious except for the fact that A is regular.
Proof of (2.8.4): By Thm. 2.2.4 it suffices to show that the nullspace of A is trivial: Ax = 0 ⇒ x = 0.
Pick x ∈ R n , Ax = 0, and i ∈ {1, . . . , n} so that

| xi | = max{| x j |, j = 1, . . . , n} .
Intermediate goal: show that all entries of x are the same
aij | aij |
Ax = 0 ⇒ xi = ∑ aii x j ⇒ | xi | ≤ ∑ |aii | | x j | . (2.8.5)
j 6 =i j 6 =i

By (2.8.3) and the sign condition from (2.8.2) we conclude

| aij |
∑ |aii | ≤ 1 . (2.8.6)
j 6 =i

Hence, (2.8.6) combined with the above estimate (2.8.5) that tells us that the maximum is smaller equal
than a mean implies | x j | = | xi | for all j = 1, . . . , n. Finally, the sign condition akj ≤ 0 for k 6= j enforces
the same sign of all xi . Thus, we conclude, w.l.o.g., x1 = x2 = · · · = xn . As
n
∃i ∈ {1, . . . , n}: ∑ aij > 0 (strict inequality) ,
j =1

Ax = 0 is only possible for x = 0.

(2.8.7) Diagonally dominant matrices

Definition 2.8.8. Diagonally dominant matrix → [?, Def. 1.24]

A ∈ K n,n is diagonally dominant, if

∀k ∈ {1, . . . , n}: ∑ j6=k |akj | ≤ |akk | .

The matrix A is called strictly diagonally dominant, if

∀k ∈ {1, . . . , n}: ∑ j6=k |akj | < |akk | .

Lemma 2.8.9. LU-factorization of diagonally dominant matrices


 A has LU-factorization
regular, diagonally dominant
A ⇔ m
with positive diagonal 
Gaussian elimination feasible without pivoting(∗)

2. Direct Methods for Linear Systems of Equations, 2.8. Stable Gaussian elimination without pivoting 217
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(∗): partial pivoting & diagonally dominant matrices ➣ triggers no row permutations !

((2.3.42) will always be satisfied for j = k)

➣ Pivoting can be dispensed with without compromising stability.
Proof.(of Lemma 2.8.9)
(2.3.12) → induction w.r.t. n:
Clear: partial pivoting in the first step selects a11 as pivot element, cf. (2.3.42).
After 1st step of elimination:

(1) ai1 (1)

aij = aij − a1j , i, j = 2, . . . , n ⇒ aii > 0 .
a11

n n
(1) (1) ai1 a
| aii | − ∑ | aij | = aii − a1i − ∑ aij − i1 a1j
j =2
a11 j =2
a11
j 6 =i j 6 =i
n
| ai1 || a1i | |a | n
≥ aii − − ∑ | aij | − i1 ∑ | a1j |
a11 j =2
a11 j=2
j 6 =i j 6 =i
n n
| ai1 || a1i | a − | a1i |
≥ aii − − ∑ | aij | − | ai1 | 11 ≥ aii − ∑ | aij | ≥ 0 .
a11 j =2
a11 j =1
j 6 =i j 6 =i

A regular, diagonally dominant ⇒ partial pivoting according to (2.3.42) selects i-th row in i-th step.

(2.8.10) Gaussian elimination for symmetric positive definite (s.p.d.) matrices

The class of symmetric positive definite (s.p.d.) matrices has been defined in Def. 1.1.8. They permit
stable Gaussian elimintation without pivoting:

Theorem 2.8.11. Gaussian elimination for s.p.d. matrices

Every symmetric/Hermitian positive definite matrix (→ Def. 1.1.8) possesses an LU-decomposition

(→ Sect. 2.3.2).

Equivalent to the assertion of the theorem: Gaussian elimination is feasible without pivoting

In fact, this theorem is a corollary of Lemma 2.3.19, because all principal minors of an s.p.d. matrix are
s.p.d. themselves.

Sketch of alternative self-contained proof.

Proof by induction: consider first step of elimination

!
a11 b⊤ 1. step a11 b⊤
A= e −−−−−−−−−−→ e − bb⊤ .
b A Gaussian elimination 0 A a 11

2. Direct Methods for Linear Systems of Equations, 2.8. Stable Gaussian elimination without pivoting 218
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

e− bb ⊤
➣ to show: A a11 s.p.d. (→ step of induction argument)

✦
e − bb⊤ ∈ R n−1,n−1
Evident: symmetry of A a 11
✦
As A s.p.d. (→ Def. 1.1.8), for every y ∈ R n−1 \ {0}

⊤
!⊤ ⊤
!
− ba11y a11 b⊤ − ba11y bb⊤
0< e = y ⊤ (A
e− )y .
y b A y a11

e− bb ⊤
A a11 positive definite. ✷
The proof can also be based on the identities

(A)1:n−1,1:n−1 (A)1:n−1,n L1 0 U1 u
= , (2.7.62)
(A)n,1:n−1 (A)n,n l⊤ 1 0 γ
⇒ (A)1:n−1,1:n−1 = L1 U1 , L1 u = (A)1:n−1,n , U1⊤ l = (A)⊤ ⊤
n,1:n −1 , l u + γ = (A )n,n ,

noticing that the principal minor (A)1:n−1,1:n−1 is also s.p.d. This allows a simple induction argument.

Note: no pivoting required (→ Sect. 2.3.3)

(partial pivoting always picks current pivot row)

The next result gives a useful criterion for telling whether a given symmetric/Hermitian matrix is s.p.d.:

Lemma 2.8.12. Diagonal dominance and definiteness

A diagonally dominant Hermitian/symmetric matrix with non-negative diagonal entries is positive

semi-definite.
A strictly diagonally dominant Hermitian/symmetric matrix with positive diagonal entries is positive
definite.

Proof. For A = AH diagonally dominant, use inequality between arithmetic and geometric mean (AGM)
ab ≤ 12 (a2 + b2 ):
n n
H 2 2
x Ax = ∑ aii | xi | + ∑ aij x̄i x j ≥ ∑ aii | xi | − ∑ | aij || xi || x j |
i =1 i 6= j i =1 i 6= j
AGM n
≥ ∑ aii | xi |2 − 12 ∑ |aij |(| xi |2 + | x j |2 )
i =1 i 6= j
n n
1 2 2 1 2 2
≥ 2 ∑ ii i ∑ ij i
{ a | x | − | a || x | } + 2 ∑ ii j
{ a | x | − ∑ ij j
| a || x | }
i =1 j 6 =i j =1 i 6= j
n
≥ ∑ | xi |2 aii − ∑ | aij | ≥ 0 .
i =1 j 6 =i

(2.8.13) Cholesky decomposition

2. Direct Methods for Linear Systems of Equations, 2.8. Stable Gaussian elimination without pivoting 219
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Lemma 2.8.14. Cholesky decomposition for s.p.d. matrices → [?, Sect. 3.4], [?, Sect. II.5],
[?, Thm. 3.6]

For any s.p.d. A ∈ K n,n , n ∈ N, there is a unique upper triangular matrix R ∈ K n,n with rii > 0,
i = 1, . . . , n, such that A = RH R (Cholesky decomposition).

Thm. 2.8.11 ⇒ A = LU (unique LU -decomposition of A, Lemma 2.3.19)

e , D=ˆ diagonal of U ,
A = LDU e=
U ˆ normalized upper triangular matrix → Def. 1.1.5

Due to uniqueness of LU -decomposition

A = A⊤ ⇒ U = DL⊤ ⇒ A = LDL⊤ ,

with unique L, D (diagonal matrix)

x⊤ Ax > 0 ∀x 6= 0 ⇒ y⊤ Dy > 0 ∀y 6= 0 .
√
➤ D has positive diagonal ➨ R= DL⊤ . ✷
Formulas analogous to (2.3.22)

 i −1

min{i,k}  ∑ r ji r jk + rii rik , if i < k ,

H j =1
R R = A ⇒ aik = ∑ r ji r jk =
 −1
i
(2.8.15)
 2 2
j =1
 ∑ |r ji | + rii
 , if i = k .
j =1

C++11 code 2.8.16: Simple Cholesky factorization ➺ GITLAB

2 //! simple Cholesky factorization
3 void c h o l f a c ( const MatrixXd &A , MatrixXd &R) {
4 i n t n = A . rows ( ) ;
5 R = A;
6 f o r ( i n t k = 0 ; k < n ; ++k ) {
7 f o r ( i n t j = k + 1; j < n ; ++ j )
8 R . row ( j ) . t a i l ( n− j ) −= R . row ( k ) . t a i l ( n− j ) ∗R( k , j ) /R( k , k ) ;
9 R. row ( k ) . t a i l ( n−k ) / = std : : s q r t (R( k , k ) ) ;
10 }
11 R. t r iangular View < S t r i c t l y L o w e r > ( ) . setZero ( ) ;
12 }

1 3
Computational costs (# elementary arithmetic operations) of Cholesky decomposition: 6n + O ( n2 )
(➣ “half the costs” of LU-factorization, cf. Code in § 2.3.21, but this does not mean “twice as fast” in a
concrete implementation, because memory access patterns will have a crucial impact, see Rem. 1.4.8.)

Gains of efficiency hardly justify the use of Cholesky decomposition in modern numerical algorithms.
Savings in memory compared to standard LU-factorization (only one factor R has to be stored) offer a
stronger reason to prefer the Cholesky decomposition.

2. Direct Methods for Linear Systems of Equations, 2.8. Stable Gaussian elimination without pivoting 220
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The computation of Cholesky-factorization by means of the algorithm of Code 2.8.16 is numerically stable
(→ Def. 1.5.85)!

Reason: recall Thm. 2.4.4: Numerical instability of Gaussian elimination (with any kind of pivoting) mani-
fests itself in massive growth of the entries of intermediate elimination matrices A(k) .

Use the relationship between LU-factorization and Cholesky decomposition, which tells us that we only
have to monitor the growth of entries of intermediate upper triangular “Cholesky factorization matrices”
A = ( R ( k ) )H R ( k ) .
Consider: Euclidean vector norm/matrix norm (→ Def. 1.5.76) k·k2

A = RH R ⇒ kAk2 = sup xH RH Rx = sup (Rx)H (Rx) = kRk22 .

k x k2 = 1 k x k2 = 1

➤ For all intermediate Cholesky factorization matrices holds: ( R ( k ) )H = R(k) = kAk1/2

2
2 2

This rules out a blowup of entries of the R(k) .

Computation of the Cholesky decomposition largely agrees with the computation of LU-factorization (with-
out pivoting). Using the latter together with forward and backward substitution (→ Sect. 2.3.2) to solve a
linear system of equations is algebraically and numerically equivalent to using Gaussian elimination with-
out pivoting.

✗ ✔
From these equivalences we conclude:

Solving LSE with s.p.d. system matrix via

Cholesky decomposition + forward & backward substitution

✖ ✕
is numerically stable (→ Def. 1.5.85)
m
✎ ☞
Gaussian elimination without pivoting is a numerically stable way to solve LSEs with s.p.d.

✍ ✌
system matrix.

Learning Outcomes

Principal take-home knowledge and skills from this chapter:

• A clear understanding of the algorithm of Gaussian elimination with and without pivoting (prerequisite
knowledge from linear algebra)

• Insight into the relationship between Gaussian elimination and LU-decomposition and the algorith-
mic relevance of LU-decomposition

• Awareness of the asymptotic complexity of dense Gaussian elimination, LU-decomposition, and

elimination for special matrices

• Familiarity with “sparse matrices”: notion, data structures, initialization, benefits

• Insight into the reduced computational complexity of the direct solution of sparse linear systems of
equations.

2. Direct Methods for Linear Systems of Equations, 2.8. Stable Gaussian elimination without pivoting 221
Chapter 3

DIrect Methods for Linear Least Squares

Problems

In this chapter we study numerical methods for overdetermined linear systems of equations, that is, linear
systems with a “tall” rectangular system matrix
   
   
    
x ∈ R n : “Ax = b” , (3.0.1)    
   
   
 A   x  = b 
b ∈ R m , A ∈ R m,n , m≥n.    
   
   
   

In contrast to Chapter 1 we will mainly restrict ourselves to real linear systems in this chapter.

Note that the quotation marks in (3.0.1) indicate that this is not a well-defined problem in the sense of
§ 1.5.67; Ax = b does no define a mapping (A, b) 7→ x, because

• such a vector x ∈ R n may not exist,

• and, even if it exists, it may not be unique.
Therefore, first we have to establish a crisp concept of that we mean by a “solution” of (3.0.1).

Contents
3.0.1 Overdetermined Linear Systems of Equations: Examples . . . . . . . . . . . 215
3.1 Least Squares Solution Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
3.1.1 Least Squares Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
3.1.2 Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
3.1.3 Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . . . . . . . . . . 225
3.1.4 Sensitivity of Least Squares Problem . . . . . . . . . . . . . . . . . . . . . . . 227
3.2 Normal Equation Methods [?, Sect. 4.2], [?, Ch. 11] . . . . . . . . . . . . . . . . . . 228
3.3 Orthogonal Transformation Methods [?, Sect. 4.4.2] . . . . . . . . . . . . . . . . . 232
3.3.1 Transformation Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
3.3.2 Orthogonal/Unitary Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
3.3.3 QR-Decomposition [?, Sect. 13], [?, Sect. 7.3] . . . . . . . . . . . . . . . . . . 233
3.3.3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

222
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3.3.3.2 Computation of QR-Decomposition . . . . . . . . . . . . . . . . . . 237

3.3.3.3 QR-Decomposition: Stability . . . . . . . . . . . . . . . . . . . . . . 243
3.3.3.4 QR-Decomposition in E IGEN . . . . . . . . . . . . . . . . . . . . . . 245
3.3.4 QR-Based Solver for Linear Least Squares Problems . . . . . . . . . . . . . . 246
3.3.5 Modification Techniques for QR-Decomposition . . . . . . . . . . . . . . . . 251
3.3.5.1 Rank-1 Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . 251
3.3.5.2 Adding a Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
3.3.5.3 Adding a Row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
3.4 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . 256
3.4.1 SVD: Definition and Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
3.4.2 SVD in E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
3.4.3 Generalized Solutions of LSE by SVD . . . . . . . . . . . . . . . . . . . . . . 263
3.4.4 SVD-Based Optimization and Approximation . . . . . . . . . . . . . . . . . 265
3.4.4.1 Norm-Constrained Extrema of Quadratic Forms . . . . . . . . . . 265
3.4.4.2 Best Low-Rank Approximation . . . . . . . . . . . . . . . . . . . . 268
3.4.4.3 Principal Component Data Analysis (PCA) . . . . . . . . . . . . . 272
3.5 Total Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
3.6 Constrained Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
3.6.1 Solution via Lagrangian Multipliers . . . . . . . . . . . . . . . . . . . . . . . 289
3.6.2 Solution via SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

3.0.1 Overdetermined Linear Systems of Equations: Examples

You may think that overdetermined linear systems of equations are exotic, but this is not true. Rather they
are very common in mathematical models.

Example 3.0.2 (Linear parameter estimation in 1D)

From first principles it is known that two physical quantities x ∈ R and y ∈ R (e.g., pressure and density
of an ideal gas) are related by a linear relationship

y = αx + β for some unknown coefficients α, β ∈ R . (3.0.3)

We carry out m ∈ N measurements that yield pairs ( xi , yi ) ∈ R 2 , i = 1, . . . , m, m ≥ 2. If the

measurements were perfect, we could expect that there exist α, β ∈ R such that yi = αxi + β for all
i = 1, . . . , m. This is an overdetermined linear system of equations of the form (3.0.1):
   
x1 1 y1
 x2 1  y2 
 . .   . 
 . .  . 
 . . α  . 
  =   ↔ Ax = b , A ∈ R m,2 , b ∈ R m , x ∈ R2 . (3.0.4)
  β  
 . .  . 
 .. ..   .. 
xm 1 ym

h i measurement errors will thwart the solvability of (3.0.4) and for m > 2
In practice inevitable (“random”)
the probability that a solution αβ exists is zero, see Rem. 3.1.2.

Example 3.0.5 (Parameter estimation for a linear model)

3. DIrect Methods for Linear Least Squares Problems, 3. DIrect Methods for Linear Least Squares Problems 223
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

§ 5.7.7 can be generalized to higher dimensions:

Given: measured data points (xi , yi ), xi ∈ R n , yi ∈ R , i = 1, . . . , m, m ≥ n + 1

(yi , xi affected by measurement errors).

Known: without measurement errors data would satisfy an affine linear relationship y = a⊤ x + β, for
some a ∈ R n , c ∈ R .
Plugging in the measured quantities gives yi = a⊤ xi + β, i = 1, . . . , m, a linear system of equations of
the form
   
x1⊤ 1 y1
 .. ..  a  .. 
 . . β =  .  , (3.0.6)
x⊤m 1 ym
overdetermined in case m > n + 1.

Example 3.0.7 (Measuring the angles of a triangle [?, Sect. 5.1])

We measure the angles of a planar triangle and obtain e eγ

α, β, e (in radians). In the case of perfect measure-
ments the true angles α, β, γ would satisfy
   
1 0 0   e
α
0 α
 1 0 β  =
 βe 
 .
0 (3.0.8)
0 1 γ e
γ
1 1 1 π
Measurement errors will inevitably make the measured angles fail to add up to π so that (3.0.8) will not
⊤
have a solution [α, β, γ] .
Then, why should we add this last equation? It turns out that solving (3.0.8) “in a suitable way” enhances
cancellation of measurement errors and gives better estimates for the angles. We will not discuss this
here and refer to statistics for an explanation. Ex. 2.7.23

Example 3.0.9 (Angles in a triangulation)

In Ex. 2.7.23 we learned about the concept and data

structures for planar triangulations → Def. 2.7.24.
Such triangulations have been and continue to be
of fundament importance for geodesy. In particular
before distances could be measured accurately by
means of lasers, triangulations were indispensable,
because angles could already be determined with
high precision. C.F. Gauss pioneered both the use
of triangulations in geodesy and the use of the least
squares method to deal with measurement errors →
Fig. 94
Wikipedia.

Die Grundlagen seines Verfahrens hatte Gauss schon 1795 im Alter von 18 Jahren entwickelt.
Basis war eine Idee von Pierre-Simon Laplace, die Beträge von Fehlern aufzusummieren,
so dass sich die Fehler zu Null addieren. Gauss nahm stattdessen die Fehlerquadrate und
konnte die künstliche Zusatzanforderung an die Fehler weglassen.

3. DIrect Methods for Linear Least Squares Problems, 3. DIrect Methods for Linear Least Squares Problems 224
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Gauss benutzte dann das Verfahren intensiv bei seiner Vermessung des Königreichs Han-
nover durch Triangulation. 1821 und 1823 erschien die zweiteilige Arbeit sowie 1826 eine
Ergänzung zur Theoria combinationis observationum erroribus minimis obnoxiae (Theorie der
den kleinsten Fehlern unterworfenen Kombination der Beobachtungen), in denen Gauss eine
Begründung liefern konnte, weshalb sein Verfahren im Vergleich zu den anderen so erfolgre-
ich war: Die Methode der kleinsten Quadrate ist in einer breiten Hinsicht optimal, also besser
als andere Methoden.

We now extend Ex. 3.0.7 to planar triangulations, for which measured values for all internal angles are
available. We obtain an overdetermined system of equations by combining the following linear relations:

1. each angle is supposed to be equal to its measured value,

2. the sum of interior angles is π for every triangle,

3. the sum of the angles at an interior node is 2π .

If the planar triangulation has N0 interior vertices and M cells, then we end up with 4M + N0 equations
for the 3M unknown angles.

Example 3.0.10 ((Relative) point locations from distances [?, Sect. 6.1])

Consider n points located on the real axis at unknown locations xi ∈ R , i = 1, . . . , n. At least we know
that xi < xi +1 , i = 1, . . . , n − 1.

We measure the m := (n2 ) = 12 n(n − 1) pairwise distances dij := | xi − x j |, i, j ∈ {1, . . . , n}, i 6= j.

They are connected to the point positions by the overdetermined linear system of equations
   
−1 1 0 . . . ... 0 d12
 −1 0 1 0   d13 
   
xi − x j = dij ,  . .   .. 
 .. ..    . 
 .  x  . 
1≤j<i≤n.  . ..
.  1  . 
 .  x   . 
  2   
↔  −1 . . . 0 1 ..  =  d1n  (3.0.11)
l     
 0 −1 1 0 .  d23 
 . ..  xn  . 
 .. .  .. 
Ax = b    
 .. ..   .. 
 . .  . 
0 ... −1 1 dn−1,n

⊤
Note that we can never expect a unique solution for x ∈ R n , because adding a multiple of [1, 1, . . . , 1]
⊤
to any solution will again yield a solution, because A has a non-trivial kernel: N (A) = [1, 1, . . . , 1] .
Non-uniqueness can be cured by setting x1 := 0, thus removing one component of x.

If the measurements were perfect, we could then find x2 , . . . , xn from di −1,i , i = 2, . . . , n by solving a
standard (square) linear system of equations. However, as in Ex. 3.0.7, using much more information
through the overdetermined system (3.0.11) helps curb measurement errors.

3. DIrect Methods for Linear Least Squares Problems, 3. DIrect Methods for Linear Least Squares Problems 225
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3.1 Least Squares Solution Concepts

Throughout we consider the (possibly overdetermined) linear system of equations

x ∈ R n : “Ax = b” , b ∈ R m , A ∈ R m,n , m≥n. (3.0.1)

Recall from linear algebra that Ax = b has a solution, if and only if the right hand side vector b lies in the
image (range space, → Def. 2.2.2) of the matrix A:

∃x ∈ R n : Ax = b ⇔ b ∈ R(A) . (3.1.1)

✎ Notation for important subspaces associated with a matrix A ∈ K m,n (→ Def. 2.2.2)

image/range: R(A) := {Ax, x ∈ Kn } ⊂ Km ,

kernel/nullspace: N (A) := {x ∈ K n : Ax = 0} .

Remark 3.1.2 (Consistent right hand side vectors are highly improbable)

If R(A) 6= R m , then “almost all” perturbations of b (e.g., due to measurement errors) will destroy b ∈
R(A), because R(A) is a “set of measure zero” in R m .

3.1.1 Least Squares Solutions

Definition 3.1.3. Least squares solution

For given A ∈ K m,n , b ∈ K m the vector x ∈ R n is a least squares solution of the linear system of
equations Ax = b, if

x ∈ argmink Ay − bk2 ,
y ∈K n
m
kAx − b k2 = inf n kAy − b k2 .
y ∈K

➨ A least squares solution is any vector x that minimizes the Euclidean norm of the residual r =
b − Ax, see Def. 2.4.1.

We write lsq(A, b) for the set of least squares solutions of the linear system of equations Ax = b,
A ∈ R m,n , b ∈ R m :

lsq(A, b) := {x ∈ R n : x is a least squares solution of Ax = b} ⊂ R n . (3.1.4)

Example 3.1.5 (linear regression → [?, Ex. 4.1])

3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 226
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

We consider the problem of parameter estimation for a linear model from Ex. 3.0.5:

Given: measured data points (xi , yi ), xi ∈ R n , yi ∈ R , i = 1, . . . , m, m ≥ n + 1

(yi , xi affected by measurement errors).
Known: without measurement errors data would satisfy affine linear relationship

y = a⊤ x + β , (3.1.6)

from some parameters a ∈ R n , β ∈ R .

Solving the overdetermined linear system of equa-
y
tions in least squares sense we obtain a least
squares estimate for the parameters a and β:
m
(a, β) = argmin ∑ | yi − p ⊤ xi − γ |2 (3.1.7)
p ∈R n ,γ ∈R i =1

In statistics, solving (3.1.7) is known as linear regres-

sion
x

linear regression for n = 1, m = 8: “fitting” a line to Fig. 95

data points ✄
[?, Sect. 4.5]: In statistics we learn that the least squares estimate provides a maximum likelihood estimate,
if the measurement errors are uniformly and independently normally distributed.

(3.1.8) The geometry of least squares problems

A geometric “proof” for the existence of least squares solutions (R = R )

✁ For a least squares solution x ∈ R n the vector
b Ax ∈ R m is the unique orthogonal projection
of b onto
{Ax, x ∈ R n }
R(A) = Span{(A):,1 , . . . , (A):,n } ,
Ax
because the orthogonal projection provides
Fig. 96
the nearest (w.r.t. the Euclidean distance) point
to b in the subspace (hyperplane) R(A).
From this geometric consideration we conclude that lsq(A, b) is the space of solutions of Ax = b∗ ,
where b∗ is the orthogonal projection of b onto R(A). Since the set of solutions of a linear system of
equations invariably is an affine space, this argument teaches that lsq(A, b) is an affine subspace of R n !

Theorem 3.1.9. Existence of least squares solutions

For any A ∈ R m,n , b ∈ R m a least squares solution of Ax = b (→ Def. 3.1.3) exists.

3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 227
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3.1.2 Normal Equations

Appealing to the geometric intuition gleaned from Fig. 96 we infer the orthogonality of b − Ax, x a least
squares solution of the overdetermined linear systems of equations Ax = b, to all columns of A:

b − Ax ⊥ R(A) ⇔ b − Ax ⊥ (A):,j , j = 1, . . . , n ⇔ A⊤ (b − Ax) = 0 .

Surprisingly, we have found a square linear system of equations satisfied by the least squares solution.
The next theorem gives the formal statement is this discovery. It also completely characterizes lsq(A, b)
and reveals a way to compute this set.

Theorem 3.1.10. Obtaining least squares solutions by solving normal equations

The vector x ∈ R n is a least squares solution (→ Def. 3.1.3) of the linear system of equations
Ax = b, A ∈ R m,n , b ∈ R m , if and only if it solves the normal equations

A⊤ Ax = A⊤ b . (3.1.11)

Note that the normal equations (3.1.11) are an n × n square linear system of equations with a symmetric
positive semi-definite coefficient matrix:
   
   
   
" # " # " # 
   
   
A⊤  A  x = A⊤  b,
   
   
   
   

 
 
 
" #" # " # 
 
 
⇔ A⊤ A x = A⊤  b.
 
 
 
 

Proof. (of Thm. 3.1.10)

➊: We first show that a least squares solution satisfies the normal equations. Let x ∈ R n be a least
squares solutions according to Def. 3.1.3. Pick an arbitrary d ∈ R n \ {0} and define the function

ϕd : R → R , ϕd (τ ) := k A(x + τd) − bk22 . (3.1.12)

We find the equivalent expression

ϕd (τ ) = τ 2 d⊤ A⊤ Ad + 2τd⊤ A⊤ (Ax − b) + kAx − bk22 ,

which shows that τ 7→ ϕd (τ ) is a smooth (C∞ ) function.

3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 228
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

2
Moreover, since every x ∈ lsq(A, b) is a minimizer of y 7→ k Ay − bk2 , we conclude that τ 7→ ϕd (τ )
has a global minimum in τ = 0. Necessarily,

dϕd
= 2d⊤ A⊤ (Ax − b) = 0 .
dτ |τ =0
Since this holds for any vector d 6= 0, we conclude (set d equal to all the Euclidean unit vectors in R n )

A⊤ (Ax − b) = 0 ,

which is equivalent to the normal equations (3.1.11).

➋: Let x be a solution of the normal equations. Then we find by tedious but straightforward computations

kAy − b k22 − kAx − b k22

= y⊤ A⊤ Ay − 2y⊤ A⊤ b + b⊤ b − x⊤ A⊤ Ax + 2x⊤ A⊤ b − b⊤ b
(3.1.11) ⊤
= y A⊤ Ay − 2y⊤ A⊤ Ax + x⊤ A⊤ Ax
=(y − x)⊤ A⊤ A(y − x) = kA(x − y)k22 ≥ 0 .
=⇒ kAy − bk ≥ kAx − bk .

Since this holds for any y ∈ R n , x must be a global minimizer of y 7→ kAy − bk!
✷

Example 3.1.13 (Normal equations for some examples from Section 3.0.1)

Given A and b it takes only elementary linear algebra operations to form the normal equations

A⊤ Ax = A⊤ b . (3.1.11)

• For § 5.7.7, A ∈ R m,2 given in (3.0.4) we obtain the normal equations linear system
 
x1 1
 x2 1
 
 .. ..  ⊤
x1 x2 . . . . . . xm  . .  α kxk22 1⊤ x α x y
  = ⊤ = ⊤ ,
1 1 ... ... 1   β 1 x m β 1 y
 . .
 .. .. 
xm 1
⊤
with 1 = [1, . . . , 1] .
• In the case of Ex. 3.0.5 and the overdetermined m × (n + 1) linear system (3.0.6), the normal
equations read
 
x1⊤ 1
x1 x2 . . . . . . xm  .. ..  a XX⊤ X1 a Xy
 . . β = ⊤ ⊤ = ⊤ ,
1 1 ... ... 1 1 X m β 1 y
x⊤
m 1

where X = [x1 , . . . , xm ] ∈ R n,m .

Remark 3.1.14 (Normal equations from gradient)

3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 229
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

We consider the function

J : R n → R , J (y ) := k Ay − bk22 . (3.1.15)
As above, using elementary identities for the Euclidean inner product on R m , J can be recast as
J (y) = Y⊤ A⊤ Ay − 2b⊤ Ax + b⊤ b
n n m n m
= ∑ ∑ (A⊤ A)ij yi y j − 2 ∑ ∑ bi (A)ij y j + ∑ bi2 .
i =1 j =1 i =1 j =1 i =1

Obviously, J is a multivariate polynomial in y1 , . . . , yn . As such, J is an infinitely differentiable function,

J ∈ C∞ (R n , R ), see [?, Bsp. 7.1.5].
According to [?, Satz 7.5.3] x ∈ R n is a minimizer of J only if
grad J (x) = 2A⊤ Ax − 2A⊤ b= 0 . (3.1.16)
∂J
This formula for the gradient of J can easily be confirmed by computing the partial derivatives ∂y from the
i
above explicit formula. Observe that (3.1.16) is equivalent to the normal equations (3.1.11).

(3.1.17) The linear least squares problem (→ § 1.5.67)

Thm. 3.1.10 together with Thm. 3.1.9 already confirms that the normal equations will always have a so-
lution and that lsq(A, b) is a subspace of R n parallel to N (A⊤ A). The next theorem gives even more
detailed information.

Theorem 3.1.18. Kernel and range of A⊤ A

For A ∈ R m,n , m ≥ n, holds

N (A ⊤ A ) = N (A ) , (3.1.19)
⊤ ⊤
R(A A) = R(A ) . (3.1.20)

For the proof we need an basic result from linear algebra:

Lemma 3.1.21. Kernel and range of (Hermitian) transposed matrices

For any matrix A ∈ K m,n holds

N (A) = R(AH )⊥ , N (A)⊥ = R(AH ) .

✎ Notation: Orthogonal complement of a subspace V ⊂ K k :

V ⊥ : = { x ∈ K k : xH y = 0 ∀ y ∈ V } .

Proof. (of Thm. 3.1.18)

➊: We first show (3.1.19)
z ∈ N (A⊤ A) ⇔ A⊤ Az = 0 ⇒ z⊤ A⊤ Az = kAz k22 = 0 ⇔ Az = 0 ,

3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 230
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Az = 0 ⇒ A⊤ Az = 0 ⇔ z ∈ N (A⊤ A) .

➋: The relationship (3.1.20) follows from (3.1.19) and Lemma 3.1.21:

Lemma 3.1.21 (3.1.19) Lemma 3.1.21
R(A⊤ ) = N (A )⊥ = N (A ⊤ A )⊥ = R(A⊤ A) .

Corollary 3.1.22. Uniqueness of least squares solutions

If m ≥ n and N (A) = {0}, then the linear system of equations Ax = b, A ∈ R m,n , b ∈ R m , has
a unique least squares solution (→ 3.1.3)

x = (A ⊤ A )−1 A ⊤ b , (3.1.23)

that can be obtained by solving the normal equations (3.1.11).

Note that A⊤ A is symmetric positive definite (→ Def. 1.1.8), if N (A) = {0}.

Remark 3.1.24 (Full-rank condition (→ Def. 2.2.3))

For a matrix A ∈ R m,n with m ≥ n is equivalent

N (A) = {0} ⇐⇒ rank(A) = n . (3.1.25)

Hence the assumption N (A) = {0} of Cor. 3.1.22 is also called a full-rank condition, because the rank
of A is maximal.

Example 3.1.26 (Meaning of full-rank condition for linear models)

We revisit the parameter estimation problem for a linear model.

• For § 5.7.7, A ∈ R m,2 given in (3.0.4) it is easy to see

 
x1 1
  x2 1 
  . . 
  . . 
  . . 
rank  = 2 ⇔ ∃i, j ∈ {1, . . . , m}: xi 6= x j ,
 
  . . 
 .. .. 
xm 1

that is, the manifest condition, that the points do not lie on a vertical line.

• In the case of Ex. 3.0.5 and the overdetermined m × (n + 1) linear system (3.0.6), we find
  There is a subset of n + 1
x1⊤ 1
 ..  points xi1 , . . . xin+1 such that
rank ... .  = n + 1 ⇔
{0, xi1 , . . . xin+1 } spans a non-
x⊤m 1 degenerate n + 1-simplex.

3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 231
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 3.1.27 (Rank defect in linear least squares problems)

In case the system matrix A ∈ R m,n , m ≥ n, of an overdetermined linear system arising from a mathe-
matical fails to have full rank, it hints at inadequate modelling:

In this case parameters are redundant, because different sets of parameters yield the same output quan-
tities: the parameters are not “observable”.

Remark 3.1.28 (Hesse matrix of least squares functional)

For the least squares functional

J : R n → R , J (y ) := k Ay − bk22 . (3.1.15)

and its explicit form as polynomial in the vector components y j we find the Hessian (→ Def. 8.4.11, [?,
Satz 7.5.3]) of J :

H J (y) = 2A⊤ A . (3.1.29)

Thm. 3.1.18 implies that A⊤ A is positive definite (→ Def. 1.1.8) if and only if N (A) = {0}.

Therefore, by [?, Satz 7.5.3], under the full-rank condition J has a positive definite Hessian everywhere,
and a minimum at every stationary point of its gradient, that is, at every solution of the normal equations.

Remark 3.1.30 (Convex least squares functional)

Another result from analysis tells us that real-valued C1 -functions on R n whose Hessian has positive
eigenvalues uniformly bounded away from zero are strictly convex. Hence, if A has full rank, the least
squares functional J from (3.1.15) is a strictly convex function.

Visualization of a least squares functional J : R 2 →

R for n = 2 ✄
Under the full-rank condition the graph of J is a
paraboloid with J (y) → ∞ for kyk → ∞

Fig. 97

3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 232
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Now we are in a position to state precisely what we mean by solving an overdetermined (m ≥ n!) linear
system of equations Ax = b, A ∈ R m,n , b ∈ R m , provided that A has full (maximal) rank, cf. (3.1.25).

(Full rank linear) least squares problem: [?, Sect. 4.2]

given: A ∈ R m,n , m, n ∈ N, m ≥ n, rank(A) = n, b ∈ R m ,

find: x ∈ R n such that kAx − b k2 = inf{kAy − bk2 : y ∈ R n }
(3.1.31)
m
x = argminkAy − bk2
y ∈R n

✎ A sloppy notation for the minimization problem (3.1.31) is kAx − bk2 → min

3.1.3 Moore-Penrose Pseudoinverse

As we have seen in Ex. 3.0.10, there can be many least squares solutions of Ax = b, in case N (A) 6=
{0}. We can impose another condition to single out a unique element of lsq(A, b):

Definition 3.1.32. Generalized solution of a lineasr system of equations

The generalized solution x† ∈ R n of a linear system of equations Ax = b, A ∈ R m,n , b ∈ R m , is

defined as

x† := argmin{k xk2 : x ∈ lsq(A, b)} . (3.1.33)

➨ The generalized solution is the least squares solution with minimal norm.

(3.1.34) Reduced normal equations

Elementary geometry teaches that the minimal norm element of an affine subspace L (a plane) in Eu-
clidean space is the orthogonal projection of 0 onto L.

lsq(A, b)
✁ visualization:
The minimal norm element x† of the affine space
x† lsq(A, b) ⊂ R n belongs to the subspace of R n that
is orthogonal to lsq(A, b).
lsq(A, b)⊥
0

Fig. 98

3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 233
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Since the space of least squares solutions of Ax = b is an affine subspace parallel to N (A)
lsq(A, b) = x0 + N (A) , x0 solves normal equations, (3.1.35)
the generalized solution x∗ of Ax = b is contained in N (A)⊥ . Therefore, given a basis {v1 , . . . , vk } ⊂
R n of N (A)⊥ , k := dim N (A), we can find y ∈ R k such that x∗ = Vy, V := [v1 , . . . , vk ] ∈ R n,k .
Plugging this representation into the normal equations and multiplying with V⊤ yields the reduced normal
equations

V⊤ A⊤ AV y = V⊤ A⊤ b (3.1.36)
m
 
    
  " #
    
V⊤  A⊤    y =
  

A 

V 
 
 

  
 
  
 V⊤ 
 A⊤  b  .


The very construction of V ensures N (AV) = {0} so that, by Thm. 3.1.18 the k × k linear system of
equations (3.1.36) has a unique solution. The next theorem summarizes our insights:

Theorem 3.1.37. Formula for generalized solution

Given A ∈ R m,n , b ∈ R m , the generalized solution x† of the linear system of equations Ax = b is

given by

x† = V(V⊤ A⊤ AV)−1 (V⊤ A⊤ b) ,

where V is any matrix whose columns form a basis of N (A)⊥ .

Terminology: The matrix V(V⊤ A⊤ AV)−1 V⊤ is called the Moore-Penrose pseudoinverse of A.

✎ notation: A† ∈ R n,m =
ˆ pseudoinverse of A ∈ R m,n

Note that the Moore-Penrose pseudoinverse does not depend on the choice of V.

Armed with the concept of generalized solution and the knowledge about its existence and uniqueness we
can state the most generali linear least squares problem:

(General linear) least squares problem:

given: A ∈ R m,n , m, n ∈ N, b ∈ R m ,
find: x ∈ R n such that (3.1.38)
(i) k Ax − bk2 = inf{k Ay − bk2 : y ∈ R n },
(ii) k xk2 is minimal under the condition (i).

3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 234
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3.1.4 Sensitivity of Least Squares Problem

Consider the full-rank linear least squares problem introduced in (3.1.31):

➣ data (A, b) ∈ R m,n × R m , result x = argminy∈R n kAy − b k2 ∈ R n
On data space and result space we use the Euclidean norm (2-norm k·k2 ) and the associated matrix
norm, see § 1.5.69.

Recall Section 2.2.2, where we discussed the sensitivity of solutions of square linear systems, that is,
the impact of perturbations in the problem data on the result. Now we study how (small) changes in A
and b affect the unique (→ Cor. 3.1.22) least solution x of Ax = b in the case of A with full rank (⇔
N ( A ) = { 0 })

Note: If the matrix A ∈ R m,n , m ≥ n, has full rank, then there is a c > 0 such that A + ∆A still has
full rank for all ∆A ∈ R m,n with k∆A k2 < c. Hence, “sufficiently small” perturbations will not destroy the
full-rank property of A. This is a generalization of the Perturbation Lemma 2.2.11.

For square linear systems the condition number of the system matrix (→ Def. 2.2.12) provided the key
gauge of sensitivity. To express the sensitivity of linear least squares problems we also generalize this
concept:

Definition 3.1.39. Generalized condition number of a matrix

Given A ∈ K m,n , m ≥ n, rank(A) = n, we define its generalized (Euclidean) condition number

as
s
λmax (AH A)
cond2 (A) := .
λmin (AH A)

✎ notation: λmin (A) =

ˆ smallest (in modulus) eigenvalue of matrix A
λmax (A) =
ˆ largest (in modulus) eigenvalue of matrix A

For a square regular matrix this agrees with its condition number according to Def. 2.2.12, which follows
from Cor. 1.5.82.

Theorem 3.1.40. Sensitivity of full-rank linear least squares problem

For m ≥ n, A ∈ R m,n , rank(A) = n, let x ∈ R n be the solution of the least squares problem
kAx − bk → min and b x − bk →
x the solution of the perturbed least squares problem k(A + ∆A)b
min. Then

kx − b
x k2 · krk2 k∆A k2
≤ 2 cond2 (A) + cond22 (A)
kxk2 k A k2 k x k 2 k A k 2

holds, where r = Ax − b is the residual.

This means: if k rk2 ≪ 1 ➤ condition of the least squares problem ≈ cond2 (A)
if k rk2 “large” ➤ condition of the least squares problem ≈ cond22 (A)
For instance, in a linear parameter estimation problem (→ Ex. 3.0.5) a small residual will be the conse-
quence of small measurement errors.

3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 235
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3.2 Normal Equation Methods [?, Sect. 4.2], [?, Ch. 11]

Given A ∈ R m,n , m ≥ n, rank(A) = n, b ∈ R m , we introduce a first practical numerical method

to determine the unique least squares solution (→ Def. 3.1.3) of the overdetermined linear system of
equations Ax = b.

In fact, Cor. 3.1.22 suggests a simple algorithm for solving linear least squares problems of the form
(3.1.31) satisfying the full (maximal) rank condition rank(A) = n: it boils down to solving the normal
equations (3.1.11):

Algorithm: Normal equation method to solve full-rank least squares problem Ax = b

➊ Compute regular matrix C := A⊤ A ∈ R n,n .

➋ Compute right hand side vector c := A⊤ b.
➌ Solve s.p.d. (→ Def. 1.1.8) linear system of equations: Cx = c → § 2.8.13

This can be done in E IGEN in a single line of code:

C++11 code 3.2.1: Solving a linear least squares probel via normal equations
2 //! Solving the overdetermined linear system of equations
3 //! Ax = b by solving normal equations (3.1.11)
4 //! The least squares solution is returned by value
5 VectorXd normeqsolve ( const MatrixXd &A , const VectorXd &b ) {
6 i f ( b . siz e ( ) ! = A . rows ( ) ) throw r u n t i m e _ e r r o r ( " D i m e n s i o n m i s m a t c h " ) ;
7 // Cholesky solver
8 VectorXd x = ( A . transpose ( ) ∗A) . l l t ( ) . solve ( A . transpose ( ) ∗ b ) ;
9 r et ur n x ;
10 }

By Thm. 2.8.11, for the s.p.d. matrix A⊤ A Gaussian elimination remains stable even without pivoting. This
is taken into account by requesting the Cholesky decomposition of A⊤ A by calling the method llt().

(3.2.2) Asymptotic complexity of normal equation method

The problem size parameters for the linear least squares problem (3.1.31) are the matrix dimensions
m, n ∈ N, where n small & fixed, n ≪ m, is common.
In Section 1.4.2 and Thm. 2.5.2 we discussed the asymptotic complexity of the operations involved in step
➊-➌ of the normal equation method:

step ➊: cost O(mn2 ) 

step ➋: cost O(nm) cost O(n2 m + n3 ) for m, n → ∞ .


3
step ➌: cost O(n )

Note that for small fixed n, n ≪ m, m → ∞ the computational effort scales linearly with m.

Remark 3.2.3 (Conditioning of normal equations [?, pp. 128])

3. DIrect Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [?, Sect. 4.2], [?, Ch. 11]236
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Normal equation method vulnerable to instability ; immediate from Def. 3.1.39:

cond2 (AH A) = cond2 (A)2 .

! Recall from Thm. 2.2.10: cond2 (AH A) governs amplification of (roundoff) errors in
A⊤ A and A⊤ b when solving normal equations (3.1.11).
➣ For fairly ill-conditioned A using the normal equations (3.1.11) to solve the linear least squares prob-
lem from Def. 3.1.3 numerically may run the risk of huge amplification of roundoff errors incurred
during the computation of the right hand side AH b: potential instability (→ Def. 1.5.85) of normal
equation approach.

Example 3.2.4 (Roundoff effects in normal equations → [?, Ex. 4.12])

In this example we witness loss of information in the computation of AH A.

 
1 1
⊤ 1 + δ2 1
A =  δ 0 ⇒ A A=
1 1 + δ2
0 δ
! √
Exp. 1.5.35: If δ ≈ EPS, then 1 + δ2 =· 1 in M. Hence the computed A⊤ A will fail to
√
be regular, though rank(A) = 2, cond2 (A) ≈ EPS.

C++-code 3.2.5:
2 i n t main ( ) {
3 MatrixXd A( 3 , 2 ) ;
4 // Inquire about machine precision → Ex. 1.5.33
5 double eps = std : : n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) ;
6 // « initialization of matrix → § 1.2.13
7 A << 1 , 1 , s q r t ( eps ) , 0 , 0 , s q r t ( eps ) ;
8 // Output rank of A⊤ A
9 std : : cout << " Rank o f A : " << A . f u l l P i v L u ( ) . rank ( ) << std : : endl
10 << " Rank o f A^ TA : "
11 << ( A . transpose ( ) ∗ A) . f u l l P i v L u ( ) . rank ( ) << std : : endl ;
12 r et ur n 0 ;
13 }

Output:
1 Rank o f A : 2
2 Rank o f A^T∗A : 1

Remark 3.2.6 (Loss of sparsity when forming normal equations)

Another reason not to compute AH A, when both m, n large:

A sparse 6⇒ A⊤ A sparse

3. DIrect Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [?, Sect. 4.2], [?, Ch. 11]237
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Example from Rem. 1.3.5: “Arrow matrices”

Consequences for normal equation method, if both m, n large:

✦ Potential memory overflow, when computing A⊤ A

✦ Squanders possibility to use efficient sparse direct elimination techniques, see Section 2.7.5

This situation is faced in Ex. 3.0.10, Ex. 3.0.9.

Remark 3.2.7 (Extended normal equations)

A way to avoid the computation of AH A:

Extend normal equations (3.1.11): introduce residual r := Ax − b as new unknown:

H H r −I A r b
A Ax = A b ⇔ B := H = . (3.2.8)
x A 0 x 0

The benefit of using (3.2.8) instead of the standard normal equations (3.1.11) is that sparsity is preserved.
However, the conditioning of the system matrix in (3.2.8) is not better than that of A⊤ A.

A more general substitution r := α−1 (Ax − b) with α > 0 may improve the conditioning for suitably
chosen parameter α

H H r −αI A r b
A Ax = A b ⇔ Bα := = . (3.2.9)
x AH 0 x 0

For m, n ≫ 1, A sparse, both (3.2.8) and (3.2.9) lead to large sparse linear systems of equations,
amenable to sparse direct elimination techniques, see Section 2.7.5.

Example 3.2.10 (Conditioning of the extended normal equations)

3. DIrect Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [?, Sect. 4.2], [?, Ch. 11]238
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10
10
cond2(A)
In this example we explore empirically how the Eu- 9
10 cond2(AHA)
clidean condition number of the extended normal 8
10
cond2(B)
equations (3.2.9) is influenced by the coice of α cond2(Bα)
7
10
Consider (3.2.8), (3.2.9) for
6

  10

1+ǫ 1 5
10

A=  1 −ǫ 1  . 4
10
ǫ ǫ
3
10

2
Plot of different condition numbers 10

in dependence on ǫ√ ✄ 1
10

(Here α = ǫk Ak2 / 2) 0
10
-5 -4 -3 -2 -1 0
10 10 10 10 10 10
Fig. 99 ε

3.3 Orthogonal Transformation Methods [?, Sect. 4.4.2]

3.3.1 Transformation Idea

We consider the full-rank linear least squares problem (3.1.31)

given A ∈ R m,n , b ∈ R m find x = argmink Ay − bk2 . (3.1.31)

y ∈R n

Setting: m ≥ n and A has full (maximum) rank: rank(A) = n.

(3.3.1) Generalizing the policy underlying Gaussian elimination

Recall the rationale behind Gaussian elimination (→ Section 2.3, Ex. 2.3.1)

➥ e,
By row transformations convert LSE Ax = b to equivalent (in terms of set of solutions) LSE Ux = b
which is easier to solve because it has triangular form.

How to adapt this policy to linear least squares problem (3.1.31) ?

Two questions: ➊ What linear least squares problems are “easy to solve” ?
➋ How can we arrive at them by equivalent transformations of (3.1.31) ?
Here we call two overdetermined linear systems Ax = b and Ax e =b e equivalent in the sense of (3.1.31),
e b
if both have the same set of least squares solutions: lsq(A, b) = lsq(A, e ), see (3.1.4).

(3.3.2) Triangular linear least squares problems

The answer to question ➊ is the same as for LSE:

Linear least squares problems (3.1.31) with upper triangular A are easy to solve!

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]239
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

 
 
  b1
   ..    −1  
   .  b1
   
       .. 
        . 
       .. 
  x1   x=   
  ..    (∗)  R   . 
 R  . −  → min =⇒    .. 
       . 
  xn  
    bn
   
   
 0   
   .. 
   .  x=
ˆ least squares solution
 
bm
2

How can we draw the conclusion (∗)? Obviously, the components n + 1, . . . , m of the vector inside the
norm are fixed and do not depend on x. All we can do is to make the first components 1, . . . , n vanish, by
choosing a suitable x, see [?, Thm. 4.13]. Obviously, x = R−1 (b)1:n accomplishes this.

Note: since A has full rank n, the upper triangular part R ∈ R n,n of A is regular!

Answer to question ➋:
Idea: If we have a (transformation) matrix T ∈ R m,m satisfying

kTy k2 = kyk2 ∀y ∈ R m , (3.3.3)

then argmink Ay − b k2 = argmin Ay e

e −b ,
y ∈R n y ∈R n 2

where A e = Tb.
e = TA and b

The next section will characterize the class of eligible transformation matrices T.

3.3.2 Orthogonal/Unitary Matrices

Definition 3.3.4. Unitary and orthogonal matrices → [?, Sect. 2.8]

• Q ∈ Kn,n , n ∈ N, is unitary, if Q−1 = Q H .

• Q ∈ R n,n , n ∈ N, is orthogonal, if Q−1 = QT .

Theorem 3.3.5. Preservation of Euclidean norm

A matrix is unitary/orthogonal, if and only if the associated linear mapping preserves the 2-norm:

Q ∈ K n,n unitary ⇔ kQx k2 = k xk2 ∀x ∈ K n .

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]240
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

From Thm. 3.3.5 we immediately conclude that, if a matrix Q ∈ K n,n is unitary/orthogonal, then

all rows/columns (regarded as vectors ∈ K n ) have Euclidean norm = 1,

all rows/columns are pairwise orthogonal (w.r.t. Euclidean inner product),

| det Q| = 1, kQk2 = 1, and all eigenvalues ∈ {z ∈ C: |z| = 1}.

kQA k2 = kAk2 for any matrix A ∈ Kn,m

3.3.3 QR-Decomposition [?, Sect. 13], [?, Sect. 7.3]

This section will answer the question whether and how it is possible to find orthogonal transformations that
convert any given matrix A ∈ R m,n , m ≥ n, rank(A) = n, to upper triangular form, as required for the
application of the “equivalence transformation idea” to full-rank linear least squares problems.

3.3.3.1 Theory

(3.3.6) Gram-Schmidt orthogonalisation recalled → § 1.5.1

Input: {a1 , . . . , ak } ⊂ K n
Output: {q1 , . . . , qk } (assuming no premature termination!)

Theorem 3.3.7. Span property of G.S.

a1
1: q1 : = % 1st output vector vectors
k 1 k2
a
2: for j = 2, . . . , k do
If {a1 , . . . , ak } is linearly independent,
{ % Orthogonal projection
then Algorithm (GS) computes orthonor-
3: q j := a j mal vectors q1 , . . . , qk satisfying
4: for ℓ = 1, 2, . . . , j − 1 do (GS)
5: { q j ← q j − a j , qℓ qℓ } Span{q1 , . . . , qℓ } = Span{a1 , . . . , aℓ } ,
6: if ( q j = 0 ) then STOP (1.5.2)
qj
7: else { qj ← }
kq j k2 for all ℓ ∈ {1, . . . , k}.
8: }

The span property (1.5.2) can be made more explicit in terms of the existence of linear combinations

q1 = t11 a1
q2 = t12 a1 + t22 a2
q3 = t13 a1 + t23 a2 + t33 a3 ∃T ∈ R n,n upper triangular: Q = AT , (3.3.8)
..
.
qk = t1n a1 + t2n a2 + · · · + tkk ak .

where Q = [q1 , . . . , qn ] ∈ R m,n (with orthonormal columns), A = [a1 , . . . , an ] ∈ R m,n . Note that
thanks to the linear independence of {a1 , . . . , ak } and {q1 , . . . , qk }, the matrix T = (tij )i,j
k
=1 ∈ R
k,k is

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]241
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

regular (“non-existent” tij are set to zero, of course).

Recall from Lemma 1.3.9 that inverses of regular upper triangular matrices are upper triangular again.
  −1  
   
   
   
  = 
 T   R 
   
   

Thus, by (3.3.8), we have found an upper triangular R := T−1 ∈ R n,n such that
   
   
   
   
   
    
   
   
    
    
 A   Q  
A = QR ↔ 

=
 


.

    R 
    
   
   
   
   
   
   
   

Next “augmentation by zero”: add m − n zero rows at the bottom of R and complement columns of Q to
e ∈ R m,m :
an orthonormal basis of R m , which yields an orthogonal matrix Q

e R
A=Q
0
l
    
    
    
    
    
    
    
    
    
    
 A   e  
 =  Q  R 
    
    
    
    
    
    0 
    
    
    
 

e⊤ R
⇔ Q A= .
0

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]242
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Thus the algorithm of Gram-Schmidt orthonormalization “proves” the following theorem.

Theorem 3.3.9. QR-decomposition → [?, Satz 5.2], [?, Sect. 7.3]

For any matrix A ∈ K n,k with rank(A) = k there exists

(i) a unique Matrix Q0 ∈ R n,k that satisfies QH 0 Q0 = I k , and a unique upper triangular Matrix
k,k
R0 ∈ K with (R)i,i > 0, i ∈ {1, . . . , k}, such that

A = Q0 · R0 (“economical” QR-decomposition) ,

(ii) a unitary Matrix Q ∈ K n,n and a unique upper triangular R ∈ K n,k with (R)i,i > 0, i ∈
{1, . . . , n}, such that

A = Q·R (full QR-decomposition) .

If K = R all matrices will be real and Q is then orthogonal.

Visualisation: “economical” QR-decomposition: QH

0 Q0 = I k (orthonormal columns),

A = Q0 R0 , Q0 ∈ K n,k , R0 ∈ K k,k upper triangular ,

   
   
   
   
   
    
   
   
    
    
 A   Q0  
 =  . (3.3.10)
    R0 
    
    
   
   
   
   
   
   
   

Visualisation: full QR-decompisiton: QH Q = QQH = In (orthogonal matrix),

A = QR , Q ∈ K n,n , R ∈ K n,k ,

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]243
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

    
    
    
    
    
    
    
    
    
 A    
 =  Q  . (3.3.11)
    
    R 
    
    
    
    
    
    
    

For square A, that is, n = k, both QR-decompositions coincide.

Corollary 3.3.12. Uniqueness of QR-factorization

The “economical” QR-factorization (3.3.3.1) of A ∈ K m,n , m ≥ n, with rank(A) = n is unique, if

we demand (R0 )ii > 0.

Proof. We observe that R is regular, if A has full rank n. Since the regular upper triangular matrices form
a group under multiplication:

Q1 R1 = Q2 R2 ⇒ Q1 = Q2 R with upper triangular R := R2 R1−1 .

I = Q1H Q1 = R H Q2H Q2 R = R H R .
| {z }
=I

The assertion follows by uniqueness of Cholesky decomposition, Lemma 2.8.14.

✷

3.3.3.2 Computation of QR-Decomposition

In theory, Gram-Schmidt orthogonalization (GS) can be used to compute the QR-factorization of a matrix
A ∈ R m,n , m ≥ n, rank(A) = n. However, as we saw in Exp. 1.5.5, Gram-Schmidt orthogonalization in
the form of Code 1.5.3 is not a stable algorithm.

There is a stable way to compute QR-decompositions, based on the accumulation of orthogonal transfor-
mations.
Corollary 3.3.13. Composition of orthogonal transformations

The product of two orthogonal/unitary matrices of the same size is again orthogonal/unitary.

Idea: find simple unitary/orthogonal (row) transformations rendering certain ma-

trix elements zero:
   
   
   
Q

=

 0

 with Q H = Q−1 .

   

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]244
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Recall that this “annihilation of column entries” is the key operation in Gaussian forward elimination, where
it is achieved by means of non-unitary row transformations, see Sect. 2.3.2. Now we want to find a
counterpart of Gaussian elimination based on unitary row transformations on behalf of numerical stability.

Example 3.3.14 (“Annihilating” orthogonal transformations in 2D)

In 2D there are two possible orthogonal transformations make 2nd component of a ∈ R 2 vanish, which,
in geometric terms, amounts to mapping the vector onto the x1 -axis.
x2
x2

cos ϕ sin ϕ
Q= − sin ϕ cos ϕ
a a

x1 ϕ x1
. .

Fig. 100
Fig. 101

reflections at angle bisector,

rotations turning a onto x1 -axis.

Note that in each case we have two different length-preserving lineare mappings at our disposal. This
flexibility will be important for curbing the impact of roundoff.

Both reflections and rotations are actually used in library routines and both are discussed in the sequel:

(3.3.15) Householder reflections → [?, Sect. 5.1]

The following so-called Householder matrices effect the reflection of a vector into a multiple of the first unit
vector with the same length:

vv H
Q = H(v) : = I − 2 with v = 12 (a ± k ak2 e1 ) . (3.3.16)
vH v
Orthogonality of these matrices can be established by direct computation.

Fig. 102 sketches a “geometric derivation” of Householder reflections:

Given a, b ∈ R n with kak = kbk, the difference

vector v = b − a is orthogonal to the bisector. v

vT v a
b = a − (a − b) = a − v T
v v
vT a vvT b
= a − 2v T = a − 2 T a = H(v)a ,
v v v v

Fig. 102

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]245
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

because, due to orthogonality (a − b) ⊥ (a + b)

( a − b ) T ( a − b ) = ( a − b ) T ( a − b + a + b ) = 2( a − b ) T a .

Suitable successive Householder transformations determined by the leftmost column (“target column”) of
shrinking bottom right matrix blocks cen be used to achieve upper triangular form R. Visualization of
annihiliation of lower triangular matrix part for a square matrix:
       
*
     *   
       * 
       
 ➤ ➤ ➤ .
   0   0   0 
       
       

= “target column a” (determines unitary transformation),

= modified in course of transformations.

Writing Qℓ for the Householder matrix used in the ℓ-th step we get

Q n − 1 Q n − 2 · · · · · Q1 A = R ,
QR-factorization Q := Q1H · · · · · QnH−1 orthogonal matrix ,
of A ∈ C n,n : A = QR ,
(QR-decomposition) R upper triangular matrix .

Remark 3.3.17 (QR-decomposition of “fat” matrices)

We can also apply successive Householder transformation as outlined in § 3.3.15 to a matrix A ∈ R m,n
with m < n. If the first m columns of A are linearly independent, we obtain another variant of the QR-
decomposition:
    
    
    
    
 A =  Q  R ,
    
    
    

A = QR , Q ∈ R m,m , R ∈ R m,n ,

where Q is orthogonal, R upper triangular, that is, (R)i,j = 0 for i > j.

Remark 3.3.18 (Stable implementation of Householder reflections)

In (3.3.16) the computation of the vector v can be prone to cancellation (→ Section 1.5.4), if the vector
a encloses a very small angle with the first unit vector, because in this case v can be very small and
beset with a huge relative error. This is a concern, because in the formula for the Householder matrix v is
2
normalized to unit length (division by k vk2 ).

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]246
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Fortunately, two choices for v are possible in (3.3.16) and at most one can be affected by cancellation.
The right choice is
(
1
2 (a + kak2 e1 ) , if a1 > 0 ,
v= 1
2 (a − kak2 e1 ) , if a1 ≤ 0 .

See [?, Sect. 19.1] and [?, Sect. 5.1.3] for a discussion.

(3.3.19) Givens rotations → [?, Sect. 14]

The 2D rotation displayed in Fig. 101 can be embedded in an identity matrix. Thus, the following orthogonal
transformation, a Givens rotation, annihilates the k-th component of a vector a = [ a1 , . . . , an ]⊤ ∈ R n .
Here γ stands for cos( ϕ) and σ for sin( ϕ), ϕ the angle of rotation, see Fig. 101.
    
(1)
γ ··· σ ··· 0 a1 a1
 .. .. . ..  ..   .. 
 . . .. .  .   .  γ = √ a1
,
     | a1 |2 +| ak |2
G1k (a1 , ak )a := −σ · · · γ · · · 0  a k  =  0   , if √ ak (3.3.20)
 . .. . . ..   .    .. 
σ = .
 .. . . .  .. 
  . 
| a1 |2 +| ak |2

0 ··· 0 ··· 1 an f an

Orthogonality (→ Def. 6.2.2) of G1k (a1 , ak ) is verified immediately. Again, we have two options for an
annihilating rotation, see Ex. 3.3.14. It will always be possible to choose one that avoids cancellation [?,
Sect. 5.1.8], see Code 3.3.21 for details.

C++11 code 3.3.21: Stable Givens rotation of a 2D vector

2 //! plane (2D) Givens rotation avoiding cancellation
3 // Rotation matrix returned in G, rotated vector in x
4 void p l a n e r o t ( const Vector2d& a , Matrix2d& G, Vector2d& x ) {
5 i f ( a ( 1 ) != 0.0) {
6 double t , s , c ;
7 i f ( std : : abs ( a ( 1 ) ) > std : : abs ( a ( 0 ) ) ) {
8 t = −a ( 0 ) / a ( 1 ) ; s = 1 . 0 / std : : s q r t ( 1 . 0 + t ∗ t ) ; c = s ∗ t ;
9 }
10 else {
11 t = −a ( 1 ) / a ( 0 ) ; c = 1 . 0 / std : : s q r t (1+ t ∗ t ) ; s = c ∗ t ;
12 }
13 // Form 2 × 2 Givens rotation matreix
14 G << c , s,− s , c ;
15 }
16 else G. s e t I d e n t i t y ( ) ;
17 x << a . norm ( ) , 0 . 0 ;
18 }

So far, we know how to annihilate a single component of a vector by means of a Givens rotation that
targets that component and some other (the first in (3.3.20)). However, for the sake of QR-decomposition
we aim to map all components to zero except for the first.

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]247
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

☞ This can be achieved by n − 1 successive Givens rotations, see also Code 3.3.23
 (2)   
    ( n − 1)
a1 (1) a a
a1  10   1 
 ..   0     0 
.  .. 
 .  G12 (a1 ,a2 ) 

 G (a(1) ,a )  0  G (a(2) ,a )
  
( n −2)
G1n ( a1 ,an )  
 ..  −−−−−→  a3  −−−−−−→ 13 1 3
 −
14 1
−−−−−
4
→ · · · −
−−−−−−− →  .  (3.3.22)
   .   a4   
.  ..   .   
 ..   ..   .. 
 . 
an an
an 0

✎ Notation: Gij (a1 , a2 ) =

ˆ Givens rotation (3.3.20) aimed at rows i and j of the matrix.

C++11 code 3.3.23: Roating a vector onto the x1 -axis by successive Givens transformation
2 // Orthogonal transformation of a (column) vector into a multiple of
3 // the first unit vector by successive Givens transformations
4 void g i v e n s c o l t r f ( const VectorXd& aIn , MatrixXd& Q, VectorXd& aOut ) {
5 unsigned i n t n = aIn . siz e ( ) ;
6 // Assemble rotations in a dense matrix.
7 // For (more efficient) alternatives see Rem. Rem. 3.3.25
8 Q. s e t I d e n t i t y ( ) ;
9 Matrix2d G; Vector2d tmp , xDummy ;
10 aOut = aIn ;
11 f o r ( i n t j = 1 ; j < n ; ++ j ) {
12 tmp ( 0 ) = aOut ( 0 ) ; tmp ( 1 ) = aOut ( j ) ;
13 planer ot ( tmp , G, xDummy ) ; // see Code 3.3.21
14 // select 1st and jth element of aOut and use the Map function
15 // to prevent copying; equivalent to aOut([1,j]) in M A T L A B
16 Map<VectorXd , 0 , I n n e r S t r i d e <> > aOutMap ( aOut . data ( ) , 2 ,
I n n e r S t r i d e < >( j ) ) ;
17 aOutMap = G ∗ aOutMap ;
18 // select 1st and jth column of Q (Q(:,[1,j]) in M A T L A B )
19 Map<MatrixXd , 0 , O u t e r S t r i d e <> > QMap(Q. data ( ) , n , 2 ,
O u t e r S t r i d e < >( j ∗ n ) ) ;
20 QMap = QMap ∗ G. transpose ( ) ;
21 }
22 }

Armed with these compound Givens rotations we can proceed as in the case of Householder reflections
to accomplish the orthogonal transformation of a full-rank matrix to upper triangular form, see

C++11 code 3.3.24: QR-decomposition by successive Givens rotations

2 //! QR decomposition of square matrix A by successive Givens
transformations
3 void q r g i v e n s ( const MatrixXd& A , MatrixXd& Q, MatrixXd& R) {
4 unsigned i n t n = A . rows ( ) ;
5 // Assemble rotations in a dense matrix.
6 // For (more efficient) alternatives see Rem. Rem. 3.3.25
7 Q. s e t I d e n t i t y ( ) ;
8 Matrix2d G; Vector2d tmp , xDummy ;
9 R = A ; // In situ transformation

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]248
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10 f o r ( i n t i = 0 ; i < n −1; ++ i ) {
11 f o r ( i n t j = n −1; j > i ; −− j ) {
12 tmp ( 0 ) = R( j −1, i ) ; tmp ( 1 ) = R( j , i ) ;
13 p l a n e r o t ( tmp , G, xDummy ) ; // see Code 3.3.21
14 R. block ( j − 1 ,0 ,2 ,n ) = G ∗ R . block ( j − 1 ,0 ,2 ,n ) ;
15 Q. block ( 0 , j −1, n , 2 ) = Q. block ( 0 , j −1,n , 2 ) ∗ G. transpose ( ) ;
16 }}
17 }

Remark 3.3.25 (Storing orthogonal transformations)

When doing successive orthogonal transformations as in the case of QR-decomposition by means of

Householder reflections (→ § 3.3.15) or Givens rotations (→ § 3.3.19) it would be prohibitively expensive
to assemble and even multiply the transformation matrices!

The matrices for the orthogonal transformation are never built in codes!
The transformations are stored in a compressed format.

➊ In the case of Householder reflections H (v) ∈ R m,m (3.3.16)

➤ store only the last n − 1 components of the normalized vector v ∈ R m

For QR-decomposition of a matrix A ∈ R m,n , by means of successive Householder reflections H(v1 ) ·

· · · · H(vk ), k := min{m, n}, we store the bottom parts of the vectors v j ∈ R m− j+1, j = 1, . . . , k, whose
lengths decrease, in place of the “annihilated” lower triangular part of A:

↑ Case m < n

↔ space for Householder vectors

Case m > n →

➋ In the case of Givens rotations, for a single rotation Gi,j (a1 , a2 )

➤ store row indices (i, j) and rotation angle [?, Sect. 5.1.11],


1 , if γ = 0 ,
γ σ 1
for G= ⇒ store ρ := sign(γ)σ , if |σ | < |γ| ,
−σ γ 2

2 sign(σ)/γ , if |σ| ≥ |γ| ,

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]249
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015


 ρ=1 ⇒ γ = 0 , σ = 1√
which means |ρ| < 1 ⇒ σ = 2ρ , γ = p 1 − σ2

|ρ| > 1 ⇒ γ = 2/ρ , σ = 1 − γ2 .

Then store Gij (a, b) as triple (i, j, ρ). The parameter ρ forgets the sign of the matrix Gij , so the signs of
the corresponding rows in the transformed matrix R have to be changed accordingly. The rationale behind
the above convention is to curb the impact of roundoff errors.

Remark 3.3.26 (QR-decomposition of banded matrices)

The advantage of Givens rotations is its selectivity, which can be exploited for banded matrices, see
Section 2.7.6, Def. 2.7.55.

Example: Orthogonal transformation of an n × n tridiagonal matrix to upper triangular form, that is, the
annihilation of the sub-diagonal, by means of successive Givens rotations:

     
∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0
∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0
     
0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0 0 0
     
0 0 ∗ ∗ ∗ 0 0 0 G12 0 0 ∗ ∗ ∗ 0 0 0 G23 ···Gn −1,n 0 0 0 ∗ ∗ ∗ 0 0
  −−−→   −−−−−−→  
0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0
     
0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗
     
0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗
0 0 0 0 0 0 ∗ ∗ 0 0 0 0 0 0 ∗ ∗ 0 0 0 0 0 0 0 ∗

∗=
ˆ entry set to zero by Givens rotation, ∗ =
ˆ new non-zero entry (“fill-in” → Def. 2.7.47).

This is a manifestation of a more general result, see Def. 2.7.55 for notatons:

Theorem 3.3.27. QR-decomposition “preserves bandwidth”

If A = QR is the QR-decomposition of a regular matrix A ∈ R n,n , then bw(R) ≤ bw(A).

A total of only n Givens rotations is required, involving an asymptotic total computational effort of O(nm)
for an m × n-matrix.

3.3.3.3 QR-Decomposition: Stability

In numerical linear algebra orthogonal transformation methods usually give rise to reliable algorithms,
thanks to the norm-preserving property of orthogonal transformations.

(3.3.28) Stability of unitary/orthogonal transformations

We consider the mapping (the “transformation” induced by Q)

F : K n → K n , F(x) := Qx , Q ∈ K n,n unitary/orthogonal .

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]250
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

We are interested in the sensitivity of F, that is, the impact of relative errors in the data vector x on the
output vector y := F(x).

We study the output for a perturbed input vector:

)
Qx = y ⇒ k xk2 = kyk2 k∆y k2 k∆x k2
= .
Q(x + ∆x) = y + ∆y ⇒ Q∆x = ∆y ⇒ k ∆y k2 = k∆x k2 k yk 2 k x k2

We conclude, that unitary/orthogonal transformations do not involve any amplification of relative errors in
the data vectors.

Of course, this also applies to the “solution” of square linear systems with orthogonal coefficient matrix
Q ∈ R n,n , which, by Def. 6.2.2, boils down to multiplication of the right hand side vector with QH .

Remark 3.3.29 (Conditioning of conventional row transformations)

Gaussian elimination as presented in § 2.3.3 converts a matrix to upper triangular form by elementary
row transformations. Those add a scalar multiple of a row of the matrix to another matrix and amount to
left-multiplication with matrices

T := In + µe j ei⊤ , µ ∈ K , i, j ∈ {1, . . . , n}, i 6= j . (3.3.30)

However, these transformations can lead to a massive amplification of relative errors, which, by virtue of
Ex. 2.2.7 can be linked to large condition numbers of T.

This accounts for fact that the computation of LU-decompositions by means of Gaussian elimination might
not be stable, see Ex. 2.4.5.

Experiment 3.3.31 (Conditioning of conventional row transformations, Rem. 3.3.29 cnt’d)

Condition numbers of row transformation matrices

7
10

6
10

Study in 2D:
5
10

2 × 2 row transformation matrix, (cf. elimination ma-

condition number

4
trices of Gaussian elimination. 10

1 0 10
3

T(µ) =
µ 1 2
10

Euclidean condition numbers of T(µ) ✄ 1

10
2-norm
maximum norm
1-norm
0
10
-4 -3 -2 -1 0 1 2 3 4
10 10 10 10 10 10 10 10 10
Fig. 103 µ

The perfect conditioning of orthogonal transformation prevents the destructive build-up of roundoff errors.

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]251
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Theorem 3.3.32. Stability of Householder QR [?, Thm. 19.4]

Let Re ∈ R m,n be the R-factor of the QR-decomposition of A ∈ R m,n computed by means of

successive Householder reflections (→ § 3.3.15). Then there exists an orthogonal Q ∈ R m,m such
that

e with k ∆A k ≤ cmn EPS

A + ∆A = QR 2 kAk2 , (3.3.33)
1 − cmn EPS
where EPS is the machine precision and c > 0 a small constant independent of A.

3.3.3.4 QR-Decomposition in E IGEN

E IGEN offers several classes dedicated to computing QR-type decompositions of matrices, for instance
HouseholderQR. Internally the QR-decomposition is stored in compressed format as explained in Rem. 3.3.25.
Its computation is triggered by the constructor.

C++-code 3.3.34: QR-decompositions in E IGEN

2 # include <Eigen /QR>
3

4 // Computation of full QR-decomposition (3.3.3.1),

5 // dense matrices built for both QR-factors (expensive!)
6 std : : p a i r <MatrixXd , MatrixXd > qr_decomp_full ( const MatrixXd& A) {
7 Eigen : : HouseholderQR<MatrixXd > qr ( A) ;
8 MatrixXd Q = qr . householderQ ( ) ; //
9 MatrixXd R = qr . matrixQR ( ) . template t r iangular View <Eigen : : Upper > ( ) ;
10 r et ur n std : : p a i r <MatrixXd , MatrixXd > (Q,R) ;
11 }
12

13 // Computation of economical QR-decomposition (3.3.3.1),

14 // dense matrix built for Q-factor (possibly expensive!)
15 std : : p a i r <MatrixXd , MatrixXd > qr_decomp_eco ( const MatrixXd& A) {
16 using i n d e x _ t = MatrixXd : : Index ;
17 const i n d e x _ t m = A . rows ( ) , n = A . cols ( ) ;
18 Eigen : : HouseholderQR<MatrixXd > qr ( A) ;
19 MatrixXd Q = ( qr . householderQ ( ) ∗ MatrixXd : : I d e n t i t y (m, n ) ) ; //
20 MatrixXd R = qr . matrixQR ( ) . block ( 0 , 0 , n , n ) . template
t r iangular View <Eigen : : Upper > ( ) ; //
21 r et ur n std : : p a i r <MatrixXd , MatrixXd > (Q,R) ;
22 }

Note that the method householderQ returns the Q-factor in compressed format → Rem. 3.3.25. As-
signment to a matrix will convert it into a (dense) matrix format, see Line 8; only then the actual computa-
tion of the matrix entries performed. It can also be multiplied with another matrix of suitable size, which is
used in Line 19 to extract the Q-factor Q0 ∈ R m,n of the economical QR-decomposition (3.3.3.1).
The matrix returned by the method matrixQR() gives access to a matrix storing the QR-factors in
compressed form. Its upper triangular part provides R, see Line 20.

A close inspection of the algorithm for the computation of QR-decompositions of A ∈ R m,n by successive

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]252
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Householder reflections (→ § 3.3.15) reveals, that n transformations costing ∼ mn operations each are
required.

Asymptotic complexity of Householder QR-decomposition

The computational effort for HouseholderQR() of A ∈ R m,n , m > n, is O(mn2 ) for m, n → ∞.

Experiment 3.3.36 (Asymptotic complexity of Householder QR-factorization)

C++-code 3.3.37: timing QR-factorizations in E IGEN

2 i n t nruns = 3 , minExp = 2 , maxExp = 6 ;
3 MatrixXd tms ( maxExp−minExp + 1 ,4) ;
4 f o r ( i n t i = 0 ; i <= maxExp−minExp ; ++ i ) {
5 Timer t1 , t2 , t 3 ; // timer class
6 i n t n = std : : pow ( 2 , minExp + i ) ; i n t m = n ∗ n ;
7 // Initialization of matrix A
8 MatrixXd A(m, n ) ; A . setZero ( ) ;
9 A . s e t I d e n t i t y ( ) ; A . block ( n , 0 ,m−n , n ) . setOnes ( ) ;
10 A += VectorXd : : LinSpaced (m, 1 ,m) ∗ RowVectorXd : : LinSpaced ( n , 1 , n ) ;
11 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
12 // plain QR-factorization in the constructor
13 t 1 . s t a r t ( ) ; HouseholderQR<MatrixXd > qr ( A) ; t 1 . s top ( ) ;
14 // full decomposition
15 t 2 . s t a r t ( ) ; std : : p a i r <MatrixXd , MatrixXd > QR2 =
q r _ d e c o m p _ f u l l ( A) ; t 2 . s top ( ) ;
16 // economic decomposition
17 t 3 . s t a r t ( ) ; std : : p a i r <MatrixXd , MatrixXd > QR3 =
qr_decomp_eco ( A) ; t 3 . s top ( ) ;
18 }
19 tms ( i , 0 ) =n ;
20 tms ( i , 1 ) = t 1 . min ( ) ; tms ( i , 2 ) = t 2 . min ( ) ; tms ( i , 3 ) = t 3 . min ( ) ;
21 }

10 2

Timings for HouseholderQR

qr_decomp_full()

• plain QR-factorization in the constructor of 10 1 qr_decomp_eco()

O(n 4 )
O(n 6 )
HouseholderQR, 10 0

• invocation of function qr_decomp_full(),

10 -1
see Code 3.4.13,
time [s]

• call to qr_decomp_eco() from 10 -2

Code 3.4.13.
10 -3
Platform:
✦ ubuntu 14.04 LTS 10 -4

✦ i7-3517U CPU @ 1.90GHz × 4 10 -5

✦ L1 32 KB, L2 256 KB, L3 4096 KB, Mem 8 GB

10 -6
✦ gcc 4.8.4, -O3 10 0 10 1 10 2
Fig. 104 n

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]253
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3.3.4 QR-Based Solver for Linear Least Squares Problems

The QR-decomposition introduced in Section 3.3.3, Thm. 3.3.9, paves the way for the practical algorithmic
realization of the “equivalent orthonormal transformation to upper triangular form”-idea from Section 3.3.1.
We consider the full-rank linear least squares problem Eq. (3.1.31): Given A ∈ R m,n , m ≥ n, rank(A) =
n,

seek x ∈ R n such that k Ax − bk2 → min .

We assume that we are given a

QR-decomposition: A = QR, Q ∈ R m,m orthogonal, R ∈ R m,n (regular) upper triangular matrix.
We apply the orthogonal 2-norm preserving (→ Thm. 3.3.5) transformation encoded in Q to Ax − b, the
vector inside the 2-norm to be minimized:

kAx − b k2 = Q(Rx − QH b) e
= Rx − b e : = QH b .
, b
2 2

Thus, we have obtained an equivalent triangular linear least squares problem:

 
 
  e
b1
   .. 
   . 
   
 R0   
 
   



  x1  
 
  ..   
kAx − b k2 → min ⇔   . −


 → min .
   
  xn  
   
   
   
 0   .. 
  
  . 
  e
bm
2

 
  −1 0
   .. 
  e  . 
  b1  
   ..   0 
x=


  .  , with residual r = Q e
.
 R0   bn + 1 

  e
bn  .. 
 . 
e
bm
q
Note: by Thm. 3.3.5 the norm of the residual is readily available: krk2 = eb2n+1 + · · · + e
b2m .

C++-code 3.3.38: QR-based solver for full rank linear least squares problem (3.1.31)
2 // Solution of linear least squares problem (3.1.31) by means of
QR-decomposition

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]254
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3 // Note: A ∈ R m,n with m > n, rank(A) = n is assumed

4 // Least squares solution returned in x, residual norm as return value
5 double q r l s q s o l v e ( const MatrixXd& A , const VectorXd& b ,
6 VectorXd& x ) {
7 const unsigned m = A . rows ( ) , n = A . cols ( ) ;
8

9 MatrixXd Ab (m, n + 1 ) ; Ab << A , b ; // Form extended matrix [A, b]

11 // QR-decomposition of extended matrix automatically transforms b

12 MatrixXd R = Ab . householderQr ( ) . matrixQR ( ) . template
13 t r iangular View <Eigen : : Upper > ( ) ; //
14

15 MatrixXd R_nn = R . block ( 0 , 0 , n , n ) ; // R-factor R0

−1
16 // Compute least squares solution x = (R)1:n,1:n (Q⊤ b)1:n
17 x = R_nn . template t r iangular View <Eigen : : Upper > ( ) . solve (R . block ( 0 ,
n , n , 1) ) ;
18 r et ur n R( n , n ) ; // residual norm = kAb x − bk2 (why ?)
19 }

Discussion of (some) details of implementation in Code 3.3.38:

• The QR-decomposition is computed in a numerically stable way by means of Householder reflec-
tions (→ § 3.3.15) by E IGEN’s built-in function householderQR available for matrix type. The com-
putational cost of this function when called for an m × n matrix is, asymptotically for m, n → ∞,
O ( n2 m ).
• Line 9: We perform the QR-decomposition of the extended matrix [A, b] with b as rightmost col-
umn. Thus, the orthogonal transformations are automatically applied to b; the augmented matrix is
converted into [R, Q⊤ b], the data of the equivalent upper triangular linear least squares problem.
Thus, actually, no information about Q needs to be stored, if one is interested in the least squares
solution x only.
The idea is borrowed from Gaussian elimination, see Code 2.3.4, Line 9.
• Line 13: MatrixQR() returns the compressed QR-factorization as a matrix, where the R-factor
R ∈ R m,n is contained in the upper triangular part, whose top n rows give R0 from see (3.3.3.1).
• Line 18: the components (b)n+2:m of the vector b (treated as rightmost column of the augmented
matrix) are annihilated when computing the QR-decomposition (by final Householder reflection):

QH [A, b] = 0. Hence, QH [A, b] e )n+1:m
= (b , which gives the norm of the
n +2:m,n n +1,n +1 2
residual.
➤ A QR-based algorithm is implemented in the solve() method available for E IGEN’s QR-
decomposition, see Code 3.3.39.

C++11 code 3.3.39: E IGEN’s built-in QR-based linear least squares solver
2 // Solving a full-rank least squares problem kAx − bk2 → min in E I G E N
3 double l s q s o l v e _ e i g e n ( const MatrixXd& A , const VectorXd& b ,
4 VectorXd& x ) {
5 x = A . householderQr ( ) . solve ( b ) ;
6 r et ur n ( ( A∗ x−b ) . norm ( ) ) ;

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]255
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

7 }

Remark 3.3.40 (QR-based solution of linear systems of equations)

Applying the QR-based algorithm for full-rank linear least squares problems in the case m = n, that
is, to a square linear system of equations Ax = b with a regular coefficient matrix , will compute the
solution x = A−1 b. In a sense, the QR-decomposition offers an alternative to Gaussian elimination/LU-
decomposition discussed in § 2.3.30.
The steps for solving a linear system of equations Ax = b by means of QR-decomposition are as follows:
① QR-decomposition A = QR, computational costs 23 n3 + O(n2 )
(about twice as expensive as LU -decomposition without pivoting)
Ax = b : ② orthogonal transformation z = Q H b, computational costs 4n2 + O(n)
(in the case of compact storage of reflections/rotations)
③ Backward substitution, solve Rx = z, computational costs 12 n(n + 1)
Benefit: we can utterly dispense with any kind of pivoting:
✬ ✩
✌ Computing the generalized QR-decomposition A = QR by means of Householder reflections
or Givens rotations is (numerically stable) for any A ∈ C m,n .
✌ For any regular system matrix an LSE can be solved by means of
QR-decomposition + orthogonal transformation + backward substitution
✫ ✪
in a stable manner.

Drawback: QR-decomposition can hardly ever avoid massive fill-in (→ Def. 2.7.47) also in situations,
where LU-factorization greatly benefits from Thm. 2.7.58.

Remark 3.3.41 (QR-based solution of banded LSE)

From Rem. 3.3.26, Thm. 3.3.27, we know that that particular situtation, in which QR-decomposition can
avoid fill-in (→ Def. 2.7.47) is the case of banded matrices, see Def. 2.7.55. For a banded n × n linear
systems of equations with small fixed bandwidth bw(A) ≤ O(1) we incur an
➣ asymptotic computational effort: O(n) for n → ∞

 
d1 c1 0 ... 0
The following code uses QR-decomposition com-  .. 
 e1 d2 c2 . 
puted by means of selective Givens rotations (→  
A =  0 e2 d3 c3 
§ 3.3.19) to solve a tridiagonal linear system of equa- . . . . 
 .. .. .. .. c n −1 
tions Ax = b
0 ... 0 e n −1 d n

The matrix is passed in the form of three vectors e, c, d giving the entries in the non-zero bands.

C++11 code 3.3.42: Solving a tridiagonal system by means of QR-decomposition ➺ GITLAB

2 //! @brief Solves the tridiagonal system Ax = b with QR-decomposition

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]256
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3 //! @param[in] d Vector of dim n; the diagonal elements

4 //! @param[in] c Vector of dim n − 1; the lower diagonal elements
5 //! @param[in] e Vector of dim n − 1; the upper diagonal elements
6 //! @param[in] b Vector of dim n; the rhs.
7 //! @param[out] x Vector of dim n
8 VectorXd t r i d i a g q r ( VectorXd c , VectorXd d , VectorXd e , VectorXd& b ) {
9 i n t n = d . siz e ( ) ;
10 // resize the vectors c and d to correct length if needed
11 c . conservativeResize ( n ) ; e . conservativeResize ( n ) ;
12 double t = d . norm ( ) + e . norm ( ) + c . norm ( ) ;
13 Matrix2d R; Vector2d z , tmp ;
14 f o r ( i n t k = 0 ; k < n −1; ++k ) {
15 tmp ( 0 ) = d ( k ) ; tmp ( 1 ) = e ( k ) ;
16 // Use givensrotation to set the entries below the diagonal
17 // to zero
18 planer ot ( tmp , R, z ) ; // see Code 3.3.21
19 i f ( std : : abs ( z ( 0 ) ) / t < std : : n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) )
20 throw std : : r u n t i m e _ e r r o r ( " A n e a r l y s i n g u l a r " ) ;
21 // Update all other entries of the matrix and rhs. which
22 // were affected by the givensrotation
23 d( k) = z (0) ;
24 b . segment ( k , 2 ) . applyOnTheLeft (R) ; // rhs.
25 // Block of the matrix affected by the givensrotation
26 Matrix2d Z ;
27 Z << c ( k ) , 0 , d ( k +1) , c ( k +1) ;
28 Z . applyOnTheLeft (R) ;
29 // Write the transformed block back to the corresponding places
30 c . segment ( k , 2 ) = Z . diagonal ( ) ; d ( k +1) = Z ( 1 , 0 ) ; e ( k ) = Z ( 0 , 1 ) ;
31 }
32 // Note that the e is now above d and c
33 // Backsubstitution acting on upper triangular matrix
34 // with upper bandwidth 2 (stored in vectors).
35 VectorXd x ( n ) ;
36 // last row
37 x ( n −1) = b ( n −1) / d ( n −1) ;
38 i f ( n >= 2 ) {
39 // 2nd last row
40 x ( n −2) = ( b ( n −2)−c ( n −2) ∗ x ( n −1) ) / d ( n −2) ;
41 // remaining rows
42 f o r ( i n t i = n −3; i >= 0 ; −− i )
43 x ( i ) = ( b ( i ) − c ( i ) ∗ x ( i +1) − e ( i ) ∗ x ( i +2) ) / d ( i ) ;
44 }
45 r et ur n x ;
46 }

Example 3.3.43 (Stable solution of LSE by means of QR-decomposition)

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]257
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Aiming to confirm the claim of superior stability of QR-based approaches (→ Rem. 3.3.40, § 3.3.28) we
revisit Wilkinson’s counterexample from Ex. 2.4.5 for which Gaussian elimination with partial pivoting does
not yield an acceptable solution.

0
10

-2

Wilkinson matrix A ∈ R n,n

relative error (Euclidean norm)

-4

 10


 −1 for i > j, j < n ,

-6
10
1 for i = j , Gaussian elimination
(A)i,j := -8
10 QR-decomposition


 0 for i < j, j < n ,
relative residual norm


 -10
10

1 for j = n . -12
10

-14
10
QR-decomposition produces perfect solution ✄
-16
10

0 100 200 300 400 500 600 700 800 900 1000
Fig. 105 n

Let us summarize the pros and cons of orthogonal transformation techniques for linear least squares
problems:

Normal equations vs. orthogonal transformations method

Superior numerical stability (→ Def. 1.5.85) of orthogonal transformations methods:

Use orthogonal transformations methods for least squares problems (3.1.38), whenever A ∈
R m,n dense and n small.
SVD/QR-factorization cannot exploit sparsity:

Use normal equations in the expanded form (3.2.8)/(3.2.9), when A ∈ R m,n sparse (→
Notion 2.7.1) and m, n big.

3.3.5 Modification Techniques for QR-Decomposition

e = b efficiently, whose
In § 2.6.13 we faced the task of solving a square linear system of equations Ax
e
coefficient matrix A was a (rank-1) perturbation of A, for which an LU-decomposition was available.
Lemma 2.6.22 showed a way to reuse the information contained in the LU-decomposition.

A similar task can be posed, if the QR-decomposition of a matrix A ∈ R m,n , m ≥ n, has already been
computed and then we have to solve a full-rank linear least squares problem e −b
Ax → min with
2
e ∈ R m,n a “slight” perturbation of A. If we aim to use orthogonalization techniques it would be desirable
A
to compute the QR-decomposition of A e with recourse to the QR-decomposition of A.

3.3.5.1 Rank-1 Modifications

For A ∈ R m,n , m ≥ n, rank(A) = n, we consider the rank-1 modification, cf. Eq. (2.6.17),

e := A + uv⊤ , u ∈ R m , v ∈ R n .
A −→ A (3.3.45)

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]258
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

e ) = n.
We assume that still rank(A
Given the (unique) full QR-decomposition A = QR, Q ∈ R m,m orthogonal, R ∈ R m,n upper trian-
gular, according to Thm. 3.3.9, the goal is to find an efficient algorithm that yields the (unique) full QR-
decomposition of Ae: A
e =Q eRe.

Step ➊: compute w = Q⊤ u, and observe A + uv⊤ = Q(R + wv⊤ ).

➣ Computational effort O(mn), if Q stored in suitable (compressed) format.

Step ➋: achieve w → k wke1 through applying m − 1 Givens rotations, visualization for m = n:

       
∗ ∗ ∗ ∗
∗   ∗  ∗ 0
       
 ..   ..   ..   .. 
. G   . G .
  n−1,n  .  Gn−2,n−1   n−3,n−2 G12  
w = ∗  −−−→ ∗ −−−−−→ ∗ −−−−−→ · · · −−−→ 0
       
∗   ∗  ∗ 0
       
∗   ∗ 0 0
∗ 0 0 0
Note that this will affect R by creating a single non-zero subdiagonal:
   
∗ ∗ ··· ∗ ∗ ∗ ∗ ∗ ∗ ··· ∗ ∗ ∗ ∗
 0 ∗ ··· ∗ ∗ ∗ ∗   ∗ ∗ ∗ ∗ 
  0 ∗ ··· 
 .. .. ..   . .. .. 
 . . .  Gn−1,n  .. . .  G
    n−2,n−1
R= 0 ··· 0 ∗ ∗ ∗ ∗  −−−→  0 · · · 0 ∗ ∗ ∗ ∗  −−−−−→
   
 0 ··· 0 0 ∗ ∗ ∗    0 ··· 0 
0 ∗ ∗ ∗ 
 
 0 ··· 0 0 0 ∗ ∗   0 ··· 0 0 0 ∗ ∗ 
0 ··· 0 0 0 0 ∗ 0 ··· 0 0 0 ∗ ∗
   
∗ ∗ ··· ∗ ∗ ∗ ∗ ∗ ∗ ··· ∗ ∗ ∗ ∗
 0 ∗ ··· ∗ ∗ ∗ ∗   ∗ ∗ ··· ∗ ∗ ∗ ∗ 
   
 .. .. ..   .. .. 
 . . .  Gn−3,n−2  . . 
  G1,2  
−−−→  0 ··· 0 ∗ ∗ ∗ ∗  −−−−−→ · · · −−−→  0 ··· ∗ ∗ ∗ ∗ ∗  = : R1
   
 0 ··· 0 0 ∗ ∗ ∗   0 ··· 0 ∗ ∗ ∗ ∗ 
   
 0 ··· 0 0 ∗ ∗ ∗   0 ··· 0 0 ∗ ∗ ∗ 
0 ··· 0 0 0 ∗ ∗ 0 ··· 0 0 0 ∗ ∗
We see (R1 )i,j = 0, if i > j + 1, it a so-called upper Hessenberg matrix. If Q1 collects all orthogonal
transformations used in Step ➋, then
A + uv H = QQ1H ( R1 + k wk2 e1 v H ) with orthogonal Q1 := G12 · · · · · Gn−1,n .
| {z }
upper Hessenberg matrix

➣ Computational effort O(m + n2 )

Step ➌: Convert R1 + kwk2 e1 v H 7→ upper triangular form by successive Givens rotations:

   
∗ ∗ ··· ∗ ∗ ∗ ∗ ∗ ∗ ··· ∗ ∗ ∗ ∗
 ∗ ∗ ··· ∗ ∗ ∗ ∗   0 ∗ ··· ∗ ∗ ∗ ∗ 
   
 ..   .. 
 .   . 
  G12   G23
R1 + k w k2 e1 v H =  0 ··· ∗ ∗ ∗ ∗ ∗  −−−→  0 ··· ∗ ∗ ∗ ∗ ∗  −−−→ · · ·
   
 0 ··· 0 ∗ ∗ ∗ ∗   0 ··· 0 ∗ ∗ ∗ ∗ 
   
 0 ··· 0 0 ∗ ∗ ∗   0 ··· 0 0 ∗ ∗ ∗ 
0 ··· 0 0 0 ∗ ∗ 0 ··· 0 0 0 ∗ ∗

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]259
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

   
∗ ∗ ··· ∗ ∗ ∗ ∗ ∗ ∗ ··· ∗ ∗ ∗ ∗
 0 ∗ 
··· ∗ ∗ ∗ ∗   
  0 ∗ ··· ∗ ∗ ∗ ∗ 
 ..   .. 

Gn−2,n−1  .  G  . 
 n−1,n   e.
−−−−−→  0 ··· 0 ∗ ∗ ∗ ∗  −−−→  0 · · · 0 ∗ ∗ ∗ ∗  =: R (3.3.46)
   
 0 ··· 0 0 ∗ ∗ ∗   0 ··· 0 0 ∗ ∗ ∗ 
  
 0 ··· 0 0 0 ∗ ∗   0 ··· 0 0 0 ∗ ∗ 
0 ··· 0 0 0 ∗ ∗ 0 ··· 0 0 0 0 ∗
We need n Givens rotations acting on matrix rows of length n:
➣ Computational effort O(n2 )
eR
A + uv H = Q e = QQ H G H
e with Q H
1 n −1,n · · · · · G12 .
➣ Asymptotic total computational effort O(mn) for m, n → ∞
e from
For large n this is much cheaper than the cost O(n2 m) for computing the QR-decomposition of A
scratch.

3.3.5.2 Adding a Column

e ∈ R m,n+1 by inserting a column v into A ∈ R m,n at an arbitrary

We obtain an augmented matrix A
location.
e = [ a1 , . . . , ak , v, ak , . . . , an ] , v ∈ R m , a j := (A):,j .
A ∈ R m,n −→ A (3.3.47)
Given: QR-decomposition A = QR, Q ∈ R m,m orthogonal, R ∈ R m,n upper triangular, of A
Sought: QR-decomposition A e =Q eRe , Q ∈ R m,n orthogonal, R
e ∈ R m,n+1 upper triangular, of A
e from
(3.3.47) computed efficiently.

Idea: Our task is easy in the case k = n + 1 !

First move new column to rightmost position

To move columns employ (partial) cyclic permutation of columns

k 7→ n + 1 , i 7→ i − 1 , i = k + 1, . . . , n + 1 ,

effected by multiplying with orthogonal permutation matrix

 
1 0 ··· ··· 0
 
0 . . . 
 
 
Column k

 1 0 
 .. 
P = . 0 1  ∈ R n+1,n+1
 
 .. .. 
. . 
 
 1
0 ··· 1 0

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]260
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

e:
Effects the following transformation of A
 
 
 
 R 
 
 
 
 
A e = [ a1 , . . . , a n , v ] = Q R Q ⊤ v = Q 
e −→ A1 = AP 


 
 
 
Column Q⊤ v  
 
Case m > n + 
1 
 
① If m > n + 1 there is an orthogonal transformation Q1 ∈ R m,m , for instance realized by m − n − 1
Givens rotations or a single Householder reflection, such that
 
  ∗ ··· ··· ∗ 
∗  
  0 ∗ ∗  

 ..     

 .   .. .. .. 
  n+1  . . .  n+1
 ∗  
   
   ∗ ∗  

⊤
Q1 Q v =  ∗   Q1 Q ⊤ A 1 =   

    ... ∗ 
 0     
  0 ··· ··· 0 
 ..  m−n−1   
.  
  . ..  m−n−1
 .. .  
0
0 ··· ··· 0

➣ Computational effort O(m − n)

Column k

To undo the initial moving of the new column we have

to apply the inverse permutation:
∼ Right-multiply with P⊤

We can visualize the impact on the structure of the matrix:

 
 
   
∗ ··· ∗ ··· ∗  
 
 ..   
 0 ∗ ∗ .   
   
 .. .. ..   
 . . .   
 .. .. ..   
 . . .   
 
e = Q1 Q ⊤ A 1 P ⊤ = 
Q1 Q ⊤ A  ∗ ∗ ∗

 = 
   
 ..   
 . ∗ ∗   
   
 0 ··· 0 ··· 0   
   
 .. .. ..   
. . .  
 
0 ··· 0 ··· 0  
 
 

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]261
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

② Apply n + 1 − k successive Givens rotations to convert Q1 Q⊤ A e into upper triangular form:

   
∗ ··· ∗ ··· ∗ ∗ ··· ∗ ··· ∗
 ..   .. 
 0 ∗ ∗ .   0 ∗ ∗ . 
 . .. ..   . .. . 
 .. . .   . . .. 
   . 
 .. . . ..   ∗ 
 . . .   
   . 
⊤

 ∗ ∗ ∗   Gn,n+1 Gk,k+1 
 0 .. 

Q1 Q A 1 =  −−−→ · · · −−−→ 
 ∗ ∗    0 ∗
..
.


 ..   . 
 . ∗ 0   .. 0 ∗ 
 ..   
   
 0 · · · 0 · · · .   0 · · · 0 ··· 0 
 . . .   . . 
 .. .. ..   .. .. 
0 ··· 0 ··· 0 0 ··· 0 ··· 0

         
         
         
         0 
 → → → 0 → 
   0   0
    
 0         

=
ˆ target rows of Givens rotations, ˆ new entries 6= 0
=

➣ Computational effort for this step O (n − k)2

➣ Total asymptotic computational cost O(n2 + m)

3.3.5.3 Adding a Row

We are given a matrix A ∈ R m,n of which a full QR-decomposition (→ Thm. 3.3.9) A = QR, Q ∈ R m,m
orthogonal, R ∈ R m,n upper triangular, is already available, maybe only in encoded form (→ Rem. 3.3.25).
We add another row to the matrix A in arbitrary position k ∈ {1, . . . , m}
 
a1,·
 .. 
 . 
 
ak−1,· 
A ∈ R m,n e =  vT 
7→ A

 , v ∈ Rn . (3.3.48)
 
 ak,· 
 . 
 .. 
am,·

Task: Find an algorithm for the efficient computation of the QR-decomposition A eR

e =Q e of A
e from (3.3.48)
e
,Q∈R m + 1,m + 1 e ∈K
orthogonal, R m + 1,n + 1 upper triangular.
Step ①: Move new row to the bottom.
e:
Employ partial cyclic permutation of rows of A

row m + 1 ← row k , row i ← row i + 1 , i = k, . . . , m .

3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]262
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

 
 
 R 
 
 
 
e = A QH 0 e R  
PA PA = =  .
vT 0 1 vT  
 
 
 
 
vT
Case m = n
Step ②: Restore upper triangular form through Givens rotations (→ § 3.3.19)

Successively target bottom row and rows from the top to turn leftmost entries of bottom row into zeros.
   
∗ ··· ··· ∗ ∗ ··· ··· ∗
 ..   .. 
 0 ∗ .   0 ∗ . 
 .. ..
  .. ..

 . 0 .  G1,m  . 0 .  G2,m
  −−−→   −−−→ · · ·
 .. .. .   .. .. .. 
 . . ∗ ..   . . ∗ . 
   
 0 0 0 ∗ ∗   0 0 0 ∗ ∗ 
∗ ··· ··· ∗ ∗ ∗ 0 ∗ ··· ∗ ∗ ∗
   
∗ ··· ··· ∗ ∗ ··· ··· ∗
 ..   .. 
 0 ∗ .   0 ∗ . 
   
Gm−2,m  ... 0
..
.  Gm−1,m  ... 0
..
. 
· · · −−−−→   . .

 −−−−→   .
 := R

e (3.3.49)
 .. .. ..   .. .. .. 
 ∗ .   . ∗ . 
 0 0 0 ∗ ∗   0 0 0 ∗ ∗ 
0 ··· 0 ∗ ∗ 0 ··· 0 ∗

Step ③: Incorporate rotations into Q:

Setting Q1 = Gm−1,m · · · · · G1,m the final QR-decomposition reads

e =P T Q 0 eR
e=Q e ∈ K m+1,m+1 ,
e with orthogonal Q
A Q1H R
0 1

because the product of orthogonal matrices is again orthogonal.

3.4 Singular Value Decomposition (SVD)

Beside the QR-decomposition of matrix A ∈]R m,n there are other factorizations based on orthogonal
transformations. The most important among them is the singular value decomposition (SVD), which can
be used to tackle linear least squares problems and many other optimization problems beyond, see [?].

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 263
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3.4.1 SVD: Definition and Theory

Theorem 3.4.1. singular value decomposition → [?, Thm. 9.6], [?, Thm. 11.1]
For any A ∈ K m,n there are unitary matrices U ∈ K m,m , V ∈ K n,n and a (generalized) diagonal
(∗) matrix Σ = diag(σ , . . . , σ ) ∈ R m,n , p := min{m, n}, σ ≥ σ ≥ · · · ≥ σ ≥ 0 such that
1 p 1 2 p

A = UΣVH .

Terminology (∗): A matrix Σ is called a generalized diagonal matrix, if (Σ )i,j = 0, if i 6= j, 1 ≤ i ≤ m,

1 ≤ j ≤ n. We still use the diag operator to create it from a vector.
Visualization of the structure of the singalar value decomposition of a matrix A = K m,n .

    
    
    
     
    
    
     
     
 A     
 =  U  Σ  VH 
     
     
    
    
    
    
   

 
     
 
     
     VH

 A = U  Σ  
     
    



 
 

Proof. (of Thm. 3.4.1, by induction)

To start the induction note that the assertion of the theorem is immediate for n = 1 or m = 1.
For the induction step (n − 1, m − 1)⇒(m, n) first remember from analysis [?, Thm. 4.2.3]: Continuous
functions attain extremal values on compact sets (here the unit ball {x ∈ R n : kxk2 ≤ 1}). In particular,
consider the function x 7→ kAx k.

➤ ∃x ∈ K n , y ∈ K m , k xk = kyk 2 = 1 : Ax = σy , σ = kAk2 ,

where we used the definition of the matrix 2-norm, see Def. 1.5.76. By Gram-Schmidt orthogonalization
or a similar procedure we can extend the single unit vectors x and y to orthonormal bases of K n and K m ,
respectivelt: ∃V e ∈ K m,m−1 such that
e ∈ K n,n−1, U

e ] ∈ K n,n , U = [y U
V = [x V e ] ∈ K m,m are unitary.

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 264
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

iH hh i yH Ax yH AV e
H

H e A xV e = σ w
U AV = y U e H Ax U e = 0 B
e H AV = : A1 .
U

For the induction argument we have to show that w = 0. Since

2 2
σ σ 2 + wH w
A1 = = (σ2 + wH w)2 + kBwk22 ≥ (σ2 + wH w)2 ,
w 2
Bw 2

we conclude
σ 2
kA1 xk22 A1 ( w ) ( σ 2 + w H w )2
kA1 k22 = sup ≥ 2
≥ = σ 2 + wH w . (3.4.2)
06 = x ∈ K n kxk22 σ
(w )
2
2
2
σ +w w H

2 (3.4.2)
σ2 = kAk22 = UH AV = kA1 k22 =⇒ kA1 k22 = kA1 k22 + kwk22 ⇒ w = 0 .
2

σ 0
A1 = .
0 B

Then apply the induction argument to B.

✷

Definition 3.4.3. Singular value decomposition (SVD)

The decomposition A = UΣVH of Thm. 3.4.1 is called singular value decomposition (SVD) of A.
The diagonal entries σi of Σ are the singular values of A.

As in the case of the QR-decomposition, compare (3.3.3.1) and (3.3.3.1), we can also drop the bottom
zero rows of Σ and the corresponding columns of U in the case of m > n. Thus we end up with an
“economical” singular value decomposition of A ∈ K m,n :

m ≥ n: : A = UΣVH , U ∈ K m,n , Σ ∈ K n,n , V ∈ K n,n , UH U = In ,V unitary ,

(3.4.4)
m < n: : A = UΣVH , U ∈ K m,m , Σ ∈ K m,m , V ∈ K n,m , U unitary ,VH V = Im .

with true diagonal matrices Σ, whose diagonals contain the singular values of A.
Visualization of economimcal SVD for m > 0:
   
   
   
   
   
     
   
   
     
     
    Σ  
 A =  U   VH 
     
     
     
   
   
   
   
   
   
   

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 265
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Lemma 3.4.5.

The squares σi2 of the non-zero singular values of A are the non-zero eigenvalues of AH A, AAH
with associated eigenvectors (V):,1 , . . . , (V):,p , (U):,1 , . . . , (U):,p , respectively.

Proof. AAH and AH A are similar (→ Lemma 9.1.6) to diagonal matrices with non-zero diagonal entries
σi2 (σi 6= 0), e.g.,

AAH = UΣVH VΣH UH = U ΣΣ H

| {z } UH . ✷
diagonal matrix

Remark 3.4.6 (SVD and additive rank-1 decomposition → [?, Cor. 11.2], [?, Thm. 9.8])

Recall from linear algebra: rank-1 matrices are tensor products of vectors

A ∈ K m,n and rank(A) = 1 ⇔ ∃u ∈ K m , v ∈ K n : A = uvH , (3.4.7)

because rank(A) = 1 means that Ax = µ(x)u for some u ∈ K m and linear form x 7→ µ(x). By the
Riesz representation theorem the latter can be written as µ(x) = vH x.

Singular value decomposition provides additive decomposition into rank-1 matrices:

p
A = UΣVH = ∑ σj (U):,j (V)H:,j . (3.4.8)
j =1

Remark 3.4.9 (Uniqueness of SVD)

The SVD from Def. 3.4.3 is not (necessarily) unique, but the singular values are.

Proof. Proof by contradiction: assume that A has two singular value decompositions

A = U1 Σ1 VH H
1 = U2 Σ2 V2 ⇒ U1 Σ1 ΣH UH H
1 = AA = U2 Σ2 ΣH UH
2 .
| {z 1} | {z 2}
=diag(σ12 ,...,σm
2) =diag(σ12 ,...,σm
2)

Two similar diagonal matrices with non-increasing diagonal entries are equal !
✷

(3.4.10) SVD, nullspace, and image space

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 266
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Lemma 3.4.11. SVD and rank of a matrix → [?, Cor. 9.7]

If, for some 1 ≤ r ≤ p := min{m, n}, the singular values of A ∈ K m,n satisfy σ1 ≥ · · · ≥ σr >
σr+1 = · · · σp = 0, then

• rank(A) = r (no. of non-zero singular values) ,

• N (A) = Span{(V):,r+1 , . . . , (V):,n } ,
• R(A) = Span{(U):,1 , . . . , (U):,r } .

Illustration for m > n: columns = ONB of R(A) rows = ONB of N (A)

    
    Σr 
    0

    
     
    
    
     
     
 A     VH 
 =   
   U   
     
    
    
    0 0 
    
    
    
 

(3.4.12)

3.4.2 SVD in E IGEN

The E IGEN class JacobiSVD is constructed from a matrix data type, computes the SVD of its argument
during construction and offers access methods MatrixU(), singularValues(), and MatrixV()
to request the SVD-factors and singular values.

C++-code 3.4.13: Computing SVDs in E IGEN

2 # include <Eigen / SVD>
3

4 // Computation of (full) SVD A = UΣVH → Thm. 3.4.1

5 // SVD factors are returned as dense matrices in natural order
6 std : : t u p l e <MatrixXd , MatrixXd , MatrixXd > s v d _ f u l l ( const MatrixXd& A) {
7 Eigen : : JacobiSVD<MatrixXd > svd ( A , Eigen : : ComputeFullU |
Eigen : : ComputeFullV ) ;
8 MatrixXd U = svd . matrixU ( ) ; // get unitary (square) matrix U
9 MatrixXd V = svd . matrixV ( ) ; // get unitary (square) matrix V
10 VectorXd sv = svd . singularValues ( ) ; // get singular values as vector
11 MatrixXd Sigma = MatrixXd : : Zero ( A . rows ( ) , A . cols ( ) ) ;
12 const unsigned p = sv . siz e ( ) ; // no. of singular values
13 Sigma . block ( 0 , 0 , p , p ) = sv . asDiagonal ( ) ; // set diagonal block of Σ
14 r et ur n std : : t u p l e <MatrixXd , MatrixXd , MatrixXd > (U, Sigma , V) ;
15 }

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 267
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

17 // Computation of economical (thin) SVD A = UΣVH , see (3.4.4)

18 // SVD factors are returned as dense matrices in natural order
19 std : : t u p l e <MatrixXd , MatrixXd , MatrixXd > svd_eco ( const MatrixXd& A) {
20 Eigen : : JacobiSVD<MatrixXd > svd ( A , Eigen : : ComputeThinU |
Eigen : : ComputeThinV) ;
21 MatrixXd U = svd . matrixU ( ) ; // get unitary (square) matrix U
22 MatrixXd V = svd . matrixV ( ) ; // get unitary (square) matrix V
23 VectorXd sv = svd . singularValues ( ) ; // get singular values as vector
24 MatrixXd Sigma = sv . asDiagonal ( ) ; // build diagonal matrix Σ
25 r et ur n std : : t u p l e <MatrixXd , MatrixXd , MatrixXd > (U, Sigma , V) ;
26 }

The second argument in the constructor of JacobiSVD determines, whether the methods matrixU()
and matrixV() return the factor for the full SVD of Def. 3.4.3 or of the economical (thin) SVD (3.4.4):
Eigen::ComputeFull* will select the full versions, whereas Eigen::ComputeThin* picks the
economical versions → documentation.

Internally, the computation of the SVD is done by a sophisticated algorithm, for which key steps rely on
orthogonal/unitary transformations. Also there we reap the benefit of the exceptional stability brought
about by norm-preserving transformations → § 3.3.28.

E IGEN’s algorithm for computing SVD is (numerically) stable → Def. 1.5.85

(3.4.14) Computational cost of computing the SVD

According to E IGEN’s documentation the SVD of a general dense matrix involves the following asymptotic
complexity:

cost(economical SVD of A ∈ K m,n ) = O(min{m, n}2 max{m, n})

The computational effort is (asymptotically) linear in the larger matrix dimension.

Example 3.4.15 (SVD-based computation of the rank of a matrix)

Based on Lemma 3.4.11, the SVD is the main tool for the stable computation of the rank of a matrix (→
Def. 2.2.3)

However, theory as reflected in Lemma 3.4.11 entails identifying zero singular values, which must rely
on a threshold condition in a numerical code, recall Rem. 1.5.36. Given the SVD A = UΣVH , Σ =
diag(σ1 , . . . , σmin{m,n} ), of a matrix A ∈ K m,n , A 6= 0 and a tolerance tol > 0, we define the numerical
rank

r := ♯ σi : |σi | ≥ tol max{|σj |} . (3.4.16)
j

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 268
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The following code implements this rule.

C++-code 3.4.17: Computing rank of a matrix through SVD

2 // Computation of the numerical rank of a non-zero matrix by means of
3 // singular value decomposition, cf. (3.4.16).
4 MatrixXd : : Index rank_by_svd ( const MatrixXd &A , double t o l = EPS) {
5 i f ( A . norm ( ) == 0 ) r et ur n MatrixXd : : Index ( 0 ) ;
6 Eigen : : JacobiSVD<MatrixXd > svd ( A) ;
7 const VectorXd sv = svd . singularValues ( ) ; // Get sorted singular
values as vector
8 MatrixXd : : Index n = sv . siz e ( ) ;
9 MatrixXd : : Index r = 0 ;
10 // Test relative size of singular values
11 while ( ( r <n ) && ( sv ( r ) >= sv ( 0 ) ∗ t o l ) ) r ++;
12 r et ur n r ;
13 }

E IGEN offers an equivalent built-in method rank() for objects representing singular value decomposi-
tions:

C++-code 3.4.18: Using rank() in E IGEN

2 // Computation of the numerical rank of a matrix by means of SVD
3 MatrixXd : : Index r ank _eigen ( const MatrixXd &A , double t o l = EPS) {
4 r et ur n A . jacobiSvd ( ) . s e t T h r e s h o l d ( t o l ) . rank ( ) ;
5 }

The method setThreshold() passes tol from (3.4.16) to rank().

Example 3.4.19 (Computation of nullspace and image space of matrices)

“Computing” a subspace of R k amounts to making available a (stable) basis of that subspace, ideally an
orthonormal basis.
Lemma 3.4.11 taught us how to glean orthonormal bases of N (A) and R(A) from the SVD of a matrix
A. This immediately gives a numerical method and its implementation is given in the next two codes.

C++-code 3.4.20: ONB of N (A) through SVD

2 // Computation of an ONB of the kernel of a matrix
3 MatrixXd n u l l s p a c e ( const MatrixXd &A , double t o l = EPS) {
4 using i n d e x _ t = MatrixXd : : Index ;
5 Eigen : : JacobiSVD<MatrixXd > svd ( A , Eigen : : ComputeFullV ) ;
6 i n d e x _ t r = svd . s e t T h r e s h o l d ( t o l ) . rank ( ) ;
7 // Rightmost columns of V provide ONB of N (A)
8 MatrixXd Z = svd . matrixV ( ) . r ight Cols ( A . cols ( )− r ) ;
9 r et ur n Z ;
10 }

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 269
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++-code 3.4.21: ONB of R(A) through SVD

2 // Computation of an ONB of the image space of a matrix
3 MatrixXd rangespace ( const MatrixXd &A , double t o l = EPS) {
4 using i n d e x _ t = MatrixXd : : Index ;
5 Eigen : : JacobiSVD<MatrixXd > svd ( A , Eigen : : ComputeFullV ) ;
6 i n d e x _ t r = svd . s e t T h r e s h o l d ( t o l ) . rank ( ) ;
7 // r left columns of U provide ONB of R(A)
8 r et ur n svd . matrixV ( ) . l e f t C o l s ( r ) ;
9 }

3.4.3 Generalized Solutions of LSE by SVD

In a similar fashion as explained for QR-decomposition in Section 3.3.4, the singular valued decompisition
(SVD, → Def. 3.4.3) can be used to transform general linear least squares problems (3.1.38) into a simpler
form. In the case of SVD based orthogonal transformation method based on SVD this simpler form
involves merely a diagonal matrix.
Here we consider the most general setting: A ∈ K m,n , rank(A) = r ≤ min{m, n}, cf. (3.1.38). In
particular, we we drop the assumption of full rank of A. This means that the minimum norm condition (ii) in
the definition (3.1.38) of a linear least squares problem may be required for singling out a unique solution.
We recal the SVD of A ∈ R m,n :

Σr 0 VH
1
A = [ U1 U2 ]
0 0 VH
2
     
     Σr 
     0 
     
       
     
     
       
       

 A  
 = 







 VH
1


   U1 U2     
       VH 
      2
     0 0 
      | {z }
     
      ∈R n,n
     

| {z } | {z } | {z }
∈R m,n ∈R m,m ∈R m,n
(3.4.22)

with U1 ∈ R m,r , U2 ∈ R m,m−r with orthonormal columns,

Σr = diag(σ1 , . . . , σr ) ∈ Rr,r (singular values, Def. 3.4.3),
V1 ∈ R n,r , V2 ∈ R n,n−r with orthonormal columns.
Then we use the invariance of the 2-norm of a vector with respect to multiplication with U = [ U1 , U2 ], see
Thm. 3.3.5, together with the fact that U is unitary, see Def. 6.2.2:
" #
UH
1
[ U1 , U2 ] · =I.
UH
2

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 270
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

H
Σ 0 VH Σr VH1 x − U1 b
kAx − b k2 = [U1 U2 ] r 1 x−b = (3.4.23)
0 0 VH
2 2
0 UH
2b 2

We follow the same strategy as in the case of QR-based

solvers
for full-rank
linear least squares problems.
H
Σr VH1 x − U1 b vanish:
We choose x such that the first r components of
0 UH
2b

➣ (possibly underdetermined) r × n linear system Σr VH H

1 x = U1 b . (3.4.24)

To fix a unique solution in the case r < n we appeal to the minimal norm condition in (3.1.38): appealing
to the considerations of § 3.1.34, the solution x of (3.4.24) is unique up to contributions from

Lemma 3.1.21 orthonormality

N (VH
1 ) = R(V1 )⊥ = R(V2 ) . (3.4.25)

Since V is unitary, the minimal norm solution is obtained by setting contributions from R(V2 ) to zero,
which amounts to choosing x ∈ R(V1 ). This converts (3.4.24) into
−1 H
Σr VH H
1 V1 z = U1 b ⇒ z = Σr U1 b .
| {z }
=I

generalized solution → Def. 3.1.32 x† = V1 Σr−1 UH

1b , k r k 2 = UH
2 b . (3.4.26)
2

In a practical implementation, as in Code 3.4.17, we have to resort to the numerical rank from (3.4.16):

r = max{i: σi /σ1 > tol} ,

where we have assumed that the singular values σj are sorted according to decreasing modulus.

C++-code 3.4.27: Computing generalized solution of Ax = b via SVD

2 # include <Eigen / SVD>
3

4 VectorXd l s q s v d ( const MatrixXd& A , const VectorXd& b ) {

5 Eigen : : JacobiSVD<MatrixXd > svd ( A , Eigen : : ComputeThinU |
Eigen : : ComputeThinV ) ;
6 VectorXd sv = svd . s i n g u l a r V a l u e s ( ) ;
7 unsigned r = svd . rank ( ) ; // No. of (numerically!) nonzero singular
values
8 MatrixXd U = svd . matrixU ( ) , V = svd . matrixV ( ) ;
9

10 r et ur n V . l e f t C o l s ( r ) ∗ ( sv . head ( r ) . cwiseInverse ( ) . asDiagonal ( ) ∗

(U. l e f t C o l s ( r ) . a d j o i n t ( ) ∗ b ) ) ;
11 }

The solve() method directly returns the generalized solution

C++-code 3.4.28: Computing generalized solution of Ax = b via SVD

2 VectorXd l s q s v d _ e i g e n ( const MatrixXd& A , const VectorXd& b ) {
3 Eigen : : JacobiSVD<MatrixXd > svd ( A , Eigen : : ComputeThinU |

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 271
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Eigen : : ComputeThinV ) ;
4 r et ur n svd . solve ( b ) ;
5 }

Remark 3.4.29 (Pseudoinverse and SVD → [?, Ch. 12], [?, Sect. 4.7])

From Thm. 3.1.37 we could conclude a general formula for the Moore-Penrose pseudoinverse of any matrix
A ∈ R m,n . Now, the solution formula (3.4.26) directly yields a concrete incarnation of the pseudoinverse
A+ .
Theorem 3.4.30. Pseudoinverse and SVD

If A ∈ K m,n has the SVD decomposition A = UΣVH partitioned as in (3.4.22), then its Moore-
Penrose pseudoinverse (→ Thm. 3.1.37)is given by A† = V1 Σr−1 UH
1.

3.4.4 SVD-Based Optimization and Approximation

For the general least squares problem (3.1.38) we have seen the use of SVD for its numerical solution in
Section 3.4.3. The the SVD was a powerful tool for solving a minimization problem for a 2-norm. In many
other contexts the SVD is also a key component in numerical optimization.

3.4.4.1 Norm-Constrained Extrema of Quadratic Forms

We consider the following problem of finding the extrema of quadratic forms on the Euclidean unit sphere
{ x ∈ K n : k x k 2 = 1}:
given A ∈ K m,n , m ≥ n, find x ∈ K n , kxk2 = 1 , kAx k2 → min . (3.4.31)

Use that multiplication with orthogonal/unitary matrices preserves the 2-norm (→ Thm. 3.3.5) and resort
to the singular value decomposition A = UΣVH (→ Def. 3.4.3):
2 2
min k Axk22 = min UΣVH x = min UΣ(VH x)
kx k 2 = 1 kx k2 = 1 2 k V x k2 = 1
H 2

[ y = VH x ] = min kΣy k22 = min (σ12 y21 + · · · + σn2 y2n ) ≥ σn2 .

ky k2 = 1 k y k2 = 1

Since the singular values are assumed to be sorted as σ1 ≥ σ2 ≥ · · · ≥ σn , the minimum with value σn2
is attained for VH x = y = en ⇒ minimizer x = Ven = (V):,n .

C++11 code 3.4.32: Solving (3.4.31) with E IGEN ➺ GITLAB

2 // E I G E N based function for solving (3.4.31);
3 // minimizer returned nin x, mininum as return value
4 double minconst ( VectorXd &x , const MatrixXd &A) {
5 MatrixXd : : Index m=A . rows ( ) , n=A . cols ( ) ;
6 i f (m < n ) throw std : : r u n t i m e _ e r r o r ( " A mu st be t a l l matrix " ) ;

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 272
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

7 // SVD factor U is not computed!

8 Eigen : : JacobiSVD<MatrixXd > svd ( A , Eigen : : ComputeThinV ) ;
9 x . r e s i z e ( n ) ; x . setZero ( ) ; x ( n −1) = 1 . 0 ; // en
10 x = svd . matrixV ( ) ∗ x ;
11 r et ur n ( svd . s i n g u l a r V a l u e s ( ) ) ( n −1) ;
12 }

By similar arguments we can solve the corresponding norm constrained maximization problem
given A ∈ K m,n , m ≥ n, find x ∈ K n , kxk2 = 1 , kAx k2 → max ,
and obtain the solution based on the SVD A = UΣVH of A:
σ1 = max k Axk2 , (V):,1 = argmaxkAx k2 . (3.4.33)
kx k2 = 1 k x k2 = 1

Recall: The Euclidean matrix norm (2-norm) of the matrix A (→ Def. 1.5.76) is defined as the maximum
in (3.4.33). Thus we have proved the following theorem:

Lemma 3.4.34. SVD and Euclidean matrix norm

If A ∈ K m,n has singular values σ1 ≥ σ2 ≥ · · · ≥ min{m, n}, then its Euclidean matrix norm is
given by k Ak2 = σ1 (A)
If m = n and A is regular/invertible, then its 2-norm condition number is cond2 (A) = σ1 /σn .

Example 3.4.35 (Fit of hyperplanes)

For an important application from computational geometry, this example studies the power and versatility
of orthogonal transformations in the context of (generalized) least squares minimization problems.
From school recall the Hesse normal form of a hyperplane H (= affine subspace of dimension d − 1) in
Rd:
H = { x ∈ R d : c + n ⊤ x = 0} , k n k2 = 1 . (3.4.36)
where n is the unit normal to H and |c| gives the distance of H from 0. The Hesse normal form is
convenient for computing the distance of points from H, because the
Euclidean distance of y ∈ R d from the plane is dist(H, y) = |c + n⊤ y| , (3.4.37)

Goal: given the points y1 , . . . , ym , m > d, find H ↔ {c ∈ R, n ∈ R d , k nk2 = 1}, such that

m m
∑ dist(H, y j ) 2
= ∑ |c + n⊤ y j |2 → min . (3.4.38)
j =1 j =1

Note that (3.4.38) is not a linear least squares problem due to the constraint k nk2 = 1. However, it turns
out to be a minimization problem with almost the structure of (3.4.31):

  
1 y1,1 · · · y1,d c
1 y · · · y2,d  n 
 2,1  1 
(3.4.38) ⇔  .. .. ..  ..  → min under constraint k nk2 = 1 .
. . .  . 
1 ym,1 · · · ym,d nd
| {z }
= :A 2

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 273
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Step ➊: To convert the minimization problem into the form (3.4.31) we start with a QR-decomposition
(→ Section 3.3.3)
 
r11 r12 · · · · · · r1,d+1
   0 r22 · · · · · · r2,d+1 
1 y1,1 · · · y1,d  . ..

 . ..
. 
1 y 
· · · y2,d   . . 
 2,1   m,d+1
A :=  .. .. ..  = QR , R :=  0 r d+1,d+1  ∈ R .
. . .   
 0 ··· ··· 0 
1 ym,1 · · · ym,d  . .. 
 .. . 
0 ··· ··· 0

 
r11 r12 · · · · · · r1,d+1
 
 0 r22 · · · · · · r2,d+1  c
 . ..

 . ..
.   n1 
 . .  . 
  
kAxk2 → min ⇔ kRxk2 =  0 rd+1,d+1  ..  → min . (3.4.39)
  
 0 ··· ··· 0  ... 
 . .. 
 .. .  n
d
0 ··· ··· 0 2

Step ➋ Note that, if n is the solution of (3.4.38), then necessarily (why?)

c · r11 + n1 · r12 + · · · + r1,d+1 · nd = 0 .

This insight converts (3.4.39) to

  
r22 r23 · · · · · · r2,d+1 n1
 0 r33 · · · · · · r  . 
 3,d+1   .. 
 .. .. ..  .  → min , k n k2 = 1 . (3.4.40)
 . . .  .. 
0 rd+1,d+1 nd
2

(3.4.40) = problem of type (3.4.31), minimization on the Euclidean sphere.

➣ Solve (3.4.40) using SVD !

√ d
−1
Note: Since r11 = k(A):,1 k2 = m 6= 0 ⇒ c = −r11 ∑ r1,j+1n j .
j =1

This algorithm is implemented in the following code, making heavy use of E IGEN’s block access operations
and the built in QR-decomposition and SVD factorization.

C++-code 3.4.41: (Generalized) distance fitting of a hyperplane: solution of (3.4.42)

2 // Fitting of a hyperplane, returning Hesse normal form
3 void c l s q ( const MatrixXd& A , const unsigned dim ,
4 double& c , VectorXd& n ) {
5 unsigned p = A . cols ( ) , m = A . rows ( ) ;
6 i f ( p < dim + 1 ) { c e r r << " n o t enough unknowns \ n " ; r et ur n ; }
7 i f (m < dim ) { c e r r << " n o t enough e q u a t i o n s \ n " ; r et ur n ; }
8

9 m = std : : min (m, p ) ;

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 274
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10 // First step: orthogonal transformation, see Code 3.3.38

11 MatrixXd R = A . householderQr ( ) . matrixQR ( ) . template
t r iangular View <Eigen : : Upper > ( ) ;
12 // compute matrix V from SVD composition of R, solve (3.4.40)
13 MatrixXd V = R . block ( p − dim , p − dim , m + dim − p , dim )
14 . j a c o b i S v d ( Eigen : : ComputeFullV ) . matrixV ( ) ;
15 n = V . col ( dim − 1 ) ;
16

17 MatrixXd R _ t o p l e f t = R . topLeftCorner ( p − dim , p − dim ) ;

18 c = −( R _ t o p l e f t . template t r iangular View <Eigen : : Upper > ( )
19 . solve (R . block ( 0 , p − dim , p − dim , dim ) ) ∗ n ) ( 0 ) ;
20 }

Code 3.4.41 solves the general problem: For A ∈ K m,n find n ∈ R d , c ∈ R n−d such that

c
A → min with constraintknk2 = 1 . (3.4.42)
n 2

3.4.4.2 Best Low-Rank Approximation

(3.4.43) Low-Rank matrix compression

Matrix compression addresses the problem of approximating a given “generic” matrix (of a certain class) by
means of matrix, whose “information content”, that is, the number of reals needed to store it, is significantly
lower than the information content of the original matrix.

Sparse matrices (→ Notion 2.7.1) are a prominent class of matrices with “low information content”. Un-
fortunately, they cannot approximate dense matrices very well. Another type of matrices that enjoy “low
information content”, also called data sparse, are low-rank matrices.

Lemma 3.4.44.

If A ∈ R m,n has rank p ≤ min{m, n} (→ Def. 2.2.3), then there exist U ∈ R m,p and V ∈ R n,p ,
such that A = UV⊤ .

None of the columns of U and V can vanish. Hence, in addition, we may assume that the columns of U
are normalized: (U):,j 2 = 1, j = 1, . . . , p.

It takes only p(m + n − p) real numbers to store A ∈ R m,n with rank(A) = p.

Thus approximating a given matrix A ∈ R m,n with a rank- p matrix, p ≪ min{m, n}, can be regarded as
an instance of matrix compression. The approximation error with respect to some matrix norm k·k will be
minimal if we choose the best approximation

A p := argmin{k A − Bk : B ∈ R m,n , rank(B) = p} . (3.4.45)

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 275
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Here we explore low-rank best approximation of general matrices with respect to the Euclidean matrix
norm k·k2 induced by the 2-norm for vectors (→ Def. 1.5.76), and the Frobenius norm k·k F .

Definition 3.4.46. Frobenius norm

The Frobenius norm of A ∈ K m,n is defined as
m n
kAk2F := ∑ ∑ |aij |2 .
i =1 j =1

It should be obvious that kAk F invariant under orthogonal/unitary transformations of A. Thus the Frobe-
nius norm of a matrix A, rank(A) = p, can be expressed through its singular values σj :

p
Frobenius norm and SVD: kAk2F = ∑ j=1 σj2 (3.4.47)

✎ notation: Rk (m, n) := {A ∈ Km,n : rank(A) ≤ k}, m, n, k ∈ N

The next profound result links best approximation in Rk (m, n) and the singular value decomposition (→
Def. 3.4.3).

Theorem 3.4.48. best low rank approximation → [?, Thm. 11.6]

Let A = UΣVH be the SVD of A ∈ K m.n (→ Thm. 3.4.1). For 1 ≤ k ≤ rank(A) set Uk :=
[ u:,1 , . . . , u:,k ] ∈ Km,k , Vk := [v:,1 , . . . , v:,k ] ∈ Kn,k , Σk := diag(σ1 , . . . , σk ) ∈ Kk,k . Then, for
k·k = k·k F and k·k = k·k2 , holds true

A − U k Σk VH
k ≤ kA − Fk ∀F ∈ Rk (m, n) .

This theorem teaches that the rank-k-matrix that is closest to A (rank-k best approximation) in both the
Euclidean matrix norm and the Frobenniusnorm (→ Def. 3.4.46) can be obtained by truncating the rank-1
sum expansion (3.4.8) obtained from the SVD of A after k terms.

Proof. (of Thm. 3.4.48) Write Ak = Uk Σk VH

k . Obviously, with r = rank A,
(
σk+1 , for k·k = k·k2 ,
rank Ak = k and kA − Ak k = k Σ − Σk k = q
2
σk+1 + · · · + σr2 , for k·k = k·k F .

➊ Pick B ∈ K n,n , rank B = k.

dim N (B) = n − k ⇒ N (B) ∩ Span{v1 , . . . , vk+1 } 6= {0} ,
where vi , i = 1, . . . , n are the columns of V. For x ∈ N (B) ∩ Span{v1 , . . . , vk+1 }, kxk2 = 1
k+1
x= ∑ (vHj x)v j ,
j =1

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 276
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

2
k+1 k+1
kA − Bk22 ≥ k(A − B)x k22 = kAxk22 = ∑ σj (vH
j x)u j = ∑ σj2 (vHj x)2 ≥ σj2+1 ,
j =1 2 j =1

k+1
2
because ∑ ( vH 2
j x ) = k x k 2 = 1.
j =1

➋ Find ONB {z1 , . . . , zn−k } of N (B) and assemble it into a matrix Z = [z1 . . . zn−k ] ∈ K n,n−k
n−k n−k r
kA − Bk2F ≥ k(A − B)Z k2F = kAZk2F = ∑ kAzi k22 = ∑ ∑ σj2 (vHj zi )2 ✷
i =1 i =1 j =1

Since the matrix norms k·k2 and k·k F are invariant under multiplication with orthogonal (unitary) matrices,
we immediately obtain expressions for the norms of the best approximation error:

A − U k Σk VH
k = σk+1 , (3.4.49)
2
2 min{m,n }
A − U k Σk VH
k = ∑ σj2 . (3.4.50)
F
j = k+1

This provides precise information about the best approximation error for rank-k matrices.

Example 3.4.51 (Image compression)

A rectangular greyscale image composed of m × n pixels (greyscale, BMP format) can be regarded as a
matrix A ∈ R m,n , aij ∈ {0, . . . , 255}, cf. Ex. 9.3.24. Thus low-rank approximation of the image matrix is
a way to compress the image.

Thm. 3.4.48 e = U k Σk V ⊤
➣ best rank-k approximation of image: A

Of course, the matrices Ul , Vk , and Σk are available from the economical (thin) SVD (3.4.4) of A. This
is the idea behind the following M ATLAB code (This example is coded in M ATLAB, because M ATH GL lacks
image rendering capabilities.).

M ATLAB-code 3.4.52: Image compression

1 [img_data] = imread(<file>.jpg’); % read image from file
2 img_data = double(img_data);
3 image (img_data); % render image
4 [U,S,V] = svd (img_data(:,:,1));
5 s = d i a g (S);
6 k=20; % rank of compressed image
7 img_data_comp = U(:,1:k)* d i a g (s(1:k))*V(:,1:k)’;
8 image (img_data_comp); % render compressed image
9 col = [0:1/215:1]’*[1,1,1];
10 colormap (col);

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 277
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

View of ETH Zurich main building Compressed image, 40 singular values used

100 100

200 200

300 300

400 400

500 500

600 600

700 700

800 800
200 400 600 800 1000 1200 200 400 600 800 1000 1200
Difference image: |original - approximated| Singular Values of ETH view (Log-Scale)
6
10

100
5
10

200

4
10
300

3
400 10 k = 40 (0.08 mem)

500
2
10

600

1
10
700

0
800 10
200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800

The sample images were rendered by the following code.

M ATLAB-code 3.4.53: SVD based image compression

1 P = double(imread(’eth.pbm’));
2 [m,n] = s i z e (P); [U,S,V] = svd(P); s = d i a g (S);
3 k = 40; S(k+1:end ,k+1: end ) = 0; PC = U*S*V’;
4

5 f i g u r e (’position’,[0 0 1600 1200]); col = [0:1/215:1]’*[1,1,1];

6 s u b p l o t (2,2,1); image (P); t i t l e (’original image’); colormap (col);
7 s u b p l o t (2,2,2); image (PC); t i t l e (’compressed (40 S.V.)’);
colormap (col);
8 s u b p l o t (2,2,3); image ( abs (P-PC)); t i t l e (’difference’);
colormap (col);
9 s u b p l o t (2,2,4); c l a ; semilogy (s); hold on; p l o t (k,s(k),’ro’);

Note that there are better and faster ways to compress images than SVD (JPEG, Wavelets, etc.)

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 278
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3.4.4.3 Principal Component Data Analysis (PCA)

Example 3.4.54 (Trend analysis)

XETRA DAX 1,1.2008 - 29.10.2010

3
10
ADS
ALV
BAYN
BEI
BMW
CBK
2 DAI
10 DB1
DBK
✁ (end of day) stock prizes
DPW =ˆ n data vectors ∈ R m
stock price (EUR)

DTE
EOAN Are there underlying governing
FME
1
FRE3 trends ?
10 HEI
HEN3
IFX Are there a few vectors
LHA
LIN u1 , . . . , u p , p ≪ n, such that,
MAN
MEO
approximately all other data
0
10 MRK vectors ∈ Span{u1 , . . . , u p }.
MUV2
RWE
SAP
SDF
SIE
TKA
-1 VOW3
10
0 100 200 300 400 500 600 700 800 900
Fig. 106 days in past

Example 3.4.55 (Classification from measured data)

Given: measured U - I characteristics of n diodes in

a box
(k)
(data points (U j , I j ), j = 1, . . . , m, k =
1, . . . , n)
Classification problem: find out
• how many different types of diodes in box,
• the U - I characteristic of each type.

! Measurement errors !
Manufacturing tolerances !

Fig. 107

Possible (“synthetic”) measured data for two types of diodes; measurement errors and manufacturing
tolerances taken into account by (Gaussian) random perturbations.

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 279
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

measured U-I characteristics for some diodes measured U-I characteristics for all diodes
1.2 1.2

1 1

0.8 0.8
current I

current I
0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 108 voltage U Fig. 109 voltage U

Ex. 3.4.54 and Ex. 3.4.55 present typical tasks that can be tackled by principal component analysis.

Given: n data points a j ∈ R m , j = 1, . . . , n, in m-dimensional (feature) space

(e.g., a j may represent a finite time series or a measured relationship of physical quantities)

In Ex. 3.4.54: n =
ˆ number of stocks,
m= ˆ number of days, for which stock prices are recorded
✦ Extreme case: all stocks follow exactly one trend

↔ a j ∈ Span{u} ∀ j = 1, . . . , n ,

for a trend vector u ∈ R m , kuk2 = 1.

✦ Unlikely case: all stocks prices are governed by p < n trends:

↔ a j ∈ Span{u1 , . . . , u p } ∀ j = 1, . . . , m , (3.4.56)

with orthonormal trend vectors ui ∈ R m , i = 1, . . . , p.

Why unlikely ? Small random fluctuations will be present in each stock prize
Why orthonormal ? Trends should be as “independent as possible” (minimally correlated)

Perspective of linear algebra:

(3.4.56) ⇔ rank(A) = p for A := [a1 , . . . , an ] ∈ R m,n , R(A) = Span{u1 , . . . , u p } (3.4.57)

✦ Realistic: stock prizes approximately follow a few trends

a j ∈ Span{u1 , . . . , u p } + “small perturbations” ∀ j = 1, . . . , m ,

with orthonormal trend vectors ui , i = 1, . . . , p.

Task (PCA): determine (minimal) p and orthonormal trend vectors ui , i = 1, . . . , p

Now singular value decomposition (SVD) according to Def. 3.4.3 comes into play, because Lemma 3.4.11
tells us that it can supply an orthonormal basis of the image space of a matrix, cf. Code 3.4.21.

Issue: how to deal with (small, random) perturbations ?

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 280
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Recall Rem. 3.4.6, (3.4.8): If A = UΣH⊤ is the SVD of A ∈ R m,n , then (u j =

ˆ columns of U, v j =
ˆ
columns of V)
     
     
     
     
     
 A     u 
  = σ1  u1 v1⊤ + σ2  2 v2⊤ +...
     
     
     
     
     

This already captures the case (3.4.56) and we see that the columns of U supply the trend vectors we are
looking for!

➊ no perturbations:

SVD: A = UΣVH satisfies σ1 ≥ σ2 ≥ . . . σp > σp+1 = · · · = σmin{m,n} = 0 ,

orthonormal trend vectors (U):,1 , . . . , (U):,p .
➋ with perturbations:

SVD: A = UΣVH satisfies σ1 ≥ σ2 ≥ . . . σp ≫σp+1 ≈ · · · ≈ σmin{m,n} ≈ 0 ,

orthonormal trend vectors (U):,1 , . . . , (U):,p .

If there is a pronounced gap in distribution of the singular values, which separates p large from min{m, n} −
p relatively small singular values, this hints that RA has essentially dimension p. It depends on the appli-
cation what one accepts as a “pronounced gap”.

Frequently used criterion:

( )
q min{m,n }
p = min q: ∑ σj2 ≥ (1 − τ ) ∑ σj2 for τ≪1.
j =1 j =1

What is the Information carried by V in PCA context ?

     
     
     
     
     
 A     u 
  = σ1  u1 v1⊤ + σ2  
 2 v2⊤ +...
    
     
     
     
     

j-th data set (↔ time series # j) in j-th column of A

(3.4.8) ⇒ (A):,j = σ1 u1 (v1 ) j + σ2 u2 (v2 ) j + . . .

The j-th row of V (up to the p-th component) gives the weights with which the p identified trends
contribute to data set j.

Example 3.4.58 (Data points confined to a subspace)

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 281
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB-code 3.4.59: PCA in three dimensions via SVD

1 % Use of SVD for PCA with perturbations
2

3 V = [1 , -1; 0 , 0.5; -1 , 0]; A = V* rand (2,20)+0.1* rand (3,20);

4 [U,S,V] = svd (A,0);
5

6 f i g u r e ; sv = d i a g (S(1:3,1:3))
7

8 [X,Y] = meshgrid (-2:0.2:0,-1:0.2:1); n = s i z e (X,1); m =

s i z e (X,2);
9 f i g u r e ; p l o t 3 (A(1,:),A(2,:),A(3,:),’r*’); g r i d on; hold on;
10 M =
U(:,1:2)*[ reshape (X,1, prod ( s i z e (X))); reshape (Y,1, prod ( s i z e (Y)))];
11 mesh( reshape (M(1,:),n,m), reshape (M(2,:),n,m), reshape (M(3,:),n,m));
12 colormap ( c o o l ); view (35,10);
13

14 p r i n t -depsc2 ’../PICTURES/svdpca.eps’;

0.2

0
singular values: -0.2

-0.4
3.1378
-0.6
1.8092
-0.8
0.1792 -1

-1.2
third singular value ➣ the data points essentially
-1.4
lie in a 2D subspace.
-1.6

-1.8 1.5
-1 1
-0.5 0.5
0
0.5 0
Fig. 110 1
1.5 -0.5

Example 3.4.60 (Principal component analysis for data classification → Ex. 3.4.55 cnt’d)

Given: measured U - I characteristics of n = 20 unknown diodes, I (U ) available for m = 50 voltages.

Sought: Number of different types of diodes in batch and reconstructed U - I characteristic for each type.

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 282
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

measured U-I characteristics for some diodes measured U-I characteristics for all diodes
1.2 1.2

1 1

0.8 0.8
current I

current I
0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 111 voltage U Fig. 112 voltage U

M ATLAB-code 3.4.61: Generation of synthetic perturbed U - I characteristics

1 % Generate synthetic measurement curves with random multiplicative and
additive
2 % perturbations supposed to reflect measurement errors and
manufacturing tolerances
3

4 % Voltage-current characteristics of both kinds of diodes

5 i1 = @(u) (2./(1+ exp (-10*u)) - 1);
6 i2 = @(u) (( exp (3*u)-1)/( exp (3)-1));
7 % Simulated measurements
8 m = 50; % number of measurements for different input voltages
9 n = 10; % no. of diodes of each kind
10 na = 0.05; % level of additive noise (normally distributed)
11 nm = 0.02; % level of multiplicative noise (normally distributed)
12

13 uvals = (0:1/(m-1):1);
14 D1 = (1+nm*randn(n,m)).*(i1(repmat(uvals,n,1)))+na*randn(n,m);
15 D2 = (1+nm*randn(n,m)).*(i2(repmat(uvals,n,1)))+na*randn(n,m);
16 A = ([D1;D2])’; A = A(1: s i z e (A,1),randperm(1: s i z e (A,2)));

Data matrix A ∈ R m,n , m ≫ n:

Columns A → series of measurements for different diodes (times/locations etc.),
Rows of A → measured values corresponding to one diode (time/location etc.).
Goal of PCA: detect linear correlations between columns of A

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 283
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

singular values for diode measurement matrix

20
← distribution of singular values of matrix
two dominant singular values !
i
singular value σ

measurements display linear correlation with two

10 principal components

5 two types of diodes in batch

0
0 2 4 6 8 10 12 14 16 18 20
Fig. 113 no. of singular value

M ATLAB-code 3.4.62: PCA for measured U - I characteristics

1 clear a l l ; diodedata;
2

3 f i g u r e (’name’,’U-I characteristics for a few diodes’);

4 p l o t (uvals,(D(:,[1 7 13 17]))’,’-+’);
5 x l a b e l (’{\bf voltage U}’,’fontsize’,14);
6 y l a b e l (’{\bf current I}’,’fontsize’,14);
7 t i t l e (’measured U-I characteristics for some diodes’);
8 p r i n t -depsc2 ’../PICTURES/diodepcafewmeas.eps’;
9

10 f i g u r e (’name’,’measured U-I characteristics’);

11 p l o t (uvals,D1’,’r+’); hold on;
12 p l o t (uvals,D2’,’b*’);
13 x l a b e l (’{\bf voltage U}’,’fontsize’,14);
14 y l a b e l (’{\bf current I}’,’fontsize’,14);
15 t i t l e (’measured U-I characteristics for all diodes’);
16 p r i n t -depsc2 ’../PICTURES/diodepcameas.eps’;
17

18 % Perform SVD based PCA

19 [U,S,V] = svd(D);
20

21 f i g u r e (’name’,’singular values’);
22 sv = d i a g (S(1:2*n,1:2*n));
23 p l o t (1:2*n,sv,’r*’); g r i d on;
24 x l a b e l (’{\bf index i of singular value}’,’fontsize’,14);
25 y l a b e l (’{\bf singular value \sigma_i}’,’fontsize’,14);
26 t i t l e (’{\bf singular values for diode measurement
matrix}’,’fontsize’,14);
27 p r i n t -depsc2 ’../PICTURES/diodepcasv.eps’;
28

29 f i g u r e (’name’,’trend vectors’);
30 p l o t (1:m,U(:,1:2),’+’);
31 x l a b e l (’{\bf voltage U}’,’fontsize’,14);
32 y l a b e l (’{\bf current I}’,’fontsize’,14);
33 t i t l e (’{\bf principal components (trend vectors) for diode

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 284
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

measurements}’,’fontsize’,14);
34 legend (’dominant principal component’,’second principal
component’,’location’,’best’);
35 p r i n t -depsc2 ’../PICTURES/diodepcau.eps’;
36

37 f i g u r e (’name’,’strength’);
38 p l o t (V(:,1),V(:,2),’mo’); g r i d on;
39 x l a b e l (’{\bf strength of singular component #1}’,’fontsize’,14);
40 y l a b e l (’{\bf strength of singular component #2}’,’fontsize’,14);
41 t i t l e (’{\bf strengths of contributions of singular
components}’,’fontsize’,14);
42 p r i n t -depsc2 ’../PICTURES/diodepcav.eps’;

strengths of contributions of singular components principal components (trend vectors) for diode measurements
0.15 0.3
dominant principal component
second principal component
0.1
0.2
strength of singular component #2

0.05

0.1
0

-0.05
current I

-0.1

-0.1
-0.15

-0.2
-0.2

-0.25

-0.3
-0.3

-0.35 -0.4
0.1 0.15 0.2 0.25 0.3 0.35 0 5 10 15 20 25 30 35 40 45 50
Fig. 114 strength of singular component #1 Fig. 115 voltage U

Observations:

✦ First two rows of V-matrix specify strength of contribution of the two leading principal components
to each measurement

➣ Points (V):,1:2 , which correspond to different diodes are neatly clustered in R2 . To determine
the type of diode i, we have to identify the cluster to which the point ((V)i,1 , Vi,2 ) belongs (→ cluster
analysis, course “machine learning”, next Rem. 3.4.65).

✦ The principal components themselves do not carry much useful information in this example.

Remark 3.4.63 (Principal axis of a point cloud)

Given m > 2 points x j ∈ R k , j = 1, . . . , m, in k-dimensional space, we ask what is the “longest” and
“shortest” diameter d+ and d− . This question can be stated rigorously in several different ways: here we
ask for directions for which the point cloud will have maximal/minimal variance, when projected onto that
direction:

d+ := argmax Q(v) ,
m m
kv k=1 1
d− := argmin Q(v) ,
Q (v) : = ∑ |(xi − c)⊤ v|2 , c = m ∑ xj . (3.4.64)
j =1 j =1
kv k=1

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 285
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6 points
major axis
minor axis

2
The directions d+ , d− are called the principal axes
of the point cloud, a term borrowed from mechanics 0
and connected with the axes of inertia of an assembly
of point masses. -2

Principal axes of a point cloud in 2D ✄ -4

-6

Fig. 116 -8 -6 -4 -2 0 2 4 6 8

d+ , d− can be computed by computing the extremizers of x 7→ kAx k2 with

 
( x1 − c ) ⊤
 ..  m,k
A= . ∈R ,
(xm − c)⊤
on {x ∈ R k : kxk2 = 1}, using the SVD-based method presented in Section 3.4.4.1.

Remark 3.4.65 (Algorithm for cluster analysis)

Given: ✦ N data points xi ∈ R k , i = 1, . . . , N ,

✦ No. n of desired clusters.
Sought: Partitioning of index set {1, . . . , N } = I1 ∪ · · · ∪ In , achieving minimal mean least squares error
n
1
mlse := ∑ ∑ kxi − ml k22 , ml =
♯ Il ∑ xi . (3.4.66)
l = 1 i ∈ Il i ∈ Il

The subsets {xi : i ∈ Il } are called the clusters. The points ml are their centers of gravity.

The Algorithm involves two components:

➊ Improvement of clusters using the Lloyd-Max algorithm, see Code 3.4.71. It involves two steps in
turns:
(a) Given centers of gravity ml redistribute points according to
Il := {i ∈ {1, . . . , N }: kxi − ml k2 ≤ k xi − mk k2 ∀k 6= l } . (3.4.67)

(b) Recompute centers of gravity

1
ml =
♯ Il ∑ xi . (3.4.68)
i ∈ Il

➋ Splitting of cluster by separation along its principal axis, see Rem. 3.4.63 and Code 3.4.70:
al := argmax{ ∑ |(xi − ml )⊤ v|2 } (3.4.69)
k v k2 = 1 i ∈ Il

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 286
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++-code 3.4.70: Principal axis point set separation

2 // Separation of a set of points whose coordinates are stored in the
3 // columns of X according to their location w.r.t. the principal axis
4 std : : p a i r <VectorXi , VectorXi > p r i n c a x i s s e p ( const MatrixXd & X) {
5 i n t N = X . cols ( ) ; // no. of points
6 VectorXd g = X . rowwise ( ) .sum ( ) / N ; // Center of gravity, cf. (3.4.68)
7 MatrixXd Y = X − g . r e p l i c a t e ( 1 ,N) ; // Normalize point coordinates.
8 // Compute principal axes, cf. (3.4.69) and (3.4.33). Note that the
9 // SVD of a symmetric matrix is available through an orthonormal
10 // basis of eigenvectors.
11 SelfAdjointEigenSolver <MatrixXd > es ( Y∗Y . transpose ( ) ) ;
12 // Major principal axis
13 Eigen : : VectorXd a = es . e i g e n v e c t o r s ( ) . rightCols <1 >() ;
14 // Coordinates of points w.r.t. to major principal axis
15 Eigen : : VectorXd c = a . transpose ( ) ∗Y ;
16 // Split point set according to locations of projections on principal
axis
17 // std::vector with indices to prevent resizing of matrices
18 std : : vector i 1 , i 2 ;
19 f o r ( i n t i = 0 ; i < c . siz e ( ) ; ++ i ) {
20 i f ( c ( i ) >= 0 )
21 i 1 . push_back ( i ) ;
22 else
23 i 2 . push_back ( i ) ;
24 }
25 // return the mapped std::vector as Eigen::VectorXd
26 r et ur n std : : p a i r <VectorXi , VectorXi > ( VectorXi : : Map( i 1 . data ( ) ,
i 1 . siz e ( ) ) , VectorXi : : Map( i 2 . data ( ) , i 2 . siz e ( ) ) ) ;
27 }

C++-code 3.4.71: Lloyd-Max algorithm for cluster indentification

2 template <class Derived >
3 std : : t u p l e <double , VectorXi , VectorXd> distcomp ( const MatrixXd & X ,
const MatrixBase <Derived > & C) {
4 // Compute squared distances
5 // d.row(j) = squared distances from all points in X to cluster j
6 MatrixXd d (C. cols ( ) , X . cols ( ) ) ;
7 f o r ( i n t j = 0 ; j < C . cols ( ) ; ++ j ) {
8 MatrixXd Dv = X − C . col ( j ) . r e p l i c a t e ( 1 , X . cols ( ) ) ;
9 d . row ( j ) = Dv . array ( ) . square ( ) . colwise ( ) .sum ( ) ;
10 }
11 // Compute minimum distance point association and sum of minimal
squared distances
12 VectorXi i d x ( d . cols ( ) ) ; VectorXd mx( d . cols ( ) ) ;
13 f o r ( i n t j = 0 ; j < d . cols ( ) ; ++ j ) {
14 // mx(j) tells the minimal squared distance of point j to the
nearest cluster
15 // idx(j) tells to which cluster point j belongs
16 mx( j ) = d . col ( j ) . minCoeff (& i d x ( j ) ) ;

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 287
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

17 }
18 double sumd = mx .sum ( ) ; // sum of all squared distances
19 // Computer sum of squared distances within each cluster
20 VectorXd cds (C . cols ( ) ) ; cds . setZero ( ) ;
21 f o r ( i n t j = 0 ; j < i d x . siz e ( ) ; ++ j ) // loop over all points
22 cds ( i d x ( j ) ) += mx( j ) ;
23 r et ur n std : : make_tuple ( sumd , i d x , cds ) ;
24 }
25

26 // Lloyd-Max iterative vector quantization algorithm for discrete point

27 // sets; the columns of X contain the points xi , the columns of
28 // C initial approximations for the centers of the clusters. The final
29 // centers are returned in C, the index vector idx specifies
30 // the association of points with centers.
31 template <class Derived >
32 void lloydmax ( const MatrixXd & X , MatrixBase <Derived > & C, VectorXi &
i d x , VectorXd & cds , const double t o l = 0.0001) {
33 i n t k = X . rows ( ) ; // dimension of space
34 i n t N = X . cols ( ) ; // no. of points
35 i n t n = C. cols ( ) ; // no. of clusters
36 i f ( k ! = C. rows ( ) )
37 throw std : : l o g i c _ e r r o r ( " d i m e n s i o n m i s m a t c h " ) ;
38 double s d_old = std : : n u m e r i c _ l i m i t s <double > : : max ( ) ;
39 double sd ;
40 std : : t i e ( sd , i d x , cds ) = distcomp ( X ,C) ;
41 // Terminate, if sum of squared minimal distances has not changed
much
42 while ( ( sd_old −sd ) / sd > t o l ) {
43 // Compute new centers of gravity according to (3.4.68)
44 MatrixXd Ctmp (C . rows ( ) ,C . cols ( ) ) ; Ctmp . setZero ( ) ;
45 // number of points in cluster for normalization
46 VectorXi n j ( n ) ; n j . setZero ( ) ;
47 f o r ( i n t j = 0 ; j < N ; ++ j ) { // loop over all points
48 Ctmp . col ( i d x ( j ) ) += X . col ( j ) ;
49 ++ n j ( i d x ( j ) ) ; // count associated points for
normalization
50 }
51 f o r ( i n t i = 0 ; i < Ctmp . cols ( ) ; ++ i ) {
52 i f ( nj ( i ) > 0)
53 C . col ( i ) = Ctmp . col ( i ) / n j ( i ) ; // normalization
54 }
55 s d_old = sd ;
56 // Get new minimum association of the points to cluster points
57 // for next iteration
58 std : : t i e ( sd , i d x , cds ) = distcomp ( X ,C) ;
59 }
60 }
61 // Note: this function is needed to allow a call with an rvalue
62 // && stands for an rvalue reference and allows rvalue arguments
63 // such as C.leftCols(nc) to be passed by reference (C++ 11 feature)
64 template <class Derived >

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 288
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

65 void lloydmax ( const MatrixXd & X , MatrixBase <Derived > && C, VectorXi
& i d x , VectorXd & cds , const double t o l = 0.0001) {
66 lloydmax ( X , C, i d x , cds ) ;
67 }

C++-code 3.4.72: Clustering of point set

2 // n-quantization of point set in k-dimensional space based on
3 // minimizing the mean square error of Euclidean distances. The
4 // columns of the matrix X contain the point coordinates, n specifies
5 // the desired number of clusters.
6 std : : p a i r <MatrixXd , VectorXi > p o i n t c l u s t e r ( const MatrixXd & X , const
int n) {
7 i n t N = X . cols ( ) ; // no. of points
8 i n t k = X . rows ( ) ; // dimension of space
9 // Start with two clusters obtained by principal axis separation
10 i n t nc = 1 ; // Current number of clusters
11 // Initial single cluster encompassing all points
12 VectorXi I b i g = VectorXi : : LinSpaced (N, 0 ,N−1) ;
13 i n t nbig = 0 ; // Index of largest cluster
14 MatrixXd C( X . rows ( ) , n ) ; // matrix for cluster midpoints
15 C. col ( 0 ) = X . rowwise ( ) .sum ( ) /N ; // center of gravity
16 VectorXi i d x (N) ; i d x . setOnes ( ) ;
17 // Split largest cluster into two using the principal axis separation
18 // algorithm
19 while ( nc < n ) {
20 VectorXi i 1 , i 2 ;
21 MatrixXd Xbig ( k , I b i g . siz e ( ) ) ;
22 f o r ( i n t i = 0 ; i < I b i g . siz e ( ) ; ++ i ) // slicing
23 Xbig . col ( i ) = X . col ( I b i g ( i ) ) ;
24 // separete Xbig into two clusters, i1 and i2 are index vectors
25 std : : t i e ( i 1 , i 2 ) = princaxissep ( Xbig ) ;
26 // new cluster centers of gravity
27 VectorXd c1 ( k ) , c2 ( k ) ; c1 . setZero ( ) ; c2 . setZero ( ) ;
28 f o r ( i n t i = 0 ; i < i 1 . siz e ( ) ; ++ i )
29 c1 += X . col ( I b i g ( i 1 ( i ) ) ) ;
30 f o r ( i n t i = 0 ; i < i 2 . siz e ( ) ; ++ i )
31 c2 += X . col ( I b i g ( i 2 ( i ) ) ) ;
32 c1 / = i 1 . siz e ( ) ; c2 / = i 2 . siz e ( ) ; // normalization
33 C. col ( nbig ) = c1 ;
34 C. col ( nbig +1) = c2 ;
35 ++nc ; // Increase number of clusters
36 // Improve clusters by Lloyd-Max iteration
37 VectorXd cds ; // saves mean square error of clusters
38 // Note C.leftCols(nc) is passed as rvalue reference (C++ 11)
39 lloydmax ( X ,C . l e f t C o l s ( nc ) , i d x , cds ) ;
40 // Identify cluster with biggest contribution to mean square
error
41 cds . maxCoeff (& nbig ) ;
42 i n t counter = 0;

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 289
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

43 // update Ibig with indices of points in cluster with biggest

contribution
44 f o r ( i n t i = 0 ; i < i d x . siz e ( ) ; ++ i ) {
45 i f ( i d x ( i ) == nbig ) {
46 I b i g ( counter ) = i ;
47 ++ c o u n t e r ;
48 }
49 }
50 I b i g . conservativeResize ( counter ) ;
51 }
52 r et ur n std : : make_pair (C, i d x ) ;
53 }

Example 3.4.73 (Principal component analysis for data analysis → Ex. 3.4.54 cnt’d)

A ∈ R m,n , m ≫ n:
Columns A → series of measurements at different times/locations etc.
Rows of A → measured values corresponding to one time/location etc.
Goal: detect linear correlations

Concrete: two quantities measured over one year at 10 different sites

1.5
(Of course, measurements affected by errors/fluctu-
ations)
1
1 n 10; =
2 m 50; =
3 x s i n ( p i *(1:m)’/m);
= 0.5

4 y cos ( p i *(1:m)’/m);
=
5 A []; = 0
6 f o r i = 1:n
7 A = [A, x.* rand (m,1),...
8 y+0.1* rand (m,1)]; -0.5

measurement 1
9 end measurement 2
measurement 2
measurement 4
-1
0 5 10 15 20 25 30 35 40 45 50

← distribution of singular values of matrix

singular value

two dominant singular values !

6
measurements display linear correlation with two
4
principal components

0
0 2 4 6 8 10 12 14 16 18 20
No. of singular value

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 290
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

principal components = u·,1 , u·,2 (leftmost columns of U-matrix of SVD)

their relative weights = v·,1 , v·,2 (leftmost columns of V-matrix of SVD)
0.25 0.4
1st principal component
2nd principal component
0.2 0.35

contribution of principal component

0.15
0.3
principal component

0.1
0.25

0.05
0.2
0

0.15
-0.05

0.1
-0.1

1st model vector

-0.15 0.05
2nd model vector
1st principal component
2nd principal component
-0.2 0
0 5 10 15 20 25 30 35 40 45 50 0 2 4 6 8 10 12 14 16 18 20
Fig. 117 No. of measurement Fig. 118 measurement

Cluster plot
0.5

0
Weight of second singular component

-0.5

Pronounced clustering of contributions around two

points of dominant singular components matches the -1

presence of two fundamental behaviors in the mea-

sured data ✄ -1.5

-2

-2.5
-6 -5 -4 -3 -2 -1 0 1
Fig. 119 Weight of first singular component

Example 3.4.74 (PCA of stock prices → Ex. 3.4.54)

columns of A → time series of end of day stock prices of individual stocks

rows of A → closing prices of DAX stocks on a particular day
XETRA DAX 1,1.2008 - 29.10.2010 Singular values of stock pricce matrix
3 4
10 10
ADS
ALV
BAYN
BEI
BMW
CBK
2 DAI 3
10 DB1 10
DBK
DPW
stock price (EUR)

DTE
singular value

EOAN
FME
FRE3
1 2
10 HEI 10
HEN3
IFX
LHA
LIN
MAN
MEO
0
10 MRK 10
1

MUV2
RWE
SAP
SDF
SIE
TKA
-1 VOW3 0
10 10
0 100 200 300 400 500 600 700 800 900 0 5 10 15 20 25 30
Fig. 120 days in past Fig. 121 no. of singular value

We observe a pronounced decay of the singular values (≈ exponential decay, logarithmic scale in Fig. 121)

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 291
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

➣ a few trends (corresponding to a few of the largest singular values) govern the time series.
Five most important stock price trends (normalized) Five most important stock price trends
0.15 500

400
0.1

300
0.05

200
0

100

-0.05
0

-0.1
-100

U(:,1) U*S(:,1)
-0.15 U(:,2) U*S(:,2)
-200
U(:,3) U*S(:,3)
U(:,4) U*S(:,4)
U(:,5) U*S(:,5)
-0.2 -300
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Fig. 122 days in past Fig. 123 days in past

Columns of U (→ Fig. 122) in SVD A = UΣV⊤ provide trend vectors, cf. Ex. 3.4.54 & Ex. 3.4.73.

When weighted with the corresponding singular value, the importance of a trend contribution emerges,
see Fig. 123
Trends in BMW stock, 1.1.2008 - 29.10.2010 Trends in Daimler stock, 1.1.2008 - 29.10.2010
0.25 0.15

0.2
0.1

0.15

0.05
0.1
relative strength

relative strength

0
0.05

0
-0.05

-0.05
-0.1

-0.1

-0.15
-0.15

-0.2 -0.2
1 2 3 4 5 1 2 3 4 5
Fig. 124 no of singular vector Fig. 125 no of singular vector

Stocks of companies from the same sector of the economy should display similar contributions of major
trend vectors, because their prices can be expected to be more closely correlated than stock prices in
general.
Data obtained from Yahoo Finance:
# ! / b i n / csh
foreach i (ADS ALV BAYN BEI BMW CBK DAI DBK DB1 LHA DPW DTE EOAN FRE3 \
FME HEI HEN3 IFX SDF LIN MAN MRK MEO MUV2 RWE SAP SIE TKA VOW3)
wget −O " i " . csv " h t t p : / / i c h a r t . f i n a n c e . yahoo . com / t a b l e . csv?s= i . DE&a=00&b=1&
c=2008&d=09&e=30& f =2010&g=d&i g n o r e = . csv "
sed − i −e ’ s / − / , / g ’ " i " . csv
end

M ATLAB-code 3.4.75: PCA of stock prices in M ATLAB

1 % Read end of day XETRA-DAX stock quotes into matrices
2 k = 1;
3 M{k} = dlmread (’ADS.csv’,’,’,1,0); leg{k} = ’ADS’; k = k + 1;

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 292
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

4 M{k} = dlmread (’ALV.csv’,’,’,1,0); leg{k} = ’ALV’; k = k + 1;

5 M{k} = dlmread (’BAYN.csv’,’,’,1,0); leg{k} = ’BAYN’; k = k + 1;
6 M{k} = dlmread (’BEI.csv’,’,’,1,0); leg{k} = ’BEI’; k = k + 1;
7 M{k} = dlmread (’BMW.csv’,’,’,1,0); leg{k} = ’BMW’; k = k + 1;
8 M{k} = dlmread (’CBK.csv’,’,’,1,0); leg{k} = ’CBK’; k = k + 1;
9 M{k} = dlmread (’DAI.csv’,’,’,1,0); leg{k} = ’DAI’; k = k + 1;
10 M{k} = dlmread (’DB1.csv’,’,’,1,0); leg{k} = ’DB1’; k = k + 1;
11 M{k} = dlmread (’DBK.csv’,’,’,1,0); leg{k} = ’DBK’; k = k + 1;
12 M{k} = dlmread (’DPW.csv’,’,’,1,0); leg{k} = ’DPW’; k = k + 1;
13 M{k} = dlmread (’DTE.csv’,’,’,1,0); leg{k} = ’DTE’; k = k + 1;
14 M{k} = dlmread (’EOAN.csv’,’,’,1,0); leg{k} = ’EOAN’; k = k + 1;
15 M{k} = dlmread (’FME.csv’,’,’,1,0); leg{k} = ’FME’; k = k + 1;
16 M{k} = dlmread (’FRE3.csv’,’,’,1,0); leg{k} = ’FRE3’; k = k + 1;
17 M{k} = dlmread (’HEI.csv’,’,’,1,0); leg{k} = ’HEI’; k = k + 1;
18 M{k} = dlmread (’HEN3.csv’,’,’,1,0); leg{k} = ’HEN3’; k = k + 1;
19 M{k} = dlmread (’IFX.csv’,’,’,1,0); leg{k} = ’IFX’; k = k + 1;
20 M{k} = dlmread (’LHA.csv’,’,’,1,0); leg{k} = ’LHA’; k = k + 1;
21 M{k} = dlmread (’LIN.csv’,’,’,1,0); leg{k} = ’LIN’; k = k + 1;
22 M{k} = dlmread (’MAN.csv’,’,’,1,0); leg{k} = ’MAN’; k = k + 1;
23 M{k} = dlmread (’MEO.csv’,’,’,1,0); leg{k} = ’MEO’; k = k + 1;
24 M{k} = dlmread (’MRK.csv’,’,’,1,0); leg{k} = ’MRK’; k = k + 1;
25 M{k} = dlmread (’MUV2.csv’,’,’,1,0); leg{k} = ’MUV2’; k = k + 1;
26 M{k} = dlmread (’RWE.csv’,’,’,1,0); leg{k} = ’RWE’; k = k + 1;
27 M{k} = dlmread (’SAP.csv’,’,’,1,0); leg{k} = ’SAP’; k = k + 1;
28 M{k} = dlmread (’SDF.csv’,’,’,1,0); leg{k} = ’SDF’; k = k + 1;
29 M{k} = dlmread (’SIE.csv’,’,’,1,0); leg{k} = ’SIE’; k = k + 1;
30 M{k} = dlmread (’TKA.csv’,’,’,1,0); leg{k} = ’TKA’; k = k + 1;
31 M{k} = dlmread (’VOW3.csv’,’,’,1,0); leg{k} = ’VOW3’;
32 % Data format of matrices M:
33 % M(:,1): year, M(:,2): month, M(:,3): day, M(:,6): end of day
stock quote
34 % Clean data: remove/interpolate days without trading
35 dv = [];
36 f o r j=1:k
37 M{j}(:,1) = round (366*M{j}(:,1)+31*M{j}(:,2)+M{j}(:,3));
38 dv = [dv; M{j}(:,1)];
39 end
40 dv = unique(dv); mxd = max(dv)+1;
41 dv = mxd - dv; mdays = max(dv);
42 A = z e r o s (mdays,k);
43 f o r j=1:k, A(mxd - M{j}(:,1),j) = M{j}(:,6); end
44 idx = f i n d (sum(A’) ~= 0); A = A(idx,:);
45

46 f o r j= s i z e (A,1):-1:2
47 zidx = f i n d (A(j-1,:) == 0);
48 A(j-1,zidx) = A(j,zidx);
49 end
50 f o r j=2: s i z e (A,1)
51 zidx = f i n d (A(j,:) == 0);
52 A(j,zidx) = A(j-1,zidx);
53 end
54

55 f i g u r e (’name’,’DAX’);

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 293
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

56 s e milogy (A); s e t ( gca ,’xlim’,[0 900]);

57 le ge nd (leg,’location’,’east’);
58 x l a b e l (’days in past’,’fontsize’,14);
59 y l a b e l (’stock price (EUR)’,’fontsize’,14);
60 t i t l e (’XETRA DAX 1,1.2008 - 29.10.2010’);
61 p r i n t -depsc2 ’../../PICTURES/stockprizes.eps’;
62

63 [U,S,V] = svd (A,0);

64 f i g u r e (’name’,’singular values’);
65 s e milogy ( d i a g (S),’r-+’);
66 x l a b e l (’no. of singular value’,’fontsize’,14);
67 y l a b e l (’singular value’,’fontsize’,14);
68 t i t l e (’Singular values of stock pricce matrix’);
69 p r i n t -depsc2 ’../../PICTURES/stocksingval.eps’;
70

71 f i g u r e (’name’,’trend vectors’);
72 p l o t (U(:,1:5));
73 x l a b e l (’days in past’,’fontsize’,14);
74 t i t l e (’Five most important stock price trends (normalized)’);
75 le ge nd (’U(:,1)’,’U(:,2)’,’U(:,3)’,’U(:,4)’,’U(:,5)’,’location’,’south’);
76 p r i n t -depsc2 ’../../PICTURES/stocktrendsn.eps’;
77

78 f i g u r e (’name’,’trend vectors’);
79 p l o t (U(:,1:5)*S(1:5,1:5));
80 x l a b e l (’days in past’,’fontsize’,14);
81 t i t l e (’Five most important stock price trends’);
82 le ge nd (’U*S(:,1)’,’U*S(:,2)’,’U*S(:,3)’,’U*S(:,4)’,’U*S(:,5)’,’location’,’south’)
83 p r i n t -depsc2 ’../../PICTURES/stocktrends.eps’;
84

85 f i g u r e (’name’,’trend components BMW’);

86 trendcomp(V,5,5);
87 t i t l e (’Trends in BMW stock, 1.1.2008 - 29.10.2010’);
88 p r i n t -depsc2 ’../../PICTURES/bmw.eps’;
89

90 f i g u r e (’name’,’trend components DAI’);

91 trendcomp(V,7,5);
92 t i t l e (’Trends in Daimler stock, 1.1.2008 - 29.10.2010’);
93 p r i n t -depsc2 ’../../PICTURES/dai.eps’;
94

95 f i g u r e (’name’,’trend components DBK’);

96 trendcomp(V,9,5);
97 t i t l e (’Trends in Deutsche Bank stock, 1.1.2008 - 29.10.2010’);
98 p r i n t -depsc2 ’../../PICTURES/dbk.eps’;
99

100 f i g u r e (’name’,’trend components CBK’);

101 trendcomp(V,6,5);
102 t i t l e (’Trends in Commerzbank stock, 1.1.2008 - 29.10.2010’);
103 p r i n t -depsc2 ’../../PICTURES/cbk.eps’;

3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 294
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3.5 Total Least Squares

In the examples of Section 3.0.1 we generally considered overdetermined linear systems of equations
Ax = b, for which only the right hand side vector b was affected by measurement errors. However, also
the entries of the coefficient matrix A may have been obtained by measurement. This is the case, for
instance, in the nodal analysis of electric circuits → Ex. 2.1.3. Then, it may be legitimate to seek a “better”
matrix based on information contained in the whole linear system. This is the gist of the total least squares
approach.

Given: overdetermined linear system of equations Ax = b, A ∈ R m,n , b ∈ R m , m>n.

Known: LSE solvable ⇔ b ∈ Im(A), if A, b were not perturbed,
but A, b are perturbed (measurement errors).

Sought: Solvable overdetermined system of equations Ax b, A

b =b b ∈ Rm,
b ∈ R m,n , b
“nearest” to Ax = b.

☞ least squares problem “turned upside down”: now we are allowed to tamper with system matrix and
right hand side vector!

Total least squares problem:

Given: A ∈ R m,n , m > n, rank(A) = n, b ∈ R m ,

find: A b ∈ R m with
b ∈ R m,n , b (3.5.1)
h i
[ A b] − A b
b b → min b ∈ R(A
, b b) .
F

h i h i
b ∈ R(A
b b) ⇒ rank( A b )=n
b b (3.5.1) ⇒ b = argmin [A b] − X
b b
A b .
b )= n F
rank(X

h i
☞ A b is the rank-n best approximation of [ A b]!
b b

We face the problem to compute the best rank-n approximation of the given matrix [A b], a problem
already treated in Section 3.4.4.2: Thm. 3.4.48 tells us how to use the SVD of [A b]

A = UΣV⊤ , U ∈ R m,n+1, Σ ∈ R n+1,n+1, V ∈ R n+1,n+1 , (3.5.2)

h i
b
to construct A b :
b

n +1 h i n
Thm. 3.4.48
[A b] = UΣV⊤ = ∑ σj (U):,j (V)⊤
:,j =⇒ A b =
b b ∑ σj (U):,j (V)⊤:,j . (3.5.3)
j =1 j =1
V orthogonal
h i
=⇒ A b (V):,n+1 = A
b b b (V)n+1,n+1 = 0 .
b (V)1:n,n+1 + b (3.5.4)

b,
b =b
(3.5.4) also provides the solution x of Sx

b −1 b
x := −A b = −(V)1:n,n+1 /(V)n+1,n+1 . (3.5.5)

3. DIrect Methods for Linear Least Squares Problems, 3.5. Total Least Squares 295
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++-code 3.5.6: Total least squares via SVD

2 // computes only solution x of fitted consistent LSE
3 VectorXd l s q t o t a l ( const MatrixXd& A , const VectorXd& b ) {
4 const unsigned m = A . rows ( ) , n = A . cols ( ) ;
5 MatrixXd C(m, n + 1 ) ; C << A , b ; // C = [A, b]
6 // We need only the SVD-factor V, see (3.5.3)
7 MatrixXd V = C . jacobiSvd ( Eigen : : ComputeThinU |
Eigen : : ComputeThinV ) . matrixV ( ) ;
8

9 // Compute solution according to (3.5.5);

10 double s = V( n , n ) ;
11 i f ( std : : abs ( s ) < 1.0E−15) { c e r r << " No s o l u t i o n ! \ n " ; e x i t ( 1 ) ; }
12 r et ur n (−V . col ( n ) . head ( n ) / s ) ;
13 }

3.6 Constrained Least Squares

In the examples of Section 3.0.1 we expected all components of the right hand side vectors to be possibly
affected by measurement errors. However, it might happen that some data are very reliable and in this
case we would like the corresponding equation to be satisfied exactly.

linear least squares problem with linear constraint defined as follows:

Linear least squares problem with linear constraint:

Given: A ∈ R m,n , m ≥ n, rank(A) = n, b ∈ R m ,

C ∈ R p,n , p < n, rank(C) = p, d ∈ R p

Find: x ∈ R n such that kAx − bk2 → min and Cx = d .

(3.6.1)

Linear constraint

Here the constraint matrix C collects all the coefficients of those p equations that are to be satisfied exactly,
and the vector d the corresponding components of the right hand side vector. Conversely, the m equations
of the (overdetermined) LSE Ax = b cannot be satisfied and are treated in a least squares sense.

3.6.1 Solution via Lagrangian Multipliers

(3.6.2) A saddle point problem

Recall important technique from multidimensional calculus for tackling constrained minimization problems:
Lagrange multipliers, see [?, Sect. 7.9].

3. DIrect Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 296
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Idea: coupling the constraint using the Lagrange multiplier m ∈ R p

x = argmin maxp L(x, m) , (3.6.3)

x ∈R n m ∈R
1
L(x, m) := kAx − bk2 + m⊤ (Cx − d) . (3.6.4)
2

L as defined in (3.6.4) is called a Lagrange function. The simple heuristics behind Lagrange multipliers is
the observation:

max L(x, m) = ∞, in case Cx 6= d!

m ∈R p

➥ A minimum in (3.6.3) can only attained, if the constraint is satisfied!

(3.6.3) is called a saddle point problem.

2.5

Solution of min-max problem like (3.6.3) is called a 2

saddle point. 1.5

L(x,m)

Saddle point of F( x, m) = x2 − 2xm

1
✄
0.5

Note that the function is “flat” in the saddle point •, 0

that is, both the derivative with respect to x and with -0.5

respect to m has to vanish.

-1
1

0.4 0.6 0.8 1

-1 -0.4 -0.2 0 0.2
-1 -0.8 -0.6

Fig. 126
multiplier m
state x

(3.6.5) Augmented normal equations

In a saddle point the Lagrange function is “flat”, that is, all its partial derivatives have to vanish there.
Necessary (and sufficient) conditions for the solution x of (3.6.3)
(For a similar technique employing multi-dimensional calculus see Rem. 3.1.14)

∂L !
(x, m) = A⊤ (Ax − b) + C⊤ m = 0 , (3.6.6a)
∂x
∂L !
(x, m) = Cx − d = 0 . (3.6.6b)
∂m

A⊤ A C⊤ x A⊤ b Augmented normal equations
= (3.6.7)
C 0 m d (matrix saddle point problem)

As we know, a direct elimination solution algorithm for (3.6.7) amounts to finding an LU-decomposition of
the coefficient matrix. Here we opt for its symmetric variant, the Cholesky decomposition, see Section 2.8.

3. DIrect Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 297
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

On the block-matrix level it can be found by considering the equation

A⊤ A C⊤ R⊤ 0 R G⊤ R, S ∈ R n,n upper triangular matrices,
= ,
C 0 G −S⊤ 0 S G ∈ R p,n .
Thus the blocks of the Cholesky factors of the coefficient matrix of the linear system (3.6.7) can be
determined in four steps.
➀ Compute R from R⊤ R = A⊤ A → Cholesky decomposition → Section 2.8,
➁ Compute G from R⊤ G⊤ = C⊤ → n forward substitutions → Section 2.3.2,
➂ Compute S from S⊤ S = GG⊤ → Cholesky decomposition → Section 2.8.

(3.6.8) Extended augmented normal equations

The same caveats as those discussed for the regular normal equations in Rem. 3.2.3, Ex. 3.2.4, and
Rem. 3.2.6, apply to the direct use of the augmented normal equations (3.6.7):
1. their condition number can be much bigger than that of the matrix A,
2. forming A⊤ A may be vulnerable to roundoff,
3. the matrix A⊤ A may not be sparse, though A is.
As in Rem. 3.2.7 also in the case of the augmented normal equations (3.6.7) switching to an extended
version by introducing the residual r = Ax − b as a new unknown is a remedy, cf. (3.2.8). This leads to
the following linear system of equations.
    
−I A 0 r b
A⊤ 0 C⊤  x  =  0  Extended augmented
=
ˆ . (3.6.9)
normal equations
0 C 0 m d

3.6.2 Solution via SVD

Idea: Identify the subspace in which the solution can vary without violating the constraint.
Since C has full rank, this subspace agrees with the nullspace/kernel of C.

From Lemma 3.4.11 and Ex. 3.4.19 we have learned that the SVD can be used to compute (an orthonormal
basis of) the nullspace N (C). The suggests the following method for solving the constrained linear least
squares problem (3.6.1).
➀ Compute an orthonormal basis of N (C) using SVD (→ Lemma 3.4.11, (3.4.22)):
⊤
V
C = U[Σ 0] 1⊤ , U ∈ R p,p , Σ ∈ R p,p , V1 ∈ R n,p , V2 ∈ R n,n− p
V2
N (C) = R(V2 ) .
and the particular solution of the constraint equation

x0 : = V1 Σ −1 U ⊤ d .

Representation of the solution x of (3.6.1): x = x 0 + V 2 y, y ∈ R n − p .

3. DIrect Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 298
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

➁ Insert this representation in (3.6.1). This yields a standard linear least squares problem with coeffi-
cient matrix AV2 ∈ R m,n− p and right hand side vector b − Ax0 ∈ R m

kA(x0 + V2 y) − bk2 → min ⇔ kAV2 y − (b − Ax0 )k → min .

Learning Outcomes

After having studied the contents of this chapter you should be able to

• give a rigorous definition of the least squares solution of an (overdetermined) linear system of equa-
tions,

• state the (extended) normal equations for any overdetermined linear system of equations,
• tell uniqueness and existence of solutions of the normal equations,
• define (economical) QR-decomposition and SVD of a matrix,
• explain the use of QR-decomposition and, in particular, Givens rotations, for solving (overdeter-
mined) linear systems of equations (in least squares sense),

• use SVD to solve (constrained) optimization and low-rank best approximation problems
• formulate the augmented (extended) normal equations for a linearly constrained least squares prob-
lem.

3. DIrect Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 299
Chapter 4

Filtering Algorithms

This chapter continues the theme of numerical linear algebra, also covered in Chapter 1, 2, 10. We will
come across very special linear transformations (↔ matrices) and related algorithms. Surprisingly, these
form the basis of a host of very important numerical methods for signal processing.

(4.0.1) Time-discrete signals and sampling

From the perspective of signal processing we can identify

vector x ∈ R n ↔ finite discrete (= sampled) signal.
Sampling converts a time-continuous signal, repre-
sented by some real-valued physical quantity (pres-
sure, voltage, power, etc.) into a time-discrete signal:

ˆ time-continuous signal, 0 ≤ t ≤ T ,
X = X (t) = x0

“sampling”: x j = X ( j∆t) , j = 0, . . . , n − 1 ,
x1
n ∈ N, n∆t ≤ T . x2 x n −2x n −1
X (t)
∆t > 0 =
ˆ time between samples.
As already indicated by the indexing the sam-
Fig. 127 t0 t1 t2 t n −2 t n −1 time
pled values can be arranged in a vector x =
[ x0 , . . . , x n −1 ] ⊤ ∈ R n .
Note that in this chapter, as is customary in signal processing, we adopt a C++-style indexing from 0: the
components of a vector with length n carry indices ∈ {0, . . . , n − 1}.

As an idealization one sometimes considers a signal of infinite duration X = X (t), −∞ < t < ∞. In
this case sampling yields a bi-infinite time-discrete signal, represented by a sequence ( xk )k∈Z . If this
sequence has a finite number of non-zero terms only, then we write (0, . . . , xℓ , xℓ+1, . . . , xn−1, xn , 0, . . .).

Contents
4.1 Discrete Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
4.2 Discrete Fourier Transform (DFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
4.2.1 Discrete Convolution via DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
4.2.2 Frequency filtering via DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
4.2.3 Real DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

300
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

4.2.4 Two-dimensional DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320

4.2.5 Semi-discrete Fourier Transform [?, Sect. 10.11] . . . . . . . . . . . . . . . . 326
4.3 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
4.4 Trigonometric transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
4.4.1 Sine transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
4.4.2 Cosine transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
4.5 Toeplitz Matrix Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
4.5.1 Toeplitz Matrix Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
4.5.2 The Levinson Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355

4.1 Discrete Convolutions

(4.1.1) Discrete finite linear time-invariant causal channels/filters

Now we study a finite linear time-invariant causal channel (filter), which is widely used model for digital
communication channels, e.g. in wireless communication theory. Mathematically speaking, a (discrete)
channel/filter is a mapping F : ℓ∞ (Z ) → ℓ∞ (Z ) from the vector space ℓ∞ (Z ) of bounded input se-
quences { x j } j∈Z to bounded output sequences {y j } j∈Z .
xk yk
input signal output signal

time time

Fig. 128

Channel/filter: F : ℓ ∞ (Z ) → ℓ ∞ (Z ) , yj j ∈Z
= F xj j ∈Z
. (4.1.2)

For the description of filters we rely on special input signals, analogous to the description of a linear
mapping R n 7→ R m through a matrix, that is, its action on unit vectors.

Definition 4.1.3. Impulse response

( is the output for a single unit impulse at t = 0 as input, that

The impulse response of a channel/filter
1 , if j = 0
is, the input signal is x j = δj,0 := (Kronecker symbol).
0 else

Visualization of the (finite!) impulse response (. . . , 0, h0 , . . . , hn−1 , 0, . . .) of a (causal, see Def. 4.1.11
below) channel/filter:
impulse response

1 h0
h1 h n −2
h2 h n −1

Fig. 129 t0 t1 t2 tn−t2n−1 time Fig. 130 t0 t1 t2 tn−t2n−1 time

In order to link digital filters to linear algebra, we have to assume certain properties that are indicated by
the attributes “finite ”, “linear”, “time-invariant” and “causal”:

4. Filtering Algorithms, 4.1. Discrete Convolutions 301

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 4.1.4. Finite channel/filter

A filter F : ℓ∞ (Z ) → ℓ∞ (Z ) is called finite, if every input signal of finite duration produces an

output signal of finite duration,

∃ M ∈ N: | j| > M ⇒ x j = 0 ⇒ ∃ N ∈ N: |k| > N ⇒ ( F( x j j ∈Z
))k = 0 ,
(4.1.5)

➣ The impulse response of a finite filter can be described by a vector h of finite length n.

It should not matter when exactly signal is fed into the channel. To express this intuition more rigorously
we introduce the time shift operator for signals: for m ∈ Z

Sm : ℓ ∞ (Z ) → ℓ ∞ (Z ) , Sm ( x j j ∈Z
) = x j−m j ∈Z
. (4.1.6)

Definition 4.1.7. Time-invariant channel/filter

A filter F : ℓ∞ (Z ) → ℓ∞ (Z ) is called time-invariant, if shifting the input in time leads to the same
output shifted in time by the same amount; it commutes with the time shift operator from (4.1.6):

∀( x j ) j∈Z ∈ ℓ∞ (Z), ∀m ∈ Z: F(Sm ( x j j ∈Z
)) = Sm ( F( x j j ∈Z
)) . (4.1.8)

Definition 4.1.9. Linear channel/filter

A filter F : ℓ∞ (Z ) → ℓ∞ (Z ) is called linear, if F is a linear mapping:

F (α x j j ∈Z
+ β yj j ∈Z
) = αF( x j j ∈Z
) + βF( y j j ∈Z
) ∀ xj j ∈Z
, yj j ∈Z
∈ ℓ∞ (Z), α, β ∈ R .
(4.1.10)

Slightly rewritten, this means that for all scaling factors α, β ∈ R

output(α · signal 1 + β · signal 2) = α · output(signal 1) + β · output(signal 2) .

Of course, a signal should not trigger an output before it arrives at the filter; output may depend only on
past and present inputs, not on the future.

Definition 4.1.11. Causal channel/filter

A filter F : ℓ∞ (Z ) → ℓ∞ (Z ) is called causal (or physical, or nonanticipative), if the output does not
start before the input

∀ M ∈ N: xj j ∈Z
∈ ℓ ∞ (Z ), x j = 0 ∀ j ≤ M ⇒ F ( x j )
j ∈Z k
= 0 ∀k ≤ M . (4.1.12)

ˆ finite (→ Def. 4.1.4), linear (→ Def. 4.1.9), time-invariant (→ Def. 4.1.7), and causal
Acronym: LT-FIR =
(→ Def. 4.1.11) filter F : ℓ∞ (Z ) → ℓ∞ (Z )

4. Filtering Algorithms, 4.1. Discrete Convolutions 302

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The input response of a finite and causal filter is a sequence of the form (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .),
n ∈ N. Such an impulse response is depicted in Fig. 130.

(4.1.13) Transmission through finite, time-invariant, linear, causal filters

Let (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .), n ∈ N, be the impulse response (→ 4.1.3) of a finite (→ Def. 4.1.4),
linear (→ Def. 4.1.9), time-invariant (→ Def. 4.1.7), and causal (→ Def. 4.1.11) filter (LT-FIR) F : ℓ∞ (Z ) →
ℓ ∞ (Z ):

F( δj,0 j ∈Z
) = (. . . , 0, h0 , h1 , . . . , hn−1, 0, . . .) .

Owing to time-invariance we already know the response to a shifted unit pulse

F( δj,k j ∈Z
) = h j−k j ∈Z
= ··· 0 h0 h1 · · · h n −1 0 ··· .
↑ ↑
t = k∆t t = (k + n − 1)∆t

Every finite input signal (. . . , 0, x0 , x1 , . . . , xm−1 , 0, . . .) ∈ ℓ∞ (Z ) can be written as the superposition of

scaled unit impulses, which, in turns, are time-shifted copies of a unit pulse at t = 0:

m−1 m−1
xj j ∈Z
= ∑ xk δj,k j ∈Z
= ∑ x k Sk δj,0 j ∈Z
.,
k=0 k=0

where Sk is the time-shift operator from (4.1.6). Applying the filter on both sides of this equation and using
linarity leads to the general formula for the output signal y j j∈Z

      
y0 h0 0   0
0
 
y1  ..   h0   .. 
   .     0   . 
 ..     ..     . 
 .   h n −1   .   h0    . 
        . 
 yn   0   h n −1   ..   . 
       .   .. 
 ..  = x 0 0  + x 1 0  + x 2  + · · · + x m−1  .
 .   .   .   h n −1   0 
   ..   ..     
 ..       0   h 
 .   .   .   .   0 
 y m+ n −3   ..   ..   ..   .. 
 . 
y m+ n −2 0 0 0 h n −1

Thius, in compact notation we can write the non-zero components of the output signal y j j∈Z as
channel is causal and finite!

m+ n −1
yk = ∑ hk− j x j , k = 0, . . . , m + n − 2 (h j := 0 for j < 0 and j ≥ n) . (4.1.14)
j =0

Summary of the above considerations:

4. Filtering Algorithms, 4.1. Discrete Convolutions 303

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Superposition of impulse responses

The output (. . . , 0, y0 , y1 , y2 , . . .) of a finite, time-invariant, linear, and causal channel for finite length
input x = (. . . , 0, x0, . . . , xn−1, 0, . . .) ∈ ℓ∞ (Z ) is
a superposition of x j -weighted j∆t time-shifted impulse responses.

Example 4.1.16 (Visualization: superposition of impulse responses)

The following diagrams give a visual display of considerations of § 4.1.13, namely of the superposition
of impulse responses for a particular finite, time-invariant, linear, and causal filter (LT-FIR), and an input
signal of duration 3∆t, ∆t =
ˆ time between samples.
input signal x impulse response h
3.5 3.5

3 3

2.5 2.5

2 2
hi
xi

1.5 1.5

1 1

0.5 0.5

0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Fig. 131 index i of sampling instance ti Fig. 132 index i of sampling instance ti

Output = linear superposition of impulse responses:

response to x0 response to x1 response to x2 response to x3

3.5 3.5 3.5 3.5

3 3 3 3

2.5 2.5 2.5 2.5

signal strength

2 2 2 2

1.5
+ 1.5
+ 1.5
+ 1.5

1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0
Fig. 133 0 1 2 3
i
4 5 6 7 Fig.8 134 0 1 2 3
i
4 5 6 7 Fig.8 135 0 1 2 3
i
4 5 6 7 Fig.8 136 0 1 2 3
i
4 5 6 7 8

all responses accumulated responses

3.5

2.5
signal strength
signal strength

4
2

3
1.5

2
1

0.5 1

0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Fig. 137 i Fig. 138 i

4. Filtering Algorithms, 4.1. Discrete Convolutions 304

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 4.1.17 (Reduction to finite signals)

We consider a finite (→ Def. 4.1.4), linear (→ Def. 4.1.9), time-invariant (→ Def. 4.1.7), and causal (→
Def. 4.1.11) filter (LT-FIR) with impulse response (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .), n ∈ N. From (4.1.14)
we learn that

duration(output signal) ≤ duration(input signal) + duration(impulse response) .

Therefore, if we know that all input signals are of the form (. . . , x0 , x1 , . . . , xm−1, 0, . . .), we can model
⊤
them as vectors x = [ x0 , . . . , xm−1 ] ∈ R m , cf. § 4.0.1, and the filter can be viewed as a linear mapping
F : R n → R m+n−1, which takes us to the realm of linear algebra.
⊤
Thus, for the filter we have a matrix representation of (4.1.14). Writing y = [y0 , . . . , y2n−2 ] ∈ R2n−1 for
the vector of the output signal we find in the case m = n
 h 
0 0 0
  h
y0  1 
 
 ..   
 .    
    x0
   
    .. 
    . 
    
    
   0   
    
  h 
h1 h0   
  =  n −1 . (4.1.18)
    
   0  
    
    
    . 
    .. 
   
    x n −1
 ..   
 .   
 
y2n−2  
0 0 h n −1

Example 4.1.19 (Multiplication of polynomials)

“Surprisingly” the bilinear operation (4.1.14) that takes two input vectors and produces an output vector
with double the number of entries (−1) also governs the multiplication of polynomials:

n −1 n −1 2n −2 k
k k
p(z) = ∑ ak z , q(z) = ∑ bk z ( pq)(z) = ∑ ∑ a j bk − j zk (4.1.20)
k=0 k=0 k=0 j =0
| {z }
= :ck

Here the roles of hk , xk are played by the ak and bk .

➣ the coefficients of the product polynomial can be obtained through an operation similar to finite, time-
invariant, linear, and causal (LT-FIR) filtering!

4. Filtering Algorithms, 4.1. Discrete Convolutions 305

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(4.1.21) Discrete convolution

Both in (4.1.14) and (4.1.20) we recognize the same pattern of a particular bi-linear combination of
• discrete signals in § 4.1.13,
• polynomial coefficient sequences in Ex. 4.1.19.
Definition 4.1.22. Discrete convolution
⊤ ⊤
Given x = [ x0 , . . . , xn−1] ∈ K n , h = [h0 , . . . , hn−1 ] ∈ Kn their discrete convolution is the
vector y ∈ K2n−1 with components

n −1
yk = ∑ hk− j x j , k = 0, . . . , 2n − 2 (h j := 0 for j < 0) . (4.1.23)
j =0

✎ Notation for discrete convolution (4.1.23): y = h ∗ x.

Defining x j := 0 for j < 0, we find that discrete convolution is commutative :

n −1 n −1
yk = ∑ hk− j x j = ∑ hl xk− l , k = 0, . . . , 2n − 2 , (that is, h∗x = x∗h ) ,
j =0 l =0

obtained by index transformation l ← k − j.

x0 , . . . , x n −1 h0 , . . . , h n −1

=
LT-FIR h0 , . . . , hn−1 LT-FIR x0 , . . . , xn−1

Remark 4.1.24 (Multiplication of polynomials)

The formula (4.1.23) for the discrete convolution also occurs in a context completely detached from signal
processing.
Consider two polynomials in t of degree n − 1, n ∈ N,

n −1 n −1
p(t) = ∑ a j t j , q(t) = ∑ bj tj , aj , bj ∈ K .
j =0 j =0

Their product pq will be a polynomial of degree 2n − 2:

2n −2 min{ j,n −1}

j
( pq)(t) = ∑ cjt , cj = ∑ aℓ b j−ℓ , j = 0, . . . , 2n − 2 . (4.1.25)
j =0 ℓ=max{0,j−(n −1)}

Let us introduce dummy coefficients for p( t) and q(t), a j , b j , j = 2n, . . . , 2n − 2, all set to 0. This can
be easily done in a computer code by resizing the coefficient vectors of p and q and filling the new entries

4. Filtering Algorithms, 4.1. Discrete Convolutions 306

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

with zeros. The above formula for c j can then be rewritten as

j
cj = ∑ aℓ bj−ℓ , j = 0, . . . , 2n − 2 . (4.1.26)
ℓ=0
Hence, the coefficients of the product polynomial can be obtained as the discrete convolution of the
coefficient vectors of p and q:

[c0 c1 . . . c2n−2 ]⊤ = a ∗ b ! (4.1.27)

Moreover, this provides another proof for the commutativity of discrete convolution.

Remark 4.1.28 (Convolution of sequences)

The notion of a discrete convolution of Def. 4.1.22 naturally extends to sequences ∈ ℓ∞ (N0 ), that is,
bounded mappings N0 7→ K: the (discrete) convolution of two sequences ( x j ) j∈N0 , (y j ) j∈N0 is the
sequence (z j ) j∈N0 defined by
k k
zk := ∑ xk− j y j = ∑ x j yk− j , k ∈ N0 .
j =0 j =0

In this context recall the product formula for power series, Cauchy product, which can be viewed as a
multiplication rule for “infinite polynoials” = power series.

Remark 4.1.29 (Linear filtering of periodic signals)

An n-periodic signal (n ∈ N) is a sequence x j j∈Z satisfying x j+n = x j ∀ j ∈ Z .

➣ An n-periodic signal ( x j ) j∈Z uniquely determined by the values x0 , . . . , xn−1 and can be associated
⊤
with a vector x = [ x0 , . . . , xn−1 ] ∈ Rn.
Whenever the input signal of a finite, time-invariant filter is n-periodic, so will be the output signal. Thus,
in the n-periodic setting, a causal, linear, and time-invariant filter (LT-FIR) will give rise to a linear mapping
R n 7→ R n according to
n −1
yk = ∑ pk− j x j for some p0 , . . . , pn−1 ∈ R , pk := pk−n for all k ∈ Z . (4.1.30)
j =0

In matrix notation (4.1.30) reads

 
  p0 p n −1 p n −2 · · · ··· p1  
y0 ..  x0

 ..   p1 p0 p n −1 .  .. 
 .   ..  . 
   p2 p1 p0 .  
    
   .. .. .. ..  
 = . . . .  . (4.1.31)
    
   ..
.
..
.
..
.
 
 .    . 
 ..   . .. ..  .. 
 .. . . p n −1 
y n −1 x n −1
p n −1 ··· p1 p0
| {z }
= :P

4. Filtering Algorithms, 4.1. Discrete Convolutions 307

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

where (P)ij = pi − j , 1 ≤ i, j ≤ n, with p j := p j+n for 1 − n ≤ j < 0.

Note that the coefficients p0 , . . . , pn−1, which can be regarded as periodic impulse response, do not
agree with the impulse response (→ Def. 4.1.3) of the filter. However, they can easily be deduced
from
it, keeping in mind that the n-periodic sequence is the output signal for the input ∑k∈Z δnk,j j∈Z . If
(. . . , h0 , h1 , . . . , hm , 0, . . .) is the impulse response of LT-FIR filter, then

pj = ∑ h j+ℓn , j ∈ {0, . . . , n − 1} . (4.1.32)

ℓ∈Z

The following special variant of a discrete convolution operation is motivated by the preceding Rem. 4.1.29.

Definition 4.1.33. Discrete periodic convolution

The discrete periodic convolution of two n-periodic sequences ( pk )k∈Z , ( xk )k∈Z yields the n-
periodic sequence

n −1 n −1
(yk ) := ( pk ) ∗n ( xk ) , yk := ∑ pk− j x j = ∑ xk− j p j , k∈Z.
j =0 j =0

✎ notation for discrete periodic convolution: ( pk ) ∗n ( xk )

Since n-periodic sequences can be identified with vectors in K n (see above), we can also introduce the
discrete periodic convolution of vectors:

Def. 4.1.33 ➣ discrete periodic convolution of vectors: y = p ∗n x ∈ K n , p, x ∈ K n .

Example 4.1.34 (Radiative heat transfer)

Beyond signal processing discrete periodic convolutions occur in many mathematical models:

heated
An engineering problem:

✦ cylindrical pipe,
✦ heated on part Γ H of its perimeter (→ prescribed heat flux),
✦ cooled on remaining perimeter ΓK (→ constant heat flux).
Task: compute local heat fluxes.

cooled

4. Filtering Algorithms, 4.1. Discrete Convolutions 308

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Modeling (discretization):
• approximation by regular n-polygon, edges Γ j ,
• isotropic radiation of each edge Γ j (power Ij ),
j
αij
αij radiative heat flow Γ j → Γi : Pji := Ij ,
i π
ϕ
opening angle: αij = π γ|i − j|, 1 ≤ i, j ≤ n,
n n
power balance: ∑ Pji − ∑ Pij = Q j . (4.1.35)
i =1,i 6= j i =1,i 6= j
| {z }
= Ij

Qj =
ˆ heat flux through Γ j , satisfies

Z 2π
(
n j local heating , if ϕ ∈ Γ H ,
Q j := q( ϕ) dϕ , q( ϕ) := R
2π ( j −1)
n
− |Γ1 | ΓH q( ϕ) dϕ (const.), if ϕ ∈ ΓK .
K

n αij
(4.1.35) ⇒ LSE: Ij − ∑ π
Ii = Q j , j = 1, . . . , n .
i =1,i 6= j

    
1 − γ1 − γ2 − γ3 − γ4 − γ3 − γ2 −γ1 I1 Q1
 − γ1 1 − γ1 − γ2 − γ3 − γ4 − γ3    
−γ2  I2   Q2 
 
 − γ2 − γ1 − γ1 − γ2 − γ3 − γ4 − γ3     
 1  I3   Q3 
 − γ3 − γ2 − γ1 − γ1 − γ2 − γ2 − γ4     
e.g. n = 8: 
1  I4  =  Q4 . (4.1.36)
 − γ4 − γ3 − γ2 − γ1 − γ1 − γ2 − γ3     
 1  I5   Q5 
 − γ3 − γ4 − γ3 − γ2 − γ1 − γ1 
− γ2     
 1  I6   Q6 
 − γ2 − γ3 − γ4 − γ3 − γ2 − γ1 1 −γ1  I7   Q7 
− γ1 − γ2 − γ3 − γ4 − γ3 − γ2 − γ1 1 I8 Q8

This is a linear system of equations with symmetric, singular, and (by Lemma 9.1.5, ∑ γi ≤ 1) positive
semidefinite (→ Def. 1.1.8) system matrix.

Note that the matrices from (4.1.31) and (4.1.36) have the same structure!

Also observe that the LSE from (4.1.36) can be written by means of the discrete periodic convolution (→
Def. 4.1.33) of vectors y = (1, −γ1 , −γ2 , −γ3 , −γ4 , −γ3 , −γ2 , −γ1 ), x = ( I1 , . . . , I8 )

(4.1.36) ↔ y ∗8 x = [ Q 1 , . . . , Q 8 ] ⊤ .

(4.1.37) Circulant matrices

In Ex. 4.1.34 we have already seen a coefficient matrix of a special form, which is common enough to
warrant giving it a particular name:

4. Filtering Algorithms, 4.1. Discrete Convolutions 309

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 4.1.38. Circulant matrix → [?, Sect. 54]

n
A matrix C = cij i,j=1 ∈ K n,n is circulant
:⇔ ∃ ( pk )k∈Z n-periodic sequence: cij = p j−i , 1 ≤ i, j ≤ n.

✎ Notation: We write circul(p) ∈ K n,n for the circulant matrix generated by the periodic sequence/vector
p = [ p0 , . . . , p n −1 ] ⊤ ∈ K n

☞ Circulant matrix has constant (main, sub- and super-) diagonals (for which indices j − i = const.).

☞ columns/rows arise by cyclic permutation from first column/row.

Similar to the case of banded matrices (→ Section 2.7.6):

“information content” of circulant matrix C ∈ K n,n = n numbers ∈ K.
(obviously, one vector u ∈ K n enough to define circulant matrix C ∈ K n,n )
Structure of a generic circulant matrix (“constant diagonals”):
 
p0 p1 p2 · · · ··· p n −1
 p n −1 p0 p n −2 
 
 .. 
 p n −2 . 
 . 
 .. 
circul(p) = 



 
 . .. 
 .. . 
 
 p2 p1 
p1 p2 . . . · · · p n −1 p0

Write Z((uk )) ∈ K n,n for the circulant matrix generated by the n-periodic sequence (uk )k∈Z . Denote by
y := (y0 , . . . , yn−1 )⊤ , x = ( x0 , . . . , xn−1 )⊤ the vectors associated to n-periodic sequences.
Then the commutativity of the discrete periodic convolution (→ Def. 4.1.33) involves
circul(x)y = circul(y)x . (4.1.39)

Remark 4.1.40 (Reduction to periodic convolution)

Recall discrete convolution (→ Def. 4.1.22) of two vectors a = (a0 , . . . , an−1 )⊤ ∈ K n , b = (b0 , . . . , bn−1 )⊤ ∈
Kn .
n −1
zk := (a ∗ b)k = ∑ a j bk − j , k = 0, . . . , 2n − 2 .
j =0

Now expand a0 , . . . , an−1 and b0 , . . . , bn−1 to 2n − 1-periodic sequences by zero padding.

( (
ak , if 0 ≤ k < n , bk , if 0 ≤ k < n ,
xk := , yk := (4.1.41)
0 , if n ≤ k < 2n − 1 0 , if n ≤ k < 2n − 1 ,
and periodic extension: xk = x2n−1+k , yk = y2n−1+k for all k ∈ Z. The zero components prevent
interaction of different periods:

4. Filtering Algorithms, 4.1. Discrete Convolutions 310

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

a0 an −1
0 0 0

Fig. 139
−n 0 n 2n 3n 4n

(a ∗ b)k = (x ∗2n−1 y) k , k = 0, . . . , 2n − 2 . (4.1.42)

In the spirit of (4.1.18) we can switch to a matrix view of the reduction to periodic convolution:
 
   y0 0 x0
0yn−1 y1 y0 
z0  ... 
 ..   y1 0  
 .   


    
    
    
    
    
   
    .. 
    .  
   0 
     x n −1 

  =y y1 y0 0 0 yn−1 .
   n −1 y1 y0  0 
(4.1.43)
  0 0 
    .. 

    .  
    
    
    
    
    
    
 ..    
 .   
0  .. 
z2n−2 0 0 y n −1 y1 y0 . 
| {z } 0
a (2n − 1) × (2n − 1) circulant matrix!

Discrete convolution can be realized by multiplication with a circulant matrix (→ 4.1.38)

4.2 Discrete Fourier Transform (DFT)

Algorithms dealing with circulant matrices make use of their very special spectral properties. Full un-
derstanding requires familiarity with the theory of eigenvalues and eigenvectors of matrices from linear
algebra, see [?, Ch. 7], [?, Ch. 9].

Experiment 4.2.1 (Eigenvectors of circulant matrices)

Now we are about to discover a very deep truth . . .

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 311

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5
C : real(ev)
1
C : imag(ev)
1

Experimentally, we examine the eigenvalues and 4

C : real(ev)
2
C2: imag(ev)

eigenvectors of two random 8 × 8 circulant matri-

ces C1 , C2 (→ Def. 4.1.38), generated from ran-
3

dom vectors with entries even distributed in [0, 1],

eigenvalue
2
VectorXd::Random(n).
1
eigenvalues (real part) ✄
0
Little relationship between (complex!) eigenvalues
can be observed, as can be expected from random -1

matrices with entries ∈ [0, 1].

-2
0 1 2 3 4 5 6 7 8 9
Fig. 140 index of eigenvalue

Now: the surprise . . .

Eigenvectors of matrix C1 , visualized through the size of the real and imaginary parts of their components.
Circulant matrix 1, eigenvector 1 Circulant matrix 1, eigenvector 2 Circulant matrix 1, eigenvector 3 Circulant matrix 1, eigenvector 4
0 0.4 0.4 0.4

-0.05 0.3 0.3 0.3

-0.1 0.2 0.2 0.2

vector component value

-0.15 0.1 0.1 0.1

-0.2 0 0 0

-0.25 -0.1 -0.1 -0.1

-0.3 -0.2 -0.2 -0.2

-0.35 -0.3 -0.3 -0.3

real part real part real part real part
imaginary part imaginary part imaginary part imaginary part
-0.4 -0.4 -0.4 -0.4
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
vector component index vector component index vector component index vector component index

Circulant matrix 1, eigenvector 5 Circulant matrix 1, eigenvector 6 Circulant matrix 1, eigenvector 7 Circulant matrix 1, eigenvector 8
0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

vector component value

0.1 0.1 0.1 0.1

0 0 0 0

-0.1 -0.1 -0.1 -0.1

-0.2 -0.2 -0.2 -0.2

-0.3 -0.3 -0.3 -0.3

Eigenvectors of matrix C2
Circulant matrix 2, eigenvector 1 Circulant matrix 2, eigenvector 2 Circulant matrix 2, eigenvector 3 Circulant matrix 2, eigenvector 4
0 0.4 0.4 0.4

-0.05 0.3 0.3 0.3

-0.1 0.2 0.2 0.2

vector component value

-0.15 0.1 0.1 0.1

-0.2 0 0 0

-0.25 -0.1 -0.1 -0.1

-0.3 -0.2 -0.2 -0.2

-0.35 -0.3 -0.3 -0.3

Circulant matrix 2, eigenvector 5 Circulant matrix 2, eigenvector 6 Circulant matrix 2, eigenvector 7 Circulant matrix 2, eigenvector 8
0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

vector component value

0.1 0.1 0.1 0.1

0 0 0 0

-0.1 -0.1 -0.1 -0.1

-0.2 -0.2 -0.2 -0.2

-0.3 -0.3 -0.3 -0.3

Observation: different random circulant matrices have the same eigenvectors!

Eigenvectors of C = gallery(’circul’,(1:128)’); :

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 312

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

random 256x256 circulant matrix, eigenvector 2 random 256x256 circulant matrix, eigenvector 3 random 256x256 circulant matrix, eigenvector 5 random 256x256 circulant matrix, eigenvector 8
0.1 0.1 0.1 0.1

0.08 0.08 0.08 0.08

0.06 0.06 0.06 0.06

vector component value

0.04 0.04 0.04 0.04

0.02 0.02 0.02 0.02

0 0 0 0

-0.02 -0.02 -0.02 -0.02

-0.04 -0.04 -0.04 -0.04

-0.06 -0.06 -0.06 -0.06

-0.08 -0.08 -0.08 -0.08

real part real part real part real part
imaginary part imaginary part imaginary part imaginary part
-0.1 -0.1 -0.1 -0.1
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
vector component index vector component index vector component index vector component index

The eigenvectors remind us of sampled trigonometric functions cos(k/n), sin(k/n), k = 0, . . . , n − 1!

Remark 4.2.2 (Eigenvectors of commuting matrices)

An abstract result from linear algebra puts the surprising observation made in Exp. 4.2.1 in a wider context.

Theorem 4.2.3. Commuting matrices have the same eigenvectors

If A, B ∈ K n,n commute, that is, AB = BA, and A has n distinct eigenvalues, then the
eigenspaces of A and B coincide.

Proof. Let v ∈ K n \ {0} be an eigenvector of A with eigenvalue λ. Then

BA= AB
(A − λI)v = 0 ⇒ B(A − λI)v = 0 ⇒ (A − λI)Bv = 0 .
Since in the case of n distinct eigenvalues dim N (A − λI) = 1, we conclude that there is ξ ∈∈ K:
Bv = ξv, v is an eigenvector of B. Since the eigenvector of A span K n , there cannot be eigenvectors of
B that are not eigenvectors of A.
Moreover, there is a basis of K n consisting of eigenvectors of B; B can be diagonalized.
✷
Next, by straightforward calculation one verifies that every circulant matrix commutes with the unitary and
circulant cyclic permutation matrix
 
0 0 0 ··· ··· 0 1
1 ..
 0 0 .0  
 .. .. 
0 1 0 . . 
 
 
S =  ... ..
.
..
.
..
. . (4.2.4)
 
 .. .. .. .. 
 . . . . 
. .. 
 .. . 0 0 
0 ··· ··· 0 1 0
As a unitary matrix S can be diagonalized. Observe that Sn − I = O, that is the minimal polynomial of S
is ξ 7→ ξ n − 1, which is irreducible, because it has n distinct roots (of unity). Therefore, by Thm. 4.2.3, S
has n different eigenvalues and every eigenvector of S is also eigenvector of any circulant matrix.

Remark 4.2.5 (Why using K = C?)

In Exp. 4.2.1 we saw that we get complex eigenvalues/eigenvectors for general circulant matrices. More
generally, in many cases real matrices can be diagonalized only in C, which is the ultimate reason for the
importance of complex numbers.

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 313

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Complex numbers also allow an elegant handling on trigonometric functions: recall from analysis the
unified treatment of trigonometric functions via the complex exponential function

exp(it) = cos(t) + i sin(t) , t ∈ R .

The field of complex numbers C is the natural framework for the analysis of linear, time-invariant
C! filters, and the development of algorithms for circulant matrices.

(4.2.6) Eigenvectors of circulant matrices

Now we verify by direct computations that circulant matrices all have a particular set of eigenvectors. This
will entail computing in C, cf. Rem. 4.2.15.
✎ notation: nth root of unity ωn := exp(−2πi/n ) = cos(2π/n) − i sin(2π/n ), n ∈ N

ω n = ωn−1 , ωnn = 1 , ωn/2 = −1 , ωnk = ωnk+n ∀ k ∈ Z ,

n
satisfies (4.2.7)
(
n −1
kj n , if j = 0 mod n ,
∑ ωn = 0 , if j 6= 0 mod n . (4.2.8)
k=0

(4.2.8) is a simple consequence of the geometric sum formula

n −1
1 − qn
∑ qk = 1−q
∀ q ∈ C \ {1} , n ∈ N . (4.2.9)
k=0
n −1 nj
kj 1 − ωn 1 − exp(−2πij)
⇒ ∑ ωn = j
=
1 − exp(−2πij/n)
=0,
k=0 1 − ωn
nj
because exp(−2πij) = ωn = (ωnn ) j = 1 for all j ∈ Z.

! In expressions like ωnkl the term “kl ” will always designate an exponent and will never play
the role of a superscript.
Now we want to confirm the conjecture gleaned from Exp. 4.2.1 that vectors with powers of roots of unity
are eigenvectors for any circulant matrix. We do this by simple and straightforward computations:
Consider: circulant matrix C ∈ C n,n circulant matrix (→ Def. 4.1.38), with cij = ui − j , for a
n-periodic sequence (uk )k∈Z , uk ∈ C
h i
jk n −1
We “guess” an eigenvector: vk ∈ Cn with vk : = ωn ∈ C n , k ∈ {0, . . . , n − 1}.
j =0
(u j−l ωnlk )l ∈Z is n-periodic!

n −1 j
(Cvk ) j = ∑ u j−l ωnlk = ∑ u j−l ωnlk
l =0 l = j − n +1
(4.2.10)
n −1
( j− l )k jk n −1 jk
= ∑ ul ωn = ωn ∑ ul ωn−lk = λk · ωn = λk · ( vk ) j .
l =0 l =0

change of summation index independent of j !

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 314

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

vk is eigenvector of C for eigenvalue λk = ∑nl =−01 ul ωn−lk .

The set {v0 , . . . , vn−1 } ⊂ C n provides the so-called orthogonal trigonometric basis of C n = eigen-
vector basis for circulant matrices
    
 0 ω 0  ωn0 ωn0 

 ω n n 

  n − 2   ω n − 1  

  .
.   ω 1   ω n   n  

  .   n  

  .  .  2 ( n − 2 )   2 ( n − 1 )  
  .   ω n  nω 
{ v0 , . . . , v n − 1 } =    
···   .. 
   
..
.  .  . (4.2.11)

  .  .     

  ..  .. 


 . .   .
.   .  . 


 ω 0 ω n −1 
(n −1)(n −2) ( n − 1)2 
n n ωn ω n

From (4.2.8) we can conclude orthogonality of the basis vectors by straightforward computations:

n −1 n −1 (4.2.8)
jk − jk jm (m− k) j
vk : = (ωn )nj=−01 ∈C : n
vH
k vm = ∑ ωn ωn = ∑ ωn = 0 , if k 6= m . (4.2.12)
j =0 j =0

The matrix effecting the change of basis trigonometrical basis → standard basis is called the Fourier-
matrix
 
ωn0 ωn0 ··· ωn0
ωn0
 0 ωn1 ··· ωnn−1 
 h i n −1
 ωn2 ··· ωn2n−2  lj
Fn =  ωn  = ωn ∈ C n,n . (4.2.13)
 .. .. ..  l,j=0
 . . . 
( n − 1)2
ωn0 ωnn−1 · · · ωn

Lemma 4.2.14. Properties of Fourier matrix

The scaled Fourier-matrix √1n Fn is unitary (→ Def. 6.2.2) : F− 1 1 H 1

n = n Fn = n F n .

Proof. The lemma is immediate from (4.2.12) and (4.2.8),because

n −1 n −1 n −1
( l − 1) k ( l − 1) k −( j−1)k k(l − j)
F n FH
n = ∑ ωn ω n ( j − 1) k = ∑ ωn ωn = ∑ ωn , 1 ≤ l, j ≤ n .
l,j
k=0 k=0 k=0

Remark 4.2.15 (Spectrum of Fourier matrix)

We draw a conclusion from the properties stated in Lemma 4.2.14:

1 4
F
n2 n
= I ⇒ σ( √1n Fn ) ⊂ {1, −1, i, −i } ,

because, if λ ∈ C is an eigenvalue of Fn , then there is an eigenvector x ∈ C n \ {0} such that Fn x = λx,

see Def. 9.1.1.

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 315

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Lemma 4.2.16. Diagonalization of circulant matrices (→ Def. 4.1.38)

For any circulant matrix C ∈ K n,n , cij = ui − j , (uk )k∈Z n-periodic sequence, holds true

CFn = Fn diag(d1 , . . . , dn ) , d = Fn [ u0 , . . . , un−1 ] ⊤ .

Proof. Straightforward computation, see (4.2.10).

✷

Conclusion (from Fn = nF− 1

n ): C = F− 1
n diag(d1 , . . . , dn )Fn . (4.2.17)

Lemma 4.2.16, (4.2.17) ➣ multiplication with Fourier-matrix will be crucial operation in algorithms for
circulant matrices and discrete convolutions.

Therefore this operation has been given a special name:

Definition 4.2.18. Discrete Fourier transform (DFT)

The linear map Fn : C n 7→ C n , Fn (y) := Fn y, y ∈ C n , is called discrete Fourier transform (DFT),

i.e. for c := Fn (y)

n −1
kj
ck = ∑ y j ωn , k = 0, . . . , n − 1 . (4.2.19)
j =0

Recall the convention also adopted for the discussion of the DFT: vector indexes range from 0 to n − 1!

Terminology: c = Fn y is also called the (discrete) Fourier transform of y

From F− 1 1
n = n F n (→ Lemma 4.2.14) we find the inverse discrete Fourier transform:

n −1
kj 1 n −1 − kj
ck = ∑ y j ωn ⇔ yj =
n k∑
c k ωn (4.2.20)
j =0 =0

(4.2.21) Discrete Fourier transform in E IGEN and M ATLAB

• E IGEN-functions for discrete Fourier transform (and its inverse):

DFT: c=fft.fwd(y) ↔ inverse DFT: y=fft.inv(c);

Before using fft, remember to:

1. # include <unsupported/Eigen/FFT>

2. Instantiate helper class Eigen::FFT<double> fft;

(The template argument should always be double.)

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 316

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11 code 4.2.22: Demo: discrete Fourier transform in E IGEN ➺ GITLAB

2 i n t main ( ) {
3 using Comp = complex <double > ;
4 const VectorXcd : : Index n = 5 ;
5 VectorXcd y ( n ) , c ( n ) , x ( n ) ;
6 y << Comp( 1 , 0 ) ,Comp( 2 , 1 ) ,Comp( 3 , 2 ) ,Comp( 4 , 3 ) ,Comp( 5 , 4 ) ;
7 FFT<double> f f t ; // DFT transform object
8 c = f f t . fwd ( y ) ; // DTF of y, see Def. 4.2.18
9 x = f f t . inv ( c ) ; // inverse DFT of c, see (4.2.20)
10

11 cout << " y = " << y . transpose ( ) << endl

12 << " c = " << c . transpose ( ) << endl
13 << " x = " << x . transpose ( ) << endl ;
14 r et ur n 0 ;
15 }

• M ATLAB-functions for discrete Fourier transform (and its inverse):

DFT: c=fft(y) ↔ inverse DFT: y=ifft(c);

4.2.1 Discrete Convolution via DFT

Coding the formula from Def. 4.1.33 one would code discrete periodic convolution as follows:

C++11 code 4.2.23: Discrete periodic convolution: straightforward implementation ➺ GITLAB

2 VectorXcd pconv ( const VectorXcd& u , const VectorXcd& x ) {
3 using i d x _ t = VectorXcd : : Index ; // may be unsigned !
4 const i d x _ t n = x . siz e ( ) ;
5 VectorXcd z = VectorXcd : : Zero ( n ) ;
6 // Need signed indices when differences are formed
7 f o r ( long k = 0 ; k < n ; ++k ) {
8 f o r ( long j = 0 ; j < n ; ++ j ) {
9 long i n d = ( k − j < 0 ? n + k − j : k − j ) ;
10 z ( k ) += u ( i n d ) ∗ x ( j ) ;
11 }}
12 r et ur n z ;
13 }

In § 4.1.37 we have seen that periodic convolution (→ Def. 4.1.33) amounts to multiplication with a cir-
culant matrix. In addition, (4.2.17) reduces multiplication with a circulant matrix to two multiplications with
the Fourier matrix Fn (= DFT) and (componentwise) scaling operations.
Summary:
n −1
discrete periodic convolution zk = ∑ uk− j x j (→ Def. 4.1.33), k = 0, . . . , n − 1
j =0
l n
multiplication with circulant matrix (→ Def. 4.1.38) z = Cx, C := ui − j i,j=1.

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 317

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Idea: (4.2.17) ➣ z = F− 1
n diag(Fn u)Fn x

This formula is usually referred to as convolution theorem:

Theorem 4.2.24. Convolution theorem

The discrete periodic convolution ∗n between n-dimensional vectors u and x is equal to the inverse
DFT of the component-wise product between the DFTs of u and u; i.e.:

n −1 n
(u) ∗n ( x) := ∑ uk− j x j = F−n 1 (Fn u ) j (Fn x ) j j =1
.
j =0

C++11 code 4.2.25: Discrete periodic convolution: DFT implementation ➺ GITLAB

2 VectorXcd p c o n v f f t ( const VectorXcd& u , const VectorXcd& x ) {
3 Eigen : : FFT<double> f f t ;
4 VectorXcd tmp = ( f f t . fwd ( u ) ) . cwiseProduct ( f f t . fwd ( x ) ) ;
5 r et ur n f f t . inv ( tmp ) ;
6 }

In Rem. 4.1.40 we learned that the discrete convolution of n-vectors (→ Def. 4.1.22) can be accomplished
by the periodic discrete convolution of 2n − 1-vectors (obtained by zero padding, see Rem. 4.1.40):

C++11 code 4.2.26: Implementation of discrete convolution (→ Def. 4.1.22) based on

periodic discrete convolution ➺ GITLAB
2 VectorXcd myconv ( const VectorXcd& h , const VectorXcd& x ) {
3 const long n = h . siz e ( ) ;
4 // Zero padding, cf. (4.1.41)
5 VectorXcd hp (2 ∗ n − 1 ) , xp (2 ∗ n − 1 ) ;
6 hp << h , VectorXcd : : Zero ( n − 1 ) ;
7 xp << x , VectorXcd : : Zero ( n − 1 ) ;
8 // Periodic discrete convolution of length 2n − 1, ??
9 r et ur n pconvf f t ( hp , xp ) ;
10 }

4.2.2 Frequency filtering via DFT

The trigonometric basis vectors, when interpreted as time-periodic signals, represent harmonic oscilla-
tions. This is illustrated when plotting some vectors of the trigonometric basis (n = 16):

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 318

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Fourier-basis vector, n=16, j=1 Fourier-basis vector, n=16, j=7 Fourier-basis vector, n=16, j=15
1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

Value

Value
0 0 0

-0.2 -0.2 -0.2

-0.4 -0.4 -0.4

-0.6 -0.6 -0.6

-0.8 -0.8 -0.8

Real part Real part Real part
Imaginary p. Imaginary p. Imaginary p.
-1 -1 -1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
Vector component k Vector component k Fig. 141 Vector component k

“slow oscillation/low frequency” “fast oscillation/high frequency” “slow oscillation/low frequency”

Dominant coefficients of a signal after transformation to trigonometric basis indicate dominant fre-
quency components.
Terminology: coefficients of a signal w.r.t. trigonometric basis = signal in frequency domain, original
signal = time domain.

Recall: DFT (4.2.19) and inverse DFT (4.2.20)

n −1
kj 1 n −1 − kj
ck = ∑ y j ωn ⇔ yj =
n k∑
c k ωn (4.2.20)
j =0 =0

kj (n − k) j
Consider yk ∈ R ⇒ ck = cn−k , because ωn = ω n , and n = 2m + 1

m 2m m
− kj − kj − kj (k− n ) j
ny j = c0 + ∑ c k ωn + ∑ c k ωn = c0 + ∑ c k ωn + c n − k ωn
k=1 k= m+1 k=1
m
= c0 + 2 ∑ Re(ck ) cos(2π kj/n ) + Im(ck ) sin(2π kj/n ) ,
k=1

since ωnℓ = cos(2π ℓ/n) + i sin(2π ℓ/n ).

➣ |ck |, |cn−k | measures the strength with which an oscillation with frequency k is represented in the
signal, 0 ≤ k ≤ ⌊ n2 ⌋.

Example 4.2.27 (Frequency identification with DFT)

Extraction of characteristic frequencies from a distorted discrete periodical signal, generated by the fol-
lowing C++ code:

C++11 code 4.2.28: Generation of noisy sinusoidal signal ➺ GITLAB

2 VectorXd s i g n a l g e n ( ) {
3 const i n t N = 64;
4 const ArrayXd t = ArrayXd : : LinSpaced (N, 0 , N) ;
5 const VectorXd x = ( ( 2 ∗ M_PI /N∗ t ) . s i n ( ) +
(14 ∗ M_PI / N∗ t ) . s i n ( ) ) . matrix ( ) ;
6 r et ur n x + VectorXd : : Random(N) ;
7 }

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 319

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3
20

18
2

1 14

12
Signal

2
|ck|
0 10

-1
6

4
-2

0
-3 0 5 10 15 20 25 30
0 10 20 30 40 50 60 70
Fig. 142 Fig. 143 Coefficient index k
Sampling points (time)

Fig. 143 was generated by the following M ATLAB code. A corresponding C++ code is also available
➺ GITLAB.

M ATLAB code 4.2.29: Frequency extraction → Fig. 143

1 c = f f t (y); p = (c.* c o n j (c))/64;
2

3 f i g u r e (’Name’,’power spectrum’);
4 bar (0:31,p(1:32),’r’);
5 s e t ( gca ,’fontsize’,14);
6 a x i s ([-1 32 0 max(p)+1]);
7 x l a b e l (’{\bf index k of Fourier coefficient}’,’Fontsize’,14);
8 y l a b e l (’{\bf |c_k|^2}’,’Fontsize’,14);

We observe that frequencies present in unperturbed signal become evident in frequency domain, whereas
it is hard to tell them from the time-domain signal.

Example 4.2.30 (Detecting periodicity in data)

DFT: a computer’s eye for periodic patterns in data

The following C++ code processes actual web search data and performs a frequency analysis using DFT:

C++11 code 4.2.31: Extraction of periodic patterns by DFT ➺ GITLAB

2 VectorXd x = read ( " t r e n d . d a t " ) ;
3 const i n t n = x . siz e ( ) ;
4

5 mgl : : F i g u r e t r e n d ;
6 trend . p lo t ( x , " r " ) ;
7 trend . t i t l e ( " Google : ’ V o r l e s u n g s v e r z e i c h n i s ’ " ) ;
8 trend . grid ( ) ;
9 t r e n d . x l a b e l ( " week ( 1 . 1 . 2 0 0 4 − 3 1 . 1 2 . 2 0 1 0 ) " ) ;
10 t r e n d . y l a b e l ( " r e l a t i v e no . o f s e a r c h e s " ) ;
11 t r e n d . save ( " s e a r c h d a t a " ) ;

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 320

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

13 Eigen : : FFT<double> f f t ;
14 VectorXcd c = f f t . fwd ( x ) ;
15 VectorXd p = c . cwiseProduct ( c . conjugate ( ) ) . r e a l ( )
16 . segment ( 2 , std : : f l o o r ( ( n + 1 . ) / 2 ) ) ;
17

18 // plot power spectrum

19 mgl : : F i g u r e f i g ; f i g . p l o t ( p , "m" ) ; f i g . g r i d ( ) ;
20 f i g . t i t l e ( " F o u r i e r spectrum " ) ;
21 f i g . x l a b e l ( " i n d e x j o f F o u r i e r co mp o n e n t " ) ;
22 f i g . ylabel ( " | c_j |^2 " ) ;
23

24 // mark 4 highest peaks with red star

25 VectorXd p_s or ted = p ;
26 // sort descending
27 std : : s o r t ( p_s or ted . data ( ) , p_s or ted . data ( ) + p . siz e ( ) ,
std : : g r e a t e r <double > ( ) ) ;
28 f o r ( i n t i = 0 ; i < 4 ; ++ i ) {
29 // get position of p_sorted[i]
30 i n t i d x = std : : f i n d ( p . data ( ) , p . data ( ) +p . siz e ( ) , p_s or ted [ i ] ) −
p . data ( ) ;
31 std : : vector <double> i n d = { double ( i d x + 1 ) } ; // save in vectors
32 std : : vector <double> v a l = { p_s or ted [ i ] } ; // to be able to plot
33 f i g . p l o t ( ind , v a l , " r ∗ " ) ;
34 }
35 f i g . save ( " f o u r i e r d a t a " ) ;

Fig. 144 Fig. 145

Pronounced peaks in the power spectrum point to periodic structure of the data. Location of peaks tells
lengths of dominant periods.

Remark 4.2.32 (“Low” and “high” frequencies)

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 321

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Plots of real parts of trigonometric basis vectors (Fn ):,j (= columns of Fourier matrix), n = 16.
Trigonometric basis vector (real part), n=16, j=0 Trigonometric basis vector (real part), n=16, j=1 Trigonometric basis vector (real part), n=16, j=2
1 1 1

0.9 0.8 0.8

0.8 0.6 0.6

vector component value

0.7 0.4 0.4

0.6 0.2 0.2

0.5 0 0

0.4 -0.2 -0.2

0.3 -0.4 -0.4

0.2 -0.6 -0.6

0.1 -0.8 -0.8

0 -1 -1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
vector component index vector component index vector component index

Re(F16 ):,0 Re(F16 ):,1 Re(F16 ):,2

Trigonometric basis vector (real part), n=16, j=3 Trigonometric basis vector (real part), n=16, j=4 Trigonometric basis vector (real part), n=16, j=5
1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

vector component value

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

-0.2 -0.2 -0.2

-0.4 -0.4 -0.4

-0.6 -0.6 -0.6

-0.8 -0.8 -0.8

-1 -1 -1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
vector component index vector component index vector component index

Re(F16 ):,3 Re(F16 ):,4 Re(F16 ):,5

Trigonometric basis vector (real part), n=16, j=6 Trigonometric basis vector (real part), n=16, j=7 Trigonometric basis vector (real part), n=16, j=8
1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

vector component value

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

-0.2 -0.2 -0.2

-0.4 -0.4 -0.4

-0.6 -0.6 -0.6

-0.8 -0.8 -0.8

-1 -1 -1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
vector component index vector component index vector component index

Re(F16 ):,6 Re(F16 ):,7 Re(F16 ):,8

By elementary trigonometric identities:

High frequencies
( j − 1) k n − 1
Re(Fn ):,j = Re ωn
k=0
= (Re exp(2πi ( j − 1)k/n)) nk=−01
Re
= (cos(2π ( j − 1) x )) x=0, 1 ,...,1− 1 .
n n

Low frequencies
Slow oscillations/low frequencies ↔ j ≈ 1 and
j ≈ n.
Fast oscillations/high frequencies ↔ j ≈ n/2.

Fig. 146

Frequency filtering of real discrete periodic signals by suppressing certain “Fourier coefficients”.

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 322

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11-code 4.2.33: DFT-based frequency filtering ➺ GITLAB

2 void f r e q f i l t e r ( const VectorXd& y , i n t k ,
3 VectorXd& low , VectorXd& high ) {
4 const VectorXd : : Index n = y . siz e ( ) ;
5 i f ( n%2 ! = 0 )
6 throw std : : r u n t i m e _ e r r o r ( " Even v e c t o r l e n g t h r e q u i r e d ! " ) ;
7 const VectorXd : : Index m = y . siz e ( ) / 2 ;
8

9 Eigen : : FFT<double> f f t ; // DFT helper object

10 VectorXcd c = f f t . fwd ( y ) ; // Perform DFT of input vector
11

12 VectorXcd clow = c ;
13 // Set high frequency coefficients to zero, Fig. 146
14 f o r ( i n t j = −k ; j <= +k ; ++ j ) clow (m+ j ) = 0 ;
15 // (Complementary) vector of high frequency coefficients
16 VectorXcd c high = c − clow ;
17

18 // Recover filtered time-domain signals

19 low = f f t . inv ( clow ) . r e a l ( ) ;
20 high = f f t . inv ( c high ) . r e a l ( ) ;
21 }

(It can be optimised exploiting y j ∈ R and cn/2−k = cn/2+k )

Map y 7→ low (in Code 4.2.33) =

ˆ low pass filter .
Map y 7→ high (in Code 4.2.33) =
ˆ high pass filter .

Example 4.2.34 (Frequency filtering by DFT)

Noisy signal:
n = 256; y = exp(sin(2*pi*((0:n-1)’)/n)) + 0.5*sin(exp(1:n)’);
Frequency filtering by Code 4.2.33 with k = 120.
3.5
signal 350
noisy signal
3 low pass filter
high pass filter
300

2.5

250
2

200
1.5
|ck|

1 150

0.5
100

0
50

-0.5

0
-1 0 20 40 60 80 100 120 140
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 No. of Fourier coefficient
time

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 323

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Low pass filtering can be used for denoising, that is, the removal of high frequency perturbations of a
signal.

Example 4.2.35 (Sound filtering by DFT)

Frequency filtering is ubiquitous in sound processing. Here we demonstrate it in M ATLAB, which offers
tools for audio processing.

M ATLAB-code 4.2.36: DFT based sound compression

1 % Low pass sound filtering by DFT
2

3 % Read sound data

4 [y,Fs,nbits] = wavread (’hello.wav’);
5 sound (y,Fs);
6

7 n = l e n g t h (y);
8 f p r i n t f (’Read wav File: %d samples, rate = %d/s, nbits = %d\n’,
n,Fs,nbits);
9 k = 1; s{k} = y; leg{k} = ’Sampled signal’;
10

11 c = fft(y);
12

13 f i g u r e (’name’,’sound signal’);
14 p l o t ((22000:44000)/Fs,s{1}(22000:44000),’r-’);
15 t i t l e (’samples sound signal’,’fontsize’,14);
16 x l a b e l (’{\bf time[s]}’,’fontsize’,14);
17 y l a b e l (’{\bf sound pressure}’,’fontsize’,14);
18 g r i d on;
19

20 % print -depsc2 ’../PICTURES/soundsignal.eps’;

22 f i g u r e (’name’,’sound frequencies’);
23 p l o t (1:n, abs (c).^2,’m-’);
24 t i t l e (’power spectrum of sound signal’,’fontsize’,14);
25 x l a b e l (’{\bf index k of Fourier coefficient}’,’fontsize’,14);
26 y l a b e l (’{\bf |c_k|^2}’,’fontsize’,14);
27 g r i d on;
28

29 % print -depsc2 ’../PICTURES/soundpower.eps’;

31 f i g u r e (’name’,’sound frequencies’);
32 p l o t (1:3000, abs (c(1:3000)).^2,’b-’);
33 t i t l e (’low frequency power spectrum’,’fontsize’,14);
34 x l a b e l (’{\bf index k of Fourier coefficient}’,’fontsize’,14);
35 y l a b e l (’{\bf |c_k|^2}’,’fontsize’,14);
36 g r i d on;
37

38 % print -depsc2 ’../PICTURES/soundlowpower.eps’;

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 324

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

40 f o r m=[1000,3000,5000]
41

42 % Low pass filtering

43 cf = z e r o s (1,n);
44 cf(1:m) = c(1:m); cf(n-m+1: end ) = c(n-m+1: end );
45

46 % Reconstruct filtered signal

47 yf = ifft(cf);
48 % No idea why this is necessary
49 w a v w r i t e (yf,Fs,nbits, s p r i n t f (’hellof%d.wav’,m));
50 cy = wavread ( s p r i n t f (’hellof%d.wav’,m));
51 sound (cy,Fs,nbits);
52

53 k = k+1;
54 s{k} = r e a l (yf);
55 leg{k} = s p r i n t f (’cutt-off = %d’,m’);
56 end
57

58 % Plot original signal and filtered signals

59 f i g u r e (’name’,’sound filtering’);
60 p l o t ((30000:32000)/Fs,s{1}(30000:32000),’r-’,...
61 (30000:32000)/Fs,s{2}(30000:32000),’b--’,...
62 (30000:32000)/Fs,s{3}(30000:32000),’m--’,...
63 (30000:32000)/Fs,s{2}(30000:32000),’k--’);
64 x l a b e l (’{\bf time[s]}’,’fontsize’,14);
65 y l a b e l (’{\bf sound pressure}’,’fontsize’,14);
66 legend (leg,’location’,’southeast’);
67

68 % print -depsc2 ’../PICTURES/soundfiltered.eps’;

M ATLAB-code 4.2.37: DFT based low pass frequency filtering of sound

1 [y,sf,nb] = wavread (’hello.wav’);
2 c = f f t (y); c(m+1: end -m) = 0;
3 w a v w r i t e ( i f f t (c),sf,nb,’filtered.wav’);

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 325

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

samples sound signal x 10

4 power spectrum of sound signal
0.6 14

0.4 12

0.2 10
sound pressure

0 8

|c |2
k
-0.2 6

-0.4 4

-0.6 2

-0.8 0
0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7
Fig. 147 time[s] Fig. 148 index k of Fourier coefficient x 10
4

x 10
4 low frequency power spectrum 0.6
14

0.4
12

0.2
10
sound pressure

0
8
|c |2
k

-0.2
6

-0.4
4

-0.6 Sampled signal

2
cutt-off = 1000
cutt-off = 3000
cutt-off = 5000
-0.8
0 0.68 0.69 0.7 0.71 0.72 0.73 0.74
0 500 1000 1500 2000 2500 3000
Fig. 150 time[s]
Fig. 149 index k of Fourier coefficient

n −1
The power spectrum of a signal y ∈ C n is the vector |c j |2 j=0 , where c = Fn y is the discrete Fourier
transform of y.

4.2.3 Real DFT

Every time-discrete signal obtained from sampling a time-dependent physical quantity will yield a real
vector. Of couse, a real vector contains only half the information compared to complex vector of the same
length. We aim to exploit this for a more efficient implementation of DFT.
Task: Efficient implementation of DFT (Def. 4.2.18) (c0 , . . . , cn−1 ) for real coefficients (y0 , . . . , yn−1 )⊤ ∈
R n , n = 2m, m ∈ N.
If y j ∈ R in DFT formula (4.2.19), we obtain redundant output
(n − k) j kj
ωn = ω n , k = 0, . . . , n − 1 ,
n −1 n −1
kj (n − k) j
⇒ ck = ∑ yjωn = ∑ y j ωn = cn−k , k = 1, . . . , n − 1 .
j =0 j =0

Idea: map y ∈ R n , to C m and use DFT of length m.

m−1
jk −1 m−1 jk jk
hk = ∑ (y2j + iy2j+1 ) ωm = ∑m
j=0 y2j ωm + i · ∑ j=0 y2j+1 ωm , (4.2.38)
j =0

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 326

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

m−1
j(m− k) −1 jk m−1 jk
hm− k = ∑ y2j + iy2j+1 ω m = ∑m
j=0 y2j ωm − i · ∑ j=0 y2j+1 ωm . (4.2.39)
j =0

−1 jk −1 jk
⇒ ∑m
j=0 y2j ωm = 12 (hk + h m−k ) , ∑m
j=0 y2j+1 ωm = − 12 i (hk − h m−k ) .

Use simple identities for roots of unity:

n −1
jk −1 kjk m−1 jk
ck = ∑ y j ωn = ∑m
j=0 y2j ωm + ωn · ∑ j=0 y2j+1 ωm . (4.2.40)
j =0
( ck = 12 (hk + h m−k ) − 21 iωnk (hk − h m−k ) , k = 0, . . . , m − 1 ,
⇒ cm = Re{h0 } − Im{h0 } , (4.2.41)
ck = cn−k , k = m + 1, . . . , n − 1 .

C++11 code 4.2.42: DFT of real vectors of length n/2 ➺ GITLAB

2 // Perform fft on a real vector y of even
3 // length and return (complex) coefficients in c
4 // Note: E I G E N ’s DFT method fwd() has this already implemented and
5 // we could also just call: c = fft.fwd(y);
6 void f f t r e a l ( const VectorXd& y , VectorXcd& c ) {
7 const unsigned n = y . siz e ( ) , m = n / 2 ;
8 i f ( n % 2 ! = 0 ) { std : : cout << " n mu st be e ve n ! \ n " ; r et ur n ; }
9

10 // Step I: compute h from (4.2.38), (4.2.39)

11 std : : complex <double> i ( 0 , 1 ) ; // Imaginary unit
12 VectorXcd yc (m) ;
13 f o r ( unsigned j = 0 ; j < m; ++ j )
14 yc ( j ) = y (2 ∗ j ) + i ∗ y (2 ∗ j + 1 ) ;
15

16 Eigen : : FFT<double> f f t ;
17 VectorXcd d = f f t . fwd ( yc ) , h (m + 1 ) ;
18 h << d , d ( 0 ) ;
19

20 c . resize ( n ) ;
21 // Step II: implementation of (4.2.41)
22 f o r ( unsigned k = 0 ; k < m; ++k ) {
23 c ( k ) = ( h ( k ) + std : : c o n j ( h (m−k ) ) ) / 2 . −
i / 2 . ∗ std : : exp ( − 2. ∗ k / n ∗ M_PI ∗ i ) ∗ ( h ( k ) − std : : c o n j ( h (m−k ) ) ) ;
24 }
25 c (m) = std : : r e a l ( h ( 0 ) ) − std : : imag ( h ( 0 ) ) ;
26 f o r ( unsigned k = m+ 1; k < n ; ++k ) c ( k ) = std : : c o n j ( c ( n−k ) ) ;
27 }

4.2.4 Two-dimensional DFT

In this we study the frequency decomposition of matrices. Due to the natural analofy

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 327

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

one-dimensional data (“signal”) ←→ vector y ∈ C n ,

two-dimensional data (“image”) ←→ matrix. Y ∈ C m,n ,

these techniques are of fundamental importance for image processing.

(4.2.43) Matrix Fourier modes

A two-dimensional trigonometric basis of C m,n is give by the tensor product matrices

n o
(Fm ):,j (Fn )⊤
:,ℓ , 1 ≤ j ≤ m, 1 ≤ ℓ ≤ n ⊂ C m,n . (4.2.44)

Let a matrix C ∈ C m,n be given as a linear combination of these basis matrices with coefficients y j1 ,j2 ∈ C,
0 ≤ j1 < m, 0 ≤ j2 < n:
m−1 n −1
C= ∑ ∑ y j1 ,j2 (Fm ):,j1 (Fn )⊤
:,j2 . (4.2.45)
j1 =0 j2 =0

Then the entries of C can be computed by two nested discrete Fourier transforms:
!
m−1 n −1 m−1 n −1
j k j k j k j k
(C)k1 ,k2 = ∑ ∑ y j1 ,j2 ωm1 1 ωn2 2 = ∑ ωm1 1 ∑ ωn2 2 y j1 ,j2 , 0 ≤ k1 < m , 0 ≤ k2 < n .
j1 =0 j2 =0 j1 =0 j2 =0

The coefficients can also be regarded as entries of a matrix Y ∈ C m,n . Thus we can rewrite the above
expressions: for all 0 ≤ k1 < m, 0 ≤ k2 < n
m−1 j k1
(C)k1 ,k2 = ∑ Fn (Y) j1 ,: k2
ωm1 C = Fm (Fn Y⊤ )⊤ = Fm YFn . (4.2.46)
j1 =0

This formula defines the two-dimensional discrete Fourier transform of the matrix Y ∈ C m,n . By Lemma 4.2.14
we immediately get the inversion formula:

m−1 n −1
C= ∑ ∑ y j1 ,j2 (Fm ):,j1 (Fn )⊤
:,j2 ⇒ Y = F− 1 −1
m CFn =
1
mn F m CF n . (4.2.47)
j1 =0 j2 =0

C++11 code 4.2.48: Two-dimensional discrete Fourier transform ➺ GITLAB

2 template <typename Sc alar >
3 void f f t 2 ( Eigen : : MatrixXcd &C, const Eigen : : MatrixBase < Sc alar > &Y) {
4 using i d x _ t = Eigen : : MatrixXcd : : Index ;
5 const i d x _ t m = Y . rows ( ) , n=Y . cols ( ) ;
6 C. r e s i z e (m, n ) ;
7 Eigen : : MatrixXcd tmp (m, n ) ;
8

9 Eigen : : FFT<double> f f t ; // Helper class for DFT

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 328

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10 // Transform rows of matrix Y

11 f o r ( i d x _ t k = 0; k<m; k ++) {
12 Eigen : : VectorXcd t v ( Y . row ( k ) ) ;
13 tmp . row ( k ) = f f t . fwd ( t v ) . transpose ( ) ;
14 }
15

16 // Transform columns of temporary matrix

17 f o r ( i d x _ t k = 0; k<n ; k ++) {
18 Eigen : : VectorXcd t v ( tmp . col ( k ) ) ;
19 C. col ( k ) = f f t . fwd ( t v ) ;
20 }
21 }

C++11 code 4.2.49: Inverse two-dimensional discrete Fourier transform ➺ GITLAB

2 template <typename Sc alar >
3 void i f f t 2 ( Eigen : : MatrixXcd &C, const Eigen : : MatrixBase < Sc alar > &Y) {
4 using i d x _ t = Eigen : : MatrixXcd : : Index ;
5 const i d x _ t m = Y . rows ( ) , n=Y . cols ( ) ;
6 f f t 2 (C, Y . conjugate ( ) ) ; C = C . conjugate ( ) / (m∗ n ) ;
7 }

Remark 4.2.50 (Two-dimensional DFT in M ATLAB)

The two-dimensional DFT is provided by the M ATLAB function: fft2(Y) .

In M ATLAB the two-dimensional DFT is defined through nested one-dimensional DFTs (4.2.19) as follows:
fft2(Y) = fft(fft(Y).’).’
Here: .’ simply transposes the matrix (no complex conjugation)

(4.2.51) Periodic convolution of matrices

We consider the following bilinear mapping B : C m,n × C m,n → C m,n :

m−1 n −1
(B(X, Y)) k,ℓ = ∑ ∑ (X)i,j (Y)k−i mod m,ℓ− j mod n , (4.2.52)
i =0 j =0

which defines the two-dimensional discrete periodic convolution, cf. Def. 4.1.33.

C++11 code 4.2.53: Straightforward implementation of 2D discrete periodic convolution

➺ GITLAB
2 // Straightforward implementation of 2D periodic convolution
3 template <typename Scalar1 , typename Scalar2 , class EigenMatr ix >
4 void pmconv_basic ( const Eigen : : MatrixBase <Scalar1 > &X , const
Eigen : : MatrixBase <Scalar2 > &Y ,

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 329

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5 E i g e n M a t r i x &Z ) {
6 using i d x _ t = typename E i g e n M a t r i x : : Index ;
7 using v a l _ t = typename E i g e n M a t r i x : : S c a l a r ;
8 const i d x _ t n=X . cols ( ) ,m=X . rows ( ) ;
9 i f ( (m! =Y . rows ( ) ) | | ( n ! =Y . cols ( ) ) ) throw
std : : r u n t i m e _ e r r o r ( " pmconv : s i z e m i s m a t c h " ) ;
10 Z . r e s i z e (m, n ) ;
11 // Implementation of (4.2.52)
12 auto idx wr ap = [ ] ( const i d x _ t L , i n t i )
13 { i f ( i >=L ) i −= L ; else i f ( i <0) i += L ; r et ur n i ; } ;
14 f o r ( i n t i = 0; i <m; i ++) f o r ( i n t j = 0; j <n ; j ++) {
15 v a l _ t s = 0;
16 f o r ( i n t k = 0; k<m; k ++) f o r ( i n t l = 0; l <n ; l ++)
17 s += X( k , l ) ∗Y( idx wr ap (m, i −k ) , idx wr ap ( n , j − l ) ) ;
18 Z( i , j ) = s ;
19 }
20 }

The 2D discrete periodic convolution admits a diagonalization by switching to the trigonometric basis of
C m,n , analogous to (4.2.17), see Section 4.2.1.
sj
In (4.2.52) set Y = (Fm ):,r (Fn )s,: ∈ C m,n ri ω , 0 ≤ i < m, 0 ≤ j < n:
↔ (Y)i,j = ωm n

m−1 n −1
(B(X, Y)) k,ℓ = ∑ ∑ (X)i,j (Y)k−i mod m,ℓ− j mod n
i =0 j =0
m−1 n −1
r (k−i ) s(ℓ− j)
= ∑ ∑ (X)i,j ωm ωn
i =0 j =0
!
m−1 n −1
ri ω sj rk sℓ
= ∑ ∑ (X)i,j ωm n · ωm ωn .
i =0 j =0
!
m−1 n −1
ri ω sj
B(X, (Fm ):,r (Fn )s,: ) =
| {z } ∑ ∑ (X)i,j ωm n (Fm ):,r (Fn )s,: . (4.2.54)
i =0 j =0
“eigenvector” | {z }
see Eq. (4.2.45)

Hence, the (complex conjugated) two-dimensional discrete Fourier transform of X according to (4.2.45)
provides the eigenvalues of the aniti-linear mapping Y 7→ B(X, Y), X ∈ C m,n fixed.
This suggests the following DFT-based algorithm for evaluating the periodic convolution of matrices:
➊ Compute Ŷ by inverse 2D DFT of Y, see Code 4.2.49
➋ Compute X̂ by 2D DFT of X, see Code 4.2.48.
➌ Component-wise multiplication of X̂ and Ŷ: Ẑ = X̂. ∗ Ŷ.
➍ Compute Z through inverse 2D DTF of Ẑ.

C++11 code 4.2.55: DFT-based 2D discrete periodic convolution ➺ GITLAB

2 // DFT based implementation of 2D periodic convolution
3 template <typename Scalar1 , typename Scalar2 , class EigenMatr ix >

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 330

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

4 void pmconv ( const Eigen : : MatrixBase <Scalar1 > &X , const

Eigen : : MatrixBase <Scalar2 > &Y ,
5 E i g e n M a t r i x &Z ) {
6 using Comp = std : : complex <double > ;
7 using i d x _ t = typename E i g e n M a t r i x : : Index ;
8 using v a l _ t = typename E i g e n M a t r i x : : S c a l a r ;
9 const i d x _ t n=X . cols ( ) ,m=X . rows ( ) ;
10 i f ( (m! =Y . rows ( ) ) | | ( n ! =Y . cols ( ) ) ) throw
std : : r u n t i m e _ e r r o r ( " pmconv : s i z e m i s m a t c h " ) ;
11 Z . r e s i z e (m, n ) ; Eigen : : MatrixXcd Xh (m, n ) ,Yh (m, n ) ;
12 // Step ➊: 2D DFT of Y
13 f f t 2 ( Yh , ( Y . template c as t <Comp> ( ) ) ) ;
14 // Step ➋: 2D DFT of X
15 f f t 2 ( Xh , ( X . template c as t <Comp> ( ) ) ) ;
16 // Steps ➌, ➍: inverse DFT of component-wise product
17 i f f t 2 ( Z , Xh . cwiseProduct ( Yh ) ) ;
18 }

Example 4.2.56 (Deblurring by DFT)

2D discrete convolutions are important for image processing. Let a Gray-scale pixel image be stored in
the matrix P ∈ R m,n , actually P ∈ {0, . . . , 255}m,n , see also Ex. 9.3.24.

Write ( pl,k )l,k∈Z for the periodically extended image:

pl,j = (P)l +1,j+1 for 1 ≤ l ≤ m, 1 ≤ j ≤ n , pl,j = pl +m,j+n ∀l, k ∈ Z .

Blurring = pixel values get replaced by weighted averages of near-by pixel values
(effect of distortion in optical transmission systems)

L L
0≤l<m,
cl,j = ∑ ∑ sk,q pl +k,j+q ,
0≤j<n,
L ∈ {1, . . . , min{m, n}} . (4.2.57)
k=− L q =− L

blurred image point spread function

Does this ring a bell? Hidden in (4.2.57) is a 2D discrete periodic convolution, see Eq. (4.2.52). In light of
the algorithm implemented in Code 4.2.55 it is hardly surprising that DFT comes handy for reversing the
effect of the blurring!

Note that usually: L small, sk,m ≥ 0, ∑kL=− L ∑qL=− L sk,q = 1 (an averaging)

1
In the experiments we used: L = 5 and the PSF sk,q = .
1 + k2 + q2

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 331

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11-code 4.2.58: Point spread function (PSF) ➺ GITLAB

2 void p s f ( const long L , MatrixXd& S) {
3 VectorXd x = VectorXd : : LinSpaced (2 ∗ L+1 , −L , L ) ;
4 MatrixXd X , Y ;
5 meshgrid ( x , x , X , Y) ;
6 MatrixXd E = MatrixXd : : Ones(2 ∗ L+1 , 2 ∗ L+1) ;
7 S = E . cwiseQuotient ( E + X . cwiseProduct ( X) + Y . cwiseProduct ( Y) ) ;
8 S / = S .sum ( ) ;
9 }

Fig. 151 Fig. 152

Note: (4.2.57) defines a linear operator B : R m,n 7→ R m,n (“blurring operator”)

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 332

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11-code 4.2.59: Blurring operator ➺ GITLAB

2 MatrixXd b l u r ( const MatrixXd& P , const MatrixXd& S) {
3 const long m = P . rows ( ) , n = P . cols ( ) ,
4 M = S . rows ( ) , N = S . cols ( ) , L = (M−1) / 2 ;
5 i f (M ! = N) {
6 std : : cout << " E r r o r : S n o t q u a d r a t i c ! \ n " ;
7 }
8

9 MatrixXd C(m, n ) ;
10 f o r ( long l = 1 ; l <= m; ++ l ) {
11 f o r ( long j = 1 ; j <= n ; ++ j ) {
12 double s = 0 ;
13 f o r ( long k = 1 ; k <= (2 ∗ L+1) ; ++k ) {
14 f o r ( long q = 1 ; q <= (2 ∗ L+1) ; ++q ) {
15 double k l = l + k − L − 1 ;
16 i f ( k l < 1 ) k l += m;
17 else i f ( k l > m) k l −= m;
18 double jm = j + q − L − 1 ;
19 i f ( jm < 1 ) jm += n ;
20 else i f ( jm > n ) jm −= n ;
21 s += P( k l −1, jm −1) ∗S( k −1,q −1) ;
22 }
23 }
24 C( l −1, j −1) = s ;
25 }
26 }
27 r et ur n C;
28 }

Now we revisit the considerations of § 4.2.43 and recall the derivation of (4.2.10) and Lemma 4.2.16.
L L L L
νk µq ν(l + k) µ( j+q) νl µj νk µq
B( ωm ωn
k,q ∈Z
= ∑ ∑ sk,q ωm ωn = ωm ωn ∑ ∑ sk,q ωm ωn .
l,j k=− L q =− L k=− L q =− L

νk ω µq
Vν,µ := ωm n k,q ∈Z
, 0 ≤ µ < m, 0 ≤ ν < n are the eigenvectors of B :

L L
νk µq
B Vν,µ = λν,µ Vν,µ , eigenvalue λν,µ = ∑ ∑ sk,q ωm ωn (4.2.60)
k=− L q =− L
| {z }
2-dimensional DFT of point spread function !

Thus the inversion of the blurring operator boils down to componentwise scaling in “Fourier domain”, see
See also Code 4.2.55 for the same idea.

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 333

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11-code 4.2.61: DFT based deblurring ➺ GITLAB

2 MatrixXd d e b l u r ( const MatrixXd& C, const MatrixXd& S , const double
t o l =1e −3) {
3 const long m = C . rows ( ) , n = C . cols ( ) ,
4 M = S . rows ( ) , N = S . cols ( ) , L = (M−1) / 2 ;
5 i f (M ! = N) {
6 std : : c e r r << " E r r o r : S n o t q u a d r a t i c ! \ n " ;
7 }
8

9 MatrixXd Spad = MatrixXd : : Zero (m, n ) ;

10 // Zero padding
11 Spad . block ( 0 , 0 , L+1 , L +1) = S . block ( L , L , L+1 , L+1) ;
12 Spad . block (m−L , n−L , L , L ) = S . block ( 0 , 0 , L , L ) ;
13 Spad . block ( 0 , n−L , L+1 , L ) = S . block ( L , 0 , L+1 , L ) ;
14 Spad . block (m−L , 0 , L , L +1) = S . block ( 0 , L , L , L+1) ;
15 // Inverse of blurring operator (fft2 expects a complex matrix)
16 MatrixXcd SF = f f t 2 ( Spad . c as t <complex > ( ) ) ;
17 // Test for invertibility
18 i f ( SF . cwiseAbs ( ) . minCoeff ( ) < t o l ∗SF . cwiseAbs ( ) . maxCoeff ( ) ) {
19 std : : c e r r << " E r r o r : D e b l u r r i n g i m p o s s i b l e ! \ n " ;
20 }
21 // DFT based deblurring
22 r et ur n f f t 2 ( i f f t 2 (C . c as t <complex > ( ) ) . cwiseQuotient ( SF ) ) . r e a l ( ) ;
23 }

4.2.5 Semi-discrete Fourier Transform [?, Sect. 10.11]

Starting from Rem. 4.1.29 we mainly looked at time-discrete n-periodic signals, which can be mapped to
vectors ∈ R n . This led to discrete periodic convolution (→ Def. 4.1.33) and the discrete Fourier transform
(DFT) (→ Def. 4.2.18) as (bi-)linear mappings in C n .

In this section we are concerned with non-periodic signals of infinite duration as introduced in § 4.0.1.

Idea: Study the limit n → ∞ for the n-periodic setting and DFT.

Let (yk )k∈Z be an n-periodic sequence (signal), n = 2m + 1, m ∈ N. Thanks to periodicity we can

rewrite
m
kj
DFT → Def. 4.2.18: ck = ∑ y j exp(−2πi
n
) , k = 0, . . . , n − 1 . (4.2.62)
j=− m

Now we associate a point tk ∈ [0, 1[ with each index k for the components of the transformed signal
(ck )nk=−01 :
k
k ∈ {0, . . . , n − 1} ←→ tk := . (4.2.63)
n

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 334

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

0.9
✁ “Squeezing” a vector ∈ R n into [0, 1[.
0.8
Thus we can read the values ck as sampled values
0.7
of a functions defined on [0, 1]
0.6

c k ↔ c ( t k ); ,
ck

0.5

0.4

k
0.3 tk = , k = 0, . . . , n − 1 .
n
0.2

0.1
This makes it possible to pass from a discrete finite
signal to a continuous signal.
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 153 t

Thus, formally, we can rewrite (4.2.62) as

m
DFT: c (tk ) := c k = ∑ y j exp(−2πıjtk ) , k = 0, . . . , n − 1 . (4.2.64)
j=− m

The notation indicates that we read ck as the value of a function c : [0, 1[7→ C for argument tk .

Example 4.2.65 (“Squeezed” DFT of a periodically truncated signal)

1
Bi-infinite discrete signal, “concentrated around 0”: yj = , j ∈ Z.
1 + j2
m
We examine the DFT of the 2m + 1-periodic signal obtained by periodic extension of (yk )k=−m .

Fig. 154 Fig. 155

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 335

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Fig. 156 Fig. 157

Fig. 158 Fig. 159

Fig. 160 Fig. 161

Visual impression: c(tk ) “converge” to a function c : [0, 1[7→ R .

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 336

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11 code 4.2.66: Plotting a periodically truncated signal and its DFT ➺ GITLAB

2 // Visualize limit m → ∞ for a 2m+!-periodic signal and

3 // its discrete Fourier transform “squeezed” into [0, 1].
4 i n t main ( ) {
5 // range of plot for visualization of discrete signal
6 const i n t N = 3 ∗ 257;
7 // function defining discrete signal
8 auto y f n = [ ] ( VectorXd k ) {
9 r etur n ( 1 / ( 1 + k . ar r ay ( ) ∗ k . ar r ay ( ) ) ) . matrix ( ) ;
10 };
11 // loop over different periods 2l + 1
12 f o r ( i n t mpow = 4 ; mpow <= 7 ; ++mpow) {
13 i n t m = std : : pow ( 2 , mpow) ;
14 VectorXd ybas = y f n ( VectorXd : : LinSpaced (2 ∗m+1 , −m, m) ) ;
15 const i n t Ncp = std : : f l o o r (N/ ( 2 ∗m+1) ) ;
16 const i n t n = ybas . siz e ( ) ;
17
18 VectorXd y ( n ∗Ncp ) ;
19 f o r ( i n t i = 0 ; i < Ncp ; ++ i )
20 y . segment ( n ∗ i , n ) = ybas ;
21
22 // Stem plots vertical lines from 0 to the y-value
23 Stem s1 ;
24 s1 . t i t l e = "Period of signal y_i = " + std : : t o _ s t r i n g (2 ∗m+1) ;
25 s1 . x l a b e l = "Index i of sampling instance" ;
26 s1 . y l a b e l = "y_i" ;
27 s1 . f i l e = "persig" + std : : t o _ s t r i n g (mpow) ; // where to save
28 s1 . p l o t ( VectorXd : : LinSpaced ( n ∗Ncp , −n ∗Ncp / 2 . , n ∗Ncp / 2 . ) , y , "r" ) ;
29
30 Eigen : : FFT<double > f f t ;
31 // DFT of wrapped signal (one period)
32 VectorXd yconc ( n ) ; yconc << ybas . t a i l (m+1) , ybas . head (m) ;
33 VectorXcd c = f f t . fwd ( yconc ) ;
34 Stem s2 ;
35 s2 . t i t l e = "DFT of period " + std : : t o _ s t r i n g (2 ∗m+1) + " signal" ;
36 s2 . x l a b e l = "t_k" ;
37 s2 . y l a b e l = "c(t_k)" ;
38 s2 . f i l e = "persigdft" + std : : t o _ s t r i n g (mpow) ;
39 s2 . p l o t ( VectorXd : : LinSpaced (2 ∗m+1 , 0 , 1) , c . r e a l ( ) , "b" ) ;
40 }
41 r etur n 0 ;
42 }

Now we pass to the limit m → ∞ (and keep the function perspective ck = c(tk ))

Note: passing to the limit amounts to dropping the assumption of periodicity!

c (t) = ∑ yk exp(−2πıkt) . (4.2.67)

k ∈Z

Terminology: The series (= infinite sum) on the right hand side of (4.2.67) is called a Fourier series (link)

The function c : [0, 1[7→ C defined by (4.2.67) is called the Fourier transform of the se-
quence (yk )k∈Z (, if the series converges).

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 337

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Corollary 4.2.68. Periodicity of Fourier transforms

Fourier transforms t 7→ c(t) are 1-periodic functions R → C.

Fourier transform of (1/1+k2)

k
3.5

Thus, the limit we “saw” in Ex. 4.2.65 is actually the

3
Fourier transform of the sequence (yk )k∈Z !
1
2.5
✁ Fourier transform of yk := 1+ k2
2
From (4.2.67) we conclude:
c(t)

1.5
Fourier transform
=
1 weighted sum of Fourier modes t 7→ exp(−2πıkt),
k∈Z
0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 162 t

1 1 1 1 1

0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4

Fourier mode

Fourier mode
0.2 0.2 0.2 0.2 0.2

-0.2 + 0

-0.2

-0.4 -0.4 -0.4 -0.4 -0.4

-0.6 -0.6 -0.6 -0.6 -0.6

-0.8 -0.8 -0.8 -0.8 -0.8

-1 -1 -1 -1 -1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t t t t

→ related animation on Wikipedia.

It is possible to derive a closed-form expression for the function displayed in Fig. 163:

1 π
π −2πt 2πt− π
c (t) = ∑ 1 + k2 exp (− 2πıkt ) = π − e−π
· e + e ∈ C∞ ([0, 1]) .
k ∈Z
e

Note that when considered as a 1-periodic function on R , this c(t) is merely continuous.

Remark 4.2.69 (Decay conditions for bi-infinite signals)

The considerations above were based on

✦ truncation of (yk )k∈Z to (yk )m

k=− m and

✦ periodic continuation to an 2m + 1-periodic signal.

Obviously, only if the signal is concentrated around k = 0 this procedure will not lose essential information
contained in the signal.

Minimal requirement: lim |yk | = 0 , (4.2.70)

k→∞

Stronger requirement: ∑ |yk | < ∞ . (4.2.71)

k ∈Z

(4.2.71) ⇒ Fourier series (4.2.67) (link) converges uniformly [?, Def. 4.8.1]
⇒ c : [0, 1[7→ C is continuous [?, Thm. 4.8.1].

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 338

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 4.2.72 (Numerical summation of Fourier series)

Assuming sufficiently fast decay of the signal (yk )k∈Z for k → ∞ (→ Rem. 4.2.69), we can approximate
the Fourier series (4.2.67) by a Fourier sum
M
c (t) ≈ c M (t) := ∑ yk exp(−2πikt) , M ≫ 1 . (4.2.73)
k=− M

j
Task: Approximate evaluation of c(t) at N equidistant points t j := N , j = 0, . . . , N (e.g., for plotting it).

M M
kj
c(t j ) = lim
M→∞
∑ yk exp(−2πikt j ) ≈ ∑ yk exp(−2πi
N
), (4.2.74)
k=− M k=− M

for j = 0, . . . , N − 1.

Note: If N = M ➣ (4.2.74) is a discrete Fourier transform (→ Def. 4.2.18).

C++11 code 4.2.75: DFT-based evaluation of Fourier sum at equidistant points ➺ GITLAB
2 # include " f e v a l . hpp " // evaluate scalar function with a vector
3 // DFT based approximate evaluation of Fourier series
4 // signal is a handle to a function providing the yk
5 // M specifies truncation of series according to (4.2.73)
6 // N is the number of equidistant evaluation points for c in [0, 1[.
7 template <class Func tion >
8 VectorXcd foursum ( const F u n c t i o n& s i g n a l , i n t M, i n t N) {
9 const i n t m = 2 ∗M+ 1; // length of the signal
10 // sample signal
11 VectorXd y = f e v a l ( s i g n a l , VectorXd : : LinSpaced (m, −M, M) ) ;
12 // Ensure that there are more sampling points than terms in series
13 i n t l ; i f (m > N) { l = c e i l ( double (m) /N) ; N ∗= l ; } else l = 1 ;
14 // Zero padding and wrapping of signal, see Code 4.2.33
15 VectorXd y _ex t = VectorXd : : Zero (N) ;
16 y _ex t . head (M+1) = y . t a i l (M+1) ;
17 y _ex t . t a i l (M) = y . head (M) ;
18 // Perform DFT and decimate output vector
19 Eigen : : FFT<double> f f t ;
20 Eigen : : VectorXcd k = f f t . fwd ( y _ex t ) , c (N / l ) ;
21 f o r ( i n t i = 0 ; i < N / l ; ++ i ) c ( i ) = k ( i ∗ l ) ;
22 r et ur n c ;
23 }

Example 4.2.76 (Convergence of Fourier sums)

1
Infinite signal, satisfying decay condition (4.2.71): yk = , see Ex. 4.2.65.
1 + k2
Monitored: approximation of Fourier transform c(t) by Fourier sums cm (t), see (4.2.73).

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 339

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

2
Fourier transform of (1/1+k )
k Fourier sum approximations with 2m+1 terms, y = 1/(1+k2)
k
3.5 3.5

m=2
3 m=4
3
m=8
m = 16
m = 32
2.5 2.5

2 2

cm(t)
c(t)

1.5 1.5

1 1

0.5 0.5

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 163 t Fig. 164 t

Observation: Convergence of Fourier sums in “eyeball norm”; quantitative statements about convergence
can be deduced from Thm. 4.2.89.

Same limit considerations as above for the inverse DFT (4.2.20), n = 2m + 1,

1 n −1 jk
yj = ∑
n k=0
ck exp(2πi ) , j = −m, . . . , m .
n
(4.2.77)

Adopt function perspective as before: ck ↔ c(tk ), tk = nk , cf. (4.2.63).

1 n −1
y j = ∑ c(tk ) exp(2πijtk ) , j = −m, . . . , m . (4.2.78)
n k=0

Now: pass to the limit m → ∞ in (4.2.78)

Idea: right hand side of (4.2.78) = Riemann sum, cf. [?, Sect. 6.2]

in the limit m → ∞ the sum becomes an integral!

R1
yj = c(t) exp (2πijt) dt . (4.2.79)
0

This formula is the inversion of the summation of a Fourier series (4.2.67)!

The formula (4.2.79) allows to recover the signal (yk )k∈Z from its Fourier transform c(t).

Terminology: y j from (4.2.79) is called the j-th Fourier coefficient of the function c.
✎ Notation: b
c j := y j with y j defined by (4.2.79) =
ˆ j-th Fourier coefficient of c : [0, 1[→ C

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 340

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

ck )k∈Z 7→ c from (4.2.67),

the mapping (b
Note: Both, are linear !
and the mapping c 7→ (bck )k∈Z from (4.2.79)
(Recall the concept of a linear mapping as explained in [?, Ch. 6].)

✬ ✩
Summary of a fundamental correspondence:

Z 1
b
cj = c(t) exp (2πıjt) dt
(continuous) function 0 (bi-infinite) sequence
c : [0, 1[7→ C (b
c j ) j ∈Z
c (t) = ∑ cbk exp(−2πıkt)
k ∈Z

✫ ✪
Fourier transform Fourier coefficient

Remark 4.2.80 (Filtering in Fourier domain)

What happens to the Fourier transform of a bi-infinite signal, if it passes through a channel?

Consider a (bi-)infinite signal ( xk )k∈Z sent through a finite (→ Def. 4.1.4, linear (→ Def. 4.1.9) time-
invariant (→ Def. 4.1.7) causal (→ Def. 4.1.11) channel with impulse response (→ 4.1.3) (. . . , 0, h0 , . . . , hn−1 0, . . .)
(→ § 4.1.1).

➣ this results in the output signal, see (4.1.14),

n −1
yk = ∑ h j xk− j , k∈Z. (4.2.81)
j =0

Fourier transforms of signals: ( y k ) k ∈Z ↔ c ( t ) , ( x j ) j ∈Z ↔ b ( t )

(Assume that ( xk )k∈Z satisfies (4.2.71))

n −1
c (t) = ∑ yk exp(−2πıkt) = ∑ ∑ h j xk− j exp(−2πıkt)
k ∈Z k ∈Z j = 0
n −1
[shift summation index k] = ∑ ∑ h j xk exp(−2πıjt) exp(−2πıkt) (4.2.82)
j = 0 k ∈Z
!
n −1
= ∑ h j exp(−2πıjt) b(t) .
j =0
| {z }
trigonometric polynomial of degree n − 1

Definition 4.2.83. Trigonometric polynomial

A trigonometric polynomial is a function R → C that is a weighted sum of finitely many terms

t → exp(−2πıkt), k ∈ Z.

We summarize the insight gleaned from (4.2.82):

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 341

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Discrete convolution in Fourier domain

The discrete convolution of a signal with finite impulse response amounts to a multiplication of
its Fourier transform with a trigonometric polynomial whose coefficients are given by the impulse
response.

(4.2.85) Isometry property of Fourier transform

We find a conservation of power through Fourier transform:

Lemma 4.2.14 ➣ for Fourier matrix Fn , see (4.2.13), √1 Fn is unitary (→ Def. 6.2.2)
n

Thm. 3.3.5
1
√ Fn y = k yk 2 . (4.2.86)
n 2

Since DFT boils down to multiplication with Fn (→ Def. 4.2.18), we conclude from (4.2.86)

1 n −1 m
| c k |2 = | y j |2 .
n k∑ ∑
ck from (4.2.62) ⇒ (4.2.87)
=0 j=− m

Now we adopt the function perspective again and associated ck ↔ c(tk ). Then we pass to the limit
m → ∞, appeal to Riemann summation (see above), and conclude

Z1
m→∞
(4.2.87) =⇒ |c(t)|2 dt = ∑ | y j |2 . (4.2.88)
0 j ∈Z

Theorem 4.2.89. Isometry property of Fourier transform

c j |2 < ∞, then the Fourier series

If ∑k∈Z |b

c (t) = ∑ bck exp(−2πıkt)

k ∈Z

yields a function c ∈ L2 (]0, 1[) that satisfies

Z 1
kck2L2 (]0,1[) := |c(t)|2 dt = ∑ |bc j |2 .
0 k ∈Z

Recalling the concept of the L2 -norm of a function, see (5.2.67), the theorem can be stated as follows:

Thm. 4.2.89 ↔ The L2 -norm of a Fourier transform agrees with the Euclidean norm of the
corresponding signal.
2
Note: Euclidean norm of a sequence ( y k ) k ∈Z 2
:= ∑ | y j |2 .
k ∈Z

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 342

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

4.3 Fast Fourier Transform (FFT)

You might have been wondering why the reduction to DFTs received sop much attention in Section 4.2.1.
An explanation is given now.

Supplementary reading. [?, Sect. 8.7.3], [?, Sect. 53], [?, Sect. 10.9.2]

At first glance (at (4.2.19)): DFT in C n seems to require asymptotic computational effort of O(n2 ) (matrix×vector
multiplication with dense matrix).

C++-code 4.3.1: Naive DFT-implementation

2 // DFT (4.2.19) of vector y returned in c
3 void n a i v e d f t ( const VectorXcd& y , VectorXcd& c ) {
4 using i d x _ t = VectorXcd : : Index ;
5 const i d x _ t n = y . siz e ( ) ;
6 const std : : complex <double> i ( 0 , 1 ) ;
7 c . resize ( n ) ;
8 // root of unity ωn , w holds its powers
9 std : : complex <double> w = std : : exp( −2∗M_PI / n ∗ i ) , s = w ;
10 c ( 0 ) = y .sum ( ) ;
11 f o r ( long j = 1 ; j < n ; ++ j ) {
12 c ( j ) = y ( n −1) ;
13 f o r ( long k = n −2; k >= 0 ; −−k ) c ( j ) = c ( j ) ∗ s + y ( k ) ;
14 s ∗= w;
15 }
16 }

Example 4.3.2 (Efficiency of fft)

C++-code 4.3.3: timing of different implementations of DFT

1 # include <complex >

2 # include <Eigen / Dense>
3 # include <unsupported / Eigen / FFT>
4 # include < f i g u r e / f i g u r e . hpp>
5 # include "timer.h"
6 # include "meshgrid.hpp"
7 # include "eigen_fft_backend_name.hpp"
8
9 using namespace Eigen ;
10
11 //! Benchmarking discrete fourier transformations
12 // N = maximal length of the transformed vector
13 // nruns = no. of runs on which to take the minimal timing
14 // If the size of the transformed array is a prime number the kissfft
15 // implementation of E I G E N performs extremly bad, use FFTW or MKL
16 // in these cases
17 void benchmark ( const i n t N, const i n t nruns =3) {
18 std : : complex <double > i ( 0 , 1 ) ; // imaginary unit
19
20 //* Version which computes for every vector size (extremely slow)
21 // MatrixXd res(N-1, 4); // save timing results and vector size
22 // for (int n = 2; n <= N; ++n)
23 //* Version which computes for powers of two only
24 MatrixXd res ( i n t ( std : : l og2 (N) ) , 4) ; // save timing results and vector size
25 f o r ( i n t n = 2 , j = 0 ; n <= N ; n ∗ =2 , ++ j ) {

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 343

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

26 VectorXd y = VectorXd : : Random( n ) ;

27 VectorXcd c = VectorXcd : : Zero ( n ) ;
28 // compute fourier transformed with a loop implementation
29 Timer t 1 ; t 1 . s t a r t ( ) ;
30 f o r ( i n t k = 0 ; k < nruns ; ++k ) {
31 std : : complex <double > omega = std : : exp(−2∗M_PI / n ∗ i ) ,
32 s = omega ;
33 c ( 0 ) = y .sum ( ) ;
34 f o r ( i n t j = 1 ; j < n ; ++ j ) {
35 c ( j ) = y ( n −1) ;
36 f o r ( i n t l = n − 1; l > 0 ; −− l ) { c ( j ) = c ( j ) ∗ s + y ( l ) ; }
37 s ∗= omega ;
38 }
39 t1 . lap ( ) ;
40 }
41 // compute fourier transformed through matrix multiplication
42 MatrixXd I , J ; VectorXd l i n = VectorXd : : LinSpaced ( n , 0 , n −1) ;
43 meshgrid ( l i n , l i n , I , J ) ;
44 MatrixXcd F = (−2∗M_PI / n ∗ i ∗ I . cwiseProduct ( J ) ) . ar r ay ( ) . exp ( ) . matrix ( ) ;
45 Timer t 2 ; t 2 . s t a r t ( ) ;
46 f o r ( i n t k = 0 ; k < nruns ; ++k ) {
47 c = F∗ y ; t 2 . l a p ( ) ;
48 }
49 // use Eigen’s FFT to compute fourier transformation
50 // Note: slow for large primes!
51 Eigen : : FFT<double > f f t ;
52 Timer t 3 ; t 3 . s t a r t ( ) ;
53 f o r ( i n t k = 0 ; k < nruns ; ++k ) {
54 f f t . fwd ( c , y ) ; t 3 . l a p ( ) ;
55 }
56 // save timings
57 res ( j , 0 ) = n ; res ( j , 1 ) = t 1 . min ( ) ;
58 res ( j , 2 ) = t 2 . min ( ) ; res ( j , 3 ) = t 3 . min ( ) ;
59 }
60
61 mgl : : F i g u r e f i g ;
62 f i g . t i t l e ( "FFT timing" ) ;
63 f i g . s e t l o g ( true , tr ue ) ;
64 f i g . p l o t ( res . col ( 0 ) , res . col ( 1 ) , "b∗" ) . l a b e l ( "Loop based computation" ) ;
65 f i g . p l o t ( res . col ( 0 ) , res . col ( 2 ) , "m^" ) . l a b e l ( "Direct matrix multiplication" ) ;
66 f i g . p l o t ( res . col ( 0 ) , res . col ( 3 ) , "r<" ) . l a b e l ( "Eigen’s FFT module" ) ;
67 f i g . f p l o t ( "x^2∗1e−8" , "k: " ) . l a b e l ( "O(n^2)" ) ;
68 f i g . f p l o t ( "x∗lg(x)∗1e−8" , "k; " ) . l a b e l ( "O(n log(n))" ) ;
69 f i g . legend ( 0 , 1 ) ;
70 f i g . save ( "ffttime" ) ;
71 std : : cout << res << std : : endl ;
72 }
73
74 i n t main ( ) {
75 // print the backend used by eigen for the fft
76 std : : cout << "FFT Backend: " << eigen_fft_backend_name ( ) << std : : endl ;
77
78 // WARNING: this code may take very long to run
79 benchmark (1024 ∗ 8) ;
80 r etur n 0 ;
81 }

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 344

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

1
10

loop based computation

0 direct matrix multiplication
10
MATLAB fft() function

-1
10

Timings in M ATLAB ✄:

run time [s]

-2
10
1. Straightforward implementation involving M AT-
LAB loops -3
10

2. Multiplication with Fourier matrix (4.2.13)

3. M ATLAB’s built-in function fft() -4
10

-5
10

-6
10
0 500 1000 1500 2000 2500 3000
vector length n

Timings in E IGEN ✄: (only for n = 2 L !)

1. Straightforward implementation involving C++
loops, see Code 4.3.1
2. Multiplication with Fourier matrix (4.2.13)
3. E IGEN’s built-in FFT class, method fwd()
Note: Eigen uses KISS FFT as default backend in its
FFT module, which falls back loop-based slow DFT
when used on data sizes, which are large primes. An
superior alternative is FFTW, see § 4.3.11.

The secret of E IGEN’s fft():

the Fast Fourier Transform (FFT) algorithm [?]

(discovered by C.F. Gauss in 1805, rediscovered by Cooley & Tuckey in 1965,

one of the “top ten algorithms of the century”).

An elementary manipulation of (4.2.19) for n = 2m, m ∈ N:

n −1 2πi
ck = ∑ y j e− n jk
j =0
m−1 m−1
− 2πi 2πi
= ∑ y2j e n 2jk + ∑ y2j+1 e− n (2j +1) k

j =0 j =0 (4.3.4)
m−1 2πi 2πi k
m−1 2πi
= ∑ y2j e|−{zm jk} +e− n · ∑ y2k+1 |e−{z
m jk .
}
j =0 jk j =0 jk
=ωm =ωm
| {z } | {z }
ceven
= :ek codd
= :ek

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 345

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Note m-periodicity: ceven

ek = ceeven eodd
k+ m , c k = ceodd
k+ m .

Note: ceven
ek codd
,ek from DFTs of length m!

−1
with yeven := (y0 , y2 , . . . , yn−2 )⊤ ∈ C m : (e
ceven
k )m
k=0 = Fm yeven ,
m−1
with yodd := (y1 , y3 , . . . , yn−1 )⊤ ∈ C m : codd
e k k=0
= Fm yodd .
✞ ☎

✝ ✆
(4.3.4): DFT of length 2m = 2× DFT of length m + 2m additions & multiplications

Idea: divide & conquer recursion

FFT-algorithm

Recursive demonstration code for DFT of length n = 2 L :

C++11 code 4.3.5: Recursive FFT ➺ GITLAB

2 // Recursive DFT for vectors of length n = 2L
3 VectorXcd f f t r e c ( const VectorXcd& y ) {
4 using i d x _ t = VectorXcd : : Index ;
5 using comp = std : : complex <double > ;
6 // Nothing to do for DFT of length 1
7 const i d x _ t n = y . siz e ( ) ;
8 i f ( n == 1 ) r et ur n y ;
9 i f ( n % 2 ! = 0 ) throw std : : r u n t i m e _ e r r o r ( " s i z e ( y ) mu st be e ve n ! " ) ;
10 const Eigen : : Map<const
Eigen : : Matrix <comp , Eigen : : Dynamic , Eigen : : Dynamic , Eigen : : RowMajor>>
11 Y( y . data ( ) , n / 2 , 2 ) ; // for selecting even and off compoents
12 const VectorXcd c1 = f f t r e c ( Y . col ( 0 ) ) , c2 = f f t r e c ( Y . col ( 1 ) ) ;
13 const comp omega = std : : exp( −2∗M_PI / n ∗comp ( 0 , 1 ) ) ; // root of unity
ωn
14 comp s ( 1 . 0 , 0 . 0 ) ;
15 VectorXcd c ( n ) ;
16 // Scaling of DFT of odd components plus periodic copying
17 f o r ( long k = 0 ; k < n ; ++k ) {
18 c ( k ) = c1 ( k%(n / 2 ) ) + c2 ( k%(n / 2 ) ) ∗ s ;
19 s ∗= omega ;
20 }
21 r et ur n c ;
22 }

Computational cost of fftrec:

1× DFT of length 2 L

2× DFT of length 2 L−1

4× DFT of length 2 L−2

2 L × DFT of length 1

Code 4.3.5: each level of the recursion requires O(2 L ) elementary operations.

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 346

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Asymptotic complexity

Asymptotic complexity of FFT algorithm, n = 2 L : O( L2 L ) = O(n log2 n)

(fft.fwd()/fft.inv()-function calls: computational cost is ≈ 5n log2 n).

Remark 4.3.7 (FFT algorithm by matrix factorization)

For n = 2m, m ∈ N,

consider even-odd sorting PmOE (1, . . . , n) = (1, 3, . . . , n − 1, 2, 4, . . . , n) .

2j j
As ωn = ωm :
 
 
 
 
 Fm Fm 
 
 
 
OE  
permutation of rows Pm Fn =  =
  0    
 ωn ωn/2
n 
 
  ωn1   ωn/2+1
n  
     
 Fm  ..  Fm  ..  
  .   .  
n/2−1
ωn ωnn−1
 
 
 
  
  
 Fm  
  
  I I 
  
  
  
  
  
  0 
  ωn −ωn0 
  
  ωn1 −ωn1 
 Fm  .. .. 
  . . 
 
  n/2−1
−ωn/2−1
n
ωn

This reveals how to apply a divide-and-conquer idea when evaluating Fn x.

Example: partitioning of Fourier matrix for n = 10

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 347

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

 
ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0
 ω0 ω2 ω4 ω6 ω8 ω0 ω2 ω4 ω6 ω8 
 
 ω0 ω4 ω8 ω2 ω6 ω0 ω4 ω8 ω2 ω6 
 
 ω0 ω6 ω2 ω8 ω4 ω0 ω6 ω2 ω8 ω4 
 
 
 ω0 ω8 ω6 ω4 ω2 ω0 ω8 ω6 ω4 ω2 
P5OE F10 =  , ω := ω10 .
 ω0 ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9 
 
 ω0 ω3 ω6 ω9 ω2 ω5 ω8 ω1 ω4 ω7 
 
 ω0 ω5 ω0 ω5 ω0 ω5 ω0 ω5 ω0 ω5 
 
 ω0 ω7 ω4 ω1 ω8 ω5 ω2 ω9 ω6 ω3 
ω0 ω9 ω8 ω7 ω6 ω5 ω4 ω3 ω2 ω1

What if n 6= 2 L ? Quoted from M ATLAB manual:

To compute an n-point DFT when n is composite (that is, when n = pq), the FFTW library decomposes the
problem using the Cooley-Tukey algorithm, which first computes p transforms of size q, and then computes
q transforms of size p. The decomposition is applied recursively to both the p- and q-point DFTs until the
problem can be solved using one of several machine-generated fixed-size "codelets." The codelets in turn
use several algorithms in combination, including a variation of Cooley-Tukey, a prime factor algorithm, and
a split-radix algorithm. The particular factorization of n is chosen heuristically.

The execution time for fft depends on the length of the transform. It is fastest for powers of two. It is
almost as fast for lengths that have only small prime factors. It is typically several times slower for
lengths that are prime or which have large prime factors → Ex. 4.3.12.

Remark 4.3.8 (FFT based on general factorization)

Fast Fourier transform algorithm for DFT of length n = pq, p, q ∈ N (Cooley-Tuckey-Algorithm)

n −1 p−1 q −1 p−1 q −1
jk [ j= :l p+ m] − 2πi
pq ( l p+ m ) k l (k mod q )
ck = ∑ y j ωn = ∑ ∑ y l p+ m e = ∑ ωnmk ∑ y l p+ m ωq . (4.3.9)
j =0 m=0 l =0 m=0 l =0

q −1
Step I: perform p DFTs of length q zm,k := ∑ yl p+m ωqlk , 0 ≤ m < p, 0 ≤ k < q.
l =0

Step II: for k =: rq + s, 0 ≤ r < p, 0 ≤ s < q

p−1 p−1
− 2πi
pq (rq + s ) m
crq+s = ∑ e zm,s = ∑ (ωnms zm,s )ωmr
p
m=0 m=0

and hence q DFTs of length p give all ck .

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 348

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Step I Step II

p p

q q

Remark 4.3.10 (FFT for prime n)

When n 6= 2 L , even the Cooley-Tuckey algorithm of Rem. 4.3.8 will eventually lead to a DFT for a vector
with prime length.
Quoted from the M ATLAB manual:
When n is a prime number, the FFTW library first decomposes an n-point problem into three (n − 1)-point
problems using Rader’s algorithm [?]. It then uses the Cooley-Tukey decomposition described above to
compute the (n − 1)-point DFTs.

Details of Rader’s algorithm: starting point is a theorem from number theory:

∀ p ∈ N prime ∃ g ∈ {1, . . . , p − 1}: { gk mod p: k = 1, . . . , p − 1} = {1, . . . , p − 1} ,

permutation Pp,g : {1, . . . , p − 1} 7→ {1, . . . , p − 1} , Pp,g (k) = gk mod p ,

reversing permutation Pk : {1, . . . , k} 7→ {1, . . . , k} , Pk (i ) = k − i + 1 .

With these two permutations we can achieve something amazing:

p p−1 ⊤
For the Fourier matrix F p = ( f ij )i,j=1 the permuted block Pp−1 Pp,g ( f ij )i,j=2 Pp,g is circulant.

Example for p = 13: g = 2, permutation (2 4 8 3 6 12 11 9 5 10 7 1)

ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0
ω0 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1
ω0 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7
ω0 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10
ω0 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5
ω0 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9
F13 −→ ω0 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11
ω0 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12
ω0 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6
ω0 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3
ω0 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8
ω0 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4
ω0 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 349

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Then apply fast algorithms for multiplication with circulant matrices (= discrete periodic convolution, see
§ 4.1.37) to right lower (n − 1) × (n − 1) block of permuted Fourier matrix. These fast algorithms rely on
DTFs of length n − 1, see Code 4.2.25.

Asymptotic complexity of fft.fwd()/fft.inv() for y ∈ C n = O(n log n).

← Section 4.2.1
Asymptotic complexity of discrete periodic convolution,see Code 4.2.25:
Cost(pconvfft(u,x), u, x ∈ C n ) = O(n log n).

Asymptotic complexity of discrete convolution, see Code 4.2.26:

Cost(myconv(h,x), h, x ∈ C n ) = O(n log n).

(4.3.11) FFTW - A highly-performing self-tuning library for FFT

From FFTW homepage: FFTW is a C subroutine library for computing the dis-
crete Fourier transform (DFT) in one or more dimensions, of arbitrary input size,
and of both real and complex data.
FFTW will perform well on most architectures without modification. Hence
the name, "FFTW," which stands for the somewhat whimsical title of “Fastest
Fourier Transform in the West.”

Supplementary reading. [?] offers a comprehensive presentation of the design and implemen-

tation of the FFTW library (version 3.x). This paper also conveys the many tricks it takes to achieve
satisfactory performance for DFTs of arbitrary length.

FFTW can be installed from source following the instructions from the installation page after downloading
the source code of FFTW 3.3.5 from the download page. Precompiled binaries for various linux distribu-
tions are available in their main package repositories:

• Ubuntu/Debian: apt-get install fftw3 fftw3-dev

• Fedora: dnf install fftw fftw-devel
E IGEN’s FFT module can use different backend implementations, one of which is the FFTW library. The
backend may be enabled by defining the preprocessor directive Eigen_FFTW_DEFAULT (prior to inclu-
sion of unsupported/Eigen/FFT) and linking with the FFTW library (-lfftw3). This setup pre-
cedure may be handled automatically by a build system like CMake (see set_eigen_fft_backend
macro on ➺ GITLAB).

Example 4.3.12 (Efficiency of fft for different backend implementations)

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 350

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Platform:
✦ Linux (Ubuntu 16.04 64bit)
✦ Intel(R) Core(TM) i7-4600U CPU @
2.10GHz
✦ L2 256KB, L3 4 MB, 8 GB DDR3 @ 1.60GHz
✦ Clang 3.8.0, -O3
For reasonably high input sizes the FFTW backend
gives, compared to E IGEN’s default backend (Kiss
FFT), a speedup of 2-4x.

4.4 Trigonometric transformations

Supplementary reading. [?, Sect. 55], see also [?] for an excellent presentation of various

variants of the cosine transform.

Keeping in mind exp(2πix ) = cos(2πx ) + ı sin(2πx ) we may also consider the real/imaginary parts
of the Fourier basis vectors (Fn ):,j as bases of R n and define the corresponding basis transformation.
They can all be realized by means of fft with an asymptotic computational effort of O(n log n). These
transformations avoid the use of complex numbers.

Details are given in the sequel.

4.4.1 Sine transform

Another trigonometric basis transform in R n−1, n ∈ N:

Standard basis of R n−1  “Sine basis”  

         ( n − 1) π
1 0 0 0  
 sin( n )π
sin( n )2π sin( n 


  
   


 0   1  . .
 .  .   
  2π  
sin( n )  sin( n )  4π   

         
 
   

 .. 0
.
 .  
.
 
 ..  ..  

2( n − 1) π  
 .  .   ..    .  .   sin( n )
  .  · · ·   .  ←−   ··· .. 

   .  0 ..   
 

 

  . 

         
 
  ..    

  ..  ..      
 

..  .. 


. . 1 0 
 

 .   .   . 

0 0 0 1 
 sin( (n−1)π ) sin( 2(n−1)π ) ( n − 1)2 π  
n n sin( n

Basis transform matrix (sine basis → standard basis): Sn := (sin( jkπ/n))nj,k−=11 ∈ R n−1,n−1 .

4. Filtering Algorithms, 4.4. Trigonometric transformations 351

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Lemma 4.4.1. Properties of the sine matrix

√
2/n Sn ∈ R n,n is real, symmetric and orthogonal (→ Def. 6.2.2)

n −1
⊤
Sine transform of y = [y1 , . . . , yn−1 ] ∈ R n −1 : sk = ∑ y j sin(πjk/n ) , k = 1, . . . , n − 1 .
j =1

(4.4.2)

ˆ Sn ×vector):
By elementary consideration we can devise a DFT-based algorithm for the sine transform (=



y j , if j = 1, . . . , n − 1 ,
2n
y ∈ R : yej = 0
tool: “wrap around”: e , if j = 0, n , (e
y “odd”)


−y2n− j , if j = n + 1, . . . , 2n − 1 .

This “wrap around” transformation can be visualized as follows:

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2
tyj

0 0
yj

−→
-0.2 -0.2

-0.4 -0.4

-0.6 -0.6

-0.8 -0.8

-1 -1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 5 10 15 20 25 30
j j

1
Next we use sin( x ) = 2ı (exp(ıx ) − exp(−iux ) to identify the DFT of a wrapped around vector as a sine
transform:
2n −1 n −1 2n −1
(4.2.19) − 2πı − πı πı
(F2n e
y )k = ∑ yej e 2n kj = ∑ yje n kj − ∑ y2n− j e− n kj
j =1 j =1 j = n +1
n −1 πı πı
= ∑ y j (e− n kj − e n kj ) = −2i (Sn y)k ,k = 1, . . . , n − 1 .
j =1

C++11 code 4.4.3: Wrap-around implementation of sine transform ➺ GITLAB

2 // Simple sine transform of y ∈ R n − 1 into c ∈ R n−1 by (4.4.2)
3 void s i n e t r f w r a p ( const VectorXd &y , VectorXd& c )
4 {
5 VectorXd : : Index n = y . siz e ( ) + 1;
6 // Create wrapped vector e y
7 VectorXd y t (2 ∗ n ) ; y t << 0 , y ,0 , − y . reverse ( ) ;
8

4. Filtering Algorithms, 4.4. Trigonometric transformations 352

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

9 Eigen : : VectorXcd c t ;
10 Eigen : : FFT<double> f f t ; // DFT helper class
11 f f t . SetFlag ( Eigen : : FFT<double > : : Flag : : Unscaled ) ;
12 f f t . fwd ( c t , y t ) ;
13

14 const std : : complex <double> v ( 0 , 2 ) ; // factor 2ı

15 c = (− c t . middleRows ( 1 , n −1) / v ) . r e a l ( ) ;
16 }

Remark 4.4.4 (Sine transform via DFT of half length)

The simple Code 4.4.3 relies on a DFT for vectors of length 2n, which may be a waste of computational
resources in some applications. A DFT of length n is sufficient as demonstrated by the following manipu-
lations.
Step ➀: transform of the coefficients

yej = sin( jπ/n )(y j + yn− j ) + 21 (y j − yn− j ) , j = 1, . . . , n − 1 , ye0 = 0 .

n −1 2πi jk
Step ➁: real DFT (→ Section 4.2.3) of (ye0 , . . . , yen−1 ) ∈ R n : ck := ∑ yej e− n
j =0

n −1 n −1
πj
Hence Re{ck } = ∑ yej cos(− 2πi
n jk) = ∑ (y j + yn− j ) sin( n ) cos( 2πi
n jk)
j =0 j =1
n −1 n −1
πj 2k + 1 2k − 1
= ∑ 2y j sin( n ) cos( 2πi
n jk) = ∑ yj sin( πj) − sin( πj)
j =0 j =0
n n
= s2k+1 − s2k−1 .
n −1 n −1 n −1
Im{ck } = ∑ yej sin(− 2πi
n jk) = − ∑ 1
2 (y j − yn− j ) sin( 2πi
n jk) =− ∑ y j sin( 2πi
n jk)
j =0 j =1 j =1
= −s2k .
Step ➂: extraction of sk
n −1
s2k+1 , k = 0, . . . , n2 − 1 ➤ from recursion s2k+1 − s2k−1 = Re{ck } , s1 = ∑ y j sin(πj/n) ,
j =1
n
s2k , k = 1, . . . , 2 − 2 ➤ s2k = − Im{ck } .

Implementation (via a fft of length n/2):

C++11-code 4.4.5: Sine transform ➺ GITLAB

2 void s i n e t r a n s f o r m ( const Eigen : : VectorXd &y , Eigen : : VectorXd& s )
3 {
4 i n t n = y . rows ( ) + 1 ;
5 std : : complex <double> i ( 0 , 1 ) ;
6

4. Filtering Algorithms, 4.4. Trigonometric transformations 353

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

7 // Prepare sine terms

8 Eigen : : VectorXd x = Eigen : : VectorXd : : LinSpaced ( n −1, 1 , n −1) ;
9 Eigen : : VectorXd s i n e v a l s = x . unaryExpr ( [ & ] ( double z ) { r et ur n
imag ( std : : pow ( std : : exp ( i ∗ M_PI / ( double ) n ) , z ) ) ; } ) ;
10

11 // Transform coefficients
12 Eigen : : VectorXd y t ( n ) ;
13 yt (0) = 0;
14 y t . t a i l ( n −1) = s i n e v a l s . array ( ) ∗ ( y + y . reverse ( ) ) . array ( ) +
0 . 5 ∗ ( y−y . reverse ( ) ) . array ( ) ;
15

16 // FFT
17 Eigen : : VectorXcd c ;
18 Eigen : : FFT<double> f f t ;
19 f f t . fwd ( c , y t ) ;
20

21 s . resize ( n ) ;
22 s ( 0 ) = s i n e v a l s . dot ( y ) ;
23

24 f o r ( i n t k = 2; k<=n −1; ++k )

25 {
26 i n t j = k −1; // Shift index to consider indices starting from
0
27 i f ( k%2==0)
28 s ( j ) = −c ( k / 2 ) . imag ( ) ;
29 else
30 s ( j ) = s ( j −2) + c ( ( k −1) / 2 ) . r e a l ( ) ;
31 }
32 }

Example 4.4.6 (Diagonalization of local translation invariant linear grid operators)

We consider a so-called 5-points-stencil-operator on R n,n , n ∈ N, defined as follows

R n,n → R n,n ,
T: (T(X)) ij := cxij + cy xi,j+1 + cy xi,j−1 + c x xi +1,j + c x xi −1,j (4.4.7)
X 7→ T(X) ,

with coefficients c, cy , c x ∈ R , convention: xij := 0 for (i, j) 6∈ {1, . . . , n}2 .

4. Filtering Algorithms, 4.4. Trigonometric transformations 354

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

A matrix can be regarded as a function that assigns

values (= matrix entries) to the points of a 2D lattice/- 25

grid: 20

Matrix X ∈ R n,n
15
l
grid function ∈ {1, . . . , n}2 7→ R 10

1
Visualization of a grid function ✄ 5
2
3
0
4
1
2 5
3
4
5

i
Identification R n,n ∼
=
2
Rn , xij ∼ xe( j−1)n+i (row-wise numbering) gives a matrix representation T ∈
R n2 ,n2 of T:

 
C cy I 0 ··· ··· 0 j
 .. 
c y I C cy I . 
 
T= ..
.
..
.
..
.  ∈ R n2 ,n2 ,
0 
 . 
 .. cy I C cy I cy
0 ··· ··· 0 cy I C cx
cx
  cy
c cx 0 ··· ··· 0
 .. 
c x c cx .
 .. .. ..

C=0 . . .  ∈ R n,n .

n+1 n+2

.  1 2 3
 .. cx c cx  i
0 ··· ··· 0 cx c

We already know the sine basis of R n,n :

n
Bkl = sin( n+
π π
1 ki ) sin ( n +1 lj ) i,j=1
. (4.4.8)
1

These matrices will also provide a basis of the vector 0.5

space of grid functions {1, . . . , n}2 7→ R .

0 10
9
n = 10: grid function B2,3 ➣ 8
7
6
5
4
3
2
-0.5 1
10
9
8
7
6
5
4
3
-1 2
1

The key observation is that elements of the sine basis are eigenvectors of T:

(T (Bkl ))ij = c sin( n+
π
1 ki ) sin ( π
n +1 lj ) + c y sin ( π
n +1 ki ) sin ( π
n +1 l ( j − 1 )) + sin ( π
n +1 l ( j + 1 )) +
π π π

c x sin( n+1 lj) sin( n+1 k(i − 1)) + sin( n+1 k(i + 1))
π π π π
= sin( n+ 1 ki ) sin ( n +1 lj )(c + 2c y cos( n +1 l ) + 2c x cos( n +1 k))

4. Filtering Algorithms, 4.4. Trigonometric transformations 355

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Hence Bkl is eigenvector of T (or T after row-wise numbering) and the corresponding eigenvalue is
π π
given by c + 2cy cos( n+ 1 l ) + 2c x cos( n +1 k). Recall very similar considerations for discrete (periodic)
convolutions in 1D (→ § 4.2.6) and 2D (→ § 4.2.51)
The basis transform can be implemented efficiently based on the 1D sine transform:
n n n n
X= ∑ ∑ ykl Bkl ⇒ xij = ∑ π
sin( n+ π
1 ki ) ∑ ykl sin( n +1 lj ) .
k=1 l =1 k=1 l =1

Hence nested sine transforms (→ Section 4.2.4) for rows/columns of Y = (ykl )nk,l =1 .
Here: implementation of sine transform (4.4.2) with “wrapping”-technique.

C++11-code 4.4.9: 2D sine transform ➺ GITLAB

2 void s i n e t r a n s f o r m 2 d ( const Eigen : : MatrixXd& Y , Eigen : : MatrixXd& S)
3 {
4 i n t m = Y . rows ( ) ;
5 i n t n = Y . cols ( ) ;
6

7 Eigen : : VectorXcd c ;
8 Eigen : : FFT<double> f f t ;
9 std : : complex <double> i ( 0 , 1 ) ;
10

11 Eigen : : MatrixXcd C(2 ∗m+2 ,n ) ;

12 C. row ( 0 ) = Eigen : : VectorXcd : : Zero ( n ) ;
13 C. middleRows ( 1 , m) = Y . c as t <std : : complex <double > > ( ) ;
14 C. row (m+1) = Eigen : : VectorXcd : : Zero ( n ) ;
15 C. middleRows (m+2 , m) =
−Y . colwise ( ) . reverse ( ) . c as t <std : : complex <double > > ( ) ;
16

17 // FFT on each column of C - Eigen::fft only operates on vectors

18 f o r ( i n t i = 0; i <n ; ++ i )
19 {
20 f f t . fwd ( c ,C . col ( i ) ) ;
21 C. col ( i ) = c ;
22 }
23

24 C. middleRows ( 1 ,m) = i ∗C . middleRows ( 1 ,m) / 2 . ;

26 Eigen : : MatrixXcd C2(2 ∗ n+2 ,m) ;

27 C2 . row ( 0 ) = Eigen : : VectorXcd : : Zero (m) ;
28 C2 . middleRows ( 1 , n ) = C . middleRows ( 1 ,m) . transpose ( ) ;
29 C2 . row ( n +1) = Eigen : : VectorXcd : : Zero (m) ;
30 C2 . middleRows ( n+2 , n ) =
−C. middleRows ( 1 ,m) . transpose ( ) . colwise ( ) . reverse ( ) ;
31

32 // FFT on each column of C2 - Eigen::fft only operates on vectors

33 f o r ( i n t i = 0; i <m; ++ i )
34 {
35 f f t . fwd ( c , C2 . col ( i ) ) ;
36 C2 . col ( i ) = c ;
37 }

4. Filtering Algorithms, 4.4. Trigonometric transformations 356

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

39 S = ( i ∗C2 . middleRows ( 1 , n ) . transpose ( ) / 2 . ) . r e a l ( ) ;

40 }

C++11-code 4.4.10: FFT-based solution of local translation invariant linear operators

➺ GITLAB
2 void f f t b a s e d s o l u t i o n l o c a l ( const Eigen : : MatrixXd& B ,
3 double c , double cx , double cy , Eigen : : MatrixXd& X)
4 {
5 s i z e _ t m = B . rows ( ) ;
6 s i z e _ t n = B . cols ( ) ;
7

8 // Eigen’s meshgrid
9 Eigen : : MatrixXd I =
Eigen : : RowVectorXd : : LinSpaced ( n , 1 , n ) . r e p l i c a t e (m, 1 ) ;
10 Eigen : : MatrixXd J =
Eigen : : VectorXd : : LinSpaced (m, 1 ,m) . r e p l i c a t e ( 1 , n ) ;
11

12 // FFT
13 Eigen : : MatrixXd X_ ;
14 s i n e t r a n s f o r m 2 d ( B , X_ ) ;
15

16 // Translation
17 Eigen : : MatrixXd T ;
18 T = c + 2 ∗ cx ∗ ( M_PI / ( n +1) ∗ I ) . array ( ) . cos ( ) +
19 2 ∗ cy ∗ ( M_PI / (m+1) ∗ J ) . array ( ) . cos ( ) ;
20 X_ = X_ . cwiseQuotient ( T ) ;
21

22 s i n e t r a n s f o r m 2 d ( X_ , X) ;
23 X = 4 ∗X / ( (m+1) ∗ ( n +1) ) ;
24 }

Thus the diagonalization of T via 2D sine transformyields an efficient algorithm for solving linear system
of equations T(X) = B: computational cost O(n2 log n).

Experiment 4.4.11 (Efficiency of FFT-based solver)

In the experiment we test the gain in runtime obtained by using DFT-based algorithms for solving linear
systems of equations with coefficient matrix T induced by the operator T from (4.4.7)

tic-toc-timing (MATLAB V7, Linux, Intel Pentium 4 Mobile CPU 1.80GHz)

4. Filtering Algorithms, 4.4. Trigonometric transformations 357

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

60
FFT-Loeser
Backslash-Loeser
50

A = gallery(’poisson’,n);
B = magic(n); 40

b = reshape(B,n*n,1);

Laufzeit [s]
tic; 30

C = fftsolve(B,4,-1,-1);
t1 = toc; 20

tic; x = A\b; t2 = toc;

0
0 100 200 300 400 500 600
n

4.4.2 Cosine transform

Another trigonometric basis transform in R n , n ∈ N:

standard basis of R n  “cosine basis” 

  −1/2
       
 2 − 1/ 2
2 −1/2 2 

1 0 0 0 
  (2n −1)π 

 
 
  π 
cos( 2n )  cos( 2n )  3π   cos( 2n )  

 0   1  . .
 .  .    
   

         
  2π  6π   cos( 2 ( 2n − 1 ) π 

 .. 0
. .
 .     

cos( 2n )  cos( 2n ) 
 2n )  



 .  .   ..    ..  .
.
  .  .
   .  · · ·   .  ←  .  .  ··· .
. 
   .  0 ..       

  .  .       
     

  .   .      
 
     

 . . 1 0 
 
  .
.  .
.   .  

  
  .  .   .
.  

0 0 0 1 
 ( n − 1 ) π ( − )


 cos( ) 3
cos( 2n )
n 1 π
cos(
( n − 1 )( 2n − 1 ) π
) 
2n 2n

Basis transform matrix (cosine basis → standard basis):

(
n −1 2−1/2 , if i = 0 ,
Cn = cij i,j=0 ∈ R n,n with cij = 2j+1
cos(i 2n π ) , if i > 0 .

Lemma 4.4.12. Properties of cosine matrix

√
The matrix 2/n Cn ∈ R n,n is real and orthogonal (→ Def. 6.2.2) .

Note: Cn is not symmetric.

n −1
2j+1
cosine transform of y = [y0 , . . . , yn−1 ]⊤ : ck = ∑ y j cos(k 2n π ) , k = 1, . . . , n − 1 ,
j =0

(4.4.13)
1 n −1
c0 = √ ∑ y j .
2 j =0

Implementation of Cy using the ”wrapping”-technique as in Code 4.4.3:

4. Filtering Algorithms, 4.4. Trigonometric transformations 358

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11-code 4.4.14: Cosine transform ➺ GITLAB

2 void c o s i n e t r a n s f o r m ( const Eigen : : VectorXd& y , Eigen : : VectorXd& c )
3 {
4 i n t n = y . siz e ( ) ;
5

6 Eigen : : VectorXd y_ (2 ∗ n ) ;
7 y_ . head ( n ) = y ;
8 y_ . t a i l ( n ) = y . reverse ( ) ;
9

10 // FFT
11 Eigen : : VectorXcd z ;
12 Eigen : : FFT<double> f f t ;
13 f f t . fwd ( z , y_ ) ;
14

15 std : : complex <double> i ( 0 , 1 ) ;

16 c . resize ( n ) ;
17 c (0) = z (0) . real ( ) /(2∗ sqrt (2) ) ;
18 f o r ( s i z e _ t j = 1; j <n ; ++ j ) {
19 c ( j ) = ( 0 . 5 ∗ pow ( exp(− i ∗ M_PI / ( 2 ∗ ( double ) n ) ) , j ) ∗
z ( j ) ) . real ( ) ;
20 }
21 }

Implementation of C− 1
n y (“Wrapping”-technique):

C++11-code 4.4.15: Inverse cosine transform ➺ GITLAB

2 void i c o s i n e t r a n s f o r m ( const Eigen : : VectorXd& c , Eigen : : VectorXd& y )
3 {
4 s i z e _ t n = c . siz e ( ) ;
5

6 std : : complex <double> i ( 0 , 1 ) ;

7 Eigen : : VectorXcd c_1 ( n ) ;
8 c_1 ( 0 ) = s q r t ( 2 ) ∗ c ( 0 ) ;
9 f o r ( s i z e _ t j = 1; j <n ; ++ j ) {
10 c_1 ( j ) = pow ( exp(− i ∗ M_PI / ( 2 ∗ ( double ) n ) ) , j ) ∗ c ( j ) ;
11 }
12

13 Eigen : : VectorXcd c_2 (2 ∗ n ) ;

14 c_2 . head ( n ) = c_1 ;
15 c_2 ( n ) = 0 ;
16 c_2 . t a i l ( n −1) = c_1 . t a i l ( n −1) . reverse ( ) . conjugate ( ) ;
17

18 // FFT
19 Eigen : : VectorXd z ;
20 Eigen : : FFT<double> f f t ;
21 f f t . i n v ( z , c_2 ) ;
22

23 // To obtain the same result of Matlab,

4. Filtering Algorithms, 4.4. Trigonometric transformations 359

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

24 // shift the inverse FFT result by 1.

25 Eigen : : VectorXd y_ (2 ∗ n ) ;
26 y_ . head (2 ∗ n −1) = z . t a i l (2 ∗ n −1) ;
27 y_ (2 ∗ n −1) = z ( 0 ) ;
28

29 y = 2 ∗ y_ . head ( n ) ;
30 }

Remark 4.4.16 (Cosine transforms for compression)

The cosine transforms discussed above are named

DCT-II and DCT-III.
Various cosine transforms arise by imposing
various boundary conditions:
• DCT-II: even around −1/2 and N − 1/2
• DCT-III: even around 0 and odd around N
DCT-II is used in JPEG-compression while a slightly
modified DCT-IV makes the main component of MP3,
AAC and WMA formats.

4.5 Toeplitz Matrix Techniques

Example 4.5.1 (Parameter identification for linear time-invariant filters)

• ( xk )k∈Z m-periodic discrete signal = known input

• (yk )k∈Z m-periodic measured (∗) output signal of a linear time-invariant filter, see § 4.1.1.
(∗) ➔ inevitably affected by measurement errors!
• Sought: Estimate for the impulse response (→ Def. 4.1.3) of the filter

This task reminds us of the parameter estimation problem from Ex. 3.0.5, which we tackled with least
squares techniques. We employ similar ideas for the current problem

• Known: impulse response of filter has maximal duration n∆t, n ∈ N, n ≤ m

cf. (4.1.14) n −1
∃ h = ( h0 , . . . , h n −1 ) ⊤ ∈ R n , n ≤ m : y k = ∑ h j xk− j . (4.5.2)
j =0

xk input signal
yk output signal

time time

If the yk were exact, we could retrieve h0 , . . . , hn−1 by examining only y0 , . . . , yn−1 and inverting the
discrete periodic convolution (→ Def. 4.1.33) using (4.2.17).

4. Filtering Algorithms, 4.5. Toeplitz Matrix Techniques 360

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

However, in case the yk are affected by measurements errors it is advisable to use all available yk for a
least squares estimate of the impulse response.

We can now formulate the least squares parameter identification problem: seek h = ( h0 , . . . , h n −1 ) ⊤ ∈
R n with
 
x0 x −1 ··· · · · x 1− n
 x ..   
 1 x0 x −1 .  y0
 . ..  
 .. x1 x0 .  h0  .. 
   . 
 .. .. ..  ..   
 . . .  .   
    
kAh − yk2 =  . . .. ..  −  → min .
 . . . x −1  .   
  ..   
 x n −1 x1 x0   . 
  h  .. 
 x n x n −1 x1  n −1
 . ..  y m−1
 .. . 
x m−1 ··· · · · xm− n 2

➣ Linear least squares problem, → Chapter 3 with a coefficient matrix A that enjoys the property that
(A)ij = xi − j (constant entries of diagonals).

The coefficient matrix for the normal equations (→ Section 3.1.2, Thm. 3.1.10) corresponding to the above
linear least squares problem is
m
M := AH A , (M)ij = ∑ x k − i x k − j = zi − j due to periodicity of ( xk )k∈Z .
k=1

➣ Again, M ∈ R n,n is a matrix with constant diagonals & s.p.d.

(“constant diagonals” ⇔ (M)i,j depends only on i − j)

Example 4.5.3 (Linear regression for stationary Markov chains)

We consider a sequence of scalar random variables: (Yk )k∈Z , a so-called Markov chain. These can be
thought of as values for a random quantity sampled at equidistant points in time.
Assume: stationary (time-independent) correlation, that is, with (A, Ω, dP) denoting the underlying
probability space,
Z
E (Yi − j Yi −k ) = Yi − j (ω )Yi −k (ω ) dP (ω ) = uk− j ∀i, j, k ∈ Z , ui = u−i .
Ω

Here E stands for the expectation of a random variable.

Model: We assume a finite linear dependency of the form

n
∃ x = ( x 1 , . . . , x n ) ⊤ ∈ R n : Yk = ∑ x j Yk − j ∀k ∈ Z .
j =1

with unknown parameters x j , j = 1, . . . , n: for fixed i ∈ Z

n 2
Estimator x = argmin E Yi − ∑ x j Yi − j (4.5.4)
x ∈R n j =1

4. Filtering Algorithms, 4.5. Toeplitz Matrix Techniques 361

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Next, we use the linearity of the expectation:

n n
E|Yi |2 − 2 ∑ x j uk + ∑ xk x j uk− j → min .
j =1 k,j=1

x⊤ Ax − 2b⊤ x → min with b = (uk )nk=1 , A = (ui − j )i,j

n
=1 . (4.5.5)

By definition A is a so-called covariance matrix and, as such, has to be symmetric and positive definite
(→ Def. 1.1.8). Also note that

x⊤ Ax − 2b⊤ x = (x − x∗ )⊤ A(x − x∗ ) − x∗ Ax∗ , (4.5.6)

with x∗ = A−1 b. Therefore x∗ is the unique minimizer of x⊤ Ax − 2b⊤ x. The problem is reduced to
solving the linear system of equations Ax = b (Yule-Walker-equation, see below).

Note that the covariance matrix A has constant diagonals, too.

(4.5.7) Matrices with constant diagonals

Matrices with constant diagonals occur frequently in mathematical models, see Ex. 4.5.1, ??. They gen-
eralize circulant matrices (→ Def. 4.1.38).

Note: “Information content” of a matrix M ∈ K m,n with constant diagonals, that is, (M)i,j = mi − j , is
m + n − 1 numbers ∈ K.

 
u0 u1 ··· · · · un −1
Definition 4.5.8. Toeplitz matrix  
..
 u − 1 u0 u1 .
 . 
..
n m,n is a Toeplitz matrix, if there is  . .. .. .. 
T = (tij )i,j =1 ∈ K  .
T= .
. . . 
.
.. .. .. ..
a vector u = [u−m+1 , . . . , un−1 ] ∈ K m+n−1 such  ..
 . . . .

that tij = u j−i , 1 ≤ i ≤ m, 1 ≤ j ≤ n.  .. .. .. 
 . . . u1 
u1− m · · · · · · u−1 u0

4.5.1 Toeplitz Matrix Arithmetic

⊤
Given: T = (u j−i ) ∈ K m,n , a Toeplitz matrix with generating vector u = [u−m+1 , . . . , un−1 ] ∈
K m+n−1 , see Def. 4.5.8.

Task: Efficient evaluation of matrix×vector product Tx, x ∈ K n

To motivate the approach we realize that we have already encountered Toeplitz matrices in the convolution
of finite signals discussed in Rem. 4.1.17, see (4.1.18). The trick introduced in Rem. 4.1.40 was to extend

4. Filtering Algorithms, 4.5. Toeplitz Matrix Techniques 362

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

the matrix to a circulant matrix by zero padding, compare (4.1.43).

Idea: Extend T ∈ K m,n to a circulant matrix (→ Def. 4.1.38) C ∈ K m+n,m+n generated by

the m + n-periodic sequence c j j∈Z given by

(
u j for j = −m + 1, . . . , n − 1 ,
cj = + periodic extension.
0 for j = n ,

The upper left m × n block of C contains T:

(C)ij = ci − j , 1 ≤ i, j ≤ m + n ⇒ (C)1:m,1:n = T . (4.5.9)
The following formula demonstrates the structure of C in the case m = n.
 
u0 u1 ··· ··· un −1 0 u1− n · · · ··· u−1
 .. .. .. 
 u−1 u0 u1 . un −1 0 . . 
 . .. ..

 . ..
.
..
.
..
.
..
.
..
.

 . . . 
 . .. .. .. .. .. 
 .. . . . . . 
 
 .. .. .. .. .. .. 
 . . . u1 . . . u1− n 
 
 u1− n · · · ··· u−1 u0 u1 un −1 0 
C=
 0


 u1− n · · · ··· u−1 u0 u1 ··· ··· un −1 
 .. .. .. 
 un −1 0 . . u−1 u0 u1 . 
 . .. .. 
 . ..
.
..
.
..
.
..
.
..
. 
 . . . 
 .. .. .. .. .. .. 
 . . . . . . 
 
 .. .. .. .. .. .. 
 . . . u1− n . . . u1 
u1 un −1 0 u1− n ··· ··· u−1 u0
Recall from 4.3 that the multiplication with a circulant (m + n) × (m + n)-matrix (= discrete periodic
convolution → Def. 4.1.33) can be carried out by means of DFT/FFT with an asymptotic computational
effort of O(m + n) log(m + n) for m, n → ∞, see Code 4.2.25.

From (4.5.9) it is clear how to implement matrix×vector for the Toeplitz matrix T

x Tx
C =
zero padding 0 ∗
Computational effort for computing Tx: O((n + m) log(m + n) (FFT based, Section 4.3)
Almost optimal in light of the data complexity O(m + n) of a Toeplitz matrix.

4.5.2 The Levinson Algorithm

n
Given: Symmetric positive definite (s.p.d.) (→ Def. 1.1.8) Toeplitz matrix T = (u j−i )i,j n,n with
=1 ∈ K
generating vector u = [ u−n+1, . . . , un−1 ] ∈ C2n−1

Note that the symmetry of a Toeplitz matrix is induced by the property u−k = uk of its generating vector.

Task: Find an efficient solution algorithm for the LSE Tx = b, b ∈ C n , the Yule-Walker problem
from 4.5.3.

Without loss of generality we assume thet T has unit diagonal, u0 = 1.

4. Filtering Algorithms, 4.5. Toeplitz Matrix Techniques 363

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Recursive (inductive) solution strategy:

Define:
k
✦ Tk := u j−i i,j=1 ∈ K k,k (left upper block of T) ➣ Tk is s.p.d. Toeplitz matrix ,

✦ xk ∈ K k : Tk xk = [ b1 , . . . , bk ]⊤ ⇔ xk = T− 1 k
k b ,
✦ u k : = ( u1 , . . . , u k ) ⊤
Thus we can block partition the LSE
    
uk b1
 ..  e k+1   ..  e k+1
k+1 
= Tk .  x   .  b
Tk+1 x


u1  =  bk
=
 bk + 1
(4.5.10)
+1
u k · · · u1 1 xkk+ 1 bk + 1

Now recall block Gaussian elimination/block-LU decomposition from Rem. 2.3.14, Rem. 2.3.34. They
+1
xk+1 and obtain an expression for xkk+
teach us how to eliminate e 1.
To state the formulas concisely, we introduce the reversing permutation:

Pk : {1, . . . , k} 7→ {1, . . . , k} , Pk (i ) := k − i + 1 . (4.5.11)

Then we carry out block elimination:

x k+1 = T−
e 1 e k+1
k (b
+1
− xkk+ k k k+1 −1 k
1 Pk u ) = x − xk+1 Tk Pk u ,
(4.5.12)
+1 −1
xkk+ k
xk+1 = bk+1 − Pk · xk + xkk+
1 = bk+1 − Pk u · e
+1 k
1 Pk · Tk Pk u .

The recursive idea is clear after introducing the auxiliary vectors: yk : = T− 1

k Pk u
k

+1
k+1 x k+1
e xkk+ k
1 = (bk+1 − Pk u )/σk
x = k+1 with +1 k , σk := 1 − Pk uk · yk . (4.5.13)
x k+1 xk+1 = xk − xkk+
e 1y

Below: Levinson algorithm for the solution of the Yule-Walker problem Tx = b with an s.p.d. Toeplitz
matrix described by its generating vector u (recursive, un+1 not used!)
Linear recursion: Computational cost ∼ (n − k) on level k, k = 0, . . . , n − 1
➣ Asymptotic complexity O ( n2 )

C++11 code 4.5.14: Levinson algorithm ➺ GITLAB

2 void l e v i n s o n ( const Eigen : : VectorXd& u , const Eigen : : VectorXd& b ,
3 Eigen : : VectorXd& x , Eigen : : VectorXd& y )
4 {
5 s i z e _ t k = u . siz e ( ) − 1 ;
6 i f ( k == 0 ) {
7 x . resize ( 1 ) ; y . resize ( 1 ) ;
8 x (0) = b(0) ; y (0) = u(0) ;
9 r et ur n ;
10 }
11

12 Eigen : : VectorXd xk , yk ;
13 l e v i n s o n ( u . head ( k ) , b . head ( k ) , xk , yk ) ;

4. Filtering Algorithms, 4.5. Toeplitz Matrix Techniques 364

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

15 double sigma = 1 − u . head ( k ) . dot ( yk ) ;

17 double t = ( b ( k ) − u . head ( k ) . reverse ( ) . dot ( xk ) ) / sigma ;

18 x = xk − t ∗ yk . head ( k ) . reverse ( ) ;
19 x . c o n s e r v a t i v e R e s i z e ( x . siz e ( ) +1) ;
20 x ( x . siz e ( ) −1) = t ;
21

22 double s = ( u ( k ) − u . head ( k ) . reverse ( ) . dot ( yk ) ) / sigma ;

23 y = yk − s ∗ yk . head ( k ) . reverse ( ) ;
24 y . c o n s e r v a t i v e R e s i z e ( y . siz e ( ) +1) ;
25 y ( y . siz e ( ) −1)= s ;
26 }

Remark 4.5.15 (Fast Toeplitz solvers)

FFT-based algorithms for solving Tx = b with asymptotic complexity O(n log3 n) [?] !

Supplementary reading. [?, Sect. 8.5]: Very detailed and elementary presentation, but the

discrete Fourier transform through trigonometric interpolation, which is not covered in this chapter.
Hardly addresses discrete convolution.

[?, Ch. IX] presents the topic from a mathematical point of few stressing approximation and trigono-
metric interpolation. Good reference for algorithms for circulant and Toeplitz matrices.

[?, Ch. 10] also discusses the discrete Fourier transform with emphasis on interpolation and (least
squares) approximation. The presentation of signal processing differs from that of the course.

There is a vast number of books and survey papers dedicated to discrete Fourier transforms, see,
for instance, [?, ?]. Issues and technical details way beyond the scope of the course are discussed
in these monographs.

4. Filtering Algorithms, 4.5. Toeplitz Matrix Techniques 365

Chapter 5

Data Interpolation and Data Fitting in 1D

Contents
5.1 Abstract interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
5.2 Global Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
5.2.1 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
5.2.2 Polynomial Interpolation: Theory . . . . . . . . . . . . . . . . . . . . . . . . 366
5.2.3 Polynomial Interpolation: Algorithms . . . . . . . . . . . . . . . . . . . . . . 370
5.2.3.1 Multiple evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . 370
5.2.3.2 Single evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
5.2.3.3 Extrapolation to zero . . . . . . . . . . . . . . . . . . . . . . . . . . 378
5.2.3.4 Newton basis and divided differences . . . . . . . . . . . . . . . . 381
5.2.4 Polynomial Interpolation: Sensitivity . . . . . . . . . . . . . . . . . . . . . . 386
5.3 Shape preserving interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
5.3.1 Shape properties of functions and data . . . . . . . . . . . . . . . . . . . . . 390
5.3.2 Piecewise linear interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
5.4 Cubic Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
5.4.1 Definition and algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
5.4.2 Local monotonicity preserving Hermite interpolation . . . . . . . . . . . . . 399
5.5 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
5.5.1 Cubic spline interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
5.5.2 Structural properties of cubic spline interpolants . . . . . . . . . . . . . . . . 407
5.5.3 Shape Preserving Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . 410
5.6 Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
5.6.1 Trigonometric Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
5.6.2 Reduction to Lagrange Interpolation . . . . . . . . . . . . . . . . . . . . . . . 419
5.6.3 Equidistant Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . 421
5.7 Least Squares Data Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

366
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5.1 Abstract interpolation

The task of (one-dimensional, scalar) data interpolation (point interpolation) can be described as fol-
lows:

Given: data points (ti , yi ), i = 0, . . . , n, n ∈ N, ti ∈ I ⊂ R, yi ∈ R

Objective: reconstruction of a (continuous) function f : I 7→ R satisfying the n + 1
interpolation conditions

f (ti ) = yi , i = 0, . . . , n.

n
The function f we find is called the interpolant of the given data set {ti , yi )}i =0 .

Minimal requirement on data: ti pairwise distinct: ti 6= ti , if i 6= j, i, j ∈ {0, . . . , n}.

Parlance: The numbers ti ∈ R are called nodes, the yi ∈ R (data) values.

For ease of presentation we will usually assume that the nodes are ordered: t0 < t1 < · · · < tn and
[t0 , tn ] ⊂ I . However, algorithms often must not take for granted sorted nodes.

Remark 5.1.1 (Interpolation of vector-valued data)

A natural generalization is data interpolation with vector-valued data values, seeking a function f : I →
R d , d ∈ N, such that, for given data points (ti , yi ), ti ∈ I mutually different, yi ∈ R d , it satisfies the
interpolation conditions f(ti ) = yi , i = 0, . . . , n.

In this case all methods available for scalar data can be applied component-wise.

x1
y3
y4
An important application is curve reconstruction, that
is the interpolation of points y0 , . . . , Vyn ∈ R 2 in the y2
plane.
y5
A particular aspect of this problem is that the nodes
ti also have to be found, usually from the location of
the yi in a preprocessing step. y1
y0
x2
Fig. 165

Remark 5.1.2 (Multi-dimensional data interpolation)

In many applications (computer graphics, computer vision, numerical method for partial differential equa-
tions, remote sensing, geodesy, etc.) one has to reconstruct functions of several variables.

This leads to task of multi-dimensional data interpolation:

5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract interpolation 367
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Given: data points (xi , yi ), i = 0, . . . , n, n ∈ N, xi ∈ D ⊂ R m , m > 1, yi ∈ R d

Objective: reconstruction of a (continuous) function f : D 7→ R d satisfying the n + 1

interpolation conditions f(xi ) = yi , i = 0, . . . , n.

Significant additional challenges arise in a genuine multidimensional setting. A treatment is beyond the
scope of this course. However, the one-dimensional techniques presented in this chapter are relevant
even for multi-dimensional data interpolation, if the points xi ∈ R m are points of a finite lattice also called
tensor product grid.
For instance, for m = 2 this is the case, if
n o
⊤ 2
{xi }i = [ tk , sl ] ∈ R : k ∈ {0, . . . , K }, l ∈ {0, . . . , L} , (5.1.3)

where tk ∈ R , k = 0, . . . , K, and sl , l = 0, . . . , L, K, L ∈ N, are pairwise distinct nodes.

(5.1.4) Interpolation schemes

When we talk about “interpolation schemes” in 1D, we mean a mapping

R n+1 × R n+1 → { f : I → R }
I: .
[ti ]in=0, [yi ]in=0 7→ interpolant
Once the function space to which the interpolant belongs is specified, then an interpolation scheme de-
fines an “interpolation problem” in the sense of § 1.5.67. Sometimes, only the data values yi are consider
input data, whereas the dependence of the interpolant on the nodes ti is suppressed, see Section 5.2.4.

Interpolation
0.5

0.4

0.3

✁ There are infinitely many ways to fix an interpolant

0.2

for given data points.

0.1

0
Interpolants can have vastly different properties.

This chapter we will discuss a few widely used

-0.1

-0.2
method to build interpolants and their different
-0.3 linear

poly
properties will become apparent.
-0.4 spline

pchip

-0.5
0 1 2 3 4 5 6 7 8

reset
Fig. 166

We may (have to!) impose additional requirements on the interpolant:

• minimal smoothness of f, e.g. f ∈ C1 , etc.
• special shape of f (positivity, monotonicity, convexity → Section 5.3 below)

Example 5.1.5 (Constitutive relations from measurements)

5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract interpolation 368
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

This example addresses an important application of data interpolation in 1D.

In this context: t, y =
ˆ two state variables of a physical system, where t determines y: a functional
dependence y = y(t) is assumed.
Examples: t and y could be
y
t y
voltage U current I
pressure p density ρ
magnetic field H magnetic flux B
··· ···
Known: several accurate (∗) measurements
t
(ti , yi ) , i = 1, . . . , m
Fig. 167 t1 t2 t3 t4 t5

Imagine that t, y correspond to the voltage U and current I measured for a 2-port non-linear circuit
element (like a diode). This element will be part of a circuit, which we want to simulate based on nodal
analysis as in Ex. 8.0.1. In order to solve the resulting non-linear system of equations F(u) = 0 for the
nodal potentials (collected in the vector u) by means of Newton’s method (→ Section 8.4) we need the
voltage-current relationship for the circuit element as a continuously differentiable function I = f (U ).

(∗) Meaning of attribute “accurate”: justification for interpolation. If measured values yi were affected by
considerable errors, one would not impose the interpolation conditions (??), but opt for data fitting (→
Section 5.7).

We can distinguish two aspects of the interpolation problem:

➊ Find interpolant f : I ⊂ R → R and store/represent it (internally).
➋ Evaluate f at a few or many evaluation points x ∈ I

Remark 5.1.6 (Mathematical functions in a numerical code)

What does is meant to “represent” or “make available” a function f : I ⊂ R 7→ R in a computer code?

A general “mathematical” function f : I ⊂ R 7→ R d , I an interval, contains an “infinite amount of
! information”.

Rather, in the context of numerical methods, “function” should be read as “subroutine”, a piece of code that
can, for any x ∈ I , compute f ( x ) in finite time. Even this has to be qualified, because we can only pass
machine numbers x ∈ I ∩ M (→ § 1.5.12) and, of course, in most cases, f ( x ) will be an approximation.
In a C++ code a simple real valued function can be incarnated through a function object of a type as given
in Code 5.1.7, see also Section 0.2.3.

C++-code 5.1.7: C++ data type representing a real-valued function

1 class F u n c t i o n {
2 private :
3 // various internal data describing f

5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract interpolation 369
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

4 public :
5 // Constructor: expects information for specifying the function
6 F u n c t i o n ( /* ... */ ) ;
7 // Evaluation operator
8 double operator () ( double t ) const ;
9 };

(5.1.8) Internal representation of classes of mathematical functions

➙ Idea: parametrization, a finite number of parameters c0 , . . . , cm , m ∈ N, characterizes f .

Special case: Representation with finite linear combination of basis functions
b j : I ⊂ R 7→ R, j = 0, . . . , m:
m
f = ∑ j =0 c j b j , c j ∈ R d . (5.1.9)

➙ f ∈ finite dimensional function space Vm := Span{b0 , . . . , bm }, dim Vm = m + 1.

Of course, the basis functions b j should be “simple” in the sense that b j ( x ) can be computed efficiently for
every x ∈ I and every j = 0, . . . , m.

Note that the basis functions may depend on the nodes ti , but they must not depend on the values yi .

➙ The internal representation of f (in the data member section of the class Function from Code 5.1.7)
will then boil down to storing the coefficients/parameters c j , j = 0, . . . , m.

Note: The focus in this chapter will be on the special case that the data interpolants belong to a finite-
dimensional space of functions spanned by “simple” basis functions.

Example 5.1.10 (Piecewise linear interpolation, see also Section 5.3.2)

Recall: A linear function in 1D is a function of the form x 7→ a + bx, a, b ∈ R (polynomial of degree 1).
y

✬ ✩
Piecewise linear interpolation

= connect data points (ti , yi ), i = 0, . . . , n,

ti −1 < ti , by line segments

✫ ✪
➣ interpolating polygon

Piecewise linear interpolant of data ✄

t
Fig. 168
t0 t1 t2 t3 t4

5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract interpolation 370
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

What could be a convenient set of basis functions {b j }nj=0 for representing the piecewise linear interpolant
through n + 1 data points?

“Tent function” (“hat function”) basis:

b0 b1 b2 b3 b4 bn
1

Fig. 169 t0 t1 t2 t3 t4 t5 tn −1 tn

Note: in Fig. 169 the basis functions have to be extended by zero outside the t-range where they are
drawn.

Explicit formulas for these basis functions can be given and bear out that they are really “simple”:
(
t − t0
1− t1 − t0 for t0 ≤ t < t1 ,
b0 (t) =
0 for t ≥ t1 .
 t j −t

1 − t j −t j−1 for t j−1 ≤ t < t j ,

b j (t) = 1 − t t−−t j t for t j ≤ t < t j+1 , , j = 1, . . . , n − 1 , (5.1.11)

 j +1 j

0 elsewhere in [t0 , tn ] .
(
1 − tnt−n − t
tn−1 for tn −1 ≤ t < tn ,
bn ( t ) =
0 for t < tn−1 .

Moreover, these basis functions are uniquely determined by the conditions

• b j is continuous on [t0 , tn ],
• b j is linear on each interval [ti −1 , ti ], i = 1, . . . , n,
(
1 , if i = j ,
• b j (ti ) = δij := ➣ a so-called cardinal basis for the node set {ti }in=0 .
0 else.
This last condition implies a simple basis representation of a (the ?) piecewise linear interpolant of the
data points (ti , yi ), i = 0, . . . , n:
n
f (t) = ∑ y j b j (t) , t0 ≤ t ≤ t n , (5.1.12)
j =0

where the b j are given by (5.1.11).

(5.1.13) Interpolation as a linear mapping

We consider the setting for interpolation that the interpolant belongs to a finite-dimension space Vm of
functions spanned by basis functions b0 , . . . , bm , see Rem. 5.1.6. Then the interpolation conditions imply

5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract interpolation 371
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

that the basis expansion coefficients satisfy a linear system of equations:

m
(??) & (5.1.9) ⇒ f ( ti ) = ∑ j = 0 c j b j ( ti ) = y i , i = 0, . . . , n , (5.1.14)
m
    
b0 (t0 ) . . . bm (t0 ) c0 y0
 ..  ..   .. 
..
Ac :=  .  .  =  .  = : y .
. (5.1.15)
b0 (tn ) . . . bm (tn ) cm yn

This is an (m + 1) × (n + 1) linear system of equations !

The interpolation problem in Vm and the linear system (5.1.15) are really equivalent in the sense that
(unique) solvability of one implies (unique) solvability of the other.

Necessary condition for unique solvability

: m=n
of interpolation problem (5.1.14)

If m = n and A from (5.1.15) regular (→ Def. 2.2.1),then for any values y j , j = 0, . . . , n we can find
coefficients c j , j = 0, . . . , n, and, from them build the interpolant according to (5.1.9):
n
f = ∑ (A −1 y) j b j . (5.1.16)
j =0

✓ ✏

For fixed nodes ti the interpolation problem R n +1 7→ Vn
I:
✒ ✑
(5.1.14) defines linear mapping y 7→ f

data space function space

Beware, “linear” in the statement above has nothing to do with a linear function or piecewise linear inter-
polation discussed in Ex. 5.1.10!

Definition 5.1.17. Linear interpolation operator

An interpolation operator I : R n+1 7→ C0 ([t0 , tm ]) for the given nodes t0 < t1 < · · · < tn is called
linear, if

I(αy + βz) = αI(y) + βI(z) ∀y, z ∈ R n+1, α, β ∈ R . (5.1.18)

✎ Notation: C0 ([t0 , tm ]) =
ˆ vector space of continuous functions on [t0 , tm ]

Remark 5.1.19 (A data type designed for of interpolation problem)

If a constitutive relationship for a circuit element is needed in a C++ simulation code (→ Ex. 5.1.5), the
following data type could be used to represent it:

5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract interpolation 372
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++-code 5.1.20: C++ class representing an interpolant in 1D

1 class I n t e r p o l a n t {
2 private :
3 // Various internal data describing f
4 // Can be the coefficients of a basis representation (5.1.9)
5 public :
6 // Constructor: computation of coefficients c j of representation
(5.1.9)
7 Interpolant ( const vector <double>& t , const vector <double>& y ) ;
8 // Evaluation operator for interpolant f
9 double operator() (double t) const;;

Practical object oriented implementation of interpolation operator:

✦ Constructor: “setup phase”, e.g. building and solving linear system of equations (5.1.15)
✦ Evaluation operator, e.g., implemented as evaluation of linear combination (5.1.9)
Crucial issue:
computational effort for evaluation of interpolant at single point: O(1) or O(n) (or in
between)?

5.2 Global Polynomial Interpolation

(Global) polynomial interpolation, that is, interpolation into spaces of functions spanned by polynomials
up to a certain degree, is the simplest interpolation scheme and of great importance as building block for
more complex algorithms.

5.2.1 Polynomials

Notation: Vector space of the polynomials of degree ≤ k, k ∈ N:

P k : = { t 7 → α k t k + α k −1 t k −1 + · · · + α1 t + α0 , α j ∈ R } . (5.2.1)

leading coefficient

Terminology: the functions t 7→ tk , k ∈ N0 , are called monomials

t 7→ αk tk + αk−1 tk−1 + · · · + α0 = monomial representation of a polynomial.

Monomial representation is a linear combination of basis functions t 7→ tk , see Rem. 5.1.6.

Obvious: Pk is a vector space, see [?, Sect. 4.2, Bsp. 4]. What is its dimension?

Theorem 5.2.2. Dimension of space of polynomials

dim Pk = k + 1 and P k ⊂ C ∞ (R ).

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 373
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Proof. Dimension formula by linear independence of monomials.

✷

As a consequence of Thm. 5.2.2 the monomial representation of a polynomial is unique.

(5.2.3) The charms of polynomials

Why are polynomials important in computational mathematics?

➙ Easy to compute (only elementary operations required), integrate and differentiate

➙ Vector space & algebra
➙ Analysis: Taylor polynomials & power series

Remark 5.2.4 (Monomial representation)

Polynomials (of degree k) in monomial representation are stored as a vector of their coefficients a j , j =
0, . . . , k. A convention for the ordering has to be fixed. For instance, M ATLAB functions expect a monomial
representation through a vector of their monomial coefficients in descending order:

M ATLAB: αk tk + αk−1 tk−1 + · · · + α0 ➙ vector [ αk , αk−1 , . . . , α0 ]⊤ (ordered!).

Remark 5.2.5 (Horner scheme → [?, Bem. 8.11])

Efficient evaluation of a polynomial in monomial representation through Horner scheme as indicated by

the following representation:

p(t) = t(· · · t(t(αn t + αn−1 ) + αn−2 ) + · · · + α1 ) + α0 . (5.2.6)

The following code gives an implementation based on vector data types of E IGEN. The function is vector-
ized in the sense that many evaluation points are processed in parallel.

C++-code 5.2.7: Horner scheme (vectorized version)

2 // Efficient evaluation of a polynomial in monomial representation
3 // using the Horner scheme (5.2.6)
4 // IN: p = vector of monomial coefficients, length = degree + 1
5 // (leading coefficient in p(0), M A T L A B convention Rem. 5.2.4)
6 // t = vector of evaluation points ti
7 // OUT: y = polynomial evaluated at ti
8 void hor ner ( const VectorXd& p , const VectorXd& t , VectorXd& y ) {
9 const VectorXd : : Index n = t . siz e ( ) ;
10 y . r e s i z e ( n ) ; y = p ( 0 ) ∗ VectorXd : : Ones ( n ) ;
11 f o r ( unsigned i = 1 ; i < p . siz e ( ) ; ++ i )
12 y = t . cwiseProduct ( y ) + p ( i ) ∗ VectorXd : : Ones ( n ) ;
13 }

Optimal asymptotic complexity: O(n)

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 374
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Realized in M ATLAB “built-in”-function polyval(p,x). (The argument x can be a matrix or a vector. In

this case the function evaluates the polynomial described by p for each entry/component.)

5.2.2 Polynomial Interpolation: Theory

Supplementary reading. This topic is also presented in [?, Sect. 8.2.1], [?, Sect. 8.1], [?,

Ch. 10].

Now we consider the interpolation problem introduced in Section 5.1 for the special case that the sought
interpolant belongs to the polynomial space Pk (with suitable degree k).

Lagrange polynomial interpolation problem

Given the simple nodes t0 , . . . , tn , n ∈ N, −∞ < t0 < t1 < · · · < tn < ∞ and the values
y0 , . . . , yn ∈ R compute p ∈ Pn such that

p(t j ) = y j for j = 0, . . . , n . (5.2.9)

Is this a well-defined problem? Obviously, it fits the framework developed in Rem. 5.1.6 and § 5.1.13,
because Pn is a finite-dimensional space of functions, for which we already know a basis, the monomi-
als. Thus, in principle, we could examine the matrix A from (5.1.15) to decide, whether the polynomial
interpolant exists and is unique. However, there is a shorter way.

(5.2.10) Lagrange polynomials

For nodes t0 < t1 < · · · < tn (→ Lagrange interpolation) consider the

n t − tj
Lagrange polynomials Li ( t ) : = ∏ ti − t j , i = 0, . . . , n . (5.2.11)
j =0
j 6 =i

➙ Evidently, the Lagrange polynomials satisfy Li ∈ Pn and Li (t j ) = δij

(
1 if i = j ,
Recall the Kronecker symbol δij =
0 else.

From this relationship we infer that the Lagrange polynomials are linearly independent. Since there are
n + 1 = dim Pn different Lagrange polynomials, we conclude that they form a basis of Pn , which is a
cardinal basis for the node set {ti }in=0 .

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 375
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Example 5.2.12 ( Lagrange polynomials for uniformly spaced nodes)

8
L
0
L
2
L
Consider the equidistant nodes in [−1, 1]: 6
5

Lagrange Polynomials
2
T : = t j = −1 + n j , 4

j = 0, . . . , n . 2

The plot shows the Lagrange polynomials for this set

of nodes that do not vanish in the nodes t0 , t2 , and -2

t5 , respectively.
-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 170 t

The Lagrange polynomial interpolant p for data points (ti , yi )in=0 allows a straightforward representation
with respect to the basis of Lagrange polynomials for the node set {ti }in=0 :

n
p ( t ) = ∑ y i Li ( t ) ⇔ p ∈ Pn and p(ti ) = yi . (5.2.13)
i =0

Theorem 5.2.14. Existence & uniqueness of Lagrange interpolation polynomial → [?,

Thm. 8.1], [?, Satz 8.3]

The general Lagrange polynomial interpolation problem admits a unique solution p ∈ Pn .

Proof. Consider the linear evaluation operator

Pn 7 → R n +1 ,
evalT :
p 7→ ( p(ti ))in=0 ,
which maps between finite-dimensional vector spaces of the same dimension, see Thm. 5.2.2.

Representation (5.2.13) ⇒ existence of interpolating polynomial

⇒ evalT is surjective (“onto”)

Known from linear algebra: for a linear mapping T : V 7→ W between finite-dimensional vector spaces
with dim V = dim W holds the equivalence
T surjective ⇔ T bijective ⇔ T injective.
Applying this equivalence to evalT yields the assertion of the theorem
✷

Corollary 5.2.15. Lagrange interpolation as linear mapping → § 5.1.13

The polynomial interpolation in the nodes T := {t j }nj=0 defines a linear operator

(
R n +1 → Pn ,
IT : T
(5.2.16)
( y0 , . . . , y n ) 7→ interpolating polynomial p .

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 376
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 5.2.17 (Vandermonde matrix)

Lagrangian polynomial interpolation leads to linear systems of equations also for the representation coef-
ficients of the polynomial interpolant in monomial basis, see § 5.1.13:
n
p(t j ) = y j ⇐⇒ ∑ ai tij = y j , j = 0, . . . , n
i =0
⇐⇒ solution of (n + 1) × (n + 1) linear system Va = y with matrix

 
1 t0 t20 · · · t0n
1 t1 t21 · · · t1n 
 
 t2 t22 · · · t2n 
V = 1  ∈ R n+1,n+1 . (5.2.18)
 .. .. .. . . .
. . . . .. 
1 tn t2n · · · tnn

A matrix in the form of V is called Vandermonde matrix.

The following code initializes a Vandermonde matrix in E IGEN:

C++11 code 5.2.19: Initialization of Vandermonde matrix

2 // Initialization of a Vandermonde matrix (5.2.18)
3 // from interpolation points ti .
4 MatrixXd vander ( const VectorXd & t ) {
5 const VectorXd : : Index n = t . siz e ( ) ;
6 MatrixXd V( n , n ) ; V . col ( 0 ) = VectorXd : : Ones ( n ) ; V . col ( 1 ) = t ;
7 // Store componentwise integer powers of point coordinate vector
8 // into the columns of the Vandermonde matrix
9 f o r ( i n t j = 2; j <n ; j ++) V . col ( j ) = ( t . array ( ) . pow ( j ) ) . matrix ( ) ;
10 r et ur n V ;
11 }

Remark 5.2.20 (Matrix representation of interpolation operator)

In the case of Lagrange interpolation:

• if Lagrange polynomials are chosen as basis for Pn , then IT is represented by the identity matrix;
• if monomials are chosen as basis for Pn , then IT is represented by the inverse of the Vandermonde
matrix V, see Eq. (5.2.18).

Remark 5.2.21 (Generalized polynomial interpolation → [?, Sect. 8.2.7], [?, Sect. 8.4])

The following generalization of Lagrange interpolation is possible: We still seek a polynomial interpolant,
but beside function values also prescribe derivatives up to a certain order for interpolating polynomial at
given nodes.
Convention: indicate occurrence of derivatives as interpolation conditions by multiple nodes.

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 377
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✬ ✩
Generalized polynomial interpolation problem
Given the (possibly multiple) nodes t0 , . . . , tn , n ∈ N, −∞ < t0 ≤ t1 ≤ · · · ≤ tn < ∞ and the values
y0 , . . . , yn ∈ R compute p ∈ Pn such that

dk
p(t j ) = y j for k = 0, . . . , l j and j = 0, . . . , n , (5.2.22)
dtk
where l j := max{i − i ′ : t j = ti = ti ′ , i, i ′ = 0, . . . , n} is the multiplicity of the nodes t j .
✫ ✪

The most important case of generalized Lagrange interpolatioon is when all the multiplicities are equal
to 2. It is called Hermite interpolation (or osculatory interpolation) and the generalized interpolation
conditions read for nodes t0 = t1 < t2 = t3 < · · · < tn−1 = tn (note the double nodes!) [?, Ex. 8.6]:

p(t2j ) = y2j , p′ (t2j ) = y2j+1 , j = 0, . . . , n/2 .

Theorem 5.2.23. Existence & uniqueness of generalized Lagrange interpolation polynomials

The generalized polynomial interpolation problem Eq. (5.2.22) admits a unique solution p ∈ Pn .

Definition 5.2.24. Generalized Lagrange polynomials

The generalized Lagrange polynomials for the nodes T = {t j }nj=0 ⊂ R (multiple nodes allowed)
are defined as Li := IT (ei +1 ), i = 0, . . . , n, where ei = (0, . . . , 0, 1, 0, . . . , 0)T ∈ R n+1 are the
unit vectors.

Note: The linear interpolation operator IT in this definition refers to generalized Lagrangian interpolation.
Its existence is guaranteed by Thm. 5.2.23.

Example 5.2.25 (Generalized Lagrange polynomials for Hermite Interpolation)

Consider the node set 1.2

T = {t0 = 0, t1 = 0, t2 = 1, t3 = 1} . 1
Cubic Hermite Polynomials

0.8
The plot shows the four unique generalized La-
grange polynomials of degree n = 3 for these nodes. 0.6 p
0
p
They satisfy 1
p
2
0.4 p
3

p0 (0) = 1, p0 (1) = p0′ (0) = p0′ (1) = 0 ,

p1 (1) = 1, p1 (0) = p1′ (0) = p1′ (1) = 0 ,
0.2

p2′ (0) = 1, p2 (1) = p2 (0) = p2′ (1) = 0 , 0

p3′ (1) = 1, p3 (1) = p3 (0) = p3′ (0) = 0 . -0.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 171 t

More details are given in Section 5.4. For explicit formulas for the polynomials see (5.4.5).

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 378
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5.2.3 Polynomial Interpolation: Algorithms

Now we consider the algorithmic realization of Lagrange interpolation as introduced in Section 5.2.2. The
setting is a follows:

Given: nodes T := {−∞ < t0 < t1 < . . . < tn < ∞},

values y := {y0 , y1 , . . . , yn },
We also write p := IT (y) for the unique Lagrange interpolation polynomial given by Thm. 5.2.14.

When used in a numerical code, different demands can be made for a routine that implements Lagrange
interpolation. They determine, which algorithm is most suitable.

5.2.3.1 Multiple evaluations

Task: For ➊ a fixed set {t0 , . . . , tn } of nodes,

and ➋ many different given data values yik , i = 0, . . . , n
and ➌ many arguments xk , k = 1, . . . , N , N ≫ 1,
efficiently compute all p( xk ) for p ∈ Pn interpolating in (ti , yik ), i = 0, . . . , n.

The definition of a possible interpolator data type could be as follows:

C++-code 5.2.26: Polynomial Interpolation

1 class P o l y I n t e r p {
2 private :
3 // various internal data describing p
4 Eigen : : VectorXd t ;
5 public :
6 // Constructors taking node vector (t0 , . . . , tn ) as argument
7 P o l y I n t e r p ( const Eigen : : VectorXd &_ t ) ;
8 template <typename SeqContainer >
9 P o l y I n t e r p ( const SeqContainer &v ) ;
10 // Evaluation operator for data (y0 , . . . , yn ); computes
11 // p( xk ) for xk ’s passed in x
12 Eigen : : VectorXd eval ( const Eigen : : VectorXd &y ,
13 const Eigen : : VectorXd &x ) const ;
14 };

The member function eval(y,x) expects n data values in y and (any number of) evaluation points in
x (↔ [ x1 , . . . , x N ]⊤ ) and returns the vector [ p( x1 ), . . . , p() x N )] ⊤ , where p is the Lagrange polynomial
interpolant.

An implementation directly based on the evaluation of Lagrange polynomials (5.2.11) and (5.2.13) would
incur an asymptotic computational effort of O(n2 N ) for every single invocation of eval and large n, N .

(5.2.27) Barycentric interpolation formula

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 379
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

By means of pre-calculations the asymptotic effort for eval can be reduced substantially:

Simple manipulations starting from (5.2.13) give an altenative representation of p:

n n n t − tj n n n n
λi
p(t) = ∑ Li ( t ) y i = ∑ ∏
t − tj
y i = ∑ λi ∏ ( t − t j ) y i = ∏ ( t − t j ) · ∑ y .
t − ti i
i =0 i = 0 j =0 i i =0 j =0 j =0 i =0
j6=i j6=i

1
where λi = , i = 0, . . . , n.
(ti − t0 ) · · · (ti − ti −1 )(ti − ti +1 ) · · · (ti − tn )

From the above formula, with p(t) ≡ 1, yi = 1:

n n n
λ 1
1= ∏(t − t j ) ∑ t −i ti ⇒ ∏(t − t j ) = λ
j =0 i =0 j =0 ∑in=0 t−iti
n
λi
∑ t − ti yi
i =0
Barycentric interpolation formula p(t) = n . (5.2.28)
λi
∑ t − ti
i =0

1
with λi = , i = 0, . . . , n, independent of t and yi
(ti − t0 ) · · · (ti − ti −1 )(ti − ti +1 ) · · · (ti − tn )
→ precompute !

The use of (5.2.28) involves

✦ computation of weights λi , i = 0, . . . , n: cost O(n2 ) (only once!),
✦ cost O(n) for every subsequent evaluation of p.
⇒ total asymptotic complexity O( Nn) + O(n2 )

The following C++ class demonstrated the use of the barycentric interpolation formula for efficient multiple
point evaluation of a Lagrange interpolation polynomial:

C++-code 5.2.29: Class for multiple data/multiple point evaluations

2 template <typename NODESCALAR = double>
3 class B a r y c P o l y I n t e r p {
4 private :
5 using nodeVec_t = Eigen : : Matrix <NODESCALAR, Eigen : : Dynamic , 1 > ;
6 using i d x _ t = typename nodeVec_t : : Index ;
7 // Number n of interpolation points, deg polynomial +1
8 const i d x _ t n ;
9 // Locations of n interpolation points
10 nodeVec_t t ;
11 // Precomputed values λi , i = 0, . . . , n − 1
12 nodeVec_t lambda ;
13 public :
14 // Constructors taking node vector [t0 , . . . , tn ]⊤ as argument
15 B a r y c P o l y I n t e r p ( const nodeVec_t &_ t ) ;

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 380
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

16 // The interpolation points may also be passed in an STL container

17 template <typename SeqContainer >
18 B a r y c P o l y I n t e r p ( const SeqContainer &v ) ;
19 // Computation of p( xk ) for data values (y0 , . . . , yn ) and
20 // evaluation points xk
21 template <typename RESVEC, typename DATAVEC>
22 RESVEC eval ( const DATAVEC &y , const nodeVec_t &x ) const ;
23 private :
24 void init_lambda ( void ) ;
25 };

C++-code 5.2.30: Interpolation class: constructors

2 template <typename NODESCALAR>
3 B a r y c P o l y I n t e r p <NODESCALAR> : : B a r y c P o l y I n t e r p ( const nodeVec_t
&_ t ) : n ( _ t . siz e ( ) ) , t ( _ t ) , lambda ( n ) {
4 init_lambda ( ) ; }
5

6 template <typename NODESCALAR>

7 template <typename SeqContainer >
8 B a r y c P o l y I n t e r p <NODESCALAR> : : B a r y c P o l y I n t e r p ( const SeqContainer
&v ) : n ( v . siz e ( ) ) , t ( n ) , lambda ( n ) {
9 i d x _ t t i = 0 ; f o r ( auto t p : v ) t ( t i ++) = t p ;
10 init_lambda ( ) ;
11 }

C++-code 5.2.31: Interpolation class: precomputations

2 template <typename NODESCALAR>
3 void B a r y c P o l y I n t e r p <NODESCALAR> : : init_lambda ( void ) {
4 // Precompute the weights λi with effort O(n2 )
5 f o r ( unsigned k = 0 ; k < n ; ++k ) {
6 // little workaround: in E I G E N cannot subtract a vector
7 // from a scalar; multiply scalar by vector of ones
8 lambda ( k ) = 1 . / ( ( t ( k ) ∗ nodeVec_t : : Ones ( k )− t . head ( k ) ) . prod ( ) ∗
9 ( t ( k ) ∗ nodeVec_t : : Ones ( n−k −1)− t . t a i l ( n−k −1) ) . prod ( ) ) ;
10 }
11 }

C++-code 5.2.32: Interpolation class: multiple point evaluations

2 template <typename NODESCALAR>
3 template <typename RESVEC, typename DATAVEC>
4 RESVEC B a r y c P o l y I n t e r p <NODESCALAR> : : eval
5 ( const DATAVEC &y , const nodeVec_t &x ) const {
6 const i d x _ t N = x . siz e ( ) ; // No. of evaluation points
7 RESVEC p (N) ; // Ouput vector
λi
8 // Compute quotient of weighted sums of t− ti , effort O ( n )

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 381
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

9 f o r ( unsigned i = 0 ; i < N ; ++ i ) {
10 nodeVec_t z = ( x ( i ) ∗ nodeVec_t : : Ones ( n ) − t ) ;
11

12 // check if we want to evaluate at a node <-> avoid division by

zero
13 NODESCALAR∗ p t r = std : : f i n d ( z . data ( ) , z . data ( ) + n ,
NODESCALAR( 0 ) ) ; //
14 i f ( p t r ! = z . data ( ) + n ) { // if ptr = z.data + n = z.end no zero
was found
15 p ( i ) = y ( p t r − z . data ( ) ) ; // ptr - z.data gives the position of
the zero
16 }
17 else {
18 const nodeVec_t mu = lambda . cwiseQuotient ( z ) ;
19 p ( i ) = (mu . cwiseProduct ( y ) ) .sum ( ) /mu .sum ( ) ;
20 }
21 } // end for
22 r et ur n p ;
23 }

As an exception, the test for equality with zero in Line 13 is possible. If x(i) is almost equal to t, then the
corresponding entry of the vector mu will be huge and the value of the barycentric sum will almost agree
with p(i), the same value returned, if x(i) is exactly zero. Hence, it does not matter, if the test returns
false because of some small perturbations of the values.

Runtime measurements of direct evaluation of a polynomial in monomial representation vs. barycentric

formula are reported in Exp. 5.2.38.

5.2.3.2 Single evaluation

Supplementary reading. This topic is also discussed in [?, Sect. 8.2.2].

Task: Given a set of interpolation points (t j , y j ), j = 0, . . . , n, with pairwise different interpolation nodes
t j , perform a single point evaluation of the Lagrange polynomial interpolant p at x ∈ R.

We discuss the efficient implementation of the following function for n ≫ 1. It is meant for a single
evaluation of a Lagrange interpolant.
double eval( const Eigen::VectorXd &t, const Eigen::VectorXd &y,
double x);

(5.2.33) Aitken-Neville scheme

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 382
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The starting point is a recursion formula for partial Lagrange interpolants: For 0 ≤ k ≤ ℓ ≤ n define

pk,ℓ := unique interpolating polynomial of degree ℓ − k through (tk , yk ), . . . , (tℓ , yℓ ),

From the uniqueness of polynomial interpolants (→ Thm. 5.2.14) we find

pk,k ( x ) ≡ yk (“constant polynomial”) , k = 0, . . . , n ,

( x − tk ) pk+1,ℓ ( x ) − ( x − tℓ ) pk,ℓ−1 ( x )
pk,ℓ ( x ) = (5.2.34)
tℓ − tk
x − tℓ
= pk+1,ℓ ( x ) + (p ( x ) − pk,ℓ−1 ( x )) , 0 ≤ k ≤ ℓ ≤ n ,
tℓ − tk k+1,ℓ

because the left and right hand sides represent polynomials of degree ℓ − k through the points (t j , y j ),
j = k, . . . , ℓ.

Thus the values of the partial Lagrange interpolants can be computed sequentially and their dependencies
can be expressed by the following so-called Aitken-Neville scheme:

n= 0 1 2 3
t0 y0 =: p0,0 ( x ) → p0,1 ( x ) → p0,2 ( x ) → p0,3 ( x )
ր ր ր
t1 y1 =: p1,1 ( x ) → p1,2 ( x ) → p1,3 ( x ) (ANS)
ր ր
t2 y2 =: p2,2 ( x ) → p2,3 ( x )
ր
t3 y3 =: p3,3 ( x )

Here, the arrows indicate contributions to the convex linear combinations of (5.2.34). The computation
can advance from left to right, which is done in following C++ code.

C++-code 5.2.35: Aitken-Neville algorithm

2 // Aitken-Neville algorithm for evaluation of interpolating polynomial
3 // IN: t, y: (vectors of) interpolation data points
4 // x: (single) evaluation point
5 // OUT: value of interpolant in x
6 double A N i p o l e v a l ( const VectorXd& t , VectorXd y , const double x ) {
7 f o r ( i n t i = 0 ; i < y . siz e ( ) ; ++ i ) {
8 f o r ( i n t k = i − 1 ; k >= 0 ; −−k ) {
9 // Recursion (5.2.34)
10 y ( k ) = y ( k + 1) + ( y ( k + 1) − y ( k ) ) ∗( x − t ( i ) ) / ( t ( i ) − t ( k ) ) ;
11 }
12 }
13 r et ur n y ( 0 ) ;
14 }

The vector y contains the columns of the above triangular tableaux in turns from left to right.

Asymptotic complexity of ANipoeval in terms of number of data points: O(n2 ) (two nested loops).

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 383
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(5.2.36) Polynomial interpolation with data updates

The Aitken-Neville algorithm has another interesting feature, when we run through the Aitken-Neville
scheme from the top left corner:
n= 0 1 2 3
t0 y0 =: p0,0 ( x ) → p0,1 ( x ) → p0,2 ( x ) → p0,3 ( x )
ր ր ր
t1 y1 =: p1,1 ( x ) → p1,2 ( x ) → p1,3 ( x )
ր ր
t2 y2 =: p2,2 ( x ) → p2,3 ( x )
ր
t3 y3 =: p3,3 ( x )
Thus, the values of partial polynomial interpolants at x can be computed before all data points are even
processed. This results in an “update-friendly” algorithm that can efficiently supply the point values p0,k ( x ),
k = 0, . . . , n, while being supplied with the data points (ti , yi ). It can be used for the efficient implemen-
tation of the following interpolator class:

C++-code 5.2.37: Single point evaluation with data updates

1 class Poly Ev al {
2 private :
3 // evaluation point and various internal data describing the
polynomials
4 public :
5 // Constructor taking the evaluation point as argument
6 Poly Ev al ( double x ) ;
7 // Add another data point and update internal information
8 void addPoint ( t , y ) ;
9 // Value of current interpolating polynomial at x
10 double eval ( void ) const ;
11 };

Experiment 5.2.38 (Timing polynomial evaluations)

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 384
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Comparison of the computational time needed for

polynomial interpolation of
√
{ti = i }i =1,...,n , { yi = i }i =1,...,n ,

n = 3, . . . , 200, and evaluation in a single point

x ∈ [0, n].

Minimum computational time

over 100 runs ➙

The measurements were carried out with code

Code 5.2.39.

Fig. 172

C++-code 5.2.39: Timing polynomial evaluations

1 # include <cmath>
2 # include <vector >
3 # include <Eigen / Dense>
4 # include < f i g u r e / f i g u r e . hpp>
5 // ----- include timer library
6 # include "timer.h"
7 // ----- includes for Interpolation functions
8 # include "ANipoleval.hpp"
9 # include "ipolyeval.hpp"
10 # include "intpolyval.hpp"
11 # include "intpolyval_lag .hpp"
12
13 /**
14 * Benchmarking 4 differentinterpolationattempts:
15 * - AitkenNeville - Barycentricformula
16 * - Polyfit + Polyval - Lagrangepolynomials
17 **/
18 i n t main ( ) {
19 // function to interpolate
20 auto f = [ ] ( const Eigen : : VectorXd& x ) { r etur n x . c w i s e S q r t ( ) ; } ;
21
22 const unsigned min_deg = 3 , max_deg = 200;
23
24 Eigen : : VectorXd b u f f e r ;
25 std : : vector <double > t1 , t2 , t3 , t4 , N;
26
27 // Number of repeats for each eval
28 const i n t r e p e a t s = 100;
29
30 // n = increasing polynomial degree
31 f o r ( unsigned n = min_deg ; n <= max_deg ; n++) {
32
33 const Eigen : : VectorXd t = Eigen : : VectorXd : : LinSpaced ( n , 1 , n ) ,
34 y = f(t);
35
36 // ANipoleval takes a double as argument
37 const double x = n ∗ drand48 ( ) ; // drand48 returns random double ∈ [0, 1]
38 // all other functions take a vector as argument
39 const Eigen : : VectorXd xv = n ∗ Eigen : : VectorXd : : Random( 1 ) ;
40
41 std : : cout << "Degree = " << n << " \n" ;
42 Timer a i t k e n , i p o l , i n t p o l , i n t p o l _ l a g ;
43
44 // do the same many times and choose the best result
45 // Aitken-Neville -----------------------
46 aitken . s ta r t ( ) ;
47 f o r ( unsigned i = 0 ; i < r e p e a t s ; ++ i ) {

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 385
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

48 ANipoleval ( t , y , x ) ; aitken . lap ( ) ; }

49 t 1 . push_back ( a i t k e n . min ( ) ) ;
50
51 // Polyfit + Polyval ---------------------
52 ipol . start () ;
53 f o r ( unsigned i = 0 ; i < r e p e a t s ; ++ i ) {
54 i p o l e v a l ( t , y , xv , b u f f e r ) ; i p o l . l a p ( ) ; }
55 t 2 . push_back ( i p o l . min ( ) ) ;
56
57 // Barycentric formula -------------------
58 intpol . start () ;
59 f o r ( unsigned i = 0 ; i < r e p e a t s ; ++ i ) {
60 i n t p o l y v a l ( t , y , xv , b u f f e r ) ; i n t p o l . l a p ( ) ; }
61 t 3 . push_back ( i n t p o l . min ( ) ) ;
62
63 // Lagrange polynomials -------------------
64 intpol_lag . start () ;
65 f o r ( unsigned i = 0 ; i < r e p e a t s ; ++ i ) {
66 i n t p o l y v a l _ l a g ( t , y , xv , b u f f e r ) ; i n t p o l _ l a g . l a p ( ) ; }
67 t 4 . push_back ( i n t p o l _ l a g . min ( ) ) ;
68
69 N. push_back ( n ) ;
70 }
71
72 mgl : : F i g u r e f i g ;
73 f i g . s e t l o g ( false , tr ue ) ;
74 f i g . t i t l e ( "Timing for single−point evaluation" ) ;
75 f i g . p l o t (N, t1 , "r" ) . l a b e l ( "Aitken−Neville scheme" ) ;
76 f i g . p l o t (N, t2 , "m" ) . l a b e l ( "Polyfit + Polyval" ) ;
77 f i g . p l o t (N, t3 , "k" ) . l a b e l ( "Barycentric formula" ) ;
78 f i g . p l o t (N, t4 , "b" ) . l a b e l ( "Lagrange polynomials" ) ;
79 f i g . x l a b e l ( "Polynomial degree" ) ;
80 f i g . y l a b e l ( "Computational time [s] " ) ;
81 f i g . legend ( 0 , 1 ) ;
82 f i g . save ( "pevaltime.eps" ) ;
83
84 r etur n 0 ;
85 }

This uses functions given in Code 5.2.32, Code 5.2.35 and the function polyfit (with a clearly greater
computational effort !)

polyfit is the equivalent to M ATLAB’s built-in polyfit. The implementation can be found on GitLab.
C++-code 5.2.40: Polynomial evaluation using polyfit

1 # include <Eigen / Dense>

2 # include " polyfit.hpp"
3 # include "polyval.hpp"
4
5 using Eigen : : VectorXd ;
6 /* Evaluationof theinterpolationpolynomialswithpolyfit+polyval
7 * IN: t = nodes
8 * y = valuesin t
9 * x = evaluationpoints
10 * v willbe usedto savethevaluesofinterpolantin x */
11 void i p o l e v a l ( const VectorXd& t , const VectorXd& y , const VectorXd& x , VectorXd& v ) {
12 // get coefficients using polyfit
13 VectorXd p = p o l y f i t ( t , y , y . siz e ( ) − 1) ;
14 // evaluate using polyval (<-> horner scheme)
15 polyval ( p , x , v ) ;
16 }

C++-code 5.2.41: Lagrange polynomial interpolation and evaluation

2 // Evaluation of the interpolation polynomials with Lagrange polynomials

3 // IN: t = vector of interpolation nodes
4 // y = values in t
5 // x = evaluation points

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 386
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6 // OUT p will be used to save the values of the interpolant evaluated in

x
7 void i n t p o l y v a l _ l a g ( const VectorXd& t , const VectorXd& y , const VectorXd& x , VectorXd& p ) {
8 p = VectorXd : : Zero ( x . siz e ( ) ) ;
9 f o r ( unsigned k = 0 ; k < t . siz e ( ) ; ++k ) {
10 // Compute values of k-th Lagrange polynomial in evaluation points
11 VectorXd L ; l a g r a n g e p o l y ( x , k , t , L ) ;
12 p += y ( k ) ∗L ;
13 }
14 }

5.2.3.3 Extrapolation to zero

Extrapolation is the same as interpolation but the evaluation point t is outside the interval
[inf j=0,...,n t j , sup j=0,...,n t j ]. In the sequel we assume t = 0, ti > 0.

Of course, Lagrangian interpolation can also be used for extrapolation. In this section we give a very
important application of this “Lagrangian extrapolation”.

Task: compute the limit limh→0 ψ(h) with prescribed accuracy, though the evaluation of the function
ψ = ψ(h) (maybe given in procedural form only) for very small arguments |h| ≪ 1 is difficult,
usually because of numerically instability (→ Section 1.5.5).
The extrapolation technique introduced below works well, if

✦ ψ is an even function of its argument: ψ(t) = ψ(−t),

✦ ψ = ψ(h) behaves “nicely” around h = 0.
Theory: existence of an asymptotic expansion in h2

f (h) = f (0) + A1 h2 + A2 h4 + · · · + An h2n + R(h) , Ak ∈ R ,

with remainder estimate | R(h)| = O(h2n+2 ) for h → 0 .

Idea: computing inaccessible limit by extrapolation to zero

➀ Pick h0 , . . . , hn for which ψ can be evaluated “safely”.

➁ evaluation of ψ(hi ) for different hi , i = 0, . . . , n, |ti | > 0.

➂ ψ(0) ≈ p(0) with interpolating polynomial p ∈ Pn , p(hi ) = ψ(hi ).

(5.2.43) Numerical differentiation through extrapolation

In Ex. 1.5.45 we have already seen a situation, where we wanted to compute the limit of a function ψ(h) for
h → 0, but could not do it with sufficient accuracy. In this case ψ(h) was a one-sided difference quotient
with span h, meant to approximate f ′ ( x ) for a differentiable function f . The cause of numerical difficulties
was cancellation → § 1.5.43.

Now we will see how to dodge cancellation in difference quotients and how to use extrapolation to zero to
computes derivatives with high accuracy:

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 387
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Given: smooth function f : I ⊂ R 7→ R in procedural form: function y = f(x)

Sought: (approximation of) f ′ ( x ), x ∈ I .

Natural idea: approximation of derivative by (symmetric) difference quotient

df f ( x + h) − f ( x − h)
(x) ≈ . (5.2.44)
dx 2h

straightforward implementation fails due to cancellation in the numerator, see also Ex. 1.5.45.

C++-code 5.2.45: Numeric differentiation through difference quotients

2 // numerical differentiation using difference quotient
f ( x + h)− f ( x )
3 // f ′ ( x) = limh→0 h
4 // IN: f (function object) = function to derive
5 // df = exact derivative (to compute error),
6 // name = string of function name (for plot filename)
7 // OUT: plot of error will be saved as "<name>numdiff.eps"
8 template <class Func tion , class D e r i v a t i v e >
9 void d i f f ( const double x , F u n c t i o n& f , D e r i v a t i v e & df , const
std : : s t r i n g name ) {
10 std : : vector <long double> e r r o r , h ;
11 // build vector of widths of difference quotients
12 f o r ( i n t i = −1; i >= − 61; i −= 5 ) h . push_back ( std : : pow ( 2 , i ) ) ;
13 f o r ( unsigned j = 0 ; j < h . siz e ( ) ; ++ j ) {
14 // compute approximate solution using difference quotient
15 double df_appr ox = ( f ( x + h [ j ] ) − f ( x ) ) / h [ j ] ;
16 // compute relative error
17 double r e l _ e r r = std : : abs ( ( df_appr ox − d f ( x ) ) / d f ( x ) ) ;
18 e r r o r . push_back ( r e l _ e r r ) ;
19 }

√
f ( x ) = arctan( x ) f (x) = x f ( x ) = exp( x )
h Relative error h Relative error h Relative error
2−1 0.20786640808609 2−1 0.09340033543136 2−1 0.29744254140026
2−6 0.00773341103991 2−6 0.00352613693103 2−6 0.00785334954789
2−11 0.00024299312415 2−11 0.00011094838842 2−11 0.00024418036620
2−16 0.00000759482296 2−16 0.00000346787667 2−16 0.00000762943394
2−21 0.00000023712637 2−21 0.00000010812198 2−21 0.00000023835113
2−26 0.00000001020730 2−26 0.00000001923506 2−26 0.00000000429331
2−31 0.00000005960464 2−31 0.00000001202188 2−31 0.00000012467100
2−36 0.00000679016113 2−36 0.00000198842224 2−36 0.00000495453865
Recall the considerations elaborated in Ex. 1.5.45. Owing to the impact of roundoff errors amplified by
cancellation, h → 0 does not achieve arbitrarily high accuracy. Rather, we observe fewer correct digits for
very small h!

Extrapolation offers a numerically stable (→ Def. 1.5.85) alternative, because for a 2(n + 1)-times con-
tinuously differentiable function f : I ⊂ R 7→ R , x ∈ I we find that the symmetric difference quotient
behaves like a polynomial in h2 in the vicinity of h = 0. Consider Taylor sum of f in x with Lagrange

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 388
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

remainder term:
n
f ( x + h) − f ( x − h) ′ 1 d2k f 1
ψ (h) := ∼ f (x) + ∑ 2k
( x ) h2k + f (2n+2) (ξ ( x )) .
2h k=1
(2k)! dx (2n + 2)!

Since limh→0 ψ(h) = f ′ ( x ) ➙ approximate f ′ ( x ) by interpolation of ψ in points hi .

C++-code 5.2.46: Numerical differentiation by extrapolation to zero

2 // Extrapolation based numerical differentation
3 // with a posteriori error control
4 // f: handle of a function defined in a neighbourhood of x ∈ R
5 // x: point at which approximate derivative is desired
6 // h0: initial distance from x
7 // rtol: relative target tolerance, atol: absolute tolerance
8 template <class Func tion >
9 double d i f f e x ( F u n c t i o n& f , const double x , const double h0 , const
double r t o l , const double a t o l ) {
10 const unsigned n i t = 10; // Maximum depth of extrapolation
11 VectorXd h ( n i t ) ; h ( 0 ) = h0 ; // Widths of difference quotients
12 VectorXd y ( n i t ) ; // Approximations returned by difference quotients
13 y ( 0 ) = ( f ( x + h0 ) − f ( x − h0 ) ) / ( 2 ∗ h0 ) ; // Widest difference quotients
14

15 // using Aitken-Neville scheme with x = 0, see Code 5.2.35

16 f o r ( unsigned i = 1 ; i < n i t ; ++ i ) {
17 // create data points for extrapolation
18 h ( i ) = h ( i −1) / 2 ; // Next width half a big
19 y ( i ) = ( f ( x + h ( i ) ) − f ( x − h ( i ) ) ) / h ( i − 1) ;
20 // Aitken-Neville update
21 f o r ( i n t k = i − 1 ; k >= 0 ; −−k )
22 y ( k ) = y ( k +1) − ( y ( k +1)−y ( k ) ) ∗ h ( i ) / ( h ( i )−h ( k ) ) ;
23 // termination of extrapolation when desired tolerance is reached
24 const double e r r e s t = std : : abs ( y ( 1 )−y ( 0 ) ) ; // error indicator
25 i f ( e r r e s t < r t o l ∗ std : : abs ( y ( 0 ) ) | | e r r e s t < a t o l ) //
26 break ;
27 }
28 r et ur n y ( 1 ) ;
29 }

While the extrapolation table (→ § 5.2.36) is computed, more and more accurate approximations of f ′ ( x )
become available. Thus, the difference between the two last approximations can be used to gauge the
error of the current approximation, it provides an error indicator, which can be used to decide when the
level of extrapolation is sufficient, see Line 25.

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 389
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

auto f = [](double x) auto g = [](double x) auto h = [](double x)

{return std::atan(x);} {return std::sqrt(x);} {return std::exp(x);}
diffex(f,1.1,0.5) diffex(g,1.1,0.5) diffex(h,1.1,0.5)
Degree Relative error Degree Relative error Degree Relative error
0 0.04262829970946 0 0.02849215135713 0 0.04219061098749
1 0.02044767428982 1 0.01527790811946 1 0.02129207652215
2 0.00051308519253 2 0.00061205284652 2 0.00011487434095
3 0.00004087236665 3 0.00004936258481 3 0.00000825582406
4 0.00000048930018 4 0.00000067201034 4 0.00000000589624
5 0.00000000746031 5 0.00000001253250 5 0.00000000009546
6 0.00000000001224 6 0.00000000004816 6 0.00000000000002
7 0.00000000000021

Advantage: guaranteed accuracy ➙ efficiency

5.2.3.4 Newton basis and divided differences

Supplementary reading. We also refer to [?, Sect. 8.2.4], [?, Sect. 8.2].

In § 5.2.33 we have seen a method to evaluate partial polynomial interpolants for a single or a few evalua-
tion points efficiently. Now we want to do this for many evaluation points that may not be known when we
receive information about the first interpolation points.

C++code 5.2.47: Polynomial evaluation

1 class Poly Ev al {
2 private :
3 // evaluation point and various internal data describing the
polynomials
4 public :
5 // Idle Constructor
6 Poly Ev al ( ) ;
7 // Add another data point and update internal information
8 void addPoint ( t , y ) ;
9 // Evaluation of current interpolating polynomial at x
10 Eigen : : VectorXd operator ( ) ( const Eigen : : VectorXd &x ) const ;
11 };

The challenge: Both addPoint() and the evaluation operator may be called many times and the imple-
mentation has to remain efficient under these circumstances.

Why not use the techniques from § 5.2.27? Drawback of the Lagrange basis or barycentric formula: adding
another data point affects all basis polynomials/all precomputed values!

(5.2.48) Newton basis for Pn

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 390
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Our tool now: “update friendly” representation: Newton basis for Pn

n −1
N0 (t) := 1 , N1 (t) := (t − t0 ) , ... , Nn (t) := ∏ ( t − ti ) . (5.2.49)
i =0

Note: Nn ∈ Pn with leading coefficient 1 ➣ linear indepdence ➣ basis property.

The abstract considerations of § 5.1.13 still apply and we get a linear system of equations for the coeffi-
cients a j of the polynomial interpolant in Newton basis:

a j ∈ R: a0 N0 (t j ) + a1 N1 (t j ) + · · · + an Nn (t j ) = y j , j = 0, . . . , n .

⇔ triangular linear system

 
1 0 ··· 0    
 .. ..  a0 y0
 1 ( t1 − t0 ) . .   
a1   y 1 
 .. .. ..  
 . . . 0  ..  =  .. .
  .   . 
 n −1 
1 ( t n − t0 ) · · · ∏ ( t n − ti ) an yn
i =0

Solution of the system with forward substitution:

a0 = y 0 ,
y − a0 y − y0
a1 = 1 = 1 ,
t1 − t0 t1 − t0
y2 − y0 y1 − y0
y −y
y2 − y0 − (t2 − t0 ) t1 −t00 −
y 2 − a0 − ( t2 − t0 ) a1 1 t2 − t0 t1 − t0
a2 = = = ,
(t2 − t0 )(t2 − t1 ) (t2 − t0 )(t2 − t1 ) t2 − t1
..
.

Observation: same quantities computed again and again !

(5.2.50) Divided differences

In order to reveal the pattern, we turn to a new interpretation of the coefficients a j of the interpolating

✓ ✏
polynomials in Newton basis.

Newton basis polynomial Nj (t): degree j and leading coefficient 1

✒ ✑
⇒ a j is the leading coefficient of the interpolating polynomial p0,j .

(the notation pℓ,m for partial polynomial interpolants through the data points (tℓ , yℓ ), . . . , (tm , ym ) was
introduced in Section 5.2.3.2, see (5.2.34))

➣ Recursion (5.2.34) implies a recursion for the leading coefficients aℓ,m of the interpolating polynomi-
als pℓ,m , 0 ≤ ℓ ≤ m ≤ n:

aℓ+1,m − aℓ,m−1
aℓ,m = . (5.2.51)
tm − tℓ

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 391
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Hence, instead of using elimination for a triangular linear system, we find a simpler and more efficient
algorithm using the so-called divided differences:
y [ ti ] = y i
y [ ti + 1 , . . . , ti + k ] − y [ ti , . . . , ti + k − 1 ]
y [ ti , . . . , ti + k ] = (recursion) (5.2.52)
ti + k − ti

(5.2.53) Efficient computation of divided differences

Recursive calculation by divided differences scheme, cf. Aitken-Neville scheme, Code 5.2.35:
t0 y [ t0 ]
> y [ t0 , t1 ]
t1 y [ t1 ] > y [ t0 , t1 , t2 ]
> y [ t1 , t2 ] > y [ t0 , t1 , t2 , t3 ] , (5.2.54)
t2 y [ t2 ] > y [ t1 , t2 , t3 ]
> y [ t2 , t3 ]
t3 y [ t3 ]

The elements can be computed from left to right, every “>” indicates the evaluation of the recursion
formula (5.2.52).

However, we can again resort to the idea of § 5.2.36 and traverse (5.2.54) along the diagonals from top to
bottom: If a new datum (tn+1 , yn+1 ) is added, it is enough to compute the n + 2 new terms
y [ t n + 1 ] , y [ t n , t n + 1 ] , . . . , y [ t0 , . . . , t n + 1 ] .

The following M ATLAB code computes divided differences for data points (ti , yi ), i = 0, . . . , n, in this
fashion. It is implemented by recursion to elucidate the successive use of data points. The divided
differences y[t0 ], y[t0 , t1 ], . . . , y[t0 , . . . , tn ], are accumulated in the vector y.

C++-code 5.2.55: Divided differences, recursive implementation, in situ computation

2 // IN: t = node set (mutually different)
3 // y = nodal values
4 // OUT: y = coefficients of polynomial in Newton basis
5 void d i v d i f f ( const VectorXd& t , VectorXd& y ) {
6 const unsigned n = y . siz e ( ) − 1 ;
7 // Follow scheme (5.2.54), recursion (5.2.51)
8 f o r ( unsigned l = 0 ; l < n ; ++ l )
9 f o r ( unsigned j = l ; j < n ; ++ j )
10 y ( j +1) = ( y ( j +1)−y ( l ) ) / ( t ( j +1)− t ( l ) ) ;
11 }

By derivation: computed finite differences are the coefficients of interpolating polynomials in Newton
basis:
n −1
p(t) = a0 + a1 (t − t0 ) + a2 (t − t0 )(t − t1 ) + · · · + an ∏ (t − t j ) (5.2.56)
j =0

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 392
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

a0 = y [ t0 ] , a1 = y [ t0 , t1 ] , a2 = y [ t0 , t1 , t2 ] , . . . .
Thus, Code 5.2.55 computes the coefficients a j , j = 0, . . . , n, of the polynomial interpolant with respect
to the Newton basis. It uses only the first j + 1 data points to find a j .

(5.2.57) Efficient evaluation of polynomial in Newton form

“Backward evaluation” of p(t) in the spirit of Horner’s scheme (→ Rem. 5.2.5, [?, Alg. 8.20]):
p ← an , p ← (t − tn −1 ) p + an −1 , p ← (t − tn −2 ) p + an −2 , ....

C++-code 5.2.58: Divided differences evaluation by modified Horner scheme

2 // Evaluation of polynomial in Newton basis (divided differences)
3 // IN: t = nodes (mutually different)
4 // y = values in t
5 // x = evaluation points (as Eigen::Vector)
6 // OUT: p = values in x */
7 void e v a l d i v d i f f ( const VectorXd& t , const VectorXd& y , const
VectorXd& x , VectorXd& p ) {
8 const unsigned n = y . siz e ( ) − 1 ;
9

10 // get Newton coefficients of polynomial (non in-situ

implementation!)
11 VectorXd c o e f f s ; d i v d i f f ( t , y , c o e f f s ) ;
12

13 // evaluate
14 VectorXd ones = VectorXd : : Ones ( x . siz e ( ) ) ;
15 p = c o e f f s ( n ) ∗ ones ;
16 f o r ( i n t j = n − 1 ; j >= 0 ; −− j ) {
17 p = ( x − t ( j ) ∗ ones ) . cwiseProduct ( p ) + c o e f f s ( j ) ∗ ones ;
18 }
19 }

Computational effort:
✦ O(n2 ) for computation of divided differences (“setup phase”),
✦ O(n) for every single evaluation of p(t).
(both operations can be interleaved, see Code 5.2.47)

Example 5.2.59 (Class PolyEval)

Implementation of a C++ class supporting the efficient update and evaluation of an interpolating polynomial
making use of
• presentation in Newton basis (5.2.49),
• computation of representation coefficients through divided difference scheme (5.2.54), see Code 5.2.55,
• evaluation by means of Horner scheme, see Code 5.2.58.

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 393
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++-code 5.2.60: Definition of a class for “update friendly” polynomial interpolant

1 class Poly Ev al {
2 private :
3 std : : vector <double> t ; // Interpolation nodes
4 std : : vector <double> y ; // Coefficients in Newton representation
5 public :
6 Poly Ev al ( ) ; // Idle constructor
7 void addPoint ( double t , double y ) ; // Add another data point
8 // evaluate value of current interpolating polynomial at x,
9 double operator ( ) ( double x ) const ;
10 private :
11 // Update internal representation, called by addPoint()
12 void d i v d i f f ( ) ;
13 };

C++-code 5.2.61: Implementation of class PolyEval

1 Poly Ev al : : Poly Ev al ( ) { }
2

3 void Poly Ev al : : addPoint ( double td , double yd ) {

4 t . push_back ( t d ) ; y . push_back ( yd ) ; d i v d i f f ( ) ;
5 }
6

7 void Poly Ev al : : d i v d i f f ( ) {
8 i n t n = t . siz e ( ) ;
9 f o r ( i n t j = 0; j <n −1; j ++) y [ n −1] = ( ( y [ n−1]− y [ j ] ) / ( t [ n−1]− t [ j ] ) ) ;
10 }
11

12 double Poly Ev al : : operator ( ) ( double x ) const {

13 double s = y . back ( ) ;
14 f o r ( i n t i = y . siz e ( ) −2; i >= 0 ; −− i ) s = s ∗ ( x−t [ i ] ) +y [ i ] ;
15 r et ur n s ;
16 }

Remark 5.2.62 (Divided differences and derivatives)

If y0 , . . . , yn are the values of a smooth function f in the points t0 , . . . , tn , that is, y j := f (t j ), then

f (k) (ξ )
y [ ti , . . . , ti + k ] =
k!
for a certain ξ ∈ [ti , ti +k ], see [?, Thm. 8.21].

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 394
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5.2.4 Polynomial Interpolation: Sensitivity

Supplementary reading. For related discussions see [?, Sect. 8.1.3].

This section addresses a major shortcoming of polynomial interpolation in case the interpolation knots ti
are imposed, which is usually the case when given data points have to be interpolated, cf. Ex. 5.1.5.

Example 5.2.63 (Oscillating polynomial interpolant (Runge’s counterexample) → [?,

Sect. 8.3], [?, Ex. 8.1])

This example offers a glimpse of the problems haunting polynomial interpolation.

We examine the polynomial Lagrange interpolant (→ 3
data

Section 5.2.2, (5.2.9)) for uniformly spaced nodes Lagrange interpolant

perturbed interpolant
2.5
and the following data:
n on 2

10
T : = −5 + n j ,
j =0 1.5
y,p(t)

1
yj = , j = 0, . . . n.
1 + t2j
1

0.5

Plotted is the interpolant for n = 10 and the inter-

0
polant for which the data value at t = 0 has been
perturbed by 0.1 ➙ -0.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
(See also Ex. 6.1.41 below.) Fig. 173 t

☞ possible strong oscillations of interpolating polynomials

of high degree on uniformly spaced nodes!
! ☞ Slight perturbations of data values can engender strong
variations of a high-degree Lagrange interpolant “far away”.

(5.2.64) Problem map and sensitivity

In Section 2.2.2 we introduced the concept of sensitivity to describe how perturbations of the data affect the
output for a problem map as defined in § 1.5.67. Concretely, in Section 2.2.2 we discussed the sensitivity
for linear systems of equations. Motivated by Ex. 5.2.63, we now examine the sensitivity of Lagrange
interpolation with respect to perturbations in the data values.

The problem: data ↔ values yi , i = 0, . . . , n ➣ data space R n+1.

problem map ↔ polynomial interpolation mapping IT , see Cor. 5.2.15, result space
Pn .

Thus, the (pointwise) sensitivity of polynomial interpolation will tell us to what extent perturbations in the y-
data will affect the values of the interpolating function somewhere else. In the case of high sensitivity small
perturbations in the data can cause big variations in some function values, which is clearly undesirable.

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 395
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(5.2.65) Norms on spaces of functions

Necessary for studying sensitivity of polynomial interpolation in quantitative terms are norms (→ Def. 1.5.70)
on the vector space of continuous functions C( I ), I ⊂ R . The following norms are the most relevant:

supremum norm k f k L∞ ( I ) := sup{| f (t)|: t ∈ I } , (5.2.66)

Z
2
2
L -norm k f k L2 ( I ) := | f (t)|2 dt , (5.2.67)
ZI
L1 -norm k f k L1 ( I ) := | f (t)| dt . (5.2.68)
I
Note the relationship with the vector norms introduced in § 1.5.69.

(5.2.69) Sensitivity of linear problem maps

In § 5.1.13 we have learned that (polynomial) interpolation gives rise to a linear problem map, see
Def. 5.1.17. For this class of problem maps the investigation of sensitivity has to study operator norms, a
generalization of matrix norms (→ Def. 1.5.76).

Let L : X → Y be a linear problem map between two normed spaces, the data space X (with norm k·k X )
and the result space Y (with norm k·kY ). Thanks to linearity, perturbations of the result y := L(x) for the
input x ∈ X can be expressed as follows:
L(x + δx) = L(x) + L(δx) = y + L(δx) .
Hence, the sensitivity (in terms of propagation of absolute errors) can be measured by the operator norm
kL(δx)kY
kLk X →Y : = sup . (5.2.70)
δx∈ X \{0} δ k x k X

This can be read as the “matrix norm of L”, cf. Def. 1.5.76.

It seems challenging to compute the operator norm (5.2.70) for L = IT (IT the Lagrange interpolation
operator for node set T ⊂ I ), X = R n+1 (equipped with a vector norm), and Y = C( I ) (endowed with a
norm from § 5.2.65). The next lemma will provide surprisingly simple concrete formulas.

Lemma 5.2.71. Absolute conditioning of polynomial interpolation

Given a mesh T ⊂ R with generalized Lagrange polynomials Li , i = 0, . . . , n, and fixed I ⊂ R ,

the norm of the interpolation operator satisfies

kIT (y)k L∞ ( I ) n
kI T k ∞ → ∞ : = sup
k yk ∞
= ∑ i = 0 | Li | L∞ ( I )
, (5.2.72)
y ∈R n+1 \{0}
kIT (y)k L2 ( I ) 1
n
k Li k2L2 ( I )
2
k I T k 2→ 2 : = sup
k yk 2
≤ ∑ i =0
. (5.2.73)
y ∈R n+1 \{0}

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 396
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Proof. (for the L∞ -Norm) By △-inequality

n n n
kIT (y)k L∞ ( I ) = ∑ j =0 y j L j L∞ ( I )
≤ sup ∑ j=0 |y j || L j (t)| ≤ kyk∞ ∑ i = 0 | Li | L∞ ( I )
,
t∈ I

equality in (5.2.72) for y := (sgn( L j (t∗ )))nj=0 , t∗ := argmaxt∈ I ∑in=0 | Li (t)|.

✷

Proof. (for the L2 -Norm) By △-inequality and Cauchy-Schwarz inequality

1 1
n n 2 n 2 2
kIT (y)k L2 ( I ) ≤ ∑ |y |
j =0 j
Lj L2 ( I )
≤ ∑ | y |2
j =0 j ∑ j =0
L j L2 ( I ) .

✷
n
Terminology: Lebesgue constant of T : λT := ∑ i = 0 | Li | L∞ ( I )
= kI T k ∞ → ∞

Remark 5.2.74 (Lebesgue constant for equidistant nodes)

We consider Lagrange interpolation for the special setting

2k n
I = [−1, 1], T = {−1 + n } k=0 (uniformly spaced nodes).

Asymptotic estimate (with (5.2.11) and Stirling formula): for n = 2m

1
n · 1
n · n3 · · · · n− 3 n +1
n · n ···· n
2n −1
(2n)! 2n+3/2
| Lm (1 − n1 )| = 2 = ∼
2 4 n −2 (n − 1)22n ((n/2)!)2 n! π ( n − 1) n
n n· · · · · · n · 1

6
Lebesgue constant for Chebychev nodes
10

5
10

Sophisticated theory [?] gives a lower bound for the

Lebesgue constant λT

4
10
Lebesgue constant for uniformly spaced nodes:
3
10
n
Chebychev nodes
equidistant nodes
λT ≥ Ce /2
2
10

with C > 0 independent of n.

1
10

0
10
0 5 10 15 20 25
Fig. 174 Polynomial degree n

We can also perform a numerical evaluation of the expression

n
λT = ∑ i = 0 | Li | L∞ ( I )
,

for the Lebesgue constant of polynomial interpolation, see Lemma 5.2.71.

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 397
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++-code 5.2.75: C++ code for approximate computation of Lebesgue constants

2 // Computation of Lebesgue constant of polynomial interpolation
3 // with knots ti passed in the vector t based on (5.2.72).
4 // N specifies the number of sampling points for the approximate
5 // computation of the maximum norm of the Lagrange polynimial
6 // on the interval [−1, 1].
7 double lebesgue ( const VectorXd& t , const unsigned& N) {
8 const unsigned n = t . siz e ( ) ;
9 // compute denominators of normalized Lagrange polynomials relative
to the nodes t
10 VectorXd den ( n ) ;
11 f o r ( unsigned i = 0 ; i < n ; ++ i ) {
12 VectorXd tmp ( n − 1 ) ;
13 // Note: comma initializer can’t be used with vectors of length 0
14 i f ( i == 0 ) tmp = t . t a i l ( n − 1 ) ;
15 else i f ( i == n − 1 ) tmp = t . head ( n − 1 ) ;
16 else tmp << t . head ( i ) , t . t a i l ( n − ( i + 1 ) ) ;
17 den ( i ) = ( t ( i ) − tmp . array ( ) ) . prod ( ) ;
18 }
19

20 double l = 0 ; // return value

21 f o r ( unsigned j = 0 ; j < N ; ++ j ) {
22 const double x = −1 + j ∗ ( 2 . / N) ; // sampling point for k·k L∞ ([−1,1])
23 double s = 0 ;
24 f o r ( unsigned k = 0 ; k < n ; ++k ) {
25 // v provides value of normalized Lagrange polynomials
26 VectorXd tmp ( n − 1 ) ;
27 i f ( k == 0 ) tmp = t . t a i l ( n − 1 ) ;
28 else i f ( k == n − 1 ) tmp = t . head ( n − 1 ) ;
29 else tmp << t . head ( k ) , t . t a i l ( n − ( k + 1 ) ) ;
30 double v = ( x − tmp . array ( ) ) . prod ( ) / den ( k ) ;
31 s += std : : abs ( v ) ; // sum over modulus of the polynomials
32 }
33 l = std : : max ( l , s ) ; // maximum of sampled values
34 }
35 r et ur n l ;
36 }

Note: In Code 5.2.75 the norm k Li k L∞ ( I ) can be computed only approximate by taking the maximum
modulus of function values in many sampling points.

(5.2.76) Importance of knowing the sensitivity of polynomial interpolation

In Ex. 5.1.5 we learned that interpolation is an important technique for obtaining a mathematical (and al-
gorithmic) description of a constitutive relationship from measured data.

If the interpolation operator is poorly conditioned, tiny measurement errors will lead to big (local) deviations
of the interpolant from its “true” form.

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 398
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Since measurement errors are inevitable, poorly conditioned interpolation procedures are useless for de-
termining constitutive relationships from measurements.

5.3 Shape preserving interpolation

When reconstructing a quantitative dependence of quantities from measurements, first principles from
physics often stipulate qualitative constraints, which translate into shape properties of the function f , e.g.,
when modelling the material law for a gas:

ti pressure values, yi densities ➣ f positive & monotonic.

Notation: given data: (ti , yi ) ∈ R2 , i = 0, . . . , n, n ∈ N, t0 < t1 < · · · < tn .

Example 5.3.1 (Magnetization curves)

For many materials physics stipulates properties of

the functional dependence of magnetic flux B from
magnetic field strength H :

✦ H 7→ B( H ) smooth (at least C1 ),

✦ H 7→ B( H ) monotonic (increasing),
✦ H 7→ B( H ) concave

Fig. 175

5.3.1 Shape properties of functions and data

(5.3.2) The “shape” of data

The section is about “shape preservation”. In the previous example we have already seen a few properties
that constitute the “shape” of a function: sign, monotonicity and curvature. Now we have to identify
analogous properties of data sets in the form of sequences of interpolation points (t j , y j ), j = 0, . . . , n, t j
pairwise distinct.

Definition 5.3.3. monotonic data

The data (ti , yi ) are called monotonic when yi ≥ yi −1 or yi ≤ yi −1 , i = 1, . . . , n.

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape preserving interpolation 399
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 5.3.4. Convex/concave data

n
The data {(ti , yi )}i =0 are called convex (concave) if

(≥) y j − y j −1
∆ j ≤ ∆ j+1 , j = 1, . . . , n − 1 , ∆ j := , j = 1, . . . , n .
t j − t j −1

Mathematical characterization of convex data:

( ti + 1 − ti ) y i − 1 + ( ti − ti − 1 ) y i + 1
yi ≤ ∀ i = 1, . . . , n − 1,
ti + 1 − ti − 1
i.e., each data point lies below the line segment connecting the other data, cf. definition of convexity of a
function [?, Def. 5.5.2].
y y

Fig. 176 t
Fig. 177 t
Convex data Convex function

Definition 5.3.5. Convex/concave function → [?, Def. 5.5.2]

convex f (λx + (1 − λ)y) ≤ λ f ( x ) + (1 − λ) f (y) ∀0 ≤λ≤1,

f : I ⊂ R 7→ R :⇔
concave f (λx + (1 − λ)y) ≥ λ f ( x ) + (1 − λ) f (y) ∀ x, y ∈ I .

(5.3.6) (Local) shape preservation

Now consider interpolation problem: data (ti , yi ), i = 0, . . . , n −→ interpolant f

✬ ✩
Goal: shape preserving interpolation:

positive data −→ positive interpolant f ,

monotonic data −→ monotonic interpolant f ,

✫ ✪
convex data −→ convex interpolant f .

More ambitious goal: local shape preserving interpolation: for each subinterval I = (ti , ti + j )

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape preserving interpolation 400
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

positive data in I −→ locally positive interpolant f | I ,

monotonic data in I −→ locally monotonic interpolant f | I ,
convex data in I −→ locally convex interpolant f | I .

Experiment 5.3.7 (Bad behavior of global polynomial interpolants)

We perform Lagrange interpolation for the following positive and monotonic data:

ti -1.0000 -0.6400 -0.3600 -0.1600 -0.0400 0.0000 0.0770 0.1918 0.3631 0.6187 1.0000
yi 0.0000 0.0000 0.0039 0.1355 0.2871 0.3455 0.4639 0.6422 0.8678 1.0000 1.0000
created by taking points on the graph of


0 if t < − 52 ,
1
f (t) = 2 (1 + cos(π (t − 35 ))) if − 52 < t < 35 ,


1 otherwise.

1.2
Polynomial
1 Measure pts.
Natural f
0.8 ← Interpolating polynomial, degree = 10
0.6
Oscillations at the endpoints of the interval (see
0.4
Fig. 173)
y

0.2
• No locality
0
• No positivity
-0.2 • No monotonicity
-0.4
• No local conservation of the curvature
-0.6
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
t

5.3.2 Piecewise linear interpolation

There is a very simple method of achieving perfect shape preservation by means of a linear (→ § 5.1.13)
interpolation operator into the space of continuous functions:

Data: (ti , yi ) ∈ R2 , i = 0, . . . , n, n ∈ N, t0 < t1 < · · · < tn .

Then the piecewise linear interpolant s : [t0 , tn ] → R is defined as, cf. Ex. 5.1.10:

( ti + 1 − t ) y i + ( t − ti ) y i + 1
s (t) = for t ∈ [ ti , ti + 1 ] . (5.3.8)
ti + 1 − ti

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape preserving interpolation 401
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The piecewise linear interpolant is also called a

polygonal curve. It is continuous and consists
of n line segments.

Piecewise linear interpolant of data from

Fig. 176 ✄

t
Fig. 178
t0 t1 t2 t3 t4
Piecewise linear interpolation means simply “connect the data points in R 2 using straight lines”.
Obvious: linear interpolation is linear (as mapping y 7→ s, see Def. 5.1.17) and local in the following
sense:

y j = δij , i, j = 0, . . . , n ⇒ supp(s) ⊂ [ti −1 , ti +1 ] . (5.3.9)

As obvious are the properties asserted in the following theorem. The local preservation of curvature is a
straightforward consequence of Def. 5.3.4.

Theorem 5.3.10. Local shape preservation by piecewise linear interpolation

Let s ∈ C([t0 , tn ]) be the piecewise linear interpolant of (ti , yi ) ∈ R 2 , i = 0, . . . , n, for every

subinterval I = [t j , tk ] ⊂ [t0 , tn ]:

if (ti , yi )| I are positive/negative ⇒ s| I is positive/negative,

if (ti , yi )| I are monotonic (increasing/decreasing) ⇒ s| I is monotonic (increasing/decreasing),
if (ti , yi )| I are convex/concave ⇒ s| I is convex/concave.

Local shape preservation = perfect shape preservation!

Bad news: none of this properties carries over to local polynomial interpolation of higher polynomial degree
d > 1.

Example 5.3.11 (Piecewise quadratic interpolation)

We consider the following generalization of piecewise linear interpolation of data points (t j , y j ) ∈ R × R ,

j = 0, . . . , n.

From Thm. 5.2.14 we know that a parabola (polynomial of degree 2) is uniquely determined by 3 data
points. Thus, the idea is to form groups of three adjacent data points and interpolate each of these triplets
by a 2nd-degree polynomial (parabola).
Assume: n = 2m even
piecewise quadratic interpolant q : [min{ti }, max{ti }] 7→ R is defined by

q j := q |[t2j−2,t2j ] ∈ P2 , q j (ti ) = yi , i = 2j − 2, 2j − 1, 2j , j = 1, . . . , m . (5.3.12)

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape preserving interpolation 402
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

1.2
Nodes
Piecewise linear interpolant
1 Piecewise quadratic interpolant
Nodes as in Exp. 5.3.7
0.8

Piecewise linear (blue) and quadratic (red) inter-

0.6
polants ✄
0.4

0.2
No shape preservation for piecewise quadratic inter-
polant 0

-0.2
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Fig. 179

The “only” drawback of piecewise linear interpolation:

interpolant is only C0 but not C1 (no continuous derivative).

However: Interpolant usually serves as input for other numerical methods like a Newton-method for solving
non-linear systems of equations, see Section 8.4, which requires derivatives.

5.4 Cubic Hermite Interpolation

Aim: construct local shape-preserving (→ Section 5.3) (linear ?) interpolation operator that fixes short-
coming of piecewise linear interpolation by ensuring C1 -smoothness of the interpolant.

✎ notation: C1 ([ a, b]) =
ˆ space of continuously differentiable functions [ a, b] 7→ R.

5.4.1 Definition and algorithms

Given: mesh points (ti , yi ) ∈ R 2 , i = 0, . . . , n, t0 < t1 < · · · < t n .

Goal: build function f ∈ C1 ([t0 , tn ]) satisfying the interpolation conditions f (ti ) = yi , i = 0, . . . , n.

Definition 5.4.1. Cubic Hermite polynomial interpolant

Given data points (t j , y j ) ∈ R × R , j = 0, . . . , n, with pairwise distinct ordered nodes t j , and slopes
c j ∈ R, the piecewise cubic Hermite interpolant s : [t0 , tn ] → R is defined by the requirements

s|[ti−1,ti ] ∈ P3 , i = 1, . . . , n , s ( ti ) = y i , s ′ ( ti ) = c i , i = 0, . . . , n .

5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 403
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Corollary 5.4.2. Smoothness of cubic Hermite polynomial interpolant

Piecewise cubic Hermite interpolants are continuously differentiable on their interval of definition.

Proof. The assertion of the corollary follows from the agreement of function values and first derivative
values on nodes shared by two intervals, on each of which the piecewise cubic Hermite interpolant is a
polynomial of degree 3.
✷

(5.4.3) Local representation of piecewise cubic Hermite interpolant

Locally, we can write a piecewise cubic Hermite interpolant as a linear combination of generalized cardinal
basis functions with coefficients supplied by the data values y j and the slopes c j :

s(t) = yi −1 H1 (t) + yi H2 (t) + ci −1 H3 (t) + ci H4 (t) , t ∈ [ti −1 , ti ] , (5.4.4)

where the basic functions Hk , k = 1, 2, 3, 4, are as follows:

1.2

1
t − t i −1
H1 (t) := φ( tih−t ) , H2 (t) := φ( hi ) ,
i
0.8 t−t
H1 H3 (t) := −hi ψ( tih−t ) , H4 (t) := hi ψ ( h i −1 ) ,
i i
0.6 H2
h i : = ti − ti − 1 ,
H (t)

H3
i

0.4
H
4
φ(τ ) := 3τ 2 − 2τ 3 ,
0.2 ψ (τ ) := τ 3 − τ 2 .
(5.4.5)
0

-0.2
✁ Local basis polynomial on [0, 1]
0 0.2 0.4 0.6 0.8 1
Fig. 180 t

By tedious, but straightforward computations using the chain rule we find the following values for Hk and
Hk′ at the endpoints of the interval [ti −1 , ti ].
H ( ti − 1 ) H ( ti ) H ′ ( ti − 1 ) H ′ ( ti )
H1 1 0 0 0
H2 0 1 0 0
H3 0 0 1 0
H4 0 0 0 1
This amounts to a proof for (5.4.4) (why?).
The formula (5.4.4) is handy for the local evaluation of piecewise cubic Hermit interpolants. The function
hermloceval in Code 5.4.6 performs the efficient evaluation (in multiple points) of a piecewise cubic
polynomial s on t1 , t2 uniquely defined by the constraints s(t1 ) = y1 , s(t2 ) = y2 , s′ (t1 ) = c1 , s′ (t2 ) = c2 :

C++11 code 5.4.6: Local evaluation of cubic Hermite polynomial

2 // Multiple point evaluation of Hermite polynomial
3 // y1 , y2 : data values
4 // c1 , c2 : slopes
5 VectorXd hermloceval ( VectorXd t , double t1 , double t2 ,
6 double y1 , double y2 ,

5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 404
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

7 double c1 , double c2 ) {
8 const double h = t2 −t1 , a1 = y2−y1 , a2 = a1−h ∗ c1 , a3 = h ∗ c2−a1−a2 ;
9 t = ( ( t . array ( )− t 1 ) / h ) . matrix ( ) ;
10 r et ur n ( y1 + ( a1 + ( a2+a3 ∗ t . array ( ) ) ∗ ( t . array ( ) −1) ) ∗ t . array ( ) ) . matrix ( ) ;
11 }

(5.4.7) Linear Hermite interpolation

However, the data for an interpolation problem (→ Section 5.1) are merely the interpolation points (t j , y j ),
j = 0, . . . , n, but not the slopes of the interpolant at the nodes. Thus, in order to define an interpolation
operator into the space of piecewise cubic Hermite functions, we have supply a mapping R n+1 × R n+1 →
R n+1 computing the slopes c j from the data points.

Since this mapping should be local it is natural to rely on (weighted) averages of the local slopes ∆ j (→
Def. 5.3.4) of the data, for instance

 ∆1
 , for i = 0 ,
y j − y j −1
ci = ∆n , for i = n , , ∆ j := ,j = 1, . . . , n . (5.4.8)

 t i +1 − t i ∆ + t i − t i −1 t j − t j −1
t −t i t i + 1 − t i − 1 ∆i + 1 , if 1 ≤ i < n .
i +1 i −1

Leads to a linear (→ Def. 5.1.17), local Hermite interpolation operator

“Local” means, that, if the values y j are non-zero for only a few adjacent data points with indices j =
k, . . . , k + m, m ∈ N small, then the Hermite interpolant s is supported on [tk−ℓ , tk+m+ℓ ] for small ℓ ∈ N
independent of k and m.

Example 5.4.9 (Piecewise cubic Hermite interpolation)

5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 405
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Data points:
✦ 11 equispaced nodes

t j = −1 + 0.2 j, j = 0, . . . , 10.

in the interval I = [−1, 1],

✦ yi = f (ti ) with

f ( x ) := sin(5x ) e x .

Here we used weighted averages of

slopes as in Eq. (5.4.8).

For details see Code 5.4.10.

Fig. 181

No local/global preservation of monotonicity!

C++-code 5.4.10: Piecewise cubic Hermite interpolation

1 # include <Eigen / Dense>

2 # include < f i g u r e / f i g u r e . hpp>
3 # include "hermloceval.hpp"
4
5 using Eigen : : VectorXd ;
6
7 // Compute and plot the cubic Hermite interpolant of the function f in
the nodes t
8 // using weighted averages according to (5.4.8) as local slopes
9 template <class Func ti on >
10 void hermi ntp1 ( F u n c t i o n& f , const VectorXd& t ) {
11 const unsigned n = t . siz e ( ) ;
12 // length of intervals between nodes: hi : = t i+1 − t i
13 const VectorXd h = t . t a i l ( n − 1) − t . head ( n − 1) ;
14
15 // evaluate f at the nodes and for plotting
16 VectorXd y ( n ) ,
17 t p l o t = VectorXd : : LinSpaced (500 , t ( 0 ) , t ( n − 1) ) ,
18 y p l o t (500) ;
19 f o r ( unsigned i = 0 ; i < n ; ++ i ) y ( i ) = f ( t ( i ) ) ;
20 f o r ( unsigned j = 0 ; j < 500; ++ j ) y p l o t ( j ) = f ( t p l o t ( j ) ) ;
21
22 // slopes of piecewise linear interpolant
23 const VectorXd d e l t a = ( y . t a i l ( n − 1) − y . head ( n − 1) ) . cwiseQuotient ( h ) ;
24 VectorXd c ( n ) ;
25 c ( 0 ) = d e l t a ( 0 ) ; c ( n − 1) = d e l t a ( n − 2) ;
26 // slopes from weighted averages, see (5.4.8)
27 f o r ( unsigned i = 1 ; i < n − 1 ; ++ i ) {
28 c ( i ) = ( h ( i ) ∗ d e l t a ( i − 1) + h ( i − 1) ∗ d e l t a ( i ) ) / ( t ( i + 1) − t ( i − 1) ) ;
29 }
30
31 mgl : : F i g u r e f i g ;
32 f i g . t i t l e ( "Hermite interpolation" ) ;
33 f i g . legend ( 0 , 0) ;
34 f i g . p l o t ( t , y , " ko" ) . l a b e l ( "Data points" ) ;
35 f i g . p l o t ( t p l o t , y p l o t ) . l a b e l ( " f (x)" ) ;
36

5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 406
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

37 // compute and plot the Hermite interpolant with slopes c

38 f o r ( unsigned j = 0 ; j < n − 1 ; ++ j ) {
39 VectorXd vx = VectorXd : : LinSpaced (100 , t ( j ) , t ( j + 1) ) ,
40 px ;
41 herml oc ev al ( vx , t ( j ) , t ( j +1) , y ( j ) , y ( j +1) , c ( j ) , c ( j +1) , px ) ;
42 f i g . p l o t ( vx , px , "r" ) ;
43 }
44 // manually adding label for pw hermite interpolants
45 f i g . a d d l a b e l ( "Piecw. cubic interpolation polynomial" , "r" ) ;
46
47 // plot segments indicating the slopes ci
48 f o r ( unsigned k = 1 ; k < n − 1 ; ++k ) {
49 VectorXd t _ s l ( 2 ) , y _ s l ( 2 ) ;
50 t _ s l << t ( k ) − 0.3 ∗ h ( k − 1) , t ( k ) + 0.3 ∗ h ( k ) ;
51 y _ s l << y ( k ) − 0.3 ∗ h ( k − 1) ∗ c ( k ) , y ( k ) + 0.3 ∗ h ( k ) ∗ c ( k ) ;
52 f i g . p l o t ( t _ s l , y _s l , "k" ) ;
53 }
54 // slope segments at beginning and end
55 VectorXd t _ s l _ 0 ( 2 ) , y _s l _0 ( 2 ) ,
56 t _ s l _ n ( 2 ) , y _s l _n ( 2 ) ;
57
58 t _ s l _ 0 << t ( 0 ) , t ( 0 ) + 0.3 ∗ h ( 0 ) ;
59 y _s l _0 << y ( 0 ) , y ( 0 ) + 0.3 ∗ h ( 0 ) ∗ c ( 0 ) ;
60 t _ s l _ n << t ( n − 1) − 0.3 ∗ h ( n − 2) , t ( n − 1) ;
61 y _s l _n << y ( n − 1) − 0.3 ∗ h ( n − 2) ∗ c ( n − 1) , y ( n − 1) ;
62 f i g . p l o t ( t _ s l _ 0 , y_sl_0 , "k" ) ;
63 f i g . p l o t ( t _ s l _ n , y_sl_n , "k" ) ;
64
65 f i g . a d d l a b e l ( "Averaged slopes" , "k" ) ;
66
67 // save figure
68 f i g . save ( "hermintp1" ) ;
69 }

Invocation:
auto f = [](double x) return sin(5*x)*exp(x);
Eigen::VectorXd t = Eigen::VectorXd::LinSpaced(10, -1, 1);
hermintp(f, t);

5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 407
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5.4.2 Local monotonicity preserving Hermite interpolation

From Ex. 5.4.9 we learn that, if the slopes are chosen according to Eq. (5.4.8), then he resulting Hermite
interpolation does not preserve monotonicity.

y
Consider the situation sketched on the right ✄
The red circles (•) represent data points, the blue line
(—) the piecewise linear interpolant → Section 5.3.2.

In the nodes marked with 7→ the first derivative of a

monotonicity preserving C1 -smooth interpolant must
vanish! Otherwise an “overshoot” occurs, see also
Fig. 179.

Of course, this will be violated, when a (weighted)

arithmetic average is used for the computation of t
slopes for cubic Hermite interpolation. Fig. 182

✁ Consider the situation sketched on the left.

y
The red circles (•) represent data points, the blue line
(—) the piecewise linear interpolant → Section 5.3.2.

A local monotonicity preserving C1 -smooth inter-

polant (→ § 5.3.6) s must be flat (= vanishing first
derivative) in data points (t j , y j ), for which

y j−1 < y j and y j+1 < y j ,

y j−1 > y j and y j+1 > y j ,
t in “local extrema” of the data set.
Fig. 183
Otherwise, overshoots or undershoots would destroy
local monotonicity on one side of the extremum.

(5.4.11) Limiting of local slopes

From the discussion of Fig. 182 and Fig. 183 it is clear that local monotonicity preservation entails that the
local slopes ci of a cubic Hermite interpolant (→ Def. 5.4.1) have to fulfill
(
0 , if sgn(∆i ) 6= sgn(∆i +1 ) ,
ci = , i = 1, . . . , n − 1 . (5.4.12)
some “average” of ∆i , ∆i +1 otherwise


1 , if ξ > 0 ,
✎ notation: sign function sgn(ξ ) = 0 , if ξ = 0 , .


−1 , if ξ < 0 .
A slope selection rule that enforces (5.4.12) is called a limiter.

Of course, testing for equality with zero does not make sense for data that may be affected by measure-
ment or roundoff errors. Thus, the “average” in (5.4.12) must be close to zero already when either ∆i ≈ 0

5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 408
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

or ∆i +1 ≈ 0. This is satisfied by the weighted harmonic mean

1
ci = , (5.4.13)
wa
∆i + ∆wb i +1

with weights w a > 0, wb > 0, ( w a + w b = 1 ).

8
The harmonic mean = “smoothed min(·, ·)-
7
function”.
6

Obviously ∆i → 0 or ∆i +1 → 0 in (5.4.13), then

b
5

c i → 0. 4

Contour plot of the harmonic mean of a and b ➙ 3

(w a = wb = 1/2). 2

1 2 3 4 5 6 7 8 9 10
Fig. 184 a
A good choice of the weights is:
2hi +1 + hi hi +1 + 2hi
wa = , wb = ,
3( hi +1 + hi ) 3( hi +1 + hi )
This yields the following local slopes, unless (5.4.12) enforces ci = 0:


 ∆1 , if i = 0 ,


sgn(∆1 )=sgn(∆2 ) 3( h i + 1 + h i )
→ ci = 2h i +1 + h i 2h i + h i +1 , for i ∈ {1, . . . , n − 1} , h i : = ti − ti − 1 . (5.4.14)
 + ∆


∆i i +1
∆ , if i = n ,
n

Piecewise cubic Hermite interpolation with local slopes chosen according to (5.4.12) and (5.4.14) is avail-
able through the M ATLAB function v = pchip(t,y,x);, where t passes the interpolation nodes, y
the corresponding data values, and x is a vector of evaluation points, see doc phchip for details.

Example 5.4.15 (Monotonicity preserving piecewise cubic polynomial interpolation)

Data points
1
Piecew. cubic interpolation polynomial
Data from Exp. 5.3.7
Plot created with MATLAB-function call 0.8

v = pchip(t,y,x); 0.6

t: Data nodes t j
s(t)

y: Data values y j 0.4

x: Evaluation points xi
v: Vector s( xi ) 0.2

We observe perfect local monotonicity preservation,

0
no under- or overshoots at extrema. ✄
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 185 t

5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 409
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 5.4.16 (Non-linear cubic Hermite interpolation)

Note that the mapping y := [y0 , . . . , yn ] → ci defined by (5.4.12) and (5.4.14) is not linear.

➣ The “pchip interpolaton operator” does not provide a linear mapping from data space R n+1 into
C1 ([t0 , tn ]) (in the sense of Def. 5.1.17).

In fact, the non-linearity of the piecewise cubic Hermite interpolation operator is necessary for (only global)
monotonicity preservation:

Theorem 5.4.17. Property of linear, monotonicity preserving interpolation into C1

If, for fixed node set {t j }nj=0, n ≥ 2, an interpolation scheme I : R n+1 → C1 ( I ) is linear as a map-
ping from data values to continuous functions on the interval covered by the nodes (→ Def. 5.1.17),
and monotonicity preserving, then I(y)′ (t j ) = 0 for all y ∈ R n+1 and j = 1, . . . , n − 1.

Of course, an interpolant that is flat in all data points, as stipulated by Thm. 5.4.17 for a lineaer, mono-
tonicity preserving, C1 -smooth interpolation scheme, does not make much sense.

At least, the piecewise cubic Hermite interpolation operator is local (in the sense discussed in § 5.4.7).

Theorem 5.4.18. Monotonicity preservation of limited cubic Hermite interpolation

The cubic Hermite interpolation polynomial with slopes as in Eq. (5.4.14) provides a local
monotonicity-preserving C1 -interpolant.

Proof. See F. F RITSCH UND R. C ARLSON, Monotone piecewise cubic interpolation, SIAM J. Numer.
Anal., 17 (1980), S. 238–246.
✷
The next code demonstrates the calculation of the slopes ci in M ATLAB’s pchip (details in [?]):

C++11 code 5.4.19: Monotonicity preserving slopes in pchip

1 # include <Eigen / Dense>
2

3 using Eigen : : VectorXd ;

5 // using forward declaration of the function pchipend, implementation

below
6 double pchipend ( const double , const double , const double , const
double ) ;
7

8 void pchipslopes ( const VectorXd& t , const VectorXd& y , VectorXd& c ) {

9 // Calculation of local slopes ci for shape preserving cubic Hermite
interpolation, see (5.4.12), (5.4.14)
10 // t, y are vectors passing the data points
11 const unsigned n = t . siz e ( ) ;
12 const VectorXd h = t . t a i l ( n − 1 ) − t . head ( n − 1 ) ,

5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 410
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

13 d e l t a = ( y . t a i l ( n − 1 ) − y . head ( n −
1 ) ) . cwiseQuotient ( h ) ; // linear
slopes
14 c = VectorXd : : Zero ( n ) ;
15

16 // compute reconstruction slope according to (5.4.14)

17 f o r ( unsigned i = 0 ; i < n − 2 ; ++ i ) {
18 i f ( delta ( i ) ∗ delta ( i + 1) > 0) {
19 const double w1 = 2 ∗ h ( i + 1 ) + h ( i ) ,
20 w2 = h ( i + 1 ) + 2 ∗ h ( i ) ;
21 c ( i + 1 ) = ( w1 + w2) / ( w1 / d e l t a ( i ) + w2 / d e l t a ( i + 1 ) ) ;
22 }
23 }
24 // Special slopes at endpoints, beyond (5.4.14)
25 c ( 0 ) = pchipend ( h ( 0 ) , h ( 1 ) , d e l t a ( 0 ) , d e l t a ( 1 ) ) ;
26 c ( n − 1 ) = pchipend ( h ( n − 2 ) , h ( n − 3 ) , d e l t a ( n − 2 ) , d e l t a ( n − 3 ) ) ;
27 }
28

29 double pchipend ( const double h1 , const double h2 , const double del1 ,

const double del2 ) {
30 // Non-centered, shape-preserving, three-point formula
31 double d = ( ( 2 ∗ h1 + h2 ) ∗ del1 − h1 ∗ del2 ) / ( h1 + h2 ) ;
32 i f ( d ∗ del1 < 0 ) {
33 d = 0;
34 }
35 else i f ( del1 ∗ del2 < 0 && std : : abs ( d ) > std : : abs (3 ∗ del1 ) ) {
36 d = 3 ∗ del1 ;
37 }
38 r et ur n d ;
39 }

5.5 Splines

Supplementary reading. Splines are also presented in [?, Ch. 9].

Piecewise cubic Hermite Interpolation presented in Section 5.4 entailed determining reconstruction slopes
ci . Now we learn about a way how to do piecewise polynomial interpolation, which results in Ck -interpolants,
k > 0, and dispenses with auxiliary slopes. The idea is to obtain the missing conditions implicitly from
extra continuity conditions.

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 411

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 5.5.1. Spline space → [?, Def. 8.1]

Given an interval I := [ a, b] ⊂ R and a knot set/mesh M := { a = t0 < t1 < . . . < tn−1 < tn =
b}, the vector space Sd,M of the spline functions of degree d (or order d + 1) is defined by

Sd,M := {s ∈ Cd−1 ( I ): s j := s|[t j−1,t j ] ∈ Pd ∀ j = 1, . . . , n} .

d − 1-times continuously differentiable locally polynomial of degree d

Obviously, spline spaces are mapped onto each other by differentiation & integration:
Z t
s ∈ Sd,M ⇒ s′ ∈ Sd−1,M ∧ s(τ ) dτ ∈ Sd+1,M .
a

Spline spaces of the lowest degrees:

• d = 0 : M-piecewise constant discontinuous functions
• d = 1 : M-piecewise linear continuous functions
• d = 2 : continuously differentiable M-piecewise quadratic functions

The dimension of spline space can be found by a counting argument (heuristic): We count the number
of “degrees of freedom” (d.o.f.s) possessed by a M-piecewise polynomial of degree d, and subtract the
number of linear constraints implicit contained in Def. 5.5.1:
dim Sd,M = n · dim Pd − #{Cd−1 continuity constraints} = n · (d + 1) − (n − 1) · d = n + d .

Theorem 5.5.2. Dimension of spline space

The space Sd,M from Def. 5.5.1 has dimension

dim Sd,M = n + d .

We already know the special case of interpolation in S1,M , when the interpolation nodes are the knots of
M, because this boils down to simple piecewise linear interpolation, see Section 5.3.2.

5.5.1 Cubic spline interpolation

Supplementary reading. More details in [?, XIII, 46], [?, Sect. 8.6.1].

Cognitive psychology teaches us that the human eye perceives a C2 -functions as “smooth”, while it can
still spot the abrupt change of curvature at the possible discontinuities of the second derivatives of a cube
Hermite interpolant (→ Def. 5.4.1).

For this reason the simplest spline functions featuring C2 -smoothness are of great importance in computer
aided design (CAD). They are the cubic splines, M-piecewise polynomials of degree 3 contained in S3,M
(→ Def. 5.5.1).

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 412

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(5.5.3) Definition of cubic spline interpolant

In this section we study cubic spline interpolation (related to cubic Hermite interpolation, Section 5.4)

Task: Given a mesh M := {t0 < t1 < · · · < tn }, n ∈ N, “find” a cubic spline s ∈ S3,M that complies
with the interpolation conditions

s(t j ) = y j , j = 0, . . . , n . (5.5.4)

=
ˆ interpolation at knots !

From dimensional considerations it is clear that the interpolation conditions will fail to fix the interpolating
cubic spline uniquely:

dim S3,M − #{ interpolation conditions} = (n + 3) − (n + 1) = 2 free d.o.f.

“two conditions are missing” ➣ interpolation problem is not yet well defined!

(5.5.5) Computing cubic spline interpolants

We opt for a linear interpolation scheme (→ Def. 5.1.17) into the spline space S3,M . As explained in
§ 5.1.13, this will lead to an equivalent linear system of equations for expansion coefficients with respect
to a suitable basis.

We reuse the local representation of a cubic spline through cubic Hermite cardinal basis polynomials from
(5.4.5):

(5.4.4)
s|[t j−1,t j ] (t) = s(t j−1 ) · (1 − 3τ 2 + 2τ 3 ) + (5.5.6)
s (t j ) · (3τ 2 − 2τ 3 ) +
h j s ′ (t j −1 ) · 2
(τ − 2τ + τ ) + 3

h j s ′ (t j ) · (−τ 2 + τ 3 ) ,

with h j := t j − t j−1 , τ := (t − t j−1 )/h j .

➣ Task of cubic spline interpolation boils down to finding slopes s′ (t j ) in the knots of the mesh M.

Once these slopes are known, the efficient local evaluation of a cubic spline function can be done as for a
cubic Hermite interpolant, see Section 5.4.1, Code 5.4.6.

Note: if s(t j ), s′ (t j ), j = 0, . . . , n, are fixed, then the representation Eq. (5.5.6) already guarantees
s ∈ C1 ([t0 , tn ]), cf. the discussion for cubic Hermite interpolation, Section 5.4.

➣ only continuity of s′′ ✦ has to be enforced by choice of s′ (t j )

m
✦ will yield extra conditions to fix the s′ (t j )

However, do the

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 413

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✦ interpolation conditions Eq. (5.5.4) s(t j ) = y j , j = 0, . . . , n, and the

✦ smoothness constraint s ∈ C2 ([t0 , tn ])
uniquely determine the unknown slopes c j := s′ (t j ) ?

From s ∈ C2 ([t0 , tn ]) we obtain n − 1 continuity constraints for s′′ (t) at the internal nodes
s′′|[t j−1,t j ] (t j ) = s′′|[t j ,t j+1] (t j ) , j = 1, . . . , n − 1 . (5.5.7)

Based on Eq. (5.5.6), we express Eq. (5.5.7) in concrete terms, using

s′′|[t j−1,t j ] (t) = s(t j−1 )h− 2 −2
j 6(−1 + 2τ ) + s (t j )h j 6(1 − 2τ ) (5.5.8)
+ h− 1 ′ −1 ′
j s (t j−1 )(−4 + 6τ ) + h j s (t j )(−2 + 6τ ) , τ := (t − t j−1 )/h j ,
−1
which can be obtained by the chain rule and from dτ
dt = h j .
Eq. (5.5.8)
⇒ s′′|[t j−1,t j ] (t j−1 ) = −6 · s(t j−1 )h− 2 −2 −1 ′ −1 ′
j + 6 · s (t j ) h j − 4 · h j s (t j −1 ) − 2 · h j s (t j ) ,

s′′|[t j−1,t j ] (t j ) = 6 · s(t j−1 )h− 2 −2 −1 ′ −1 ′

j + −6 · s (t j )h j + 2 · h j s (t j −1 ) + 4 · h j s (t j ) .

Eq. (5.5.7) ➙ n − 1 linear equations for n slopes c j := s′ (t j ); with s(ti ) = yi ,

! !
1 2 2 1 y j − y j −1 y j +1 − y j
c + + c + c =3 + , (5.5.9)
h j j −1 h j h j +1 j h j +1 j +1 h2j h2j+1
for j = 1, . . . , n − 1.

Eq. (5.5.9) ⇔ underdetermined (n − 1) × (n + 1) linear system of equations.

n − 1: no. of interpolation conditions

n + 1: dimension of cubic spline space on knot set {t0 < t1 < · · · < tn }
   
b0 a1 b1 0 · · · ··· 0   y1 − y0
3 h2 + h2
y2 − y1
 0 b1 a2 b2  c0  1 2 
    
 .. .. .. ..    
 0 . . . .  .   .. 
 . .. .. ..
 ..  =  . . (5.5.10)
 ..   
 . . . 
 
 
 .   
 .. a
n − 2 bn − 2 0  cn  y n −1 − y n −2 y n − y n −1 
3 h2n−1
+ h2
0 ··· · · · 0 bn − 2 a n − 1 bn − 1 n

with
1 2 2 i = 0, 1, . . . , n − 1 ,
bi := , ai : = + ,
hi +1 hi hi +1 [ bi , ai > 0 , ai = 2(bi + bi −1 ) ]; .
➙ two additional constraints are required, as already noted in § 5.5.3.

(5.5.11) Types of cubic spline interpolants

To saturate the remaining two degrees of freedom the following three approaches are popular:

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 414

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

➀ Complete cubic spline interpolation: s′ (t0 ) = c0 , s′ (tn ) = cn prescribed.

Then the first and last column can be removed from the system matrix of (5.5.10). Their products with
c0 and cn , respectively, have to be subtracted from the right hand side of (5.5.10).

➁ Natural cubic spline interpolation: s′′ (t0 ) = s′′ (tn ) = 0

2 1 y1 − y0 1 2 y n − y n −1
h1 c 0 + h1 c 1 =3 , hn c n −1 + hn c n =3 .
h21 h2n

Combining these two extra equations with (5.5.10), we arrive at a linear system of equations with
tridiagonal s.p.d. (→ Def. 1.1.8, Lemma 2.8.12) system matrix and unknowns c0 , . . . , cn . Due to
Thm. 2.7.58 it can be solved with an asymptotic computational effort of O(n).

➂ Periodic cubic spline interpolation: s′ (t0 ) = s′ (tn ) (➣ c0 = cn ), s′′ (t0 ) = s′′ (tn )
This removes one unknown and adds another equations so that we end up with an n × n-linear
system with s.p.d. (→ Def. 1.1.8) system matrix
 
a1 b1 0 ··· 0 b0
 b1 a2 b2 0 
 
 . . .. .. 
 0 .. .. . .  bi := 1
h i +1 , i = 0, 1, . . . , n − 1 ,
A := 
 ... .. .. ..
 ,
 2 2
 . . . 0  ai : = h i + h i +1 , i = 0, 1, . . . , n − 1 .
 .. 
 0 . a n − 1 bn − 1 
b0 0 · · · 0 bn − 1 a0

This linear system can be solved with rank-1-modifications techniques (see § 2.6.13, Lemma 2.6.22)
+ tridiagonal elimination: asymptotic computational effort O(n).

Remark 5.5.12 (Splines in M ATLAB)

M ATLAB provides many tools for computing and dealing with splines.

MATLAB-function: v = spline(t,y,x): natural / complete spline interpolation

There is even a “Curve Fitting Toolbox” that provides extended functionality.

Remark 5.5.13 (Piecewise cubic interpolation schemes)

✦ Piecewise cubic local Lagrange interpolation

➣ Extra degrees of freedom fixed by putting four nodes in one interval
➥ yields merely C0 -interpolant; perfectly local.
✦ Cubic Hermite interpolation
➣ Extra degrees of freedom fixed by reconstruction slopes
➥ yields C1 -interpolant; still local.

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 415

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✦ Cubic spline interpolation

➣ Extra degrees of freedom fixed by C2 -smoothness, complete/natural/periodic constraint.
➥ yields C2 -interpolant; non-local.

5.5.2 Structural properties of cubic spline interpolants

(5.5.14) Extremal properties of natural cubic spline interpolants → [?, Sect. 8.6.1, Prop-
erty 8.2]

For a function f : [ a, b] 7→ R , f ∈ C2 ([ a, b]), the term

Z b
Ebd ( f ) := 1
2 | f ′′ (t)|2 dt ,
a

models the elastic bending energy of a rod, whose shape is described by the graph of f (Soundness
check: zero bending energy for straight rod). We will show that cubic spline interpolants have minimal
bending energy among all C2 -smooth interpolating functions.

Given: mesh M := { a = t0 < t1 < · · · < tn = b} of [ a, b] with knots t j

Set s ∈ S3,M := natural cubic spline interpolant of data points (ti , yi ) ∈ R2 , i = 0, . . . , n.

Theorem 5.5.15. Optimality of natural cubic spline interpolant

The natural cubic spline interpolant s minimizes the elastic curvature energy among all interpolating
functions in C2 ([ a, b]), that is

Ebd (s) ≤ Ebd ( f ) ∀ f ∈ C2 ([ a, b]), f (ti ) = yi , i = 0, . . . , n .

Idea of proof: variational calculus

We show that any small perturbation of s such that the perturbed spline still satisfies the interpolation
conditions leads to an increase in elastic energy.

Pick perturbation direction k ∈ C2 ([t0 , tn ]) satisfying k(ti ) = 0, i = 0, . . . , n:

Zb
Ebd (s + k) = 1
2 |s′′ + λk′′ |2 dt (5.5.16)
a
Zb Zb
= Ebd (s) + s′′ (t)k′′ (t) dt + 21 |k′′ |2 dt .
a a
| {z } | {z }
:= I ≥0

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 416

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Scrutiny of I : split in interval contributions, integrate by parts twice, and use s(4) ≡ 0:

n Zt j
I=∑ s′′ (t)k′′ (t) dt
j=1t
j −1
n
= − ∑ s′′′ (t−
j ) k ( t j ) − s ′′′ +
( t j −1 ) k ( t j −1 ) + s′′ (tn ) k′ (tn ) − s′′ (t0 ) k′ (t0 ) = 0 .
j =1 |{z} | {z } | {z } | {z }
=0 =0 =0 =0

In light of (5.5.16): non perturbation compatible with interpolation conditions can make the bending energy
of s decrease!

Remark 5.5.17 (Origin of the term “Spline”)

§ 5.5.14: (Natural) cubic spline interpolant provides C2 -curve of minimal elastic bending energy that travels
through prescribed points.

m
Nature: A thin elastic rod fixed a certain points attains a shape that minimizes its potential bending energy
(virtual work principle of statics).

Cubic spline interpolation approximates shape of

elastic rods. Such rods were in fact used in the
manufacturing of ship hulls as “analog comput-
ers” for “interpolating points” that were specified
by the designer of the ship.

Cubic spline interpolation

✄
before M ATLAB

Remark 5.5.18 (Shape preservation of cubic spline interpolation)

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 417

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

1.2

Data s(t j ) = y j from Exp. 5.3.7 and Data points

Cubic spline interpolant
1

y1 − y0
c0 : = ,
t1 − t0 0.8

y n − y n −1
cn := . 0.6

tn − tn −1

s(t)
0.4

The cubic spline interpolant is nor monotonicity nor

curvature preserving 0.2

This is not surprising in light of Thm. 5.4.17, be- 0

cause cubic spline interpolation is a linear interpo-

lation scheme. -0.2
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 186 t

(5.5.19) Weak locality of an interpolation scheme

In terms of locality of interpolation schemes, in the sense of § 5.4.7, we habe seen:

• Piecewise linear interpolation (→ Section 5.3.2) is strictly local: Changing a single data value y j
affects the interpolant only on the interval ]t j−1 , t j+1 [.
• Monotonicity preserving piecewise cubic Hermite interpolation (→ Section 5.4.2) is still local, be-
cause changing y j will lead to a change in the interpolant only in ]t j−2 , t j+2 [ (the remote intervals
are affected through the averaging of local slopes).
• Polynomial Lagrange interpolation is highly non-local, see Ex. 5.2.63.
We can weaken the notion of locality of an interpolation scheme on an ordered node set {ti }in=0 :
➣ (weak) locality measures the impact of a perturbation of a data value y j at points t ∈ [t0 , tn ] as a
function of |t − ti |.
➣ an interpolation scheme is weakly local, if the impact of the perturbation of yi displays a rapid (e.g.
exponential) decay as |t − ti | increases.
For a linear interpolation scheme (→ § 5.1.13) locality can be deduced from the decay of the cardinal
interpolants/cardinal basis functions (→ Lagrange polynomials of § 5.2.10), that is, the functions b j :=
I(e j ), where e j is the j-th unit vector, and I the interpolation operator. Then weak locality can be quantified
as

∃λ > 0|: b j (t)| ≤ exp(−λ|tt j |) , t ∈ [t0 , tn ] . (5.5.20)

Remember:
• Lagrange polynomials satisfying (5.2.11) provide cardinal interpolants for polynomial interpolation
→ § 5.2.10. As is clear from Fig. 170, they do not display any decay away from their “base node”.
Rather, they grow strongly. Hence, there is no locality in global polynomial interpolation.
• Tent functions (→ Fig. 169) are the cardinal basis functions for piecewise linear interpolation, see
Ex. 5.1.10. Hence, this scheme is perfectly local, see (5.3.9).

Experiment 5.5.21 (Weak locality of the natural cubic spline interpolation)

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 418

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Given a grid M := {t0 < t1 < · · · < tn } the ith natural cardinal spline is defined as

Li ∈ S3,M , Li (t j ) = δij , Li′′ (t0 ) = Li′′ (tn ) = 0 . (5.5.22)

n
Natural spline interpolant: s (t) = ∑ y j L j (t) .
j =0

Decay of Li ↔ weak locality of natural cubic spline interpolation.

Cardinal cubic spline function Cardinal cubic spline in middle points of the intervals
1.2 0
10

1
-1
10

Value of the cardinal cubic splines

0.8

-2
10
0.6

0.4 -3
10

0.2
-4
10
0

-5
-0.2 10
-8 -6 -4 -2 0 2 4 6 8
-8 -6 -4 -2 0 2 4 6 8
x
x
Exponential decay of the cardinal splines ➞ cubic spline interpolation is weakly local.

5.5.3 Shape Preserving Spline Interpolation

According to Rem. 5.5.18 is cubic spline interpolation neither monotonicity preserving nor curvature pre-
serving. Necessarily so, because it is a linear interpolation scheme, see Thm. 5.4.17.

This section presents a non-linear quadratic spline (→ Def. 5.5.1, C1 -functions) based interpolation
scheme that manages to preserve both monotonicity and curvature of data even in a local sense, cf.
Section 5.3.

Given: data points (ti , yi ) ∈ R2 , i = 0, . . . , n, assume ordering t0 < t1 < · · · < tn .

Sought:
✦ extended knot set M ⊂ [t0 , tn ] (→ Def. 5.5.1),
✦ an interpolating quadratic spline function s ∈ S2,M , s(ti ) = yi , i = 0, . . . , n
that preserves the “shape” of the data in the sense of § 5.3.2.
n
Notice that here M 6= {t j } j=0: s interpolates the data in the points ti but is piecewise polynomial with
respect to M! The interpolation nodes will usually not belong to M.

We proceed in four steps:

➀ Shape preserving choice of slopes ci , i = 0, . . . , n [?, ?], analogous to Section 5.4.2

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 419

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Recall Eq. (5.4.12) and Eq. (5.4.14): we fix the slopes ci in the nodes using the harmonic mean of data
slopes ∆ j , the final interpolant will be tangent to these segments in the points (ti , yi ). If (ti , yi ) is a local
maximum or minimum of the data, c j is set to zero (→ § 5.4.11)
( 2
, if sign(∆i ) = sign(∆i +1 ) ,
Limiter ci : = ∆−
i
1
+ ∆− 1
i +1 i = 1, . . . , n − 1 .
0 otherwise,
c0 := 2∆1 − c1 , cn := 2∆n − cn−1 ,
y j − y j −1
where ∆ j = t −t .
j j −1

yi+1 yi−1 yi−1

yi yi
yi+1

yi−1 yi+1 yi
t i−1 ti t i+1 t i−1 ti t i+1 t i−1 ti t i+1
Figures: slopes according to limited harmonic mean formula

➁ Choice of “extra knots” pi ∈]ti −1 , ti ], i = 1, . . . , n:

Rule: yi −1
1
Let Ti be the unique straight line through (ti , yi ) with
ci −1
slope ci ; — in figure ✄
If the intersection of Ti −1 and Ti is non-empty and
has a t-coordinate ∈]ti −1 , ti ], l

☞ then pi := t-coordinate of Ti −1 ∩ Ti , ci
yi 1
☞ otherwise pi = 12 (ti −1 + ti ).
Fig. 187
ti−11 ( pi + ti−1 ) pi 1
2 ( pi + ti ) ti
2

These points will be used to build the knot set for the final quadratic spline:

M = { t0 < p 1 ≤ t1 < p 2 ≤ · · · < p n ≤ t n } .

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 420

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++-code 5.5.23: Quadratic spline: selection of pi

1 # include <Eigen / Dense>
2 using Eigen : : VectorXd ;
3

4 // choice of extra knots pi ∈ ]ti−1 , ti [

5 // for given nodeset t and coefficients c
6 void e x t r a _ k n o t s ( const VectorXd& t , const VectorXd& c , VectorXd& p ) {
7 const unsigned n = t . siz e ( ) ;
8 p = ( t ( n − 1 ) − 1 ) ∗ VectorXd : : Ones ( n − 1 ) ;
9 f o r ( unsigned j = 0 ; j < n − 1 ; ++ j ) {
10 i f ( c ( j ) != c ( j + 1) ) {
11 p ( j ) = ( y ( j +1) − y ( j ) + t ( j ) ∗ c ( j ) − t ( j +1) ∗ c ( j +1) ) / ( c ( j +1)−
c( j ) ) ;
12 }
13 i f ( p ( j ) < t ( j ) | | p ( j ) > t ( j +1) ) {
14 p ( j ) = 0 . 5 ∗ ( t ( j ) + t ( j +1) ) ;
15 }
16 }
17 }

➂ Set l = linear spline (polygon) on the knot set M′ (middle points of M)

M′ = {t0 < 21 (t0 + p1 ) < 12 ( p1 + t1 ) < 12 (t1 + p2 ) < · · · < 21 (tn−1 + pn ) < 12 ( pn + tn ) < tn } ,

with l ( ti ) = y i , l ′ ( ti ) = c i .
In each interval ( 21 ( p j + t j ), 12 (t j + p j+1 )) the spline corresponds to the segment of slope c j pass-
ing through the data point (t j , y j ).

In each interval ( 21 (t j + p j+1 ), 21 ( p j+1 + t j+1 )) the spline corresponds to the segment connecting
the previous ones, see Fig. 187.
✞ ☎

✝ ✆
l “inherits” local monotonicity and curvature from the data.

Example 5.5.24 (Auxiliary construction for shape preserving quadratic spline interpolation)

Data points: t=(0:12); y = cos(t);

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 421

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Linear auxiliary spline l

1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
y

y
-0.2
-0.2
-0.4
-0.4
-0.6
-0.6
-0.8
-0.8
-1
-1
0 2 4 6 8 10 12
Fig. 188 t 0 2 4 6 8 10 12
Fig. 189 t
Local slopes ci , i = 0, . . . , n
Linear auxiliary spline l

➃ Local quadratic approximation / interpolation of l :

Tedious, but elementary calculus, confirms the following fact:

Lemma 5.5.25.
If g is a linear spline through the three points

1
(a, y a ) , ( (a + b), w) , (b, yb ) witha < b , y a , yb , w ∈ R ,
2
then the parabola

p(t) := (y a (b − t)2 + 2w(t − a)(b − t) + yb (t − a)2 )/(b − a)2 , a ≤ t ≤ b ,

satisfies
1. p(a) = y a , p(b) = yb , p′ (a) = g′ (a) , p′ (b) = g′ (b),
2. g monotonic increasing / decreasing ⇒ p monotonic increasing / decreasing,
3. g convex / concave ⇒ p convex / concave.

The proof boils down to discussing many cases as indiated in the following plots:

ya Linear Spline l
Parabola p ya Linear Spline l
Parabola p w Linear Spline l
Parabola p
w

w ya
yb yb yb

Fig. 190
a 1 b
Fig. 191
a 1 b
Fig. 192
a 1 b
2 (a + b) 2 (a + b) 2 (a + b)
Lemma 5.5.25 implies that the final quadratic spline that passes through the points (t j , y j ) with slopes c j
can be built locally as the parabola p using the linear spline l that plays the role of g in the lemma.

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 422

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Quadratic spline

0.8

0.6

0.4

Continuation of Ex. 5.5.24: 0.2

y
Interpolating quadratic spline ➙ -0.2

-0.4

-0.6

-0.8

-1

0 2 4 6 8 10 12
Fig. 193 t

Example 5.5.26 (Cardinal shape preserving quadratic spline)

We examine the shape preserving quadratic spline that interpolates data values y j = 0, j 6= i, yi , i ∈
{0, . . . , n}, on an equidistant node set.
Data and slopes Linear auxiliary spline l Quadratic spline

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 194 t Fig. 195 t Fig. 196 t

Shape preserving quadratic spline interpolation = local but not linear

Example 5.5.27 (Shape preserving quadratic spline interpolation)

Data from Exp. 5.3.7:

Data and slopes Linear auxiliary spline l Quadratic Spline

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Fig. 197 t Fig. 198 t Fig. 199 t

ti 0 1 2 3 4 5 6 7 8 9 10 11 12
Data from [?]:
yi 0 0.3 0.5 0.2 0.6 1.2 1.3 1 1 1 1 0 -1

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 423

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Data and slopes Linear auxiliary spline l Quadratic spline

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

y
0 0 0

-0.5 -0.5 -0.5

-1 -1 -1

0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 200 t Fig. 201 t Fig. 202 t

In all cases we observe excellent local shape preservation.

C++-code 5.5.28: Step by step shape preserving spline interpolation

1 # include

2 # include <cmath>
3 # include <string >
4 # include <vector >
5 # include <Eigen / Dense>
6 # include < f i g u r e / f i g u r e . hpp>
7
8 using Eigen : : VectorXd ;
9 using Eigen : : ArrayXd ;
10
11 void s h a p e p r e s i n t p ( const VectorXd& t , const VectorXd& y ) {
12 const long n = t . siz e ( ) − 1 ;
13
14 // ======= Step 1: choice of slopes ========
15 // shape-faithful slopes (c) in the nodes using harmonic mean of data
slopes
16 // the final interpolant will be tangents to these segments
17 std : : cout << "STEP 1 − Shape
−faithful slopes . . . \n" ;
18
19 const VectorXd h = t . t a i l ( n ) − t . head ( n ) ; // n = t.size() - 1
20 const VectorXd d e l t a = ( y . t a i l ( n ) − y . head ( n ) ) . cwiseQuotient ( h ) ;
21
22 std : : cout << d e l t a . transpose ( ) << " \n" ;
23
24 VectorXd c = VectorXd : : Zero ( n + 1) ;
25 f o r ( long j = 0 ; j < n − 1 ; ++ j ) {
26 i f ( d e l t a ( j ) ∗ d e l t a ( j + 1) > 0) { // no change of sign
27 c ( j + 1) = 2 . / ( 1 . / d e l t a ( j ) + 1 . / d e l t a ( j + 1) ) ;
28 }
29 else { // change of sign
30 // as c is already zero we have nothing to do here
31 }
32 }
33 // take care of first and last slope
34 c ( 0 ) = 2∗ d e l t a ( 0 ) − c ( 1 ) ;
35 c ( n ) = 2∗ d e l t a ( n − 1) − c ( n − 1) ;
36
37 // plot segments indicating the slopes c(i)
38 mgl : : F i g u r e segs ;
39 std : : vector <double > x ( 2 ) , s ( 2 ) ;
40 f o r ( long j = 1 ; j < n ; ++ j ) {
41 x [ 0 ] = t ( j ) − 0.3 ∗ h ( j − 1) ; x [ 1 ] = t ( j ) + 0.3 ∗ h ( j ) ;
42 s [ 0 ] = y ( j ) − 0.3 ∗ h ( j − 1) ∗ c ( j ) ; s [ 1 ] = y ( j ) + 0.3 ∗ h ( j ) ∗ c ( j ) ;
43 segs . p l o t ( x , s , "c" ) ;
44 }
45 // plot first segment
46 x [ 0 ] = t ( 0 ) ; x [ 1 ] = t ( 0 ) + 0.3 ∗ h ( 0 ) ;
47 s [ 0 ] = y ( 0 ) ; s [ 1 ] = y ( 0 ) + 0.3 ∗ h ( 0 ) ∗ c ( 0 ) ;
48 segs . p l o t ( x , s , "c" ) ;
49
50 // plot last segment
51 x [ 0 ] = t ( n ) − 0.3 ∗ h ( n − 1) ; x [ 1 ] = t ( n ) ;
52 s [ 0 ] = y ( n ) − 0.3 ∗ h ( n − 1) ∗ c ( n ) ; s [ 1 ] = y ( n ) ;
53 segs . p l o t ( x , s , "c" ) ;
54
55 // plot data points

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 424

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

56 segs . p l o t ( t , y , " #mo" ) ;

57
58 // save plot
59 segs . t i t l e ( "Data and slopes" ) ;
60 segs . g r i d ( f a l s e ) ;
61 segs . save ( "segments" ) ;
62
63 // ==== Step 2: choice of middle points ====
64 // fix point p j in [t j , t j+1 ] depending on the slopes c j , c j+1
65
66 std : : cout << "Step 2 − Middle points . . . \n" ;
67
68 VectorXd p = ( t ( 0 ) − 1) ∗ VectorXd : : Ones( n ) ;
69 f o r ( long j = 0 ; j < n ; ++ j ) {
70 i f ( c ( j ) ! = c ( j + 1) ) { // avoid division by zero
71 p ( j ) = ( y ( j + 1) − y ( j ) + t ( j ) ∗ c ( j ) − t ( j + 1) ∗ c ( j + 1) ) / ( c ( j ) − c ( j + 1) ) ;
72 }
73 // check and repair if p j is outside its interval
74 i f ( p ( j ) < t ( j ) | | p ( j ) > t ( j + 1) ) {
75 p ( j ) = 0 . 5 ∗ ( t ( j ) + t ( j + 1) ) ;
76 }
77 }
78
79 // ==== Step 3: auxiliary line spline =====
80 // build the linear spline with nodes in:
81 // - tj
82 // - the middle points between t j and p j
83 // - the middle points between p j and t j+1
84 // - t j+1
85 // and with slopes c j in t j , for every j
86
87 std : : cout << "Step 3 − Auxiliary line spline . . . \n" ;
88
89 x = std : : vector <double > ( 4 ) ;
90 s = std : : vector <double > ( 4 ) ;
91
92 mgl : : F i g u r e l i n ;
93
94 f o r ( long j = 0 ; j < n ; ++ j ) {
95 x [0] = t ( j ) ; s [0] = y( j ) ;
96 x [ 1 ] = ( p ( j ) + t ( j ) ) / 2 . ; s [ 1 ] = y ( j ) + c ( j ) ∗( p ( j ) − t ( j ) ) / 2 . ;
97 x [ 2 ] = ( p ( j ) + t ( j +1) ) / 2 . ; s [ 2 ] = y ( j +1) + c ( j +1) ∗ ( p ( j ) − t ( j +1) ) / 2 . ;
98 x [ 3 ] = t ( j +1) ; s [ 3 ] = y ( j +1) ;
99
100 // plot linear spline
101 lin . plot (x , s , "r" ) ;
102
p j +t j p j + t j +1
103 // plot additional evaluation points 2 , 2
104 std : : vector <double > p t = { x [ 1 ] , x [ 2 ] } ,
105 py = { s [ 1 ] , s [ 2 ] } ;
106 l i n . p l o t ( pt , py , " r^" ) ;
107 }
108 // plot data points
109 l i n . p l o t ( t , y , " #r^" ) . l a b e l ( "Data" ) ;
110 l i n . a d d l a b e l ( "Additional points" , " r^" ) ;
111 l i n . a d d l a b e l ( "Linear splines" , "r" ) ;
112 // lin.legend(0, 0.3);
113 l i n . t i t l e ( "Linear auxiliary spline l " ) ;
114 l i n . g r i d ( false ) ;
115 l i n . save ( "linearsplines" ) ;
116
117 // ======= Step 4: quadratic spline ========
118 // final quadratic shape preserving spline
119 // quadratic polynomial in the intervals [t j , p j ] and [ p j , t j+1 ]
120 // tangent in t j and p j to the linear spline of step 3
121
122 std : : cout << "Step 4 − Quadratic spline . . . \n" ;
123
124 // for every interval 2 quadratic interpolations
125 // a, b, ya, yb = extremes and values in subinterval
126 // w = value in middle point that gives the red slope
127 mgl : : F i g u r e quad ;

5. Data Interpolation and Data Fitting in 1D, 5.5. Splines 425

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

128 f o r ( long j = 0 ; j < n ; ++ j ) {

129 // handling the interval [t j , p j ]
130 double a = t ( j ) , b = p ( j ) , ya = y ( j ) ,
131 w = y ( j ) + 0.5 ∗ c ( j ) ∗ ( p ( j ) − t ( j ) ) ,
132 yb = ( ( t ( j +1) − p ( j ) ) ∗ ( y ( j ) + 0.5 ∗ c ( j ) ∗ ( p ( j ) − t ( j ) ) ) +
133 ( p ( j ) − t ( j ) ) ∗ ( y ( j +1) + 0.5 ∗ c ( j +1) ∗ ( p ( j ) − t ( j +1) ) ) ) / ( t ( j +1) − t ( j ) ) ;
134
135 // te = points in which we evaluate the interpolant,
136 // pe = values in te
137 ArrayXd t e = ArrayXd : : LinSpaced (100 , a , b ) ,
138 pe = ( ya ∗ ( b − t e ) . pow ( 2 ) + 2∗w ∗ ( t e − a ) ∗ ( b − t e ) + yb ∗ ( t e − a ) . pow ( 2 ) ) / std : : pow ( b − a , 2) ;
139
140 quad . p l o t ( t e . matrix ( ) , pe . matrix ( ) , "r" ) ;
141

142 // now the same for interval [ p j , t j+1 ]

143 a = b ; b = t ( j +1) ; ya = yb ; yb = y ( j +1) ;
144 w = y ( j +1) + 0.5 ∗ c ( j +1) ∗ ( p ( j ) − t ( j + 1) ) ;
145 t e = ArrayXd : : LinSpaced (100 , a , b ) ;
146 pe = ( ya ∗ ( b − t e ) . pow ( 2 ) + 2∗w ∗ ( t e − a ) ∗ ( b − t e ) + yb ∗ ( t e − a ) . pow ( 2 ) ) / std : : pow ( b − a , 2) ;
147
148 quad . p l o t ( t e . matrix ( ) , pe . matrix ( ) , "r" ) ;
149 }
150
151 // plot data points
152 quad . p l o t ( t , y , " #mo" ) ;
153 quad . t i t l e ( "Quadratic spline" ) ;
154 quad . g r i d ( f a l s e ) ;
155 quad . save ( "quadraticspline" ) ;
156
157 }

5.6 Trigonometric Interpolation

Supplementary reading. This topic is also presented in [?, Sect. 8.5].

We consider time series data (ti , yi ), i = 0, . . . , n, ti , yi ∈ R , obtained by sampling a time-dependent

scalar physical quantity t 7→ ϕ(t). We know that ϕ is a T -periodic function with period T > 0, that is
ϕ(t) = ϕ(t + T ) for all t ∈ R . In the spirit of shape preservation (→ Section 5.3) an interpolant f of the
time series should also be T -periodic: f (t + T ) = f (t) for all t ∈ R .

Assumption 5.6.1. Sampling in a period

We assume the period T > 0 to be known and ti ∈ [0, T [ for all interpolation nodes ti , i = 0, . . . , n.

In the sequel, for the case of simplicity, we consider only T = 1.

Task: Given data points (ti , yi ), yi ∈ K, ti ∈ [0, T [, find a 1-periodic function f : R → K (the interpolant),
f (t + T ) = f (t) ∀t ∈ R, that satisfies the interpolation conditions

f (ti ) = yi , i = 0, . . . , n . (5.6.2)

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 426
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5.6.1 Trigonometric Polynomials

The most fundamental periodic functions are derived from the trigonometric functions sin and cos and
dilations of them (A dilation of a function t 7→ ψ(t) is a function of the form t 7→ ψ(ct) with some c > 0).

Definition 5.6.3. Space of trigonometric polynomials

The vector space of 1-periodic trigonometric polynomials of degree 2n, n ∈ N, is given by

T
P2n := Span{t 7→ cos(2πjt), t 7→ sin(2πjt)} nj=0 ⊂ C∞ (R ) .

The terminology is natural after recalling expressions for trigonometric functions via complex exponentials
(“Euler’s formula”)

n cos t = 1 (eıt + e−ıt )

it 2
e = cos t + ı sin t ⇒ (5.6.4)
sin t = 1 ıt
2ı (e − e−ıt ) .
T given in the form
Thus we can rewrite q ∈ P2n

q(t) = α0 + ∑ α j cos(2πjt) + β j sin(2πjt) , α j , β j ∈ R , (5.6.5)

j =1

by means of complex exponentials making use of Euler’s formula (5.6.4):

n
q (t ) = α0 + ∑ α j cos(2πjt) + β j sin(2πjt)
j =1

1n n 2πıjt −2πıjt
o
2 j∑
= α0 + ( α j − ıβ j ) e + ( α j + ıβ j ) e
=1
−1 n
= α0 + 1
2 ∑ (α− j + ıβ − j )e2πıjt + 1
2 ∑ (α j − ıβ j )e2πıjt
j=− n j =1

 1
2n  2 (αn− j + ıβ n− j ) for j = 0, . . . , n − 1 ,
= e−2πınt ∑ γ j e2πıjt , with γ j = α0 for j = n , (5.6.6)
j =0

1
2 (α j− n − ıβ j− n ) for j = n + 1, . . . , 2n .

Note: γ j ∈ C ➣ work in C in the context of trigonometric interpolation! Admit yk ∈ C.

From the above manipulations we conclude

a polynomial !
2n
T −2πınt
q ∈ P2n +1 ⇒ q(t) = e · p(e2πit ) with p(z) = ∑ γj z j ∈ P2n , (5.6.7)
j =0

and γ j from (5.6.6).

(After scaling) a trigonometric polynomial of degree 2n is a regular polynomial ∈ P2n (in C) re-
stricted to the unit circle S1 ⊂ C.

✎ notation: S1 := {z ∈ C: |z| = 1} is the unit circle in the complex plane.

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 427
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

T a space of trigonometric polynomials and immediately reveals

The relationship (5.6.7) justifies calling P2n
T
the dimension of P2n :

T
Corollary 5.6.8. Dimension of P2n

T has dimension T
The vector space P2n dim P2n = 2n + 1.

5.6.2 Reduction to Lagrange Interpolation

We observed that trigonometric polynomials are standard (complex) polynomials in disguise. Next we
can relate trigonometric interpolation to well-known standard Lagrangian interpolation discussed in Sec-
tion 5.2.2. In fact, we slightly extend the method, because now we admit complex interpolation nodes. All
results obtained earlier carry over to this setting.

The key tool is a smooth bijective mapping between I := [0, 1[ and S1 defined as

ΦS1 : I → S1 , t 7→ z := exp(2πıt) . (5.6.9)

Im
z = exp(−2πit)
1

t
0 1 1 Re

Fig. 203

Trigonometric interpolation Polynomial interpolation

⇐⇒
through data points (tk , yk ) through data points (e2πitk , yk )

Trigonometric interpolation Polynomial interpolation

⇐⇒
of f on I of pullback (ΦS−11 )∗ f on S1

Here we deal with a non-affine pullback, but the definition is the same as the one given in (6.1.20) for an
affine pullback:

(ΦS−11 )∗ : C0 ([0, 1[) → C0 (S1 ) , (ΦS−11 )∗ f (z) := f (ΦS−11 (z)) , z ∈ S1 . (5.6.10)

Trigonometric interpolation ←→ polynomial interpolation on S1

All theoretical results and algorithms from polynomial interpolation carry over to trigonometric
interpolation

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 428
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✦ Existence and uniqueness of trigonometric interpolation polynomial, see Thm. 5.2.14,

✦ Concept of Lagrange polynomials, see (5.2.11),
✦ the algorithms and representations discussed in Section 5.2.3.
The next code demonstrates the use of standard routines for polynomial interpolation provided by BarycPoly-
Interp (→ Code 5.2.32) for trigonometric interpolation.

C++-code 5.6.11: Evaluation of trigonometric interpolation polynomial in many points

2 // Evaluation of trigonometric interpolant at numerous points
3 // IN : t = vector of nodes t0 , . . . , tn ∈ [0, 1[
4 // y = vector of data y0 , . . . , yn
5 // x = vector of evaluation points x1 , . . . , x N
6 // OUT : vector of values of the interpolant at points in x
7 VectorXd t r i g p o l y v a l ( const VectorXd& t , const VectorXd& y , const
VectorXd& x ) {
8 using i d x _ t = VectorXd : : Index ;
9 using comp = std : : complex <double > ;
10 const i d x _ t N = y . siz e ( ) ; // Number of data points
11 i f (N % 2 == 0 ) throw std : : r u n t i m e _ e r r o r ( " Number o f p o i n t s mu st be
odd ! " ) ;
12 const i d x _ t n = (N − 1 ) / 2 ;
13 const std : : complex <double> M_I ( 0 , 1 ) ; // imaginary unit
14 // interpolation nodes and evalutation points on unit circle
15 VectorXcd t c = ( 2 ∗ M_PI ∗ M_I ∗ t ) . array ( ) . exp ( ) . matrix ( ) ,
16 xc = ( 2 ∗ M_PI ∗ M_I ∗ x ) . array ( ) . exp ( ) . matrix ( ) ;
17 // Rescaled values, according to q(t) = e−2πint · p(e2πit ), see (5.6.6)
18 VectorXcd z = ( ( 2 ∗ n ∗ M_PI ∗ M_I ∗ t ) . array ( ) . exp ( ) ∗ y . array ( ) ) . matrix ( ) ;
19 // Evaluation of interpolating polynomial on unit circle using the
20 // barycentric interpolation formula in C, see Code 5.2.29
21 BarycPolyInterp <comp> I n t e r p o l ( t c ) ;
22 VectorXcd p = I n t e r p o l . eval <VectorXcd > ( z , xc ) ;
23 // Undo the scaling, see (5.6.7)
24 VectorXcd qc = (( − 2 ∗ n ∗ M_PI ∗ M_I ∗ x ) . array ( ) . exp ( ) ∗
p . array ( ) ) . matrix ( ) ;
25 r et ur n ( qc . r e a l ( ) ) ; // imaginary part is zero, cut it off
26 }

The computational effort is O(n2 + Nn), N =

ˆ no. of evaluation points, as for barycentric polynomial
interpolation studied in § 5.2.27.

The next code finds the coefficients α j , β j ∈ R of a trigonometric interpolation polynomial in the real-valued
representation (5.6.5) for real-valued data y j ∈ R by simply solving the linear system of equations arising
from the interpolation conditions (5.6.2).

C++-code 5.6.12: Computation of coefficients of trigonometric interpolation polynomial, gen-

eral nodes
2 //Computes expansion coefficients of trigonometric polyonomials (5.6.5)
3 // IN : t = vector of nodes t0 , . . . , tn ∈ [0, 1[
4 // y = vector of data y0 , . . . , yn

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 429
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5 // OUT: pair of vectors storing the basis expansion coefficients α j , β j ,

see Def. 5.6.3
6 std : : p a i r <VectorXd , VectorXd>
7 t r i g p o l y c o e f f ( const VectorXd& t , const VectorXd& y ) {
8 const unsigned N = y . siz e ( ) , n = (N−1) / 2 ;
9 i f (N % 2 == 0 ) throw " Number o f p o i n t s mu st be odd ! \ n " ;
10

11 // build system matrix M

12 MatrixXd M(N, N) ;
13 M. col ( 0 ) = VectorXd : : Ones (N) ;
14 f o r ( unsigned c = 1 ; c <= n ; ++c ) {
15 M. col ( c ) = ( 2 ∗ M_PI ∗ c ∗ t ) . array ( ) . cos ( ) . matrix ( ) ;
16 M. col ( n + c ) = ( 2 ∗ M_PI ∗ c ∗ t ) . array ( ) . s i n ( ) . matrix ( ) ;
17 }
18 // solve LSE and extract coefficients α j and β j
19 VectorXd c = M. l u ( ) . solve ( y ) ;
20 r et ur n std : : p a i r <VectorXd , VectorXd > ( c . head ( n + 1 ) , c . t a i l ( n ) ) ;
21 }

The asymptotic computational effort of this implementation is dominated by the cost for Gaussian elimina-
tion applied to a fully populated (dense) matrix, see Thm. 2.5.2: O(n3 ) for n → ∞.

5.6.3 Equidistant Trigonometric Interpolation

Often timeseries data for a time-periodic quantity are measured with a constant rhythm over the entire
T
(known) period of duration T > 0, that is, t j = j∆t, ∆t = n+ 1 , j = 0, . . . , n. In this case, the for-
mulas for computing the coefficients of the interpolating trigonometric polynomial (→ Def. 5.6.3) become
special versions of the discrete Fourier transform (DFT, see Def. 4.2.18) studied in Section 4.2. Efficient
implementation can thus harness the speed of FFT introduced in Section 4.3.

k
Now: 1-periodic setting, uniformly distributed interpolation nodes tk = , k = 0, . . . , 2n
2n + 1
(2n + 1) × (2n + 1) linear system of equations:
2n
jk nk
∑ γj exp 2πı
2n + 1
= (b)k := exp 2πı y , k = 0, . . . , 2n .
2n + 1 k
(5.6.13)
j =0
m
1
F2n+1 c = b , c = [γ0 , . . . , γ2n ] ⊤
Lemma 4.2.14
⇒ c= F2n+1 b . (5.6.14)
2n + 1
(2n + 1) × (2n + 1) (conjugate) Fourier matrix, see (4.2.13)
Fast solution by means of FFT: O(n log n) asymptotic complexity, see Section 4.3

C++-code 5.6.15: Efficient computation of coefficient of trigonometric interpolation polyno-

mial (equidistant nodes)
2 // Efficient FFT-based computation of coefficients in expansion (5.6.5)
3 // for a trigonometric interpolation polynomial in equidistant points

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 430
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

j
4 // ( 2n+1 , y j ), j = 0, . . . , 2n.
5 // IN : y has to be a row vector of odd length, return values are
column vectors
6 // OUT: vectors α j j , α j j of expansion coefficients
7 // with respect to trigonometric basis from Def. 5.6.3
8 std : : p a i r <VectorXd , VectorXd> t r i g i p e q u i d ( const VectorXd& y ) {
9 using i n d e x _ t = VectorXcd : : Index ;
10 const i n d e x _ t N = y . siz e ( ) , n = (N − 1 ) / 2 ;
11 i f (N % 2 ! = 1 ) throw " Number o f p o i n t s mu st be odd ! " ;
12 // prepare data for fft
13 std : : complex <double> M_I ( 0 , 1 ) ; // imaginary unit
14 // right hand side vector b from (5.6.14)
15 VectorXcd b (N) ;
16 f o r ( i n d e x _ t k = 0 ; k < N ; ++k )
17 b ( k ) = y ( k ) ∗ std : : exp (2 ∗ M_PI ∗ M_I ∗ ( double ( n ) /N∗ k ) ) ;
18 Eigen : : FFT<double> f f t ; // DFT helper class
19 VectorXcd c = f f t . fwd ( b ) ; // means that “c = fft(b)”
20

21 // By (5.6.6) we can recover

22 // α j = 12 (γn− j + γn+ j ) and β j = 2i1 (γn− j − γn+ j ), j = 1, . . . , n, α0 = γn .
23 VectorXcd alpha ( n + 1 ) , beta ( n ) ; alpha ( 0 ) = c ( n ) / ( ( double )N) ;
24 f o r ( i n d e x _ t l = 1 ; l <= n ; ++ l ) {
25 alpha ( l ) = ( c ( n− l ) +c ( n+ l ) ) / ( ( double )N) ;
26 beta ( l −1) = −M_I ∗ ( c ( n− l )−c ( n+ l ) ) / ( ( double )N) ;
27 }
28 r et ur n std : : p a i r <VectorXd , VectorXd > ( alpha . r e a l ( ) , beta . r e a l ( ) ) ;
29 }

Example 5.6.16 (Runtime comparison for computation of coefficient of trigonometric inter-

polation polynomials)

0
10

trigpolycoeff
trigipequid

Runtime measurements for M ATLAB equivalents of -1

codes Code 5.6.12 and Code 5.6.15

tic-toc-timings ✄ -2
10
runtime[s]

MATLAB 7.10.0.499 (R2010a)

-3
10

CPU: Intel Core i7, 2.66 GHz, 2 cores, L2 256 KB, L3

4 MB -4
10

OS: Mac OS X, Darwin Kernel Version 10.5.0

-5
10
1 2 3
10 10 10
Fig. 204 n

M ATLAB-code 5.6.17: Runtime comparison

1 f u n c t i o n trigipequidtiming
2 % Runtime comparison between efficient (→ Code ??) and direct
computation

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 431
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3 % (→ Code ?? of coefficients of trigonoetric interpolation polynomial

in
4 % equidistant points.
5 Nruns = 3; times = [];
6 f o r n = 10:10:500
7 d i s p (n)
8 N = 2*n+1; t = 0:1/N:(1-1/N); y = exp ( cos (2* p i *t));
9 t1 = r e a lma x ; t2 = r e a lma x ;
10 f o r k=1:Nruns
11 t i c ; [a,b] = trigpolycoeff(t,y); t1 = min (t1, t o c );
12 t i c ; [a,b] = trigipequid(y); t2 = min (t2, t o c );
13 end
14 times = [times; n , t1 , t2];
15 end
16

17 f i g u r e ; l o g l o g (times(:,1),times(:,2),’b+’,...
18 times(:,1),times(:,3),’r*’);
19 x l a b e l (’{\bf n}’,’fontsize’,14);
20 y l a b e l (’{\bf runtime[s]}’,’fontsize’,14);
21 le ge nd (’trigpolycoeff’,’trigipequid’,’location’,’best’);
22

23 p r i n t -depsc2 ’../PICTURES/trigipequidtiming.eps’;

Same observation as in Ex. 4.3.12: massive gain in efficiency through relying on FFT.

Remark 5.6.18 (Efficient evaluation of trigonometric interpolation polynomials)

Task: Find an efficient way to evaluate the trigonometric polynomial

q(t) = α0 + ∑ α j cos(2πjt) + β j sin(2πjt) , α j , β j ∈ R , (5.6.5)

j =1

k
at equidistant points N , N > 2n. k = 0, . . . , N − 1.

As in (5.6.6) we can rewrite


 1
2n  2 (αn− j + ıβ n− j ) for j = 0, . . . , n − 1 ,
q(t) = e−2πınt ∑ γ j e2πıjt , with γ j = α0 for j = n ,
j =0

1
2 (α j− n − ıβ j− n ) for j = n + 1, . . . , 2n .

2n
kj
q(k/N ) = e−2πın /N
k
(5.6.6) ∑ γj exp(2πı N ) , k = 0, . . . , N − 1 .
j =0

q(k/N ) = e−2πı
kn/N
v j with v = F Ne
c, (5.6.19)

Fourier matrix, see (4.2.13).

c∈
where e CN is obtained by zero padding of c := (γ0 , . . . , γ2n )T :
(
γ j , for k = 0, . . . , 2n ,
(ec)k =
0 , for k = 2n + 1, . . . , N − 1 .

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 432
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The FFT-based implementation has been fine in the following code.

C++-code 5.6.20: Fast evaluation of trigonometric polynomial at equidistant points

2 void t r i g i p e q u i d c o m p ( const VectorXcd& a , const VectorXcd& b , const
unsigned N, VectorXcd& y ) {
3 const unsigned n = a . siz e ( ) − 1 ;
4 i f (N < (2 ∗ n − 1 ) ) {
5 std : : c e r r << " N i s t o o s m a l l ! Mu st be l a r g e r t h a n 2 ∗ n " ;
6 r et ur n ;
7 }
8 const std : : complex <double> i u ( 0 , 1 ) ; // imaginary unit
9 // build vector γ
10 VectorXcd gamma(2 ∗ n + 1 ) ;
11 gamma( n ) = a ( 0 ) ;
12 f o r ( unsigned k = 0 ; k < n ; ++k ) {
13 gamma( k ) = 0 . 5 ∗ ( a ( n − k ) + i u ∗ b ( n − k − 1 ) ) ;
14 gamma( n + k + 1 ) = 0 . 5 ∗ ( a ( k + 1 ) − i u ∗ b ( k ) ) ;
15 }
16 // zero padding to obtain e c
17 VectorXcd ch (N) ; ch << gamma, VectorXcd : : Zero (N − (2 ∗ n + 1 ) ) ;
18

19 // realize multiplication with conjugate fourier matrix

20 Eigen : : FFT<double> f f t ;
21 VectorXcd chCon = ch . conjugate ( ) ;
22 VectorXcd v = f f t . fwd ( chCon ) . conjugate ( ) ;
23

24 // final scaling, implemented without efficiency considerations

25 y = VectorXcd (N) ;
26 f o r ( unsigned k = 0 ; k < N ; ++k )
27 y ( k ) = v ( k ) ∗ std : : exp ( − 2.∗ k ∗ n ∗ M_PI /N∗ i u ) ;
28 }

The next code merges the steps of computing the coefficient of the trigonometric interpolation polynomial
in equidistant points and its evaluation in another set of equidistant points.

C++11 code 5.6.21: Equidistant points: fast on the fly evaluation of trigonometric interpola-
tion polynomial
2 // Evaluation of trigonometric interpolation polynomial through
( 2nj+1 , y j ), j = 0, . . . , 2n
3 // in equidistant points Nk , k = 0, N − 1
4 // IN : y = vector of values to be interpolated
5 // q (COMPLEX!) will be used to save the return values
6 void t r i g p o l y v a l e q u i d ( const VectorXd y , const i n t M, VectorXd& q ) {
7 const i n t N = y . siz e ( ) ;
8 i f (N % 2 == 0 ) {
9 std : : c e r r << " Number o f p o i n t s mu st be odd ! \ n " ;
10 r et ur n ;

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 433
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

11 }
12 const i n t n = (N − 1 ) / 2 ;
13 // computing coefficient γ j , see (5.6.14)
14 VectorXcd a , b ;
15 trigipequid (y , a, b) ;
16

17 std : : complex <double> i ( 0 , 1 ) ;

18 VectorXcd gamma(2 ∗ n + 1 ) ;
19 gamma( n ) = a ( 0 ) ;
20 f o r ( i n t k = 0 ; k < n ; ++k ) {
21 gamma( k ) = 0 . 5 ∗ ( a ( n − k ) + i ∗ b ( n − k − 1 ) ) ;
22 gamma( n + k + 1 ) = 0 . 5 ∗ ( a ( k + 1 ) − i ∗ b ( k ) ) ;
23 }
24

25 // zero padding
26 VectorXcd ch (M) ; ch << gamma, VectorXcd : : Zero (M − (2 ∗ n + 1 ) ) ;
27

28 // build conjugate fourier matrix

29 Eigen : : FFT<double> f f t ;
30 const VectorXcd chCon = ch . conjugate ( ) ;
31 const VectorXcd v = f f t . fwd ( chCon ) . conjugate ( ) ;
32

33 // multiplicate with conjugate fourier matrix

34 VectorXcd q_complex = VectorXcd (M) ;
35 f o r ( i n t k = 0 ; k < M; ++k ) {
36 q_complex ( k ) = v ( k ) ∗ std : : exp ( − 2.∗ k ∗ n ∗ M_PI /M∗ i ) ;
37 }
38 // complex part is zero up to machine precision, cut off!
39 q = q_complex . r e a l ( ) ;
40 }

5.7 Least Squares Data Fitting

As remarked in Ex. 5.1.5 the basic assumption underlying the reconstructing of the functional dependence
of two quantities by means of interpolation as that of accurate data. In case of data uncertainty or mea-
surement errors the exact satisfaction of interpolation conditions ceases to make sense and we are better

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 434
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

off reconstructing a fitting function that is merely “close to the data” in a sense to be made precise next.

The task of (multidimensional, vector-valued) least squares data fitting can be described as follows:

Given: data points (ti , yi ), i = 1, . . . , m, m ∈ N, ti ∈ D ⊂ R k , yi ∈ R d

Objective: Find a (continuous) function f : D 7→ R d in some set S ⊂ C0 ( D )
of admissible functions satisfying
m
f ∈ argmin ∑ kg(ti ) − yi k22 . (5.7.1)
g∈S i =1

The function f is called the (best) least squares fit for the data in S.

Focus in this section: k = 1, d = 1 (one-dimensional scalar setting)

↔ fitting of scalar data depending on one parameter (t ∈ R),
↔ Data points (ti , yi ), ti ∈ I ⊂ R, yi ∈ R ➣ S ⊂ C0 ( I )
↔ (best) least squares fit f : I ⊂ R → R is a function of one real variable.
m
m = 1, d = 1: (5.7.1) ⇔ f ∈ argmin ∑ | g(ti ) − yi |2 . (5.7.2)
g∈ S i =1

(5.7.3) Linear spaces of admissible functions

Consider a special variant of the general least squares data fitting problem: The set S of admissible
continuous functions is now chosen as a finite-dimensional vector space Vn ⊂ C0 ( D ), dim Vn = n ∈ N,
cf. the discussion in § 5.1.13 for interpolation.

Choose basis of Vn : Vn = Span{b1 , . . . , bn }, b j : D → R d continuous

The best least squares fit f ∈ Vn can be represented by a finite linear combination of the basis
functions b j :
n
f (t) = ∑ j =1 x j b j (t) , xj ∈ R . (5.7.4)

Often, in the case d > 1, Vn is chosen as a product space

Vn = W × · · · × W , (5.7.5)
| {z }
d factors

of a space W ⊂ C0 ( D ) of R -valued functions D 7→ R . In this case

dim Vn = d · dim W .

If ℓ := dim W and q1 , . . . , qℓ is a basis of W , then

ei q j i =1,...,d = { e 1 q1 , e 2 q1 , . . . , e d q1 , e 1 q2 , . . . , e d q2 , e 1 q3 , . . . , e 1 q ℓ , . . . e d q ℓ } (5.7.6)
j =1,...,ℓ

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 435
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

is a basis of Vn (ei =
ˆ i-th unit vector).

(5.7.7) Linear data fitting → [?, Sect. 4.1]

We adopt the setting of § 5.7.3 of an n-dimensional space Vn of admissible functions with basis {bq , . . . , bn }.
Then the least squares data fitting problem can be recast as follows.

General Linear least squares fitting problem:

Given:
✦ data points (ti , yi ) ∈ R k × R d , i = 1, . . . , m
✦ basis functions b j : D ⊂ R k 7→ R, j = 1, . . . , n, n < m

Sought: coefficients x j ∈ R , j = 1, . . . , n, such that

2
m n
( x1 , . . . , xn ) = argmin ∑ ∑ z j b j ( t i ) − yi . (5.7.8)
z j ∈R d i =1 j =1 2

Special cases:

• For k = 1, d = 1, data points (ti , yi ) ∈ R × R (scalar, one-dimensional setting), and Vn =

Span{b1 , . . . , bn }, we seek coefficients x j ∈ R, j = 1, . . . , n, as the components of a vector
x = [ x1 , . . . , xn ] ⊤ ∈ R n satisfying
2
m n
x = argmin ∑ ∑ ( z ) j b j ( ti ) − y i . (5.7.9)
z ∈R n i =1 j =1

• If Vn is a product space according to (5.7.5) with basis (5.7.6), then (5.7.8) amounts to finding
vectors x j ∈ R d , j = 1, . . . , ℓ with

2
m n
(x1 , . . . , xℓ ) = argmin ∑ ∑ z j q j ( t i ) − yi . (5.7.10)
z j ∈R d i = 1 j = 1 2

Example 5.7.11 (Linear parameter estimation = linear data fitting → Ex. 3.0.5, Ex. 3.1.5)

The linear parameter estimation/linear regression problem presented in Ex. 3.0.5 can be recast as a linear
data fitting problem with

• k = n, d = 1, data points (xi , yi ) ∈ R k × R,

• an k + 1-dimensional space Vn = {x 7→ a⊤ x + β, a ∈ R k , β ∈ R } of affine linear admissible
functions,
• the choice of basis { x 7 → ( x )1 , . . . , x 7 → ( x ) k , x 7 → 1 } .

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 436
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(5.7.12) Linear data fitting as a lineare least squares problem

Linear (least squares) data fitting leads to an overdetermined linear system of equations for which we seek
a least squares solution (→ Def. 3.1.3) as in Section 3.1.1. To see this rewrite
2 !2
m n m n d
∑ ∑ z j b j ( t i ) − yi = ∑ ∑∑ b j ( t i ) r z j − ( yi ) r .
i =1 j =1 2 i =1 j =1 r =1

Theorem 5.7.13. Least squares solution of data fitting problem

⊤
The solution x = [ x1 , . . . , xn ] ∈ R n of the linear least squares fitting problem (5.7.8) is the least
squares solution of the overdetermined linear system of equations
   
A1 b1
 ..   . 
 . x =  ..  , (5.7.14)
Ad bd
   
(b1 (t1 ))r . . . (bn (t1 ))r ( y1 )r
 .. ..  m,n  ..  m
with Ar : =  . . ∈R , br : =  .  ∈ R , r = 1, . . . , d .
(bq (tm ))r . . . (bn (tm ))r ( y m )r

In the one-dimensional, scalar case (= 1, d = 1) of (5.7.9) the related overdetermined linear system of
equations is
   
b1 (t1 ) . . . bn (t1 ) y1
 ..  .. . 
 . x =  ..  .
. (5.7.15)
b1 (tm ) . . . bn (tm ) ym

Obviously, for m = n we recover the 1D data interpolation problem from § 5.1.13.

Having reduced the linear least squares data fitting problem to finding the least squares solution of an
overdetermined linear system of equations, we can now apply theoretical results about least squares
solutions, for instance, Cor. 3.1.22. The key issue is, whether the coefficient matrix of (5.7.15) has full rank
n. Of course, this will depend on the location of the ti .
Lemma 5.7.16. Unique solvability of linear least squares fitting problem

The scalar one-dimensional linear least squares fitting problem (5.7.9) with dim Vn = n, Vn the
vector space of admissible functions, has a unique solution, if and only if there are ti1 , . . . , tin such
that
 
b1 (ti1 ) . . . bn (ti1 )
 .. ..  n,n
 . . ∈R is invertible, (5.7.17)
b1 (tin ) . . . bn (tin )

which is independent of the choice of basis of Vn .

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 437
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Equivalent to (5.7.17) is the requirement, that there is an n-subset of {t1 , . . . , tn } such that the corre-
sponding interpolation problem for Vn has a unique solution for any data values yi .

Example 5.7.18 (Polynomial fitting)

Special variant of scalar (d = 1), one-dimensional k = 1 linear data fitting (→ § 5.7.7): we choose the
space of admissible functions as polynomials of degree n − 1,

Vn = Pn−1 , e.g. with basis b j (t) = t j−1 (monomial basis, Section 5.2.1) .

The corresponding overdetermined linear system of equations (5.7.15) now reads:

   
1 t1 t21 . . . t1n−1 y1
1 t 2 n −1   
 2 t2 . . . t2   y2 
 .. .. ..  x =  ..  , (5.7.19)
. . v . . . .  .
1 tm t2m . . . tnm−1 yn

which, for M ≥ n, has full rank, because it contains invertible Vandermonde matrices (5.2.18), Rem. 5.2.17.

The next code demonstrates the computation of the fitting polynomial with respect to the monomial basis
of Pn−1 :

C++11-code 5.7.20: Polynomial fitting ➺ GITLAB

2 // Solver for polynomial linear least squares data fitting problem
3 // data points passed in t and y, ’order’ = degree + 1
4 VectorXd p o l y f i t ( const VectorXd& t , const VectorXd& y , const
unsigned& o r d e r ) {
5 // Initialize the coefficient matrix of (5.7.19)
6 Eigen : : MatrixXd A = Eigen : : MatrixXd : : Ones ( t . siz e ( ) , o r d e r + 1 ) ;
7 f o r ( unsigned j = 1 ; j < o r d e r + 1 ; ++ j )
8 A . col ( j ) = A . col ( j − 1 ) . cwiseProduct ( t ) ;
9 // Use E I G E N ’s built-in least squares solver, see Code 3.3.39
10 Eigen : : VectorXd c o e f f s = A . householderQr ( ) . solve ( y ) ;
11 // leading coefficients have low indices.
12 r et ur n c o e f f s . reverse ( ) ;
13 }

The function polyfit returns a vector [ x1 , x2 , . . . , xn ]⊤ describing the fitting polynomial according to the
convention

p ( t ) = x1 t n −1 + x2 t n −2 + · · · + x n −1 t + x n . (5.7.21)

Example 5.7.22 (Polybomial interpolation vs. polynomial fitting)

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 438
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++-code 5.7.23: Fitting and interpolating polynomial

1 ///////////////////////////////////////////////////////////////////////////
2 /// Demonstration code for lecture "Numerical Methods for CSE" @ ETH
Zurich
3 /// (C) 2016 SAM, D-MATH
4 /// Author(s): Xiaolin Guo, Julien Gacon
5 /// Repository: https://gitlab.math.ethz.ch/NumCSE/NumCSE/
6 /// Do not remove this header.
7 //////////////////////////////////////////////////////////////////////////
8
9 # include <Eigen / Dense>
10 # include < f i g u r e / f i g u r e . hpp>
11 # include // NCSE’s polyfit (equivalent to Matlab’s)
12 # include // NCSE’s polyval (equivalent to Matlab’s)
13
14 // Comparison of polynomial interpolation and polynomial fitting
15 // (“Quick and dirty”, see 5.2.3)
16 i n t main ( ) {
1
17 // use C++ lambda functions to define runge function f ( x) = 1+ x 2
18 auto f = [ ] ( const Eigen : : VectorXd& x ) {
19 r etur n ( 1 . / ( 1 + x . ar r ay ( ) ∗ x . ar r ay ( ) ) ) . matrix ( ) ;
20 };
21
22 const unsigned d = 10; // Polynomial degree
23 Eigen : : VectorXd t i p ( d + 1) ; // d + 1 nodes for interpolation
24 f o r ( unsigned i = 0 ; i <= d ; ++ i )
25 t i p ( i ) = −5 + i ∗ 1 0 . / d ;
26
27 Eigen : : VectorXd t f t (3 ∗ d + 1) ; // 3d + 1 nodes for polynomial fitting
28 f o r ( unsigned i = 0 ; i <= 3∗ d ; ++ i )
29 t f t ( i ) = −5 + i ∗ 1 0 . / ( 3 ∗ d ) ;
30
31 Eigen : : VectorXd f t i p = f ( Eigen : : VectorXd : : Ones ( 2 ) ) ;
32 Eigen : : VectorXd p i p = p o l y f i t ( t i p , f ( t i p ) , d ) , // Interpolating polynomial (deg = d)
33 p f t = p o l y f i t ( t f t , f ( t f t ) , d ) ; // Fitting polynomial (deg = d)
34
35 Eigen : : VectorXd x = Eigen : : VectorXd : : LinSpaced (1000 , − 5 ,5) ;
36 mgl : : F i g u r e f i g ;
37 f i g . p l o t ( x , f ( x ) , "g| " ) . l a b e l ( "Function f " ) ;
38 f i g . p l o t ( x , p o l y v a l ( pi p , x ) , "b" ) . l a b e l ( "Interpolating polynomial" ) ;
39 f i g . p l o t ( x , p o l y v a l ( p f t , x ) , "r" ) . l a b e l ( "Fitting polynomial" ) ;
40 f i g . p l o t ( t i p , f ( t i p ) , " b∗" ) ;
41 f i g . save ( " interpfit" ) ;
42
43 r etur n 0 ;
44 }

2
function f
interpolating polynomial
fitting polynomial

1.5
1
Data from function f (t) = ,
1 + t2
1

✦ polynomial degree d = 10,

0.5 ✦ interpolation through data points (t j , f (t j )),

j = 0, . . . , d, t j = −5 + j, see Ex. 5.2.63,
✦ fitting to data points (t j , f (t j )), j = 0, . . . , 3d,
0
t j = −5 + j/3.

-0.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Fig. 205 t

Fitting helps curb oscillations that haunt polynomial interpolation!

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 439
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Learning outcomes

After you have studied this chapter you should

• understand the use of basis functions for representing functions on a computer,

• know the concept of a interpolation operator and what its linearity means,
• know the connection between linear interpolation operators and linear systems of equations,
• be familiar with efficient algorithms for polynomial interpolation in different settings,
• know the meaning and significance of “sensitivity” in the context of interpolation,
• Be familiar with the notions of “shape preservation” for an interpolation scheme and its different
aspects (monotonicity, curvature),

• know the details of cubic Hermite interpolation and how to ensure that it is monotonicity preserving.
• know what splines are and how cubic spline interpolation with different endpoint constraints works.

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 440
Chapter 6

Approximation of Functions in 1D

(6.0.1) General approximation problem

Approximation of functions: Generic view

Given: function f : D ⊂ R n 7→ R d (often in procedural form double f(double), Rem. 5.1.6)

Goal: Find a “ SIMPLE ” function e

f : D 7→ R d such that the approximation error f − ef is “ SMALL ”

(∗): What is “ SIMPLE ” ?

The function e
f can be encoded by small amount of information and is easy to evaluate.
For instance, this is the case for polynomial or piecewise polynomial e
f.

(♣): What does “ SMALL approximation error” mean ?

f − ef is small for some norm k·k on the space C0 ( D ) of (piecewise) continous functions.

The most commonly used norms are

✦ the supremum norm kgk ∞ := k gk L∞ (D) := max |g( x )|, see (5.2.66).
x∈D

If the approximation error is small with respect to the supremum norm, e

f is also called a good uniform
approximant of f.
R
✦ the L2 -norm kgk22 := kgk2L2 (D) = |g( x )|2 dx, see (5.2.67).
D

Below we consider only the case n = d = 1: approximation of scalar valued functions defined on an
interval. The techniques can be applied componentwise in order to cope with the case of vector valued
function (d > 1).

Example 6.0.3 (Model reduction by interpolation)

441
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The non-linear circuit (✄| stands for a diode, a circuit

element, whose resistance strongly depends on the
voltage) sketched beside ✄ has two ports.
I
For the sake of circuit simulation it should be replaced
by a non-linear lumped circuit element characterized
by a single voltage-current constitutive relationship
I = I (U ). For any value of the voltage U the current
I can be computed by solving a (large) non-linear
system of equations, see Ex. 8.0.1. U
Fig. 206

A faster alternative is the advance approximation of the function U 7→ I (U ) based on a few computed
values I (Ui ), i = 0, . . . , n, followed by the fast evaluation of the approximant U 7→ e
I (U ) during actual
circuit simulations. This is an example of model reduction by approximation of functions: a complex
subsystem in a mathematical model is replaced by a surrogate function.

In this example we also encounter a typical situation: we have nothing at our disposal but, possibly expen-
sive, point evaluations of the function U 7→ I (U ) (U 7→ I (U ) in “procedural form”, see Rem. 5.1.6). The
number of evaluations of I (U ) will largely determine the cost of building e
I.

This application displays a fundamental difference compared to the reconstruction of constitutive relation-
ships from a priori measurements → Ex. 5.1.5: Now we are free to choose the number and location of the
data points, because we can simply evaluate the function U 7→ I (U ) for any U and as often as needed.

C++11 code 6.0.4: Class describing a 2-port circuit element for circuit simulation
1 class C i r c u i t E l e m e n t {
2 private :
3 // internal data describing U 7→ e I (U ).
4 public :
5 // Constructor taking some parameters and building e I
6 C i r c u i t E l e m e n t ( const Parameters &P) ;
7 // Point evaluation operators for e d e
I and dU I
8 double I ( double U) const ;
9 double dIdU ( double U) const ;
10 };

(6.0.5) Approximation schemes

We define an abstract concept for the sake of clarity: When in this chapter we talk about an “approximation
scheme” (in 1D) we refer to a mapping A : X 7→ V , where X and V are spaces of functions I 7→ K,
I ⊂ R an interval.

Examples are
• X = Ck ( I ), the spaces of functions I 7→ K that are k times continuously differentiable, k ∈ N.
• V = Pm ( I ), the space of polynomials of degree ≤ k, see Section 5.2.1
• V = Sd,M , the space of splines of degree d on the knot set M ⊂ I , see Def. 5.5.1.

6. Approximation of Functions in 1D, 6. Approximation of Functions in 1D 442

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

T , the space of trigonometric polynomials of degree 2n, see Def. 5.6.3.

• V = P2n

(6.0.6) Approximation by interpolation

In Chapter 5 we discussed ways to construct functions whos graph runs through given data points, see
5.1. We can hope that the interpolant will approximate the function, if the data points are also located
on the graph of that function. Thus every interpolation scheme, see § 5.1.4, spawns a corresponding
approximation scheme.

Interpolation scheme + sampling → approximation scheme

sampling interpolation
f : I ⊂ R → K −−−−→ (ti , yi := f (ti ))im=0 −−−−−−→ fe := IT y ( fe(ti ) = yi ) .

free choice of nodes ti ∈ I

In this chapter we will mainly study approximation by interpolation relying on the interpolation schemes
(→ § 5.1.4) introduced in Section 5.2, Section 5.4, and Section 5.5.

There is additional freedom compared to data interpolation: we can choose the interpolation nodes in
a smart way in order to obtain an accurate interpolant fe.

Remark 6.0.8 (Interpolation and approximation: enabling technologies)

Approximation and interpolation (→ Chapter 5) are key components of many numerical methods, like for
integration, differentiation and computation of the solutions of differential equations, as well as for computer
graphics and generation of smooth curves and surfaces.

This chapter is a “foundations” part of the course

Contents
6.1 Approximation by Global Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . 434
6.1.1 Polynomial approximation: Theory . . . . . . . . . . . . . . . . . . . . . . . 435
6.1.2 Error estimates for polynomial interpolation . . . . . . . . . . . . . . . . . . 441
6.1.3 Chebychev Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
6.1.3.1 Motivation and definition . . . . . . . . . . . . . . . . . . . . . . . 451
6.1.3.2 Chebychev interpolation error estimates . . . . . . . . . . . . . . . 456
6.1.3.3 Chebychev interpolation: computational aspects . . . . . . . . . . 461
6.2 Mean Square Best Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
6.2.1 Abstract theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
6.2.1.1 Mean square norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
6.2.1.2 Normal equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
6.2.1.3 Orthonormal bases . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
6.2.2 Polynomial mean square best approximation . . . . . . . . . . . . . . . . . . 470

6. Approximation of Functions in 1D, 6. Approximation of Functions in 1D 443

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6.3 Uniform Best Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477

6.4 Approximation by Trigonometric Polynomials . . . . . . . . . . . . . . . . . . . . 481
6.4.1 Approximation by Trigonometric Interpolation . . . . . . . . . . . . . . . . 481
6.4.2 Trigonometric interpolation error estimates . . . . . . . . . . . . . . . . . . . 482
6.4.3 Trigonometric Interpolation of Analytic Periodic Functions . . . . . . . . . . 487
6.5 Approximation by piecewise polynomials . . . . . . . . . . . . . . . . . . . . . . . 490
6.5.1 Piecewise polynomial Lagrange interpolation . . . . . . . . . . . . . . . . . 491
6.5.2 Cubic Hermite interpolation: error estimates . . . . . . . . . . . . . . . . . . 494
6.5.3 Cubic spline interpolation: error estimates [?, Ch. 47] . . . . . . . . . . . . . 498

6.1 Approximation by Global Polynomials

The space Pk of polynomials of degree ≤ k has been introduced in Section 5.2.1. For reasons listed in
§ 5.2.3 polynomials are the most important theoretical and practical tool for the approximation of functions.
The next example presents an important case of approximation by polynomials.

Example 6.1.1 (Taylor approximation → [?, Sect. 5.5])

The local approximation of sufficiently smooth functions by polynomials is a key idea in calculus, which
manifests itself in the importance of approximation by Taylor polynomials: For f ∈ Ck ( I ), k ∈ N, I ⊂ R
an interval, we approximate

k
f ( j ) ( t0 )
f (t) ≈ ∑ (t − t0 ) j , for some t0 ∈ I .
j =0
j!
| {z }
= :Tk (t)

✎ Notation: f (k) =
ˆ k-th derivative of function f : I ⊂ R → K

The Taylor polynomial Tk of degree k approximates f in a neighbourhood J ⊂ I of t0 ( J can be small!).

This can be quantified by the remainder formulas [?, Bem. 5.5.1]

Zt
(t − τ )k
f (t) − Tk (t) = f ( k + 1) ( τ ) dτ (6.1.2a)
k!
t0
( t − t0 ) k + 1
= f ( k + 1) ( ξ ) , ξ = ξ (t, t0 ) ∈] min(t, t0 ), max(t, t0 )[ , (6.1.2b)
( k + 1) !

which shows that for f ∈ Ck+1 ( I ) the Taylor polynomial Tk is pointwise close to f ∈ Ck+1 ( I ), if the
interval I is small and f (k+1) is bounded pointwise.

Approximation by Taylor polynomials is easy and direct but inefficient: a polynomial of lower degree often
gives the same accuracy. Moreover, when f is available only in procedural form as double f(double),
(approximations of) higher order derivatives are difficult to obtain.

(6.1.3) Nested approximation spaces of polynomials

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 444

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Obviously, for every interval I ⊂ R , the spaces of polynomials are nested in the following sense:

P0 ⊂ P1 ⊂ · · · ⊂ P m ⊂ P m + 1 ⊂ · · · ⊂ C ∞ ( I ) , (6.1.4)

with finite, but increasing dimensions dim Pm = m + 1 according to Thm. 5.2.2.

With this family of nested spaces of polynomials at our disposal, it is natural to study associated families
of approximation schemes, one for each degree, mapping into Pm , m ∈ N0 .

6.1.1 Polynomial approximation: Theory

(6.1.5) Scope of polynomial approximation

Sloppily speaking, according to (6.1.2b) the Taylor polynomials from Ex. 6.1.1 provide uniform (→ § 6.0.1)
approximation of a smooth function f in (small) intervals, provided that its derivatives do not blow up “too
fast” (We do not want to make this precise here).

The question is, whether polynomials still offer uniform approximation on arbitrary bounded closed inter-
vals and for functions that are merely continuous, but not any smoother. The answer is YES and this
profound result is known as the Weierstrass Approximation Theorem. Here we give an extended version
with a concrete formula due to Bernstein, see [?, Section 6.2].

Theorem 6.1.6. Uniform approximation by polynomials

For f ∈ C0 ([0, 1]), define the n-th Bernstein approximant as

n n j
pn (t) = ∑ j =0
f ( j/n) t (1 − t ) n − j , p n ∈ P n . (6.1.7)
j
(k)
It satisfies k f − pn k∞ → 0 for n → ∞. If f ∈ Cm ([0, 1]), then even f (k) − pn → 0 for
∞
n → ∞ and all 0 ≤ k ≤ m.

✎ Notation: g(k) =
ˆ k-th derivative of a function g : I ⊂ R → K.

In (6.1.7) the function f is approximated by a linear combination of Bernstein polynomials of degree n

n j
Bnj (t) = t (1 − t ) n − j , p n ∈ P n . (6.1.8)
j

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 445

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

j= 0
0.9 j= 1
j= 2
j= 3
j= 4
Plots of Bernstein polynomials of degree n = 7 ✄ 0.8 j=
j=
5
6
j= 7
0.7

Bernstein polynomials satisfy:

0.6

B (t)
0.5

∑ Bnj (t) ≡ 1 ,

j
(6.1.9)
0.4
j =0
0 ≤ Bnj (t) ≤ 1 ∀0 ≤ t ≤ 1 . (6.1.10) 0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 207 t

0.5
deg = 2, j = 1
deg = 5, j = 2
deg = 8, j = 3
deg = 11, j = 4
deg = 14, j = 6
0.4
deg = 17, j = 7
deg = 20, j = 8
deg = 23, j = 9
deg = 26, j = 10
deg = 29, j = 12
d n j n− j
0.3
✁ Since dt B j (t) = Bnj (t)( t − 1− t ), Bnj has its
Bj (t)

j
unique local maximum in [0, 1] at the site tmax := n .
0.2
As n → ∞ the Bernstein polynomials become more
and more concentrated around the maximum.
0.1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 208 t

Proof. (of Thm. 6.1.6, first part) Fix t ∈ [0, 1]. Using the notations from (6.1.7) and the identity (6.1.9) we
find
n
f (t) − pn (t) = ∑ ( f (t) − f ( j/n))Bnj (t) . (6.1.11)
k=0

As we see from Fig. 208, for large n the bulk of sum will be contributed by Bernstein polynomial with index
j/n ≈ x, because for every δ > 0
n
1 1 (∗) nt(1 − t) 1
∑ Bnj (t) ≤ ∑ ( j/n − t)2 Bnj (t) ≤ ∑ ( j/n − t)2 Bnj (t) = ≤ .
| j/n− t|> δ
δ2 | j/n− t|> δ
δ2 j =0
2
δ n 2 4nδ2

∑ means summation over j ∈ N0 with summation indices confined to the set { j : | j/n − t| > δ}.
| j/n− t|> δ
The identity (∗) can be established by direct but tedious computations.

Combining this estimate with (6.1.10) and (6.1.11) we arrive at

1
| f (t) − pn (t)| ≤ ∑ 4nδ 2
| f (t) − f ( j/n)| + ∑ | f (t) − f ( j/n )| .
| j/n− t|> δ | j/n− t|≤ δ

Since, f is uniformly continuous on [0, 1], given ǫ > 0 we can choose δ > 0 independently of t such that
| f (s) − f (t)| < ǫ, if |s − t| < δ. The, if we choose n > (ǫδ2 )−1 , we can bound

| f (t) − pn (t)| ≤ (k f k∞ + 1)ǫ ∀t ∈ [0, 1] .

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 446

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Experiment 6.1.12 (Bernstein approximants)

We compute and plot pn , n = 1, . . . , 25, for two functions

(
0 , if |2t − 1| > 21 , 1
f 1 (t) := 1 , f 2 (t) := .
2 (1 + cos(2π (2t − 1))) else, 1 + e−12(x−1/2)

The following plots display the sequences of the polynomials pn for n = 2, . . . , 25.
Bernstein approximants on [0,1] Bernstein approximants on [0,1]
1 1
function f function f
0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 209 t Fig. 210 t

f = f1 f = f2
We see that the Bernstein approximants “slowly” edge closer and closer to f . Apparently it takes a very
large degree to get really close to f .

(6.1.13) Best approximation

Now we introduce a concept needed to gauge how close an approximation scheme gets to the best
possible performance.

Definition 6.1.14. (Size of) best approximaton error

Let k·k be a (semi-)norm on a space X of functions I 7→ K, I ⊂ R an interval. The (size of the)

best approximation error of f ∈ X in the space Pk of polynomials of degree ≤ k with respect to k·k
is

distk·k ( f , Pk ) := inf k f − pk .
p∈Pk

The notation distk·k is motivated by the notation of “distance” as distance to the nearest point in a set.

For the L2 -norm k·k2 and the supremum norm k·k ∞ the best approximation error is well defined for

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 447

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C = C 0 ( I ).

The polynomial realizing best approximation w.r.t. k·k may neither be unique nor computable with reason-
able effort. Often one is content with rather sharp upper bounds like those asserted in the next theorem,
due to Jackson [?, Thm. 13.3.7].

Theorem 6.1.15. L∞ polynomial best approximation estimate

If f ∈ Cr ([−1, 1]) (r times continuously differentiable), r ∈ N, then, for any polynomial degree
n ≥ r,
( n − r ) ! (r )
inf k f − p k L∞ ([−1,1]) ≤ (1 + π 2/2)r f .
p∈Pn n! L ∞ ([−1,1])

As above, f (r ) stands for the r-th derivative of f . Using Stirling’s formula

√
2π nn+1/2 e−n ≤ n! ≤ e nn+1/2 e−n ∀n ∈ N , (6.1.16)

we can get a looser bound of the form

inf k f − p k L∞ ([−1,1]) ≤ C(r )n−r f (r) , (6.1.17)

p∈Pn L ∞ ([−1,1])

with C(r ) dependent on r, but independent of f and, in particular, the polynomial degree n. Using the
Landau symbol from Def. 1.4.5 we can rewrite the statement of (6.1.17) in asymptotic form

(6.1.17) ⇒ inf k f − p k L∞ ([−1,1]) = O(n−r ) for n → ∞ .

p∈Pn

Remark 6.1.18 (Transformation of polynomial approximation schemes)

What if a polynomial approximation scheme is defined only on a special interval, say [−1, 1]. Then by the
following trick it can be transferred to any interval [ a, b] ⊂ R .

b : C0 ([−1, 1]) →
Assume that an interval [ a, b] ⊂ R , a < b, and a polynomial approximation scheme A
Pn are given. Based on the affine linear mapping
Φ : [−1, 1] → [ a, b] , Φ(t̂ ) := a + 12 (t̂ + 1)(b − a) , − 1 ≤ t̂ ≤ 1 , (6.1.19)

we can introduce the affine pullback of functions:

Φ∗ : C0 ([ a, b]) → C0 ([−1, 1]) , Φ∗ ( f )(t̂ ) := f (Φ(t̂ )) , − 1 ≤ t̂ ≤ 1 . (6.1.20)

Φ∗ f lives here f lives here

bt t
Fig. 211
−1 1 a b
bt 7→ t := Φ(bt ) := 1 (1 − bt)a + 1 (bt + 1)b
2 2

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 448

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

We add the important observations that affine pullbacks are linear and bijective, they are isomorphisms of
the involved vector spaces of functions (what is the inverse?).

Lemma 6.1.21. Affine pullbacks preserve polynomials

If Φ∗ : C0 ([ a, b]) → C0 ([−1, 1]) is an affine pullback according to (6.1.19) and (6.1.20), then
Φ∗ : Pn → Pn is a bijective linear mapping for any n ∈ N0 .

Proof. This is a consequence of the fact that translations and dilations take polynomials to polynomials of
the same degree: for monomials we find

Φ∗ {t → tn } = {t̂ → (a + 21 (t̂ + 1)(b − a))n } ∈ Pn .

The lemma tells us that the spaces of polynomials of some maximal degree are invariant under affine
pullback. Thus, we can define a polynomial approximation scheme A on C0 ([ a, b]) by

A : C0 ([ a, b]) → Pn , b ◦ Φ∗ ,
A : = (Φ∗ )−1 ◦ A (6.1.22)

b is a polynomial approximation scheme on [−1, 1].

whenever A

Remark 6.1.23 (Transforming approximation error estimates)

Thm. 6.1.15 targets only the special interval [−1, 1]. What does it imply for polynomial best approximation
on a general interval [ a, b]? To answer this question we apply techniques from Rem. 6.1.18, in particular
the pullback (6.1.20).

We first have to study the change of norms of functions under the action of affine pullbacks:

Lemma 6.1.24. Transformation of norms under affine pullbacks

For every f ∈ C0 ([ a, b]) we have

q
k f k L∞ ([ a,b]) = kΦ∗ f k L∞ ([−1,1]) , k f k L2 ([ a,b]) = |b − a| kΦ∗ f k L2 ([−1,1]) . (6.1.25)

Proof. The first estimate should be evident, and the second is a consequence of the transformation
formula for integrals [?, Satz 6.1.5]
Z 1 Z b
∗ b−a
Φ t(bt ) dbt = f (t) dt , (6.1.26)
−1 2 a

and the definition of the L2 -norm from (5.2.67).

✷

Thus, for norms of the approximation errors of polynomial approximation schemes defined by affine trans-

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 449

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

formation (6.1.22) we get

k f − A f k L∞ ([ a,b]) = Φ∗ f − A
b (Φ∗ f ) ,
L ∞ ([−1,1])
q , ∀ f ∈ C0 ([ a, b]) . (6.1.27)
k f − A f k L2 ([ a,b]) = |b − a| Φ∗ f − A b (Φ∗ f )
2
,
L ([−1,1])

b.
Equipped with approximation error estimates for A, we can infer corresponding estimates for A

The bounds for approximation errors often involve norms of derivatives as in Thm. 6.1.15. Hence, it is
important to understand the interplay of pullback and differentiation: By the 1D chain rule
d df dΦ df
(Φ∗ f )(t̂ ) = (Φ(t̂ )) = (Φ(t̂ )) · 12 (b − a) ,
dt̂ dt dt̂ dt
which implies a simple scaling rule for derivatives of arbitrary order r ∈ N0 :

∗ (r ) b − a r ∗ (r )
(Φ f ) = Φ (f ) . (6.1.28)
2

Lemma 6.1.24
∗ (r ) b − a r (r )
(Φ f ) = f , f ∈ Cr ([ a, b]), r ∈ N0 . (6.1.29)
L ∞ ([−1,1]) 2 L ∞ ([ a,b ])

Remark 6.1.30 (Polynomial best approximation on general intervals)

The estimate (6.1.28) together with Thm. 6.1.15 paves the way for bounding the polynomial best approxi-
mation error on arbitrary intervals [ a, b], a, b ∈ R . Based on the affine mapping Φ : [−1, 1] → [ a, b] from
(6.1.19) and writing Φ∗ for the pullback according to (6.1.20) we can chain estimates. If f ∈ Cr ([ a, b])
and n ≥ r, then
(∗)
inf k f − p k L∞ ([ a,b]) = inf kΦ∗ f − pk L∞ ([−1,1])
p∈Pn p∈Pn
Thm. 6.1.15 (n − r )!
≤ (1 + π 2/2)r ( Φ ∗ f ) (r ) ∞
n! L ([−1,1])
r
(6.1.28) (n − r )! b − a
= (1 + π 2/2)r f (r ) ∞ .
n! 2 L ([ a,b ])

In step (∗) we used the result of Lemma 6.1.21 that Φ∗ p ∈ Pn for all p ∈ Pn . Invoking the arguments
that gave us (6.1.17), we end up with the simpler bound
r
b−a
inf k f − p k L∞ ([ a,b]) ≤ C (r ) f (r ) . (6.1.31)
p∈Pn n L ∞ ([ a,b ])

Observe that the length of the interval enters the bound in r-th power.

6.1.2 Error estimates for polynomial interpolation

n +1 →
n the Lagrangian polynomial interpolation operator IT : K
In Section 5.2.2, Cor. 5.2.15, we introduced
Pn belonging to a node set T = t j j=0. In the spirit of § 6.0.6 it induces an approximation scheme on
C0 ( I ), I ⊂ R an interval, if T ⊂ I .

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 450

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 6.1.32. Lagrangian (interpolation polynomial) approximation scheme

Given an interval I ⊂ R , n ∈ N, a node set T = {t0 , . . . , tn } ⊂ I , the Lagrangian (interpolation

polynomial) approximation scheme LT : C0 ( I ) → Pn is defined by

LT ( f ) : = IT (y) ∈ P n with y := ( f (t0 ), . . . , f (tn ))T ∈ K n+1 .

Our goal in this section will be

to estimate the norm of the interpolation error k f − IT f k (for relevant norm on C( I )).

(6.1.33) Families of Lagrangian interpolation polynomial approximation schemes

Already Thm. 6.1.15 considered the size of the best approximation error in Pn as a function of the poly-
nomial degree n. In the same vein, we may study a family of Lagrange interpolation schemes {LTn } n∈N
0
on I ⊂ R induced by a family of node sets {Tn }n∈N0 , Tn ⊂ I , according to Def. 6.1.32.

An example for such a family of node sets on I := [ a, b] are the equidistant or equispaced nodes

(n ) j
Tn := {t j := a + (b − a) : j = 0, . . . , n} ⊂ I . (6.1.34)
n

For families of Lagrange interpolation schemes {LTn }n∈N we can shift the focus onto estimating the
0
asymptotic behavior of the norm of the interpolation error for n → ∞.

Experiment 6.1.35 (Asymptotic behavior of Lagrange interpolation error)

We perform polynomial interpolation of f (t) = sin t on equispaced nodes in I = [0, π ]: Tn = { jπ/n}nj=0 .

Write p for the polynomial interpolant: p := LTn f ∈ Pn .
0
10

||f-p ||
n ∞
-2
10 In the numerical experiment the norms of the inter-
||f-p ||
-4
n 2 polation errors can be computed only approximately
10
as follows.
-6 • L∞ -norm: approximated by sampling on a grid
Error norm

of meshsize π/1000.
• L2 -norm: numerical quadrature (→ Chapter 7)
-8
10

-10
10
with trapezoidal rule (7.4.4) on a grid of mesh-
size π/1000.
-12
10
✁ approximate norms k f − LTn f k ∗ , ∗ = 2, ∞.
-14
10
2 4 6 8 10 12 14 16
Fig. 212 n

(6.1.36) Classification of asymptotic behavior of norms of the interpolation error

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 451

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

In the previous experiment we observed a clearly visible regular behavior of k f − LTn f k as we increased
the polynomial degree n. The prediction of the decay law for k f − LTn f k for n → ∞ is one goal in the
study of interpolation errors.

Often this goal can be achieved, even if a rigorous quantitative bound for a norm of the interpolation error

✎ ☞
remains elusive. In other words, in many cases

no bound for k f − LTn f k can be given, but its decay for increasing n can be de-

✍ ✌
scribed precisely.

Now we introduce some important terminology for the qualitative description of the behavior of k f − LTn f k
as a function of the polynomial degree n. We assume that

∃ C 6= C(n) > 0: k f − LTn f k ≤ C T (n) for n → ∞ . (6.1.37)

Definition 6.1.38. Types of asymptotic convergence of approximation schemes

Writing T (n) for the bound of the norm of the interpolation error according to (6.1.37) we distinguish
the following types of asymptotic behavior :

∃ p > 0: T (n) ≤ n− p : algebraic convergence, with rate p > 0 ,

∀n ∈ N .
∃ 0 < q < 1: T (n) ≤ qn : exponential convergence ,

The bounds are assumed to be sharp in the sense, that no bounds with larger rate p (for algebraic
convergence) or smaller q (for exponential convergence) can be found.

Convergence behavior of norms of the interpolation error is often expressed by means of the Landau-O-
notation, cf. Def. 1.4.5:
Algebraic convergence: k f − IT f k = O (n− p )
for n → ∞ (“asymptotic!”)
Exponential convergence: k f − IT f k = O (qn )

Remark 6.1.39 (Different meanings of “convergence”)

Beware: same concept ↔ different meanings:

• convergence of a sequence (e.g. of iterates x(k) → Section 8.1 )

• convergence of an approximation (dependent on an approximation parameter, e.g. n)

Remark 6.1.40 (Determining the type of convergence in numerical experiments → § 1.4.9)

Given pairs (ni , ǫi ), i = 1, 2, 3, . . ., ni =

ˆ polynomial degrees, ǫi =
ˆ (measured) norms of interpolation
errors, how can be tease out the likely type of convergence according to Def. 6.1.38? A similar task was
already encountered in § 1.4.9, where we had to extract information about asymptotic complexity from
runtime measurements.

➊ Conjectured: algebraic convergence: ǫi ≈ Cn− p

log(ǫi ) ≈ log(C) − p log ni (affine linear in log-log scale).

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 452

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Apply linear regression from Ex. 3.1.5 for data points (log ni , log ǫi ) ➣ least squares estimate for rate p.

➊ Conjectured: exponential convergence: ǫi ≈ C exp(− βni )

log ǫi ≈ log(C) − βni (affine linear in lin-log scale). .
Apply linear regression (→ Ex. 3.1.5)to points (ni , log ǫi ) ➣ estimate for q := exp(− β).

☞ Fig. 212: we suspect exponential convergence in Exp. 6.1.35.

Example 6.1.41 (Runge’s example → Ex. 5.2.63)

We examine the polynomial interpolant of f (t) = 1+1t2 for equispaced nodes:

n on 1
10
T n : = t j : = −5 + n j , j = 0, . . . , n ➣ y j = .
j =0 1 + t2j
We rely on an approximate computation of the supremum norm of the interpolation error by means of
sampling as in Exp. 6.1.35.

C++11 code 6.1.42: Computing the interpolation error for Runge’s example

2 // Note: “quick & dirty” implementation !

3 // Lambda function representing x 7→ (1 + x2 )− 1
4 auto f = [ ] ( double x ) { r etur n 1 . / ( 1 + x ∗ x ) ; } ;
5 // sampling points for approximate maximum norm
6 const VectorXd x = VectorXd : : LinSpaced (1000 , − 5, 5) ;
7 // Sample function
8 const VectorXd f x = f e v a l ( f , x ) ; // evaluate f at x
9
10 std : : vector <double > e r r ; // Accumulate error norms here
11 f o r ( i n t d = 1 ; d <= 20; ++d ) {
12 // Interpolation nodes
13 const VectorXd t = Eigen : : VectorXd : : LinSpaced ( d + 1 , − 5, 5) ;
14 // Interpolation data values
15 const VectorXd f t = f e v a l ( f , t ) ;
16 // Compute interpolating polynomial
17 const VectorXd p = p o l y f i t ( t , f t , d ) ;
18 // Evaluate polynomial interpolant
19 const VectorXd y = polyval ( p , x ) ;
20 // Approximate supremum norm of interpolation error
21 e r r . push_back ( ( y − f x ) . cwiseAbs ( ) . maxCoeff ( ) ) ;
22 }

2
1/(1+x2)
Interpolating polynomial

1.5

0.5

-0.5
-5 -4 -3 -2 -1 0 1 2 3 4 5

Fig. 213 Fig. 214

Interpolating polynomial, n = 10 Approximate f − LTn f ∞

on [−5, 5]

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 453

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Note: approximation of k f − LTn f k∞ by sampling in 1000 equidistant points.

Observation: Strong oscillations of IT f near the endpoints of the interval, which seem to cause
n→∞
k f − LT f k L∞ (]−5,5[) −−−→ ∞ .
Though polynomials possess great power to approximate functions, see Thm. 6.1.15 and Thm. 6.1.6, here
polynomial interpolants fail completely. Approximation theorists even discovered the following “negative
result”:

Theorem 6.1.43. Divergent polynomial interpolants

(n ) (n ) (n )
Given a sequence of meshes of increasing size {Tn }∞ n =1 , T j = {t0 , . . . , tn } ⊂ [ a, b], a ≤ t0 <
( j) (n )
t2 < · · · < tn ≤ b, there exists a continuous function f such that the sequence of interpolating
∞
polynomials (LTn f ) n=1 does not converge to f uniformly as n → ∞.

Now we aim to establish bounds for the supremum norm of the interpolation error of Lagrangian interpo-
lation similar to the result of Thm. 6.1.15.
Theorem 6.1.44. Representation of interpolation error [?, Thm. 8.22], [?, Thm. 37.4]

We consider f ∈ Cn+1 ( I ) and the Lagrangian interpolation approximation scheme (→ Def. 6.1.32)
for a node set T := {t0 , . . . , tn } ⊂ I . Then,
for every t ∈ I there exists a τt ∈] min{t, t0 , . . . , tn }, max{t, t0 , . . . , tn }[ such that

f (n+1) (τt ) n
( n + 1) ! ∏
f (t) − LT ( f )(t) = · (t − t j ) . (6.1.45)
j =0

n
Proof. Write wT (t) := ∏ (t − t j ) ∈ Pn+1 and fix t ∈ I \ T .
j =0

t 6= t j ⇒ wT (t) 6= 0 ⇒ ∃c = c(t) ∈ R: f (t) − LT ( f )(t) = c(t)wT (t) (6.1.46)

Consider the auxiliary function ϕ( x ) := f ( x ) − LT ( f )( x ) − cwT ( x ) that has n + 2 distinct zeros

t0 , . . . , tn , t. By iterated application of the mean value theorem [?, Thm .5.2.1]

f ∈ C1 ([ a, b]), f (a) = f (b) = 0 ⇒ ∃ξ ∈] a, b[: f ′ (ξ ) = 0 ,

to higher and higher derivatives, we conclude that

ϕ(m) has n + 2 − m distinct zeros in I .

m: = n +1
⇒ ∃τt ∈ I: ϕ(n+1) (τt ) = f (n+1) (τt ) − c(n + 1)! = 0 .
f ( n +1) ( τ )
This fixes the value of c = (n+1)!t and by (6.1.46) this amounts to the assertion of the theorem.
✷

Remark 6.1.47 (Explicit representation of error of polynomial interpolation)

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 454

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The previous theorem can be refined:

Lemma 6.1.48. Error representation for polynomial Lagrange interpolation

For f ∈ Cn+1 ( I ) let IT ∈ Pn stand for the unique Lagrange interpolant (→ Thm. 5.2.14) of f in the
node set T := {t0 , . . . , tn } ⊂ I . Then for all t ∈ I the interpolation error is

Z1 Zτ1 τZn−1Zτn

f (t) − IT ( f )(t) = ··· f (n+1) (t0 + τ1 (t1 − t0 ) + · · ·

0 0 0 0
n
+ τn (tn − tn−1 ) + τ (t − tn )) dτdτn · · · dτ1 · ∏(t − t j ) .
j =0

Proof. By induction on n, use (5.2.34) and the fundamental theorem of calculus [?, Sect. 3.1]:
✷

Remark 6.1.49 (Error representation for generalized Lagrangian interpolation)

A result analogous top Lemma 6.1.48 holds also for general polynomial interpolation with multiple nodes
as defined in (5.2.22).

Lemma 6.1.48 provides an exact formula (6.1.45) for the interpolation error. From it we can derive esti-
mates for the supremum norm of the interpolation error on the interval I as follows:

➊ first bound the right hand side via f (n+1) (τt ) ≤ f ( n + 1) ,

L∞ ( I )

➋ then increase the right hand side further by switching to the maximum (in modulus) w.r.t. t (the
resulting bound does no longer depend on t!),
➌ and, finally, take the maximum w.r.t. t on the left of ≤.
This yields the following interpolation error estimate for degree-n Lagrange interpolation on the node set
{ t0 , . . . , t n :

k f ( n +1) k L ∞ ( I )
Thm. 6.1.44 ⇒ k f − LT f k L ∞ ( I ) ≤ ( n + 1) !
max|(t − t0 ) · · · · · (t − tn )| . (6.1.50)
t∈ I

Remark 6.1.51 (Significance of smoothness of interpoland)

The estimate (6.1.50) hinges on bounds for (higher) derivatives of the interpoland f , which, essentially,
should belong to Cn+1 ( I ). The same can be said about the estimate of Thm. 6.1.15.

This reflects a general truth about estimates of norms of the interpolation error:

Quantitative interpolation error estimates rely on smoothness!

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 455

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Example 6.1.52 (Error of polynomial interpolation Exp. 6.1.35 cnt’d)

Now we are in a position to give a theoretical explanation for exponential convergence observed for poly-
nomial interpolation of f (t) = sin(t) on equidistant nodes: by Lemma 6.1.48 and (6.1.50)

1
f (k)
≤1, k f − pk L∞ ( I ) ≤ max (t − 0)(t − πn )(t − 2π
n ) · · · · · (t − π)
L∞ ( I )
(1 + n ) ! t ∈ I
⇒
∀k ∈ N0 1 π n +1
≤ .
n+1 n
➙ Uniform asymptotic (even more than) exponential convergence of the interpolation polynomials
(independently of the set of nodes T . In fact, k f − pk L∞ ( I ) decays even faster than exponential!)

Example 6.1.53 (Runge’s example Ex. 6.1.41 cnt’d)

How can the blow-up of the interpolation error observed in Ex. 6.1.41 be reconciled with Lemma 6.1.48 ?

Here f (t) = 1
1+ t 2
allows only to conclude | f (n) (t)| = 2n n! · O(|t|−2−n ) for n → ∞.
➙ Possible blow-up of error bound from Thm. 6.1.44 →∞ for n → ∞.

Remark 6.1.54 ( L2 -error estimates for polynomial interpolation)

Thm. 6.1.44 gives error estimates for the L∞ -norm. What about other norms?
From Lemma 6.1.48 using the Cauchy-Schwarz inequality
2
Zb Zb Zb
2
f (t) g(t) dt ≤ | f (t)| dt | g(t)|2 dt , ∀ f , g ∈ C0 ([ a, b]) , (6.1.55)
a a a

repeatedly, we can estimate

Z Z1 Zτ1 τZn−1Zτn 2
n
kf − LT ( f )k2L2 ( I ) = ··· f ( n + 1)
(. . .) dτdτn · · · dτ1 · ∏ (t − t j ) dt
I 0 0 0 0 j =0
| {z }
| t− t j |≤| I |
Z Z
2n +2
≤ |I| vol(n+1) (Sn+1 ) | f (n+1) (. . .)|2 dτ dt
I | {z } S n +1
=1/( n+1) !
Z Z
| I |2n+2
= vol(n) (Ct,τ ) | f (n+1) (τ )|2 dτdt ,
I ( n + 1) ! I | {z }
≤2( n−1)/2 /n!

where

Sn+1 := {x ∈ R n+1 : 0 ≤ xn ≤ xn−1 ≤ · · · ≤ x1 ≤ 1} (unit simplex) ,

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 456

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Ct,τ := {x ∈ Sn+1 : t0 + x1 (t1 − t0 ) + · · · + xn (tn − tn−1 ) + xn+1 (t − tn ) = τ } .

This gives the bound for the L2 -norm of the error:

Z
2(n−1)/4 | I |n+1 1/2
⇒ k f − LT ( f )k L2 ( I ) ≤ p | f (n+1) (τ )|2 dτ . (6.1.56)
(n + 1)!n! I

Notice: f 7→ f (n ) defines a seminorm on Cn+1 ( I )

L2 ( I )
(Sobolev-seminorm, measure of the smoothness of a function).
Estimates like (6.1.56) play a key role in the analysis of numerical methods for solving partial differential
equations (→ course “Numerical methods for partial differential equations”).

Remark 6.1.57 (Interpolation error estimates and the Lebesgue constant)

The sensitivity of a polynomial interpolation scheme IT : K n+1 → C0 ( I ), T ⊂ I a node set, as introduced

in Section 5.2.4 and expressed by the Lebesgue constant (→ Lemma 5.2.71)

kIT (y)k L∞ ( I )
λ T : = kI T k ∞ → ∞ : = sup ,
y ∈R n+1 \{0} k yk ∞

establishes an important connection between the norms of the interpolation error and of the best approx-
imation error.

We first observe that the polynomial approximation scheme LT induced by IT preserves polynomials of
degree ≤ n:

LT p = IT [ p(t)] t∈T = p ∀ p ∈ Pn . (6.1.58)

Thus, by the triangle inequality, for a generic norm on C0 ( I ) and kLT k designating the associated oper-
ator norm of the linear mapping LT , cf. (5.2.70),

(6.1.58)
k f − LT f k = k( f − p) − LT ( f − p)k ≤ (1 + kLT k)k f − p k ∀ p ∈ Pn ,

k f − LT f k ≤ (1 + kLT k) inf k f − p k . (6.1.59)

p∈Pn
| {z }
best approximation error

Note that for k·k = k·k L∞ ( I ) , since [ f (t)] t∈T ∞ ≤ k f k L∞ ( I ) , we can estimate the operator norm, cf.
(5.2.70),

kLT k L∞ ( I )→ L∞ ( I ) ≤ kIT kR n+1 → L∞ ( I ) = λT , (6.1.60)

k f − LT f k L∞ ( I ) ≤ (1 + λT ) inf k f − pk L∞ ( I ) ∀ f ∈ C0 ( I ) . (6.1.61)
p∈Pn

Hence, if a bound for λT is available, the best approximation error estimate of Thm. 6.1.15 immediately
yields interpolation error estimates.

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 457

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(6.1.62) Interpolation error estimates for analytic interpolands

Exponential convergence can often be observed for families of Lagrangian approximation schemes, when
they are applied to an analytic interpoland.

Definition 6.1.63. Analyticity of a complex valued function

let D ⊂ C be an open set in the complex plane. A function f : D → C is called ana-

lytic/holomorphic in D, if f ∈ C∞ ( D ) and it possesses a convergent Taylor series at every point
z ∈ D.

The mathematical area of complex analysis (→ course in the BSc program CSE) studies analytic func-
tions. Analyticity gives access to powerful tools provided by complex analysis. One of theses tools is the
residue theorem.

Theorem 6.1.64. Residue theorem [?, Ch. 13]

Let D ⊂ C be an open set, G ⊂ D a closed set contained in D, γ := ∂G its (piecewise smooth

and oriented) boundary, and Π a finite set contained in the interior of G.

Then for each function f that is analytic in D \ Π holds

Z
1
2πı γ
f (z) dz = ∑ res p f ,
p∈ Π

where res p f is the residual of f in p ∈ C.

R
• Note that the integral γ in Thm. 6.1.64 is a path integral in the complex plane (“contour integral”): If
the path of integration γ is described by a parameterization τ ∈ J 7→ γ(τ ) ∈ C, J ⊂ R , then
Z Z
f (z) dz := f (γ(τ )) · γ̇(τ ) dτ , (6.1.65)
γ J

where γ̇ designates the derivative of γ with respect to the parameter, and · indicates multiplication
in C. For contour integrals we have the estimate
Z
f (z) dz ≤ |γ| max | f (z)| . (6.1.66)
γ z∈γ

• Π often stands for the set of poles of f , that is, points where “ f attains the value ∞”.

The residue theorem is very useful, because there are simple formulas for res p f :

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 458

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Lemma 6.1.67. Residual formula for quotients

let g and h be complex valued functions that are both analytic in a neighborhood of p ∈ C, and
satisfy h( p) = 0, h′ ( p) 6= 0. Then

g g( p)
res p = ′ .
h h ( p)

Now we consider a polynomial Lagrangian approximation scheme on the interval I := [ a, b] ⊂ R , based

on the node set T := {t0 , . . . , tn } ⊂ I .

Im
Assumption 6.1.68. Analyticity of inter-
poland
D
γ We assume that the interpoland f : I → C
t0 t1 t2 t4 Re can be extended to a function f : D ⊂ C →
a b C, which is analytic (→ Def. 6.1.63) on the
open set D ⊂ C with [ a, b] ⊂ D.
Fig. 215

Key is the following representation of the Lagrange polynomials (5.2.11) for node set T = {t0 , . . . , tn }:
n
t − tk w(t) w(t)
L j (t) = ∏ t − tk
= n =
(t − t j )w′ (t j )
, (6.1.69)
k=0,k6= j j (t − t j ) ∏ (t j − tk )
k=0,k6= j
where w ( t ) = ( t − t0 ) · · · · · ( t − t n ) ∈ P n + 1 .

Consider the following parameter dependent function gt , whose set of poles in D is Π = {t, t0 , . . . , tn }

f (z)
gt ( z ) : = , z ∈ C \ Π , t ∈ [ a, b] \ {t0 , . . . , tn } .
(z − t)w(z)

➣ gt is analytic on D \ {t, t0 , . . . , tn } (t must be regarded as parameter!)

Apply residue theorem Thm. 6.1.64 to gt and a closed path of integration γ ⊂ D winding once
around [ a, b], such that its interior is simply connected, see the magenta curve in Fig. 215:

Z n n
1 Lemma 6.1.67 f (t) f (t j )
2πı γ
gt (z) dz = rest gt + ∑ restj gt = +∑
w (t) j =0 (t j − t)w ′ (t j )
j =0

Possible, because all zeros of w are single zeros!

n Z
w(t) w(t)
f (t) = − ∑ f (t j ) + gt (z) dz . (6.1.70)
j =1
(t j − t)w′ (t j ) 2πı γ
| {z } | {z }
−Lagrange polynomial ! interpolation error !
| {z }
polynomial interpolant !

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 459

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

This is another representation formula for the interpolation error, an alternative to that of Thm. 6.1.44 and
Lemma 6.1.48. We conclude that for all t ∈ [ a, b]

Z max |w(τ )| max | f (z)|

In concrete setting, in order to exploit the estimate (6.1.71) to study the n-dependence of the supremum
norm of the interpolation error, we need to know
• an upper bound for |w(t)| for a ≤ t ≤ b,
• an a lower bound for |w(z)|, z ∈ γ, for a suitable path of integration γ ⊂ D,
• a lower bound for the distance of the path γ and the interval [ a, b] in the complex plane.

Remark 6.1.72 (Determining the domain of analyticity)

The subset of C, where a function f given by a formula, is analytic can often be determined without
computing derivatives using the following consequence of the chain rule:

Theorem 6.1.73. Composition of analytic functions

If f : D ⊂ C → C and g : U ⊂ C → C are analytic in the open sets D and U , respectively, then

their composition f ◦ g is analytic in

{z ∈ U : g(z) ∈ D } .

This can be combined with the following facts:

• Polynomials, exp(z), sin(z), cos(z), sinh(z), cosh(z) are analytic on C (entire functions).
• Rational functions (quotients of polynomials) are analytic everywhere except in the zeros of their
denominator.
√
• The square root z → z is analytic in C \] − ∞, 0].
For example, according to these rules the function f (t) = (1 + t2 )−1 can be extended to an analytic
function on C \ {−ı, ı}. If A ∈ C n,n , b ∈ R n , n ∈ N, then z 7→ (A + zI)−1 b ∈ C n is (componentwise)
analytic in C \ σ (A), where σ (A) denotes the set of eigenvalues of A (the spectrum).

6.1.3 Chebychev Interpolation

As pointed out in § 6.0.6, when we build approximation schemes from interpolation schemes, we have
the extra freedom to choose the sampling points (= interpolation nodes). Now, based on the insight into
the structure of the interpolation error gained from Thm. 6.1.44, we seek to choose “optimal” sampling
points. They will give rise to the so-called Chebychev polynomial approximation schemes, also known as
Chebychev interpolation.

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 460

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6.1.3.1 Motivation and definition

Setting:
✦ Without loss of generality (→ Rem. 6.1.18): I = [−1, 1],
✦ interpoland f : I → R at least continuous, f ∈ C0 ( I ),
✦ set of interpolation nodes T := {−1 ≤ t0 < t1 < · · · < tn−1 < tn ≤ 1}, n ∈ N.

1
Recall Thm. 6.1.44: k f − LT f k L ∞ ( I ) ≤ f ( n + 1) k wk L∞ ( I ) ,
( n + 1) ! L∞ ( I )

with nodal polynomial w ( t ) : = ( t − t0 ) · · · · · ( t − t n ) .

Optimal choice of interpolation nodes independent of interpoland

Idea: choose nodes t0 , . . . , tn such that kwk L∞ ( I ) is minimal!

This is equivalent to finding a polynomial q ∈ Pn+1
✦ with leading coefficient = 1,
✦ such that it minimizes the norm kqk L∞ ( I ) .
Then choose nodes t0 , . . . , tn as zeros of q (caution: t j must be-
long to I ).

Requirements on q (by heuristic reasoning)

Remark 6.1.75 (A priori and a posteriori

choice of optimal interpolation nodes)

We stress that we aim for an “optimal” a priori choice

of interpolation nodes, a choice that is made before
w(t) any information about the interpoland becomes avail-
able.
k qk L∞ ( I ) Of couse, an a posteriori choice based on information
gleaned from evaluations of the interpoland f may
yield much better interpolants (in the sense of smaller
norm of the interpolation error). Many modern algo-
rithms employ this a posteriori adaptive approxima-
tion policy, but this chapter will not cover them.
−1t0 t1 t2 tn 1 However, see Section 7.5 for the discussion of an
a posteriori adaptive approach for the numerical ap-
proximation of definite integrals.

−kqk L∞ ( I ) Optimal polynomials q will exist, but they seem to be

Fig. 216
elusive. First, we develop some insights in how they
must “look like”.
• If t∗ is an extremal point of q ➙ |q(t∗ )| =
k qk L ∞ ( I ) ,
• q has n + 1 zeros in I (∗),
• |q(−1)| = |q(1)| = k qk L∞ ( I ) .
➣ q has n + 2 extrema in [−1, 1]

(∗) is motivated by an indirect argument:

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 461

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

If q(t) = (t − t0 ) · · · · · (t − tn+1 ) with t0 < −1, then

| p(t)| := |(t + 1) · · · · · (t − tn+1 )|<|q(t)| ∀t ∈ I (why ?) ,

which contradicts the minimality property of q. Same argument for t0 > 1. The reasonings leading to the
above heuristic demands will be elaborated in the proof of Thm. 6.1.82.

Are there polynomials satisfying these requirements? If so, do they allow a simple characterization?

Definition 6.1.76. Chebychev polynomials → [?, Ch. 32]

The nth Chebychev polynomial is Tn (t) := cos(n arccos t), − 1 ≤ t ≤ 1, n ∈ N .

The next result confirms that the Tn are polynomials, indeed.

Theorem 6.1.77. 3-term recursion for Chebychev polynomials → [?, (32.2)]

The function Tn defined in Def. 6.1.76 satisfy the 3-term recursion

Tn+1 (t) = 2t Tn (t) − Tn−1 (t) , T0 ≡ 1 , T1 (t) = t , n ∈ N . (6.1.78)

Proof. Just use the trigonometric identity cos(n + 1) x = 2 cos nx cos x − cos(n − 1) x with cos x = t.
✷
The theorem implies: • Tn ∈ Pn ,
• their leading coefficients are equal to 2n−1,
• the
Tnn are linearly independent,
• Tj j=0 is a basis of Pn = Span{T0 , . . . , Tn }, n ∈ N0 .
See Code 6.1.79 for algorithmic use of the 3-term recursion (6.1.78).
1 1 n=5
n=6
n=7
0.8 0.8 n=8
n=9
0.6 0.6

0.4 0.4

0.2 0.2
Tn(t)

Tn(t)

0 0

-0.2 -0.2

n=0
-0.4 -0.4
n=1

-0.6 n=2 -0.6

n=3
-0.8 n=4 -0.8

-1 -1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 217 t Fig. 218 t

Chebychev polynomials T0 , . . . , T4 Chebychev polynomials T5 , . . . , T9

C++11 code 6.1.79: Efficient evaluation of Chebychev polynomials up to a certain degree

2 // Computes the values of the Chebychev polynomials T0 , . . . , Td , d ≥ 2
3 // at points passed in x using the 3-term recursion (6.1.78).
4 // The values Tk ( x j ), are returned in row k + 1 of V.
5 void c hebpolmult ( const i n t d , const RowVectorXd& x , MatrixXd& V) {

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 462

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6 const unsigned n = x . siz e ( ) ;

7 V = MatrixXd : : Ones ( d + 1 , n ) ; // T0 ≡ 1
8 V . block ( 1 , 0 , 1 , n ) = x ; // T1 ( x) = x
9 f o r ( i n t k = 1 ; k < d ; ++k ) {
10 const RowVectorXd p = V . block ( k , 0 , 1 , n ) ; // p = Tk
11 const RowVectorXd q = V . block ( k − 1 , 0 , 1 , n ) ; // q = Tk−1
12 V . block ( k + 1 , 0 , 1 , n ) = 2 ∗ x . cwiseProduct ( p ) − q ; // 3-term
recursion
13 }
14 }

C++11 code 6.1.80: Plotting Chebychev polynomials, see Fig. 217, 218

2 // plots Chebychev polynomials up to degree nmax on [−1, 1]

3 void c h e b p o l p l o t ( const unsigned nmax ) {
4 Eigen : : RowVectorXd x = Eigen : : RowVectorXd : : LinSpaced (500 , − 1, 1) ; // evaluation points
5 Eigen : : MatrixXd V ; c hebpol mul t ( nmax , x , V ) ; // get values of cheb. polynomials
6
7 mgl : : F i g u r e f i g ;
8 // iterate over rows of V, which contain the values of the cheb.
polynomials
9 f o r ( unsigned r = 0 ; r < nmax + 1 ; ++ r )
10 f i g . p l o t ( x , V . row ( r ) ) . l a b e l ( "n = " + std : : t o _ s t r i n g ( r ) ) ;
11
12 fig . t i t l e ( "First " + std : : t o _ s t r i n g ( nmax ) + " cheb. polynomials" ) ;
13 fig . x l a b e l ( "x" ) ; f i g . y l a b e l ( "T_n(x)" ) ;
14 fig . ranges ( − 1.1 , 1 . 1 , − 1.1, 1 . 1 ) ;
15 fig . legend ( ) ; f i g . save ( "chebpols.eps" ) ;
16 }

From Def. 6.1.76 we conclude that Tn attains the values ±1 in its extrema with alternating signs, thus
matching our heuristic demands:
kπ
| Tn (t k )| = 1 ⇔ ∃ k = 0, . . . , n: tk = cos , k Tn k L∞ ([−1,1]) = 1 . (6.1.81)
n

What is still open is the validity of the heuristics guiding the choice of the optimal nodes. The next funda-
mental theorem will demonstrate that, after scaling, the Tn really supply polynomials on [−1, 1] with fixed
leading coefficient and minimal supremum norm.

Theorem 6.1.82. Minimax property of the Chebychev polynomials [?, Section 7.1.4.], [?,
Thm. 32.2]

The polynomials Tn from Def. 6.1.76 minimize the supremum norm in the following sense:

kTn k L∞ ([−1,1]) = inf{k pk L∞ ([−1,1]) : p ∈ Pn , p(t) = 2n−1 tn + · · · } , ∀n ∈ N .

Proof. (indirect) Assume

∃q ∈ Pn , leading coefficient = 2n−1 : kqk L∞ ([−1,1]) < k Tn k L∞ ([−1,1]) . (6.1.83)

(Tn − q)( x ) > 0 in local maxima of Tn

(Tn − q)( x ) < 0 in local minima of Tn

From knowledge of local extrema of Tn , see (6.1.81):

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 463

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Tn − q changes sign at least n + 1 times

⇒ Tn − q has at least n zeros
Tn − q ≡ 0, because Tn − q ∈ Pn−1 (same leading coefficient!)
This cannot be reconciled with the properties (6.1.83) of q and, thus, leads to a contradiction.
✷

2k + 1
The zeros of Tn are tk = cos π , k = 0, . . . , n − 1 . (6.1.84)
2n

Too see this, notice

zeros of cos π
Tn (t) = 0 ⇔ n arccos t ∈ (2Z + 1)
2
arccos ∈ [0, π ] 2k + 1 π
⇔ t ∈ cos , k = 0, . . . , n − 1 .
n 2
Thus, we have identified the tk from (6.1.84) as optimal interpolation nodes for a Lagrangian approximation
scheme. The tk are known as Chebychev nodes. Their distribution in [−1, 1] is plotted in Fig. 219, a
geometric construction is indicated in Fig. 220.
20

12
n

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Fig. 219 t Fig. 220

When we use Chebychev nodes for polynomial interpolation we call the resulting Lagrangian approxima-
tion scheme Chebychev interpolation. On the interval [−1, 1] is is characterized by:
n o
2k+1
• “optimal” interpolation nodes T = cos 2( n + 1)
π , k = 0, . . . , n ,
• w(t) = (t − t0 ) · · · (t − tn ) = 2−n Tn+1 (t) , kwk L∞ ( I ) = 2−n , with leading coefficient 1.
Then, by Thm. 6.1.44, we immediately get an interpolation error estimate for Chebychev interpolation of
f ∈ Cn+1 ([−1, 1]):
2− n
k f − IT ( f )k L∞ ([−1,1]) ≤ f ( n + 1) . (6.1.85)
( n + 1) ! L ∞ ([−1,1])

Remark 6.1.86 (Chebychev polynomials on arbitrary interval)

Following the recipe of Rem. 6.1.18 Chebychev interpolation on an arbitrary interval [ a, b] can immediately
be defined. The same polynomial Lagrangian approximation scheme is obtained by transforming the
Chebychev nodes (6.1.84) from [−1, 1] to [ a, b] using the unique affine transformation (6.1.19):

bt ∈ [−1, 1] 7→ t := a + 1 (bt + 1)(b − a) ∈ [ a, b] ].

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 464

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✬ ✩
The Chebychev nodes in the interval I = [ a, b]
are

1 2k + 1
tk := a + 2 (b − a) cos π +1 ,
2( n + 1)
(6.1.87)
k = 0, . . . , n .

a b✫ ✪

p ∈ Pn ∧ p(t j ) = f (t j ) ⇔ pb ∈ Pn ∧ pb(bt j ) = fb(bt j ) .

dn fb dn f
With transformation formula for the integrals & (bt ) = ( 1 | I |)n ( t ):
2
dbtn dtn

2− n dn+1 fb
k f − IT ( f )k L∞ ( I ) = fb − ITb ( fb) ≤
L ∞ ([−1,1]) (n + 1)! dbtn+1
L ∞ ([−1,1])
2−2n−1 n+1 (n+1)
≤ |I| f . (6.1.88)
( n + 1) ! L∞ ( I )

6.1.3.2 Chebychev interpolation error estimates

Example 6.1.89 (Polynomial interpolation: Chebychev nodes versus equidistant nodes)

We consider Runge’s function f (t) = 1+1t2 , see Ex. 6.1.41, and compare polynomial interpolation based
on uniformly spaced nodes and Chebychev nodes in terms of behavior of interpolants.

2 1.2
1/(1+x2) Function f
Interpolating polynomial Chebychev interpolation polynomial
1
1.5

0.8

1
0.6

0.4
0.5

0.2

0
0

-0.5 -0.2
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
Fig. 221 Fig. 222 t

Equidistant nodes Chebychev nodes

We observe that the Chebychev nodes cluster at the endpoints of the interval, which successfully sup-
presses the huge oscillations haunting equidistant interpolation there.

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 465

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 6.1.90 (Lebesgue Constant for Chebychev nodes → Section 5.2.4)

We saw in Rem. 5.2.74 that the Lebesgue constant λT that measures the sensitivity of a polynomial
interpolation scheme, blows up exponentially with increasing number of equispaced interpolation nodes.
In stark contrast λT grows only logarithmically in the number of Chebychev nodes.
3.2

More precisely, sophisticated theory [?, ?, ?] supplies 3

the bound
2.8

2.6
2
λT ≤ π log(1 + n) + 1 . (6.1.91)

λT
2.4

2.2
Measured Lebesgue constant for Chebychev nodes
based on approximate evaluation of (5.2.72) by sam- 2

pling. ✄
1.8
0 5 10 15 20 25
Polynomial degree n

Combining (6.1.91) with the general estimate from Rem. 6.1.57

k f − LT f k L∞ ( I ) ≤ (1 + λT ) inf k f − pk L∞ ( I ) ∀ f ∈ C0 ( I ) , (6.1.61)
p∈Pn

and the bound for the best approximation error for polynomaial from Thm. 6.1.15

( n − r ) ! (r )
inf k f − p k L∞ ([−1,1]) ≤ (1 + π 2/2)r f ,
p∈Pn n! L ∞ ([−1,1])

we end up with a bound for the supremum norm of the interpolation error in the case of Chebychev
interpolation on [−1, 1]

( n − r ) ! (r )
k f − LT f k L∞ ([−1,1]) ≤ (2/π log(1 + n) + 2)(1 + π 2/2)r f . (6.1.92)
n! L ∞ ([−1,1])

Experiment 6.1.93 (Chebychev interpolation errors)

Now we empirically investigate the behavior of norms of the interpolation error for Chebychev interpolation
and functions with different (smoothness) properties as we increase the number of interpolation nodes.

In the experiments, for I = [ a, b] we set xl := a + b− a

N l, l = 0, ..., N, N = 1000, and we approximate
the norms of the interpolation error as follows ( p =
ˆ interpolating polynomial):

|| f − p||∞ ≈ max | f ( xl ) − p( xl )| (6.1.94)

0≤ l ≤ N
b−a
p||22 2 2
2N 0≤∑
|| f − ≈ | f ( xl ) − p( xl )| + | f ( xl +1 ) − p( xl +1 )| (6.1.95)
l<N

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 466

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

➀ f (t) = (1 + t2 )−1 , I = [−5, 5] (see Ex. 6.1.41): analytic in a neighborhood of I .

Interpolation with n = 10 Chebychev nodes (plot on the left).
1
10

1.2
||f-p ||
Function f n ∞
Chebychev interpolation polynomial
1
||f-p n||2

0.8 10
0

Error norm
0.6

0.4

-1
10
0.2

-0.2
-5 -4 -3 -2 -1 0 1 2 3 4 5 -2
10
t 2 4 6 8 10 12 14 16 18 20
Fig. 223 Polynomial degree n

Notice: exponential convergence (→ Def. 6.1.38) of the Chebychev interpolation:

pn → f , k f − In f k L∞ ([−5,5]) ≈ 0.8n
.

➁ f (t) = max{1 − |t|, 0}, I = [−2, 2], n = 10 nodes (plot on the left).

Now f ∈ C0 ( I ) but f ∈
/ C 1 ( I ).

0
1.2 10
Function f ||f-p ||
n ∞
Chebychev interpolation polynomial ||f-p n||2
1

0.8
Error norm

0.6
-1
10

0.4

0.2

-2
-0.2 10
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2 4 6 8 10 12 14 16 18 20
t Polynomial degree n
Fig. 224 Fig. 225

0
10
||f-p ||
n ∞
||f-p n||2
Error norm

-1

From the doubly logarithmic plot we conclude ➙ 10

• no exponential convergence
• algebraic convergence (?)

-2
10
0 1 2
10 10 10

Fig. 226
Polynomial degree n

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 467

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(
1
2 (1 + cos πt) |t| < 1
➂ f (t) = I = [−2, 2], n = 10 (plot on the left).
0 1 ≤ |t| ≤ 2

0
1.2 10
Function f ||f-p n||∞
Chebychev interpolation polynomial ||f-p n||2
1

0.8
-1
10

Error norm
0.6

0.4

-2
10
0.2

-3
-0.2 10
0 1 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 10 10 10

Fig. 227 t Fig. 228

Polynomial degree n

Notice: only (vaguely) algebraic convergence.

Summary of observations, cf. Rem. 6.1.40:

✦ Essential role of smoothness of f : slow convergence of approximation error of the Cheychev

interpolant if f enjoys little smoothness, cf. also (6.1.50),

✦ for analytic f ∈ C∞ (→ Def. 6.1.63) the approximation error of the Cheychev interpolant seems to
decay to zero exponentially in the polynomial degree n.

Remark 6.1.96 (Chebychev interpolation of analytic functions)

Assuming that the interpoland f possesses an analytic extension to a complex neighborhood D of [−1, 1],
we now apply the theory of § 6.1.62 to bound the supremum norm of the Chebychev interpolation error of
f on [−1, 1].
To convert

Z max |w(τ )| max | f (z)|

as obtained in § 6.1.62, into a more concrete estimate, we have to study the behavior of

2k + 1
wn (t) = (t − t0 )(t − t1 ) · · · · · (t − tn ) , tk = cos π , k = 0, . . . , n ,
2n + 2

where the tk are the Chebychev nodes according to (6.1.87). They are the zeros of the Chebychev poly-
nomial (→ Def. 6.1.76) of degree n + 1. Since w has leading coefficient 1, we conclude w = 2−n Tn+1 ,
and

max |w(t)| ≤ 2−n . (6.1.97)

− 1≤ t ≤ 1

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 468

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Next, we fix a suitable path γ ⊂ D for integration:For a constant ρ > 1 we set

γ := {z = cos(θ − ı log ρ), 0 ≤ θ ≤ 2π }

1
= z = (exp(ı (θ − ı log ρ)) + exp(−ı (θ − ı log ρ))) , 0 ≤ θ ≤ 2π
2

1 ıθ −1 −iuθ

= z= ρe + ρ e , 0 ≤ θ ≤ 2π
2
n o
= z = 12 (ρ + ρ−1 ) cos θ + ı 21 (ρ − ρ−1 ) sin θ, 0 ≤ θ ≤ 2π

0.8
ρ=1
0.6
ρ=1.2
ρ=1.4
0.4
ρ=1.6
Thus, we see that γ is an ellipsis with foci ±1, large 0.2 ρ=1.8
ρ=2
axis 12 (ρ + ρ−1 ) > 1 and small axis 12 (ρ − ρ−1 ) > 0.

Im
0

Elliptical integration contours for different values of ρ -0.2

✄ -0.4

-0.6

-0.8

-1 -0.5 0 0.5 1
Fig. 229 Re

Appealing to geometric evidence, we find dist(γ, [−1, 1]) = 12 (ρ + ρ−1 ) − 1, which gives another term
in (6.1.71).

The rationale for choosing this particular integration contour is that the cos in its defintion nicely cancels
the arccos in the formula for the Chebychev polynomials. This lets us compute (s := n + 1)

| Ts (cos(θ − ı log ρ))|2 = | cos(s(θ − ı log ρ))|2

= cos(s(θ − ı log ρ)) · cos(s(θ − ı log ρ))
1
= (ρs eısθ + ρ−s e−ısθ )(ρs e−ısθ + ρ−s eısθ )
4
1 2s
= (ρ + ρ−2s + e2ısθ + e−2ısθ )
4
1 s 1 1
= (ρ − ρ−s )2 + (eısθ + e−ısθ )2 ≥ (ρs − 1)2 ,
4 |{z} |4 {z } 4
<1
=cos2 (sθ )≥0

for all 0 ≤ θ ≤ 2π , which provides a lower bound for |wn | on γ. Plugging all these estimates into (6.1.71)
we arrive at
2| γ | 1
k f − LT f k L∞ ([−1,1]) ≤ · max | f (z)| . (6.1.98)
π (ρ n + 1 − 1)(ρ + ρ−1 − 2) z∈γ
Note that instead on the nodal polynomial w we have inserted Tn+1 into (6.1.71), which is a simple
multiple. The factor will cancel.

The supremum norm of the interpolation converges exponentially (ρ > 1!):

k f − LT f k L∞ ([−1,1]) = O((2ρ)−n ) for n → ∞ .

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 469

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Experiment 6.1.99 (Chebychev interpolation of analytic function → Exp. 6.1.93 cnt’d)

-1
10

-2 ||f-p ||
10 n ∞
||f-p n||2
-3
10

Modification: the same function f (t) = (1 + t2 )−1

-4
10
on a smaller interval I = [−1, 1].

Error norm
-5
10
(Faster) exponential convergence than on the interval
I =] − 5, 5[: -6
10

k f − In f k L2 ([−1,1]) ≈ 0.42n . -7
10

-8
10

-9
10
2 4 6 8 10 12 14 16 18 20
Fig. 230 Polynomial degree n

Explanation, cf. Rem. 6.1.96: for I = [−1, 1] the poles ±i of f are farther away relative to the size of the
interval than for I = [−5, 5].

6.1.3.3 Chebychev interpolation: computational aspects

Task: Given: given degree n ∈ N, continuous function f : [−1, 1] 7→ R

Sought: efficient representation/evaluation of Chebychev interpolant p ∈ Pn (= polynomial

Lagrange interpolant of degree ≤ n in Chebychev nodes (6.1.87) on [−1, 1]).

More concretely, this boils down to a implementation of the following class:

C++11 code 6.1.100: Definition of class for Chebychev interpolation

1 class ChebInter p {
2 private :
3 // various internal data describing Chebychev interpolating
polynomial p
4 public :
5 // Constructor taking function f and degree n as arguments
6 template <typename Func tion >
7 P o l y I n t e r p ( const F u n c t i o n &f , unsigned i n t n ) ;
8 // Evaluation operator: y j = p( x j ), j = 1, . . . , m (m “large”)
9 void e v a l ( const vector <double> &x , vector <double> &y ) const ;
10 };

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 470

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Trick: internally represent p as a linear combination of Chebychev polynomials, a

Chebychev expansion:
n
p= ∑ α j Tj , αj ∈ R . (6.1.101)
j =0

The representation (6.1.101) is always possible, because { T0 , . . . , Tn } is a basis of Pn , owing to deg Tn =

n. The representation is amenable to efficient evaluation and computation by means of special algorithms.

(6.1.102) Fast evaluation of Chebychev expansion → [?, Alg. 32.1]

[Fast evaluation of Chebychev expansion]

Task: Given n ∈ N, x ∈ R , and the Chebychev expansion coefficients α j ∈ R , j = 0, . . . , n, compute
p( x ) with
n
p( x ) = ∑ α j Tj (x) , αj ∈ R . (6.1.101)
j =0

Idea: Use the 3-term recurrence (6.1.78)

Tj ( x ) = 2xTj−1 ( x ) − Tj−2 ( x ) , j = 2, 3, . . . , (6.1.78)

to design a recursive evaluation scheme.

By means of (6.1.78) rewrite (6.1.101) as
n −1
p( x ) = ∑ α j Tj (x) + αn Tn (x)
j =0
n −1
(6.1.78)
= ∑ α j Tj (x) + αn (2xTn−1 (x) − Tn−2 (x))
j =0
n −3
= ∑ α j Tj (x) + (αn−2 − αn )Tn−2 (x) + (αn−1 + 2xαn )Tn−1 (x) .
j =0

We recover the point value p( x ) as the point value of another polynomial of degree n − 1 with know
Chebychev expansion:

n −1 
α j + 2xα j+1 , if j = n − 1 ,
p( x ) = ∑ e α j Tj ( x ) with e
α j = α j − α j +2 , if j = n − 2 , (6.1.103)
j =0


αj else.

recursive algorithm, see Code 6.1.104.

C++11 code 6.1.104: Recursive evaluation of Chebychev expansion (6.1.101)

2 // Recursive evaluation of a polynomial p = ∑nj=+11 a j Tj−1 at point x
3 // based on (6.1.103)
4 // IN : Vector of coefficients a
5 // evaluation point x

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 471

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6 // OUT: Value at point x

7 double recclenshaw ( const VectorXd& a , const double x ) {
8 const VectorXd : : Index n = a . siz e ( ) − 1 ;
9 if ( n == 0 ) r et ur n a ( 0 ) ; // Constant polynomial
10 else i f ( n == 1 ) r et ur n ( x ∗ a ( 1 ) + a ( 0 ) ) ; // Value α1 ∗ x + α0
11 else {
12 VectorXd new_a ( n ) ;
13 new_a << a . head ( n − 2 ) , a ( n − 2 ) − a ( n ) , a ( n − 1 ) + 2 ∗ x ∗ a ( n ) ;
14 r et ur n recclenshaw ( new_a , x ) ; // recursion
15 }
16 }

Non-recursive version: Clenshaw algorithm

C++11 code 6.1.105: Clenshaw algorithm for evalation of Chebychev expansion (6.1.101)
2 // Clenshaw algorithm for evaluating p = ∑nj=+11 a j Tj−1
3 // at points passed in vector x

4 // IN : a = α j , coefficients for p = ∑nj=+11 α j Tj−1
5 // x = (many) evaluation points
6 // OUT: values p( x j ) for all j
7 VectorXd clenshaw ( const VectorXd& a , const VectorXd& x ) {
8 const i n t n = a . siz e ( ) − 1 ; // degree of polynomial
9 MatrixXd d ( n + 1 , x . siz e ( ) ) ; // temporary storage for intermediate
values
10 f o r ( i n t c = 0 ; c < x . siz e ( ) ; ++c ) d . col ( c ) = a ;
11 f o r ( i n t j = n − 1 ; j > 0 ; −− j ) {
12 d . row ( j ) += 2 ∗ x . transpose ( ) . cwiseProduct ( d . row ( j +1) ) ; // see
(6.1.103)
13 d . row ( j −1) −= d . row ( j + 1 ) ;
14 }
15 r et ur n d . row ( 0 ) + x . transpose ( ) . cwiseProduct ( d . row ( 1 ) ) ;
16 }

Computational effort : O(nm) for evaluation at m points

(6.1.106) Computation of Chebychev expansions of interpolants

Chebychev interpolation is a linear interpolation scheme, see § 5.1.13. Thus, the expansion α j in (6.1.101)
can be computed by solving a linear system of equations of the form (5.1.15). However, for Chebychev
interpolation this linear system can be cast into a very special form, which paves the way for its fast direct
solution:

Task: Efficiently compute the Chebychev expansion coefficients α j in (6.1.101) from the interpolation
conditions

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 472

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

2k + 1
p(tk ) = f (tk ) , k = 0, . . . , n , for tk := cos π . (6.1.107)
2( n + 1)

Chebychev nodes

Trick: transformation of p into a 1-periodic function, which turns out to be a Fourier sum (= finite Fourier
series):
n n
Def. 6.1.76
q(s) := p(cos 2πs) = ∑ α j Tj (cos 2πs) = ∑ α j cos(2πjs)
j =0 j =0
n
= ∑ 12 α j exp(2πıjs) + exp(−2πıjs) [ by cos z = 21 (ez + e−z ) ]
j =0
 (6.1.108)

 0 , for j = n + 1 ,
n +1 
1α
2 j , for j = 1, . . . , n ,
= ∑ β j exp(−2πıjs) , with β j :=
 α0 , for j = 0 ,
j=− n 

1
2 α n − j , for j = − n, . . . , −1 .
Transformed interpolation conditions (6.1.107) for q:

(6.1.107) 2k + 1
t = cos(2πs) =⇒ q = yk := f (tk ) , k = 0, . . . , n . (6.1.109)
4( n + 1)
This is an interpolation problem for equidistant points on the unit circle as we have seen them in Sec-
tion 5.6.3.

Also observe the symmetry

q ( s ) = q (1 − s )
⇓← (6.1.109)
2k + 1
q (1 − ) = yk , k = 0, . . . , n .
4( n + 1)

It ensures that the coefficients β j actually satisfy the

constraints implied by their relationship with α j .

Fig. 231

Extend interpolation conditions (6.1.109) by symmetry, see Fig. 232

(
k 1 yk , for k = 0, . . . , n ,
q + = zk := (6.1.110)
2( n + 1) 4( n + 1) y2n+1−k , for k = n + 1, . . . , 2n + 1 .

In a sense, we can mirror the interpolation conditions at x = 12 :

x = 1/2

0 1

Fig. 232

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 473

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Chebychev expansion for Chebychev interpolants

m
Trigonometric interpolation at equidistant points → Section 5.6.3

Trigonometric interpolation at equidistant points can be done very efficiently by means of FFT-based al-
gorithms, see Code 5.6.15. We can also apply these for the computation of Chebychev expansion coeffi-
cients.

Details: (6.1.110) ⇐⇒ square linear system of equations for unknown β j .

n +1
k 1 2πıj 2πı
q +
2( n + 1) 4( n + 1)
= ∑ β j exp −
4( n + 1)
exp −
2n + 1
kj = zk .
j=− n
m
2n +1
2πı( j − n) 2πı nk
∑ β j exp −
4( n + 1)
= exp −
2n + 1
kj = exp −πı
n+1 k
z , k = 0, . . . , 2n + 1 .
j =0 | {z }
kj
= ω 2( n +1) !

m
h i
2πıj 2n +1
c = β j exp − 4(n+1) ,
F2(n+1) c = b with j =0 (6.1.111)
h i2n+1
b = exp −πı nnk +1 zk .
k=0

(2n + 2) × (2n + 2) Fourier matrix, see (4.2.13)

solve (6.1.111) with inverse discrete Fourier transform, see 4.2:
asymptotic complexity O(n log n) (→ Section 4.3)
Note: by symmetry of z ➥ β 2n+1 = 0, cf. (6.1.108)!

M ATLAB-code 6.1.112: Efficient computation of Chebychev expansion coefficient of Cheby-

chev interpolant
2 // efficiently compute coefficients α j in the Chebychev expansion
n
3 // p = ∑ α j Tj of p ∈ Pn based on values yk ,
j=0
4 // k = 0, . . . , n, in Chebychev nodes tk , k = 0, . . . , n
5 // IN: values yk passed in y
6 // OUT: coefficients α j
7 VectorXd chebexp ( const VectorXd& y ) {
8 const i n t n = y . siz e ( ) − 1 ; // degree of polynomial
9 const std : : complex <double> M_I ( 0 , 1 ) ; // imaginary unit
10 // create vector z, see (6.1.110)
11 VectorXcd b ( 2 ∗ ( n + 1 ) ) ;
12 const std : : complex <double> om = −M_I ∗ ( M_PI ∗ n ) / ( ( double ) ( n+1) ) ;
13 f o r ( i n t j = 0 ; j <= n ; ++ j ) {
14 b ( j ) = std : : exp (om∗ double ( j ) ) ∗ y ( j ) ; // this cast to double is

6. Approximation of Functions in 1D, 6.1. Approximation by Global Polynomials 474

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

necessary!!
15 b (2 ∗ n+1− j ) = std : : exp (om∗ double (2 ∗ n+1− j ) ) ∗ y ( j ) ;
16 }
17

18 // Solve linear system (6.1.111) with effort O(n log n)

19 Eigen : : FFT<double> f f t ; // E I G E N ’s helper class for DFT
20 VectorXcd c = f f t . inv ( b ) ; // -> c = ifft(z), inverse fourier
transform
21 // recover β j , see (6.1.111)
22 VectorXd beta ( c . siz e ( ) ) ;
23 const std : : complex <double> sc = M_PI_2 / ( n + 1 ) ∗ M_I ;
24 f o r ( unsigned j = 0 ; j < c . siz e ( ) ; ++ j )
25 beta ( j ) = ( std : : exp ( sc ∗ double(−n+ j ) ) ∗ c [ j ] ) . r e a l ( ) ;
26 // recover α j , see (6.1.108)
27 VectorXd alpha = 2 ∗ beta . segment ( n , n ) ; alpha ( 0 ) = beta ( n ) ;
28 r et ur n alpha ;
29 }

Remark 6.1.113 (Chebychev representation of built-in functions)

Computers use approximation by sums of Chebychev polynomials in the computation of functions like
log, exp, sin, cos, . . .. The evaluation by means of Clenshaw algorithm according to Code 6.1.105 is more
efficient and stable than the approximation by Taylor polynomials.

6.2 Mean Square Best Approximation

There is a particular family of norms for which the best approximant of a function f in a finite dimensional
function space VN , that is, the element of VN that is closest to f with respect to that particular norm can
actually be computed. It turns out that this computation boils down to solving a kind of least squares
problem, similar to the least squares problems in K n discussed in Chapter 3.

6.2.1 Abstract theory

Concerning mean square best approximation it is useful to learn an abstract framework first into which the
concrete examples can be fit later.

6.2.1.1 Mean square norms

Mean square norms generalize the Euclidean norm on K n , see [?, Sect. 4.4]. In a sense, they endow a
vector space with a geometry and give a meaning to concepts like “orthogonality”.

6. Approximation of Functions in 1D, 6.2. Mean Square Best Approximation 475

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 6.2.1. (Semi-)inner product [?, Sect. 4.4]

Let V be a vector space over the field K. A mapping b : V × V → K is called an inner product on
V , if it satisfies
(i) b is linear in the first argument: b(αv + βw, u) = αb(v, u) + βb(w, u) for all α, β ∈ K,
u, v, w ∈ V ,
(ii) b is (anti-)symmetric: b(v, w) = b(w, v) ( = ˆ complex conjugation),
(iii) b is positive definite: v 6= 0 ⇔ b(v, v) > 0.
b is a semi-inner product, if it still complies with (i) and (ii), but is only positive semi-definite:
b(v, v) ≥ 0 for all v ∈ V .

✎ notation: usually we write (·, ·)V for an inner product on the vector space V .

Definition 6.2.2. Orthogonality

Let V be a vector space equipped with a (semi-)inner product (·, ·)V . Any two elements v and w of
V are called orthogonal, if (v, w)V = 0. We write v ⊥ w.

✎ notation: If W ⊂ V is a (closed) subsspace: v ⊥ W :⇔ (v, w)V = 0 ∀ w ∈ W .

Theorem 6.2.3. Mean square (semi-)norm/Inner product (semi-)norm

If (·, ·)V is a (semi-)inner product (→ Def. 6.2.1) on the vector space V , then
q
kvkV := (v, v)V

defines a (semi-)norm (→ Def. 1.5.70) on V , the mean square (semi-)norm/ inner product
(semi-)norm induced by (·, ·)V .

(6.2.4) Examples for mean square norms

✦ The Euclidean norm on K n induced by the dot product (Euclidean inner product).
n
(x, y)Kn := ∑ (x ) j (y) j [“Mathematical indexing” !] x, y ∈ K n .
j =1

✦ The L2 -norm (5.2.67) on C0 ([ a, b]) induced by the L2 ([ a, b]) inner product

Z b
( f , g) L2 ([ a,b)] := f (τ ) g (τ ) dτ , f , g, ∈ C0 ([ a, b]) . (6.2.5)
a

6.2.1.2 Normal equations

Mean square best approximation = best approximation in a mean square norm

6. Approximation of Functions in 1D, 6.2. Mean Square Best Approximation 476

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

From § 3.1.8 we know that in Euclidean space K n the best approximation of vector x ∈ K n in a subspace
V ⊂ K n is unique and given by the orthogonal projection of x onto V . Now we generalize this to vector
spaces equipped with inner products.

ˆ a vector space over K = R, equipped with an mean square semi-norm k·k X induced by a semi-
X=
inner product (·, ·) X , see Thm. 6.2.3.
It can be an infinite dimensional function space, e.g., X = C0 ([ a, b]).

ˆ a finite-dimensional subspace of X , with basis BV := {b1 , . . . , b N } ⊂ V , N := dim V .

Assumption 6.2.6.

The semi-inner product (·, ·) X is a genuine inner product (→ Def. 6.2.1) on V , that is, it is positive
definite: (v, v) X > 0 ∀v ∈ V \ {0}.

Now we give a formula for the element q of V , which is nearest to a given element f of X with respect to
the norm k·k X . This is a genuine generalization of Thm. 3.1.10.

Theorem 6.2.7. Mean square norm best approximation through normal equations

Given any f ∈ X there is a unique q ∈ V such that

k f − qk X = inf k f − p k X .
p ∈V

Its coefficients γ j , j = 1, . . . , N , with respect to the basis BV := {b1 , . . . , b N } of V (q =

∑Nj=1 γ j b j ) are the unique solution of the normal equations
 
h (b1 , b1 ) X . . . (b1 , b N ) X
N iN  .. ..  N,N
M γ j j =1 = f , b j X , M :=  . . ∈K . (6.2.8)
j =1
(b N , b1 ) X . . . (b N , b N )X

Proof. (inspired by Rem. 3.1.14) We first show that M is s.p.d. (→ Def. 1.1.8). Symmetry is clear from the
definition and the symmetry of (·, ·) X . That M is even positive definite follows from
N N 2
N
xH Mx = ∑ ∑ ξ k ξ j bk , b j X
= ∑ ξ b
j =1 j j X
>0, (6.2.9)
k=1 j =1
N
if x := ξ j j=1 6= 0 ⇔ ∑ N
j=1 ξ j b j 6= 0, since k·k X is a norm on V by Ass. 6.2.6.

N h iN
Now, writing c := γ j j=1 ∈ K N , b := f , bj X j =1
∈ K N , and using the basis representation

N
q= ∑ γj bj ,
j =1

we find

Φ(c) := k f − qk2X = k f k2X − 2b⊤ c + c⊤ Mc .

6. Approximation of Functions in 1D, 6.2. Mean Square Best Approximation 477

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Applying the differentiation rules from Ex. 8.4.12 to Φ : K N → R , c 7→ Φ(c), we obtain

grad Φ(c) = 2(Mc − b) , (6.2.10)

H Φ(c) = 2M . (independent of c!) (6.2.11)

Since M is s.p.d., the unique solution of grad Φ(c) = Mc − b = 0 yields the unique global minimizer of
Φ; the Hessian 2M is s.p.d. everywhere!
✷

The unique q from Thm. 6.2.7 is called the best approximant of f in V .

Corollary 6.2.12. Best approximant by orthogonal projection

If q is the best approximant of f in V , then f − q is orthogonal to every p ∈ V :

( f − q, p) X = 0 ∀ p ∈ V ⇔ f −q ⊥ V .

f
The message of Cor. 6.2.12:

V ✁ the best approximation error f − q for f ∈ X in V

is orthogonal to the subspace V .
q See § 3.1.8 for related discussion in Euclidean space
Fig. 233
Kn .

Remark 6.2.13 (Connetion with linear least squares problems Chapter 3)

In Section 3.1.1 we introduced the concept of least squares solutions of overdetermined linear systems
of equations Ax = b, A ∈ R m,n , m > n, see Def. 3.1.3. Thm. 3.1.10 taught that the normal equations
A⊤ AX = A⊤ b give the least squares solution, if rank(A) = n.
In fact, Thm. 3.1.10 and the above Thm. 6.2.7 agree if X = K n (Euclidean space) and V = Span{a1 , . . . , an },
where a j ∈ R m are the columns of A and N = n.

6.2.1.3 Orthonormal bases

In the setting of Section 6.2.1.2 we may ask: Which choice of basis B = {b1 , . . . , b N } of
V ⊂ X renders
the normal equations (6.2.8) particularly simple? Answer: A basis B, for which bk , b j X = δkj (δkj the
Kronecker symbol), because this will imply M = I for the coefficient matrix of the normal equations.

Definition 6.2.14. Orthonormal basis

A subset {b1 , . . . , b N } of an N -dimensional

vector space V with inner product (→ Def. 6.2.1) (·, ·)V
is an orthonormal basis (ONB), if bk , b j V = δkj .

A basis of {b1 , . . . , b N } of V is called orthogonal, if bk , b j V = 0 for k 6= j.

6. Approximation of Functions in 1D, 6.2. Mean Square Best Approximation 478

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Corollary 6.2.15. ONB representation of best approximant

If {b1 , . . . , b N } is an orthonormal basis (→ Def. 6.2.14) of V ⊂ X , then the best approximant

q := argmin p∈V k f − p k X of f ∈ X has the representation

N
q= ∑ f , bj b
X j
. (6.2.16)
j =1

(6.2.17) Gram-Schmidt orthonormalization

From Section 1.5.1 we already know how to compute orthonormal bases: The algorithm from § 1.5.1 can
be run in the framework of any vector space V endowed with an inner product (·, ·)V and induced mean
square norm k·kV .

Theorem 6.2.19. Gram-Schmidt orthonor-

p malization
1: b1 := k p 1k % 1st output vector
1 V
2: for j = 2, . . . , k do When supplied with k ∈ N linearly indepen-
{ % Orthogonal projection dent vectors p1 , . . . , pk ∈ V in a vector space
3: b j := p j with inner product (·, ·)V , Algorithm (6.2.18)
4: for ℓ = 1, 2, . . . , j − 1 do computes vectors b1 , . . . , bk with
(6.2.18)
5: { b j ← b j − p j , bℓ V bℓ }
bℓ , b j = δℓ j , ℓ, j ∈ {1, . . . , k} ,
6: if ( b j = 0 ) then STOP V

7: else { bj ←
bj
}
Span{b1 , . . . , bℓ } = Span{ p1 , . . . , pℓ }
k b j kV
} for all ℓ ∈ {1, . . . , k}.

This suggests the following alternative approach to the computation of the mean square best approximant
q in V of f ∈ X :
➊ Orthonormalize a basis {b1 , . . . , b N } of V , N := dim V , using Gram-Schmidt algorithm (6.2.18).
➋ Compute q according to (6.2.16).
Number of inner products to be evaluated: O( N 2 ) for N → ∞.

6.2.2 Polynomial mean square best approximation

Now we apply the results of Section 6.2.1 in the following setting:

X : function space C0 ([ a, b]), −∞ < a < b < ∞, of R-values continuous functions,

V : space Pm of polynomials of degree ≤ m.

Remark 6.2.20 (Inner products on spaces Pm of polynomials)

6. Approximation of Functions in 1D, 6.2. Mean Square Best Approximation 479

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

To match the abstract framework of Section 6.2.1 we need to find (semi-)inner products on C0 ([ a, b]) that
supply positive definite inner products on Pm . The following options are commonly considered:

✦ On any interval [ a, b] we can use the L2 ([ a, b])-inner product (·, ·) L2 ([ a,b)] , defined in (6.2.5).

✦ Given a positive weight function w : [ a, b] → R, w(t) > 0 for all t ∈ [ a, b], we can consider the
weighted L2 -inner product on the interval [ a, b]
Z b
( f , g) w,[ a,b] := w(τ ) f (τ ) g (τ ) dτ . (6.2.21)
a

✦ For n ≥ m and n + 1 distinct points collected in the set T := {t0 , t1 , . . . , tn } ⊂ [ a, b] we can use
the discrete L2 -inner product
n
( f , g)T := ∑ f (t j ) g(t j ) . (6.2.22)
j =0

Since a polynomial of degree ≤ m must be zero everywhere, if it vanishes in at least m + 1 distinct

points, ( f , g)T is positive definite on Pm .

For all these inner products on Pm holds

({t 7→ t f (t)} , g)X = ( f , {t 7→ tg(t)}) X , f , g ∈ C0 ([ a, b]) , (6.2.23)

that is, multiplication with the independent variable can be shifted to the other function inside the inner
product.

✎ notation: Note that we have to plug a function into the slots of the inner products; this is indicated by
the notation {t 7→ . . .}.

Assumption 6.2.24. Self-adjointness of multiplication operator

We assume the inner product (·, ·) X to satisfy (6.2.23).

The ideas of Section 6.2.1.3 that center around the use of orthonormal bases can also be applied to
polynomials.

Definition 6.2.25. Orthonormal polynomials → Def. 6.2.14

Let (·, ·) X be an inner product on Pm . A sequence r0 , r1 , . . . , rm provides orthonormal polynomials
(ONPs) with respect to (·, ·) X , if

rℓ ∈ Pℓ , (rk , rℓ ) X = δkℓ , ℓ, k ∈ {0, . . . , m} . (6.2.26)

The polynomials are just orthogonal, if (rk , rℓ ) X = 0 for k 6= ℓ.

By virtue of Thm. 6.2.19 orthonormal polynomials can be generated

by applying Gram-Schmidt orthonor-
m
malization from § 6.2.17 to the ordered basis of monomials t 7→ t j j=0.

6. Approximation of Functions in 1D, 6.2. Mean Square Best Approximation 480

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Lemma 6.2.27. Uniqueness of orthonormal polynomials

The sequence of orthonormal polynomials from Def. 6.2.25 is unique up to signs, supplies an (·, ·) X -
orthonormal basis (→ Def. 6.2.14) of Pm , and satisfies

Span{r0 , . . . , rk } = Pk , k ∈ {0, . . . , m} . (6.2.28)

Proof. Comparing Def. 6.2.14 and (6.2.26) the ONB-property of {r0 , . . . , rm } is immediate. Then (6.2.26)
follows from dimensional considerations.

r0 must be a constant, which, up to sign, is fixed by the normalization condition kr0 k X = 1.

Pk−1 ⊂ Pk has co-dimension 1 so that there a unit “vector” in Pk , which is orthogonal to Pk−1 and unique
up to sign.
✷

(6.2.29) Orthonormal polynomials by orthogonal projection

Let r0 , . . . , rm ∈ T p be a sequence of orthonormal polynomials according to Def. 6.2.25. From (6.2.26)

we conclude that rk ∈ Pk has leading coefficient 6= 0.

Hence sk (t) := t · rk (t) is a polynomial of degree k + 1 with leading coefficient 6= 0, that is sk ∈ Pk+1 \ Pk .
Therefore, rk+1 can be obtain by orthogonally projecting sk onto Pk plus normalization, cf. Lines 4-5 of
Algorithm (6.2.18):

k
e
r k+1
r k+1 = ± r k+1 = s k − ∑ s k , r j X r j .
, e (6.2.30)
kerk+1 kX j =0

Straightforward computations confirm that rk+1 ⊥ Pk .

The sum in (6.2.30) collapses to two terms! In fact, since (rk , q) X = 0 for all q ∈ Pk−1 , by Ass. 6.2.24
(6.2.23)
sk , r j X
= {t 7→ trk (t)} , r j X
= rk , {t 7→ tr j } X
= 0 , if j < k − 1 ,

because in this case {t 7→ tr j } ∈ Pk−1 . As a consequence (6.2.30) reduces to the 3-term recursion

e
r k+1
r k+1 = ± ,
kerk+1 k X , k = 1, . . . , m − 1 . (6.2.31)
e
r k+1 = sk − ({t 7→ trk }, rk ) X rk − ({t 7→ trk }, rk−1 )X rk−1

The recursion starts with r−1 := 0, r0 = {t 7→ 1/k1k X }.

The 3-term recursion (6.2.31) can be recast in various ways. Forgoing normalization the next theorem
presents one of them.

6. Approximation of Functions in 1D, 6.2. Mean Square Best Approximation 481

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Theorem 6.2.32. 3-term recursion for orthogonal polynomials

Given any inner product (·, ·)X on Pm , m ∈ N, define p−1 := 0, p0 = 1, and

pk+1 (t) := (t − αk+1 ) pk (t) − β k pk−1 (t) , k = 0, 1, . . . , m − 1 ,

({t 7→ tpk (t)} , pk )X k pk k2X (6.2.33)
with α k+1 : = , β k := .
k pk k2X k pk−1 k2X

Then ✦ pk ∈ Pk has leading coefficient = 1, and

✦ { p0 , p1 , . . . , pm } is an orthogonal basis of Pm .

Proof. (by rather straightforward induction) We first confirm, thanks to the definition of α1 ,

( p0 , p1 ) X = ( p0 , {t 7→ (t − α1 ) p0 (t)}) X = ( p0 , {t 7→ tp0 (t)}) X − α1 ( p0 , p0 ) X = 0 .

For the induction step we assume that the assertion is true for p0 , . . . , pk and observe that for pk+1
according to (6.2.33) we have

( pk , pk+1 ) X = ( pk , {t 7→ (t − αk+1 ) pk (t)} − β k pk−1 )X

= ( pk , {t 7→ tpk (t)}) X − αk+1 ( pk , pk ) X − β k ( pk , pk−1 ) X = 0 ,
| {z } | {z }
=({t7→tpk (t)},pk ) X =0

( pk−1 , pk+1 ) X = ( pk−1 , {t 7→ (t − αk+1 ) pk (t)} − β k pk−1 ) X

= ( pk−1 , {t 7→ tpk (t)}) X − αk+1 ( pk−1 , pk )X − β k ( pk , pk−1 ) X = 0 ,
( pℓ , pk+1 ) X = ( pℓ , {t 7→ (t − αk+1 ) pk (t)} − β k pk−1 )X
= ( pℓ , {t 7→ tpk (t)}) X −αk+1 ( pℓ , pk ) X − β k ( pℓ , pk−1 ) X = 0 , ℓ = 0, . . . , k − 2 .
| {z } | {z } | {z }
=0 0 0

This amounts to the assertion of orthogonality for k + 1. Above, several inner product vanish because of
the induction hypothesis!
✷

Remark 6.2.34 ( L2 ([−1, 1])-orthogonal polynomials)

An important inner product on C0 [−1, 1] is the L2 -inner product, see (6.2.5)

Z 1
( f , g) L2 ([−1,1)] := f (τ ) g (τ ) dτ , f , g, ∈ C0 ([−1, 1]) .
−1

It is a natural question what is the unique sequence of L2 ([−1, 1])-orthonormal polynomials. Their rather
simple characterization will be discussed in the sequel.

6. Approximation of Functions in 1D, 6.2. Mean Square Best Approximation 482

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Legendre polynomials

1 n=0
n=1
0.8 n=2
The Legendre polynomials Pn can be defined by the n=3
0.6 n=4
3-term recursion n=5
0.4

2n + 1 n 0.2
Pn+1 (t) := tPn (t) − Pn−1 (t) ,

Pn(t)
n+1 n+1 0
(7.3.33) -0.2

P0 := 1 , P1 (t) := t . -0.4

-0.6

-0.8

-1
-1 -0.5 0 0.5 1
Fig. 234 t

Pk ∈ Pk is immediate and so is the parity

Pk (t) = Pk (−t) if k is even, Pk (t) = − Pk (−t) if k is odd . (6.2.35)

Orthogonality, ( Pk , Pm ) L2 (Ω) = 0 or m 6= k, as well as k]k L Pk ([−1,1) 22 = 2/2n+1 can be proved by

induction
√ based on (6.2.35).This means implies that the L] ([−1, 1)2-orthonormal polynomials are rk =
k + 1/2 Pk and their 3-term recursion from ?? reads
p s
4 ( n + 1 )2 −1 n 4 ( n + 1 )2 − 1
r n +1 (t) = trn (t) − un −1 (t) , (6.2.36)
n+1 n+1 4n2 − 1
√
r −1 := 0 , r0 = 12 2 . (6.2.37)

(6.2.38) Discrete orthogonal polynomials

Since they involve integrals, weighted L2 -inner products (6.2.21) are not accessible computationally, un-
less one resigns to approximation, see Chapter 7 for corresponding theory and techniques.

Therefore, given a point set T := {t0 , t1 , . . . , tn }, we focus on the associated discrete L2 -inner product
n
( f , g) X := ( f , g) T := ∑ f (t j ) g(t j ) , f , g ∈ C0 ([ a, b]) , (6.2.22)
j =0

which is positive definite on Pn and satisfies Ass. 6.2.24.

The polynomials pk generated by the 3-term recursion (6.2.33) from Thm. 6.2.32 are then called discrete
orthogonal polynomials. The following C++ code computes the recursion coefficients αk and β k , k =
1, . . . , n − 1.

C++11 code 6.2.39: Computation of weights in 3-term recursion for discrete orthogonal poly-
nomials
2 // Computation of coefficients α, β from 6.2.32
3 // IN : t = points in the definition of the discrete L2 -inner product

6. Approximation of Functions in 1D, 6.2. Mean Square Best Approximation 483

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

4 // n = maximal index desired

5 // alpha, beta are used to save coefficients of recursion
6 void c o e f f o r t h o ( const VectorXd& t , const unsigned n , VectorXd& alpha ,
VectorXd& beta ) {
7 const unsigned m = t . siz e ( ) ; // maximal degree of orthogonal
polynomial
8 alpha = VectorXd ( std : : min ( n −1, m−2) + 1 ) ;
9 beta = VectorXd ( std : : min ( n −1, m−2) + 1 ) ;
10 alpha ( 0 ) = t .sum ( ) /m;
11 // initialization of recursion; we store only the values of
12 // the polynomials a the points in T
13 VectorXd p0 ,
14 p1 = VectorXd : : Ones (m) ,
15 p2 = t − alpha ( 0 ) ∗ VectorXd : : Ones (m) ;
16 f o r ( unsigned k = 0 ; k < std : : min ( n −1, m−2) ; ++k ) {
17 p0 = p1 ; p1 = p2 ;
18 // 3-term recursion (6.2.33)
19 alpha ( k +1) = p1 . dot ( t . cwiseProduct ( p1 ) ) / p1 . squaredNorm ( ) ;
20 beta ( k ) = p1 . squaredNorm ( ) / p0 . squaredNorm ( ) ;
21 p2 = ( t −alpha ( k +1) ∗ VectorXd : : Ones (m) ) . cwiseProduct ( p1 )−beta ( k ) ∗ p0 ;
22 }
23 }

(6.2.40) Polynomial fitting

Given a point set T := {t0 , t1 , . . . , tn } ⊂ [ a, b], and a function f : [ a, b] → K, we may seek to ap-
proximate f by its polynomial best approximant with respect to the discrete L2 -norm k·kT induced by the
discrete L2 -inner product (6.2.22).

Definition 6.2.41. Fitted polynomial

Given a point set T := {t0 , t1 , . . . , tn } ⊂ [ a, b], and a function f : [ a, b] → K we call

qk := argmink f − p kT , k ∈ {0, . . . , n} ,
p∈Pk

the fitting polynomial to f on T of degree k.

The stable and efficient computation of fitting polynomials can rely on combining Thm. 6.2.32 with Cor. 6.2.15:

➊ (Pre-)compute the weights αℓ and β ℓ for the 3-term recursion (6.2.33).

➋ (Pre-)compute the values of the orthogonal polynomials pk at desired evaluation points xi ∈ R,
i = 1, . . . , N .
➌ Compute the inner product ( f , pℓ ) X , ℓ = 0, . . . , k, and use (6.2.16) to linearly combine the vectors
[ pℓ ( xi )]iN=1 , ℓ = 0, . . . , k.
N
This yields [q( xi )]i =1 , q ∈ Pk the fitting polynomial.

6. Approximation of Functions in 1D, 6.2. Mean Square Best Approximation 484

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Example 6.2.42 (Approximation by discrete polynomial fitting)

2
We use equidistant points T := {tk = −1 + k m , k = 0, . . . , m} ⊂ [−1, 1], m ∈ N to compute fitting
polynomials (→ Def. 6.2.41) for two different functions.

We monitor the L2 -norm and L∞ -norm of the approximation error, both norms approximated by sampling
j
in ξ j = −1 + 500 , j = 0, . . . , 1000.

➀ f (t) = (1 + (5t)2 )−1 , I = [−1, 1] → Ex. 6.1.41, analytic in complex neighborhood of [−1, 1]:
1.2 Polynomial fitting of Runge function: equidistant points
0
2 { 10
(1+(5x) ) -1} ∞
n=0 L -norm, m=50
1
n=2 L2 -norm, m=100
n=4 L∞-norm, m=100
2
n=6 L -norm, m=200
0.8 ∞
L -norm, m=200
n=8
n=10 10 -1 L2 -norm, m=400

error norm
0.6
y

0.4

-2
10
0.2

-0.2 10 -3
0 5 10 15 20 25 30 35
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 236
Fig.
Fig. 235 polynomial degree n
t
➣ We observe exponential convergence (→ Def. 6.1.38) in the polynomial degree n.

➁ f (t) = max{0, 1 − 2 ∗ | x + 41 |}, f only in C0 ([−1, 1]):

1 Polynomial fitting of tent function: equidistant points
10 0
1/(1+(5x)2)
n=0 L∞-norm, m=50
0.8
n=2 L2 -norm, m=100
n=4 L∞-norm, m=100
2
n=6 L -norm, m=200
0.6 ∞
L -norm, m=200
n=8
n=10 10 -1 L2 -norm, m=400
error norm

0.4
y

0.2

10 -2
0

-0.2

10 -3
-0.4 10 0 10 1 10 2
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 238
Fig.
Fig. 237
polynomial degree n
t

6. Approximation of Functions in 1D, 6.2. Mean Square Best Approximation 485

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

➣ We observe only algebraic convergence (→ Def. 6.1.38 in the polynomial degree n (for n ≪ m!).

Polynomial fitting of cosine bump function: equidistant points

0
10
∞
L -norm, m=50
L2 -norm, m=100
L∞-norm, m=100
➂ “bump function” 10 -1
2
L -norm, m=200
∞
L -norm, m=200
L2 -norm, m=400
1
f (t) = max{cos(4π |t + |), 0} .

error norm
4 10 -2

➣ Merely f ∈ C1 ([−1, 1])

Doubly logarithmic plot suggests “asymptotic” alge- 10 -3

braic convergence.

10 -4
10 0 10 1 10 2
Fig. 239 polynomial degree n

6.3 Uniform Best Approximation

(6.3.1) The alternation theorem

Given an interval [ a, b] we seek a best approximant of a function f ∈ C0 ([ a, b]) in the space Pn of

polynomials of degree ≤ n with respect to the supremum norm k·k L∞ ([ a,b]) :

q ∈ argmink f − p k L∞ ( I ) .
p∈Pn

The results of Section 6.2.1 cannot be applied because the supremum norm is not induced by an inner
product on Pn .

Theory provides us with surprisingly precise necessary and sufficient conditions to be satisfied by the
polynomial L∞ ([ a, b])-best approximant q.

6. Approximation of Functions in 1D, 6.3. Uniform Best Approximation 486

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Theorem 6.3.2. Chebychev alternation theorem

Given f ∈ C0 [ a, b], a < b, and a polynomial degree n ∈ N, a polynomial q ∈ Pn satisfies

q = argmink f − p k L∞ ( I )
p∈Pn

if and only if there exist n + 2 points a ≤ ξ 0 < ξ 1 < · · · < ξ n+1 ≤ b such that

|e(ξ j )| = kek L∞ ([ a,b]) , j = 0, . . . , n + 1 ,

e(ξ j ) = −e(ξ j+1 ) , j = 0, . . . , n ,

where e := f − q denotes the approximation error.

Visualization of the behavior of the kek L∞ ([ a,b])

L∞ ([ a, b])-best approximation error e :=
f − q according to the Chebychev alter-
nation theorem Thm. 6.3.2. ✄
The extrema of the approximation error
are sometimes called alternants. ξ0 ξ1 ξ2 ξm

Compare with the shape of the Chebychev

polynomials → Def. 6.1.76.
−kek L∞ ([ a,b])
Fig. 240

(6.3.3) Remez algorithm

The widely used iterative algorithm (Remez algorithm) for finding an L∞ -best approximant is motivated
by the alternation theorem. The idea is to determine successively better approximations of the set of
alternants: A(0) → A(1) → . . ., ♯A(l ) = n + 2.

Key is the observation that, due to the alternation theorem, the polynomial L∞ ([ a, b])-best approximant q
will satisfy (one of the) interpolation conditions

q(ξ k )±(−1)k δ = f (ξ k ) , k − 0, . . . , n + 1 , δ := k f − qk L∞ ([ a,b]) . (6.3.4)

(0) (0) (0) (0)

➀ Initial guess A(0) := {ξ 0 < ξ 1 < · · · < ξ n < ξ n+1 } ⊂ [ a, b] “arbitrary”, for instance extremal
points of the Chebychev polynomial Tn+1, → Def. 6.1.76, so so-called Chebychev alternants

(0) 1 1 j
ξ j = 2 (a + b) + 2 (b − a) cos π , j = 0, . . . , n + 1 . (6.3.5)
n+1

(l ) (l ) (l ) (l )
➁ Given approximate alternants A(l ) := {ξ 0 < ξ 1 < · · · < ξ n < ξ n+1 } ⊂ [ a, b] determine
q ∈ Pn and a deviation δ ∈ R satisfying the extended interpolation condition
(l ) (l )
q(ξ k )+(−1)k δ = f (ξ k ) , k = 0, . . . , n + 1 . (6.3.6)

6. Approximation of Functions in 1D, 6.3. Uniform Best Approximation 487

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

After choosing a basis for Pn , this is (n + 2) × (n + 2) linear system of equations, cf. § 5.1.13.
➂ Choose A(l +1) as the set of extremal points of f − q, truncated in case more than n + 2 of these
exist.

These extreme can be located approximately by sampling on a fine grid covering [ a, b]). If the
derivative of f ∈ C1 ([ a, b]) is available, too, then search for zeros of ( f − p)′ using the secant
method from § 8.3.22.
➃ If k f − qk L∞ ([ a,b]) | ≤ TOL · k dk L∞ ([ a,b]) STOP, else GOTO ➁.
(TOL is a prescribed relative tolerance.)

C++11 code 6.3.7: Remez algorithm for uniform polynomial approximation on an interval
1 f u n c t i o n c = remes(f,f1,a,b,d,tol)
2 % f is a handle to the function, f1 to its derivative
3 % d = polynomial degree (positive integer)
4 % a,b = interval boundaries
5 % returns coefficients of polynomial in monomial basis
6 % (M A T L A B convention, see Rem. 5.2.4).
7

8 n = 8*d; % n = number of sampling points

9 xtab=[a:(b-a)/(n-1):b]’; % Points of sampling grid
10 ftab = f e v a l (f,xtab); % Function values at sampling points
11 fsupn = max( abs (ftab)); % Approximate supremum norm of f
12 f1tab = f e v a l (f1,xtab); % Derivative values at sampling points
13 % The vector xe stores the current guess for the alternants; initial
14 % guess is Chebychev alternants (6.3.5).
15 h= p i /(d+1); xe=(a+b)/2 + (a-b)/2* cos (h*[0:d+1]’);
16 fxe= f e v a l (f,xe);
17

18 maxit = 10;
19 % Main iteration loop of Remez algorithm
20 f o r k=1:maxit
21 % Interpolation at d + 2 points xe with deviations ±δ
22 % Algorithm uses monomial basis, which is not optimal
23 V= v a nde r (xe); A=[V(:,2:d+2), (-1).^[0:d+1]’]; % LSE
24 c=A\fxe; % Solve for coefficients of polynomial q
25 c1=[d:-1:1]’.*c(1:d); % Monomial doefficients of derivative q′
26

27 % Find initial guesses for the inner extremes by sampling; track sign
28 % changes of the derivative of the approximation error
29 deltab = ( p o l y v a l (c1,xtab) - f1tab);
30 s=[deltab(1:n-1)].*[deltab(2:n)];
31 ind= f i n d (s<0); xx0=xtab(ind); % approximate zeros of e’
32 nx = l e n g t h (ind); % number of approximate zeros
33 % Too few extrema; bail out
34 i f (nx < d), e r r o r (’Too few extrema’); end
35

36 % Secant method to determine zeros of derivative

37 % of approximation error
38 F0 = p o l y v a l (c1,xx0) - f e v a l (f1,xx0);
39 % Initial guess from shifted sampling points
40 xx1=xx0+(b-a)/(2*n);
41 F1 = p o l y v a l (c1,xx1) - f e v a l (f1,xx1);

6. Approximation of Functions in 1D, 6.3. Uniform Best Approximation 488

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

42 % Main loop of secant method

43 w h i l e min ( abs (F1)) > 1e-12,
44 xx2=xx1-F1./(F1-F0).*(xx1-xx0);
45 xx0=xx1; xx1=xx2; F0=F1;
46 F1= p o l y v a l (c1,xx1) - f e v a l (f1,xx1);
47 end;
48

49 % Determine new approximation for alternants; store in xe

50 % If too many zeros of the derivative ( f − p)′
51 % have been found, select those, where the deviation is maximal.
52 i f (nx == d)
53 xe=[a;xx0;b];
54 e l s e i f (nx == d+1)
55 xmin = min (xx0); xmax = max(xx0);
56 i f ((xmin - a) > (b-xmax)), xe = [a;xx0];
57 e l s e xe = [xx0;b]; end
58 e l s e i f (nx == d+2)
59 xe = xx0;
60 else
61 fx = f e v a l (f,xx0);
62 del = abs ( p o l y v a l (c(1:d+1),xx0) - fx);
63 [dummy,ind] = s o r t (del);
64 xe = xx0(ind( end -d-1:end ));
65 end
66

67 % Deviation in sampling points and approximate alternants

68 fxe= f e v a l (f,xe);
69 del = [ p o l y v a l (c(1:d+1),a) - ftab(1);...
70 p o l y v a l (c(1:d+1),xe) - fxe;...
71 p o l y v a l (c(1:d+1),b) - ftab( end)];
72 % Approximation of supremum norm of approximation error
73 dev = max( abs (del));
74 % Termination of Remez iteration
75 i f ( dev < tol*fsupn), br e a k ; end
76 end ;
77 c = c(1:d+1);

Experiment 6.3.8 (Convergence of Remez algorithm)

We examine the convergence of the Remez algorithm from Code 6.3.7 for two different functions:

• f (t) = (1 + t2 )−1 , I = [−5, 5] → Bsp. 6.1.41

(
1
(1 + cos(2πt)) , if |t| < 21 ,
• f (t) = 2 , I = [−1, 1]
0 else.

6. Approximation of Functions in 1D, 6.3. Uniform Best Approximation 489

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

f(t) = 1/(1+t 2 ), I = [-5,5] 0

f(t) = χ(1+cos(2π t)), I = [-1,1]
10
0 10

n=3 n=3
n=5 n=5
-2
10 10 -2 n=7
n=7
n=9 n=9
10 -4 n=11 n=11
10 -4

L∞-norm of approximation error

-6
10
10 -6

10 -8
10 -8

10 -10

10 -10
10 -12

10 -12
10 -14

-14
10
10 -16

10 -18 10 -16
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Fig. 241 Step of Remez algorithm Fig. 242 Step of Remez algorithm

Convergence in both cases; faster convergence observed for smooth function, for which machine precision
is reached after a few steps.

6.4 Approximation by Trigonometric Polynomials

Task: approximation of a continuous 1-periodic function

f ∈ C 0 (R ) , f ( t + 1) = f ( t ) ∀ t ∈ R .

Dubious: approximation by global polynomials, because they are not 1-periodic

➣ the global polynomial Lagrange interpolant will lack an essential structural property.

The natural space for approximating generic periodic functions is a space of trigonometric polyno-
mials with the same period.

Remember from Def. 5.6.3: space of 1-periodic trigonometric polynomials of degree 2n, n ∈ N,
n
T
P2n = Span t 7→ 1, t 7→ sin(2πt), t 7→ cos(2πt), t 7→ sin(4πt), t 7→ cos(4πt), . . . (6.4.1a)
o
t 7→ sin(2πnt), t 7→ cos(2πnt)
= Span{t 7→ exp(2πıkt) : k = −n . . . , n} . (6.4.1b)

T (when considered as a space of C -valued functions on R ).

Both sets of functions provide a basis of P2n

6.4.1 Approximation by Trigonometric Interpolation

Idea: Adapt the policy of approximation by interpolation from § 6.0.6

Now: Employ trigonometric interpolation from Section 5.6

T of 1-periodic trigonometric polynomials → Def. 5.6.3.
into space P2n

6. Approximation of Functions in 1D, 6.4. Approximation by Trigonometric Polynomials 490

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Recall: Trigonometric interpolation → Section 5.6

Given nodes t0 < t1 < · · · < t2n , tk ∈ [0, 1[, and values yk ∈ R , k = 0, . . . , 2n find
T
q ∈ P2n := Span{t 7→ cos(2πjt), t 7→ sin(2πjt)} nj=0 , (6.4.3)
with q(tk ) = yk for all k = 0, . . . , 2n . (6.4.4)

Terminology: T =
P2n ˆ space of trigonometric polynomials of degree 2n.

From Section 5.6 remember a few more facts about trigonometric polynomials and trigonometric interpo-
lation:
T = 2n + 1
✦ Cor. 5.6.8: Dimension of the space of trigonometric polynomials: dim P2n
✦ Trigonometric interpolation can be reduced to polynomial interpolation on the unit circle S1 ⊂ C in
the complex plane, see (5.6.7).
existence & uniqueness of trigonometric interpolant q satisfying (6.4.3) and (6.4.4)
✦ Code 5.6.15: Efficient FFT-based algorithms for trigonometric interpolation in equidistant nodes
tk = 2nk+1 , k = 0, . . . , 2n.
The relationship of trigonometric interpolation and polynomial interpolation on the unit circle suggests a
uniform distribution of nodes for general trigonometric interpolation.

Nodes for function approximation by trigonometric interpolation

T usually
Trigonometric approximation of generic 1-periodic continuous functions ∈ C0 (R ) in P2n
relies on equidistant interpolation nodes tk = 2nk+1 , k = 0, . . . , 2n.

k
✎ notation: trigonometric interpolation operator in 2n + 1 equidistant nodes tk = 2n +1 , k = 0, . . . , 2n
T
Tn : C0 ([0, 1[) → P2n , Tn ( f )(tk ) = f (tk ) ∀k ∈ {0, . . . , 2n} . (6.4.6)

f (t)

Note that a function f ∈ C0 ([0, 1[) can spawn a dis-

continuous 1-periodic function on R

A prominent example is the “sawtooth function” ✄

−1 1 2 3 t
Fig. 243

6.4.2 Trigonometric interpolation error estimates

Our focus will be on the asymptotic behavior of

k f − Tn f k L∞ ([0,1[) and k f − Tn f k L2 ([0,1[) as n → ∞ ,

for functions f : [0, 1[→ C with different smoothness properties. To begin with we perform an empiric
study.

Experiment 6.4.7 (Interpolation error: trigonometric interpolation)

6. Approximation of Functions in 1D, 6.4. Approximation by Trigonometric Polynomials 491

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Now we study the asymptotic behavior of the error of equidistant trigonometric interpolation as n → ∞ in
a numerical experiment for functions with different smoothness properties.

#1 Step function: f (t) = 0 for |t − 12 | > 14 , f (t) = 1 for |t − 21 | ≤ 41

1
#2 C∞ periodic function: f (t) = q .
1
1 + sin(2πt)
2

#3 “wedge function”: f (t) = |t − 12 |

Approximate computation of norms of interpolation errors on equidistant grid with 4096 points.
0 0
10 10

-2 -2
10 10

-4 -4
10 10
||Interpolationsfehler||∞

||Interpolationsfehler||2
-6
-6
10 10

#1 #1
#2 -8 #2
-8 10
10 #3 #3

-10
-10 10
10

-12
-12 10
10

-14
-14
10
10

2 4 8 16 32 64 128 2 4 8 16 32 64 128
Fig. 244 n Fig. 245 n

Observation: Function #1: no convergence in L∞ -norm, algebraic convergence in L2 -norm

Function #3: algebraic convergence in both norms
Function #2: exponential convergence in both norms

We conclude that in this experiment higher smoothness of f leads to faster convergence of the trigono-
metric interplant.

Experiment 6.4.8 (Gibbs phenomenon)

Of course the smooth trigonometric interpolants of step function fail to converge in L∞ -norm in Exp. 6.4.7.
Moreover, they will not even converge “visually” to the step function, which becomes manifest by a closer
inspection of the interpolants.

6. Approximation of Functions in 1D, 6.4. Approximation by Trigonometric Polynomials 492

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

1.2 1.2
f f
p p

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t

n = 16 n = 128
Observation: overshooting in neighborhood of discontinuity: Gibbs phenomenon

(6.4.9) Fourier series → (4.2.67)

T is spanned by the 2n + 1 Fourier

From (6.4.1b) we learn that the space of trigonometric polynomials P2n
modes t 7→ exp(2πıkt) of “lowest frequency”, that is, for −n ≤ k ≤ n. In Section 4.2.5 we learned
that every function f : [0, 1[→ C with finite L2 ([0, 1])-norm can be expanded in a Fourier series (→
Thm. 4.2.89)
Z1
f (t) = ∑ fbk exp(−2πıkt) in L ([0, 1]) , 2
fbk := f (t) exp (2πıkt) dt .
k ∈Z 0
M
(A limit in L2 ([0, 1]) means that f− ∑ fbk exp(−2πık·) → 0 for M → ∞. Also remember
k=− M L2 ([0,1])
the notation fbk for the Fourier coefficients.)

Idea: Study trigonometric interpolation error for interpolands

in Fourier series representation.

(6.4.10) Aliasing

We study the action of the trigonometric interpolation operator Tn from (6.4.6) on individual Fourier modes
µk (t) := exp(−2πkıt), t ∈ R, k ∈ Z. Due to the 1-periodicity of t 7→ exp(2πıt) we find for every node
j
t j := 2n +1 , j = 0, . . . , 2n,
j j
µk (t j ) = exp(−2πık 2n+1 ) = exp(−2πı (k − ℓ(2n + 1)) 2n+1 ) = µk−ℓ(2n+1) (t j ) ∀ℓ ∈ Z .
When sampled on the node set Tn := {t0 , . . . , t2n } all the Fourier modes µk−ℓ(2n+1) , ℓ ∈ Z, yield the
same values. Thus the trigonometric cannot distinguish them! This phenomenon is called aliasing.

Aliasing demonstrated for f (t) = sin(2π · 19t) = Im(exp(2πı19t)) for different node sets.

6. Approximation of Functions in 1D, 6.4. Approximation by Trigonometric Polynomials 493

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

1 1 1
p p p
f f f
0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

-0.2 -0.2 -0.2

-0.4 -0.4 -0.4

-0.6 -0.6 -0.6

-0.8 -0.8 -0.8

-1 -1 -1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t t

n=2 n=4 n=8

The “low-frequency” sine waves plotted in red coincide with f on the node set Tn .

T , the aliasing effect yields

Since Tn µk = µk for k = −n, . . . , n, that is, for µk ∈ P2n

Tn µk = µek , e
k ∈ {−n, . . . , n} , k − e
k ∈ (2n + 1)Z [ e
k := k mod (2n + 1) ] . (6.4.11)

e = n, n]
(n fn = −n, −^
+ 1 = − n, − f = −1, etc.)
n − 1 = n, 2n
Trigonometric interpolation by Tn folds all Fourier modes (“frequencies”) to the finite range {−n, . . . , n}.

From (6.4.11), by linearity of Tn , we obtain for f : [0, 1[→ C in Fourier series representation

∞ n ℓ
f (t) = ∑ fbj µ j (t) Tn ( f )(t) = ∑ γ j µ j (t) , γ j = ∑ fbj+ℓ(2n+1) . (6.4.12)
j=− ∞ j=− n ℓ=− ∞

We can read the trigonometric polynomial Tn f ∈ P2n T as a Fourier series with non-zero coefficients only

in the index range {−n, . . . , n}. Thus, for the Fourier coefficients of the trigonometric interpolation error
E(t) := f (t) − Tn f (t) we find from (6.4.12)

− ∑ fbj+ℓ(2n+1) , if j ∈ {−n, . . . , n} ,
bj =
E ℓ∈Z \{0} j∈Z. (6.4.13)

fbj , if | j| > n ,

Since we have (sufficient smoothness of f assumed)

∞
E (t) : = f (t) − Tn f (t) = ∑ bj e−2πıjt ,
E (6.4.14)
j=− ∞

we conclude, in particular from the isometry property asserted in Thm. 4.2.89,

2
n
k f − Tn f k2L2 (]0,1[) = ∑ ∑ fbj+ℓ(2n+1) + ∑ | fbj |2 , (6.4.15)
j=− n ℓ∈Z \{0} | j|> n
n
k f − Tn f k L∞ (]0,1[) ≤ ∑ ∑ | fbj+ℓ(2n+1) | + ∑ | fbj | . (6.4.16)
j=− n ℓ∈Z \{0} | j|> n

In order to estimate these norms of the trigonometric interpolation error we need quantitative information
about the decay of the Fourier coefficients fbj as | j| → ∞.

6. Approximation of Functions in 1D, 6.4. Approximation by Trigonometric Polynomials 494

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(6.4.17) Fourier expansions of derivatives

d
For 1-periodic c ∈ C0 (R ) with integrable derivative ċ := dt c we find by integration by parts (boundary
terms cancel due to periodicity)

Z1 Z1
b
cj = c (t)e 2πı jt
dt = −2πıj · ċ(t)e2πı jt dt = (−2πıj) ḃc j , j ∈ Z .
0 0

We can also arrive at this formula by (formal) term-wise differentiation of the Fourier series:
∞ ∞
c (t) = ∑ c j e−2πıjt =⇒ ċ(t) =
b ∑ c j e−2πıjt .
(−2πıj)b (6.4.18)
j=− ∞ j=− ∞ | {z }
=ḃc j

Lemma 6.4.19. Fourier coefficients of derivatives

For the Fourier coefficients of the derivatives a 1-periodic function f ∈ Ck−1 (R ), k ∈ N, with
integrable k-th derivative f (k) holds

d
f (k) j = (−2πı )k fbj , j ∈ Z .

(6.4.20) Fourier coefficients and smoothness

From Lemma 6.4.19 and the trivial estimates (| exp(2πıt)| = 1)

Z 1
| fbj | ≤ | f (t)| dt ≤ k f k L∞ (]0,1[) ∀ j ∈ Z , (6.4.21)
0

we conclude that (2π | j|)m fbj , m ∈ N, is bounded, provided that f ∈ Cm−1 (R ) with integrable
j ∈Z
m-th derivative
Lemma 6.4.22. Decay of Fourier coefficients

If f ∈ Ck−1 (R ) with integrable k-th derivative, then fbj = O( j−k ) for | j| → ∞

Decay of Fourier coefficients and smoothness

The smoother a periodic function the faster the decay of its Fourier coefficients

The isometry property of Thm. 4.2.89 also yields for f ∈ Ck−1 (R ) with f (k) ∈ L2 (unitintv ) that

2 ∞
f (k)
= (2π ) k
∑ j2k | fbj |2 . (6.4.24)
L2 (]0,1[)
j=− ∞

6. Approximation of Functions in 1D, 6.4. Approximation by Trigonometric Polynomials 495

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

We can now combine the identity (6.4.24) with (6.4.15) and obtain an interpolation error estimate in
L2 (]0, 1[)-norm.

Theorem 6.4.25. L2 -error estimate for trigonometric interpolation

If f ∈ Ck−1 (R ), k ∈ N, with square-integrable k-th derivative ( f (k) ∈ L2 (]0, 1[)), then

p
k f − Tn f k L2 (]0,1[) ≤ 1 + c k n− k f (k) , (6.4.26)
L2 (]0,1[)

with c k = 2 ∑∞
ℓ=1 (2ℓ − 1)
−2k < ∞.

Thm. 6.4.25 confirms algebraic convergence of the L2 -norm of the trigonometric interpolation error for
functions with limited smoothness. Higher rates can be expected for smoother functions, which we have
also found in cases #1 and #3 in Exp. 6.4.7.

6.4.3 Trigonometric Interpolation of Analytic Periodic Functions

In § 6.1.62 we saw that we can expect exponential decay of the maximum norm of polynomial interpolation
errors in the case of “very smooth” interpolands. To capture this property of functions we resorted to
the notion of analytic functions, as defined in Def. 6.1.63. Since trigonometric interpolation is closely
connected to polynomial interpolation (on the unit circle S1 , see Section 5.6.2), it is not surprising that
analyticity of interpolands will also involve exponential convergence of trigonometric interpolants. This
result will be established in this section.
In case #2 of Exp. 6.4.7 we already say an instance of exponential convergence for an analytic interpoland.
A more detailed study follows.

Experiment 6.4.27 (Trigonometric interpolation of analytic functions)

2
10

We study the convergence of equidistant trigonomet- 0

ric interpolation for the interpoland -2

1
Interpolationsfehlernorm

-4

f (t) = p on I = [0, 1] . 10

1 − α sin(2πt) -6
10

(6.4.28) -8 α=0.5, L∞
10
α=0.5, L2

For 0 ≤ α < 1 we have f ∈ C∞ (R ), and f is even -10

10 α=0.9, L∞
α=0.9, L2
analytic (→ Def. 6.1.63). -12
10 α=0.95, L∞
2
α=0.95, L

Approximative computations of error norms by “over- -14

10 α=0.99, L∞
α=0.99, L2
sampling” in 4096 points. ✄ -16
10
0 10 20 30 40 50 60 70 80 90 100
Fig. 246 Polynomgrad n
➣ Observation: exponential convergence in n, faster for smaller α

(6.4.29) Decay of Fourier coefficients of analytic functions

6. Approximation of Functions in 1D, 6.4. Approximation by Trigonometric Polynomials 496

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Lemma 6.4.22 asserts algebraic decay of the Fourier coefficients of functions with limited smoothness. As
analytic 1-periodic functions are “infinitely smooth”, the will always belong to C∞ (R ), we expect a stronger
result in this case. In fact, we can conclude exponential decay of the Fourier coefficients.

Theorem 6.4.30. Exponential decay of Fourier coefficients of analytic functions

If f : R → C is 1-periodic and has an analytic extension to the strip

S̄ := {z ∈ C: − η ≤ Im z ≤ η } , for some η > 0 ,

then its Fourier coefficients decay according to

| fbj | ≤ qk · k f k L∞ (S̄) with q := exp(−2πη ) ∈]0, 1[ . (6.4.31)

Proof.

z-plane

Let f : R 7→ C be 1-periodic with analytic extension

η
to the (closed) strip S := {z ∈ C: − η ≤ Im{z} ≤
η } , η > 0.
➣ gr (t) := f (t + ır ) is 1-periodic and smooth for
−η ≤ r ≤ η η

Fig. 247

We compute the Fourier coefficients of gr

Z1 Z1
gbr k = f (t + ir )e −2πıkt
dt = f (t)e2πık(t−ir) dt = e−2πrk fbk ,
0 0

and arrive at the bound

| fbk | ≤ e−2πrk max{| f (t + iur )|, 0 ≤ t ≤ 1} ∀k ∈ Z , ∀ − η − ≤ r ≤ η + .

| {z }
k gr k L∞ (]0,1[)

The assertion of the theorem follows directly.

✷

Knowing exponential decay of the Fourier coefficients, the geometric sum formula can be used to extract
estimates from (6.4.15) and (6.4.16):

6. Approximation of Functions in 1D, 6.4. Approximation by Trigonometric Polynomials 497

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Lemma 6.4.32. Interpolation error estimates for exponentially decaying Fourier coefficients

If f : R → C is 1-periodic and f ∈ L2 (]0, 1[), then

√
ρn/2 2 2
k f − Tn ( f )k L2 (]0,1[) ≤M p ,
1 − ρ n 1 − ρ2
∃ M > 0, ρ ∈]0, 1[: | fb(k)| ≤ Mρ | k|
⇒
ρn/2
k f − Tn ( f )k L∞ (]0,1[) ≤ 4M .
1−ρ

This estimate can be combined with the result of Thm. 6.4.30 and gives the main result of this section:

Theorem 6.4.33. Exponential convergence of trigonometric interpolation for analytic inter-

polands

If f : R → C is 1-periodic and possesses an analytic extension to the strip

S̄ := {z ∈ C: − η ≤ Im z ≤ η } , for some η > 0 ,

then there is Cη > 0 depending only on η such that

k f − Tn f k∗ ≤ Cη e−πηn k f k L∞ (S̄) , n ∈ N , ( ∗ = L2 (]0, 1[), L∞ (]0, 1[) ) . (6.4.34)

The speed of exponential convergence clearly depends on the width η of the “strip of analyticity” S̄.

(6.4.35) Convergence of trigonometric interpolation for analytic interpolands

Explanation of the observation made in Exp. 6.4.27, cf. Rem. 6.1.96:

Similar to Chebychev interpolants, also trigonometric interpolants converge exponentially fast, if the inter-
poland f is 1-periodic analytic (→ Def. 6.1.63) in a strip around the real axis in C, see Thm. 6.4.33 for
details.

Note that f from (6.4.28) is holomorphic, where

1 + α sin(2πz) 6∈ R0− ⇔ sin(2πz) = sin(2πx ) cosh (2πy) + i cos(2πx ) sinh (2πy) 6∈] − ∞, −1 − α1 ] ,

because the square root is holomorphic in C \ R 0− .

Domain of analyticity of f :
[
C\ ( 2k + 14 + i (R \] − ζ, ζ [)) , ζ ∈ R + , cosh(2πζ ) = 1 + 1
α .
k ∈Z

6. Approximation of Functions in 1D, 6.4. Approximation by Trigonometric Polynomials 498

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10 Im
9

7
1
6
cosh Re
5
−2 −1 1 2
4

3
−1
2

1
Fig. 249

0
➣ f analytic in strip
Fig. 248 -1
-3 -2 -1 0 1 2 3 S := {z ∈ C: : −ζ < Im(z) < ζ }.
➣ As α decreases the strip of analyticity becomes wider, since x → cosh( x ) is increasing for x > 0.

6.5 Approximation by piecewise polynomials

Recall some alternatives to interpolation by global polynomials discussed in Chapter 5:

✦ piecewise linear/quadratic interpolation → Section 5.3.2, Ex. 5.3.11,

✦ cubic Hermite interpolation → Section 5.4,
✦ (cubic) spline interpolation → Section 5.5.1.
☞ All these interpolation schemes rely on piecewise polynomials (of different global smoothness)

Focus in this section: function approximation by piecewise polynomial interpolants

(6.5.1) Grid/mesh

The attribute “piecewise” refers to partitioning of the interval on which we aim to approximate. In the
case of data interpolation the natural choice was to use intervales defined by interpolation nodes. Yet we
already saw exceptions in the case of shape-preserving interpolation by means of quadratic splines, see
Section 5.5.3.

In the case of function approximation based on an interpolation scheme the additional freedom to choose
the interpolation nodes suggests that those be decoupled from the partitioning.

Idea: use piecewise polynomials with respect to a grid/mesh

M : = { a = x0 < x1 < . . . < x m −1 < x m = b } (6.5.2)

to approximate function f : [ a, b] 7→ R , a < b.

Borrowing from terminology for splines, cf. Def. 5.5.1, the underlying mesh for piecewise polynomial

6. Approximation of Functions in 1D, 6.5. Approximation by piecewise polynomials 499

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

approximation is sometimes called the “knot set”.

Terminology:
✦ xj = ˆ nodes of the mesh M,
✦ [ x j −1 , x j [ =
ˆ intervals/cells of the mesh, a b
✦ hM := max | x j − x j−1 | = ˆ mesh width, x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14
j
✦ If x j = a + jh =
ˆ equidistant (uniform) mesh
with meshwidth h > 0

Remark 6.5.3 (Local approximation by piecewise polynomials)

We will see that most approximation schemes relying on piecewise polynomials are local in the sense that
finding the approximant on a cell of the mesh relies only on a fixed number of function evaluations in a
neighborhood of the cell.

➣ O(1) computational effort to find interpolant on [ x j−1 , x j ] (independent of m)

➣ O(m) computational effort to determine piecewise polynomial approximant for m → ∞ (‘fine meshes”)
Contrast this with the computational cost of computing global polynomial interpolants, which will usually
be O(n2 ) for polynomial degree n → ∞.

6.5.1 Piecewise polynomial Lagrange interpolation

Given: interval [ a, b] ⊂ R endowed with mesh M : = { a = x0 < x1 < . . . < x m − 1 < x m = b }.

Recall theory of polynomial interpolation → Section 5.2.2: n + 1 data points needed to fix interpolating
polynomial, see Thm. 5.2.14.

Approach to local Lagrange interpolation (→ (5.2.9)) of f ∈ C( I ) on mesh M

General local Lagrange interpolation on a mesh

➊ Choose local degree n j ∈ N0 for each cell of the mesh, j = 1, . . . , m.

➋ Choose set of local interpolation nodes
j j
T j := {t0 , . . . , tn j } ⊂ Ij := [ x j−1 , x j ] , j = 1, . . . , m ,

for each mesh cell/grid interval I j .

➌ Define piecewise polynomial interpolant s : [ x0 , xm ] → K:
j j
s j := s| Ij ∈ Pn j and s j (ti ) = f (ti ) i = 0, . . . , n j , j = 1, . . . , m . (6.5.5)

Owing to Thm. 5.2.14, s j is well defined.

6. Approximation of Functions in 1D, 6.5. Approximation by piecewise polynomials 500

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Corollary 6.5.6. Piecewise polynomials Lagrange interpolation operator

The mapping f 7→ s defines a linear operator IM : C0 ([ a, b]) 7→ CM

0
,pw ([ a, b]) in the sense of
Def. 5.1.17.

Obviously, IM depends on M, the local degrees n j , and the sets T j of local interpolation points (the latter
two are suppressed in notation).

Corollary 6.5.7. Continuous local Lagrange interpolants

j
If the local degrees n j are at least 1 and the local interpolation nodes tk , j = 1, . . . , m, k = 0, . . . , n j ,
for local Lagrange interpolation satisfy
j j +1
t n j = t0 ∀ j = 1, . . . , m − 1 ⇒ s ∈ C0 ([ a, b]) , (6.5.8)

then the piecewise polynomial Lagrange interpolant according to (6.5.5) is continuous on [ a, b]:
s ∈ C0 ([ a, b]).

Focus: asymptotic behavior of (some norm of) interpolation error

k f − IM f k ≤ CT ( N ) for N → ∞ , (6.5.9)

m
where N := ∑ ( n j + 1 ).
j =1

The decay of the bound T ( N ) will characterize the type of convergence:

☛ algebraic convergence or exponential convergence, see Section 6.1.2, Def. 6.1.38.

But why do we choose this strange number N as parameter when investigating the approximation error?

Because, by Thm. 5.2.2, it agrees with the dimension of the space of discontinuous, piecewise polynomials
functions

{q : [ a, b] → R: q| Ij ∈ Pn j ∀ j = 1, . . . , m} !

This dimension tells us the number of real parameters we need to describe the interpolant s, that is, the
“information cost” of s. N is also proportional to the number of interpolation conditions, which agrees with
the number of f -evaluations needed to compute s (why only proportional in general?).

Special case: uniform polynomial degree n j = n for all j = 1, . . . , m.

Then we may aim for estimates k f − IM f k ≤ CT (hM ) for hM → 0

in terms of meshwidth hM .

Terminology: investigations of this kind are called the study of h-convergence.

Example 6.5.10 (h-convergence of piecewise polynomial interpolation)

6. Approximation of Functions in 1D, 6.5. Approximation by piecewise polynomials 501

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

1.5

atan(t)
piecew. linear
Compare Exp. 5.3.7: 1
piecew. quadratic

f (t) = arctan t, I = [−5, 5] piecew. cubic

0.5

Grid M := {−5, − 52 , 0, 25 , 5}
Local interpolation nodes equidistant in I j , endpoints 0

included, (6.5.8) satisfied.

-0.5

Plots of the piecewise linear, quadratic and cubic -1

polynomial interpolants ✄
-1.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Fig. 250 t

i
✦ Sequence of (equidistant) meshes: Mi := {−5 + j 2−i 10}2j=0, i = 1, . . . , 6.
✦ Equidistant local interpolation nodes (endpoints of grid intervals included).
Monitored: interpolation error in (approximate) L∞ - and L2 -norms, see (6.1.95), (6.1.94)

k gk L∞ ([−5,5]) ≈ max | g(−5 + j/100 )| ,

j=0,...,1000
q !1/2
999
2
k gk L2 ([−5,5]) ≈ 1
1000 · 1
2 g(−5) + ∑ | g(−5 + j/100 )|2 + 21 g(5)2 .
j =1

2 0
10 10

0
10

-2
10
∞
||Interpolation error||2

||Interpolation error||

-4 -5
10 10

-6
10

-8
10

-10 -10
10 10

-12
10 Deg. =1 Deg. =1
Deg. =2 Deg. =2
-14
Deg. =3 Deg. =3
10 Deg. =4 Deg. =4
Deg. =5 Deg. =5
Deg. =6 Deg. =6
-16 -15
10 10
-2 -1 0 1 -2 -1 0 1
10 10 10 10 10 10 10 10
Fig. 251 mesh width h Fig. 252 mesh width h

Observation: Algebraic convergence (→ Def. 6.1.38) for meshwidth hM → 0

(nearly linear error norm graphs in doubly logarithmic scale, see Rem. 6.1.40)

Observation: rate of algebraic convergence increases with polynomial degree n

Rates α of algebraic convergence O(hαM ) of norms of interpolation error:

n 1 2 3 4 5 6
w.r.t. L2 -norm 1.9957 2.9747 4.0256 4.8070 6.0013 5.2012
w.r.t. L∞ -norm 1.9529 2.8989 3.9712 4.7057 5.9801 4.9228
➣ Higher polynomial degree provides faster algebraic decrease of interpolation error norms. Empiric
evidence for rates α = p + 1

6. Approximation of Functions in 1D, 6.5. Approximation by piecewise polynomials 502

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Here: rates estimated by linear regression (→ Ex. 3.1.5) based on MATLAB’s polyfit and the interpo-
lation errors for meshwidth h ≤ 10 · 2−5. This was done in order to avoid erratic “preasymptotic”, that is,
for large meshwidth h, behavior of the error.

The bad rates for n = 6 are probably due to the impact of roundoff, because the norms of the interpolation
error had dropped below machine precision, see Fig. 251, 252.

(6.5.11) Approximation error estimates for piecewise polynomial Lagrange interpolation

The observations made in Ex. 6.5.10 are easily explained by applying the polynomial interpolation error
estimates of Section 6.1.2 locally on the mesh intervals [ x j−1, x j ], j = 1, . . . , m: for constant polynomial
degree n = n j , j = 1, . . . , m, we get

hnM+1
(6.1.50) ⇒ k f − s k L∞ ([ x0 ,xm ]) ≤ f ( n + 1) , (6.5.12)
( n + 1) ! L ∞ ([ x0 ,xm ])

with mesh width hM := max{| x j − x j−1 |: j = 1, . . . , m}.

Another special case: fixed mesh M, uniform polynomial degree n

Study estimates k f − IM f k ≤ CT (n) for n → ∞.
Terminology: investigation of p-convergence

Example 6.5.13 ( p-convergence of piecewise polynomial interpolation)

We study p-convergence in the setting of Ex. 6.5.10.

1 0
10 10

0
10 -1
10

-1
10 -2
10

-2
10
∞
||Interpolation error||2

-3
10
||Interpolation error||

-3
10
-4
10
-4
10
-5
10
-5
10
-6
10
-6
10

-7
-7 10
10
h =5 h =5
h =2.5 -8
h =2.5
-8
10 h =1.25 10 h =1.25
h =0.625 h =0.625
h =0.3125 h =0.3125
-9 -9
10 10
1 2 3 4 5 6 1 2 3 4 5 6
Fig. 253 Local polynomial degree Fig. 254 Local polynomial degree

Observation: (apparent) exponential convergence in polynomial degree

Note: in the case of p-convergence the situation is the same as for standard polynomial interpolation, see
6.1.2.

In this example we deal with an analytic function, see Rem. 6.1.96. Though equidistant local interpolation
nodes are used cf. Ex. 6.1.41, the mesh intervals seems to be small enough that even in this case
exponential convergence prevails.

6. Approximation of Functions in 1D, 6.5. Approximation by piecewise polynomials 503

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6.5.2 Cubic Hermite interpolation: error estimates

See Section 5.4 for definition and algorithms for cubic Hermite interpolation of data points, with a focus
on shape preservation, however. If the derivative f ′ of the interpoland f is available (in procedural form),
then it can be used to fix local cubic polynomials by prescribing point values and derivative values in the
endpoints of grid intervals.

Definition 6.5.14. Piecewise cubic Hermite interpolant (with exact slopes) → Def. 5.4.1

[Piecewise cubic Hermite interpolant (with exact slopes) Given f ∈ C1 ([ a, b]) and a mesh M :=
{a = x0 < x1 < . . . < xm−1 < xm = b} the piecewise cubic Hermite interpolant (with exact
slopes) s : [ a, b] → R is defined as

s|[ x j−1,x j ] ∈ P3 , j = 1, . . . , m , s( x j ) = f ( x j ) , s′ ( x j ) = f ′ ( x j ) , j = 0, . . . , m .

Clearly, the piecewise cubic Hermite interpolant is continuously differentiable: s ∈ C1 ([ a, b]), cf. Cor. 5.4.2.

Experiment 6.5.15 (Convergence of Hermite interpolation with exact slopes)

In this experiment we study the h-convergence of Cubic Hermite interpolation for a smooth function.
2
10
sup-norm
2
L -norm
1
10

Piecewise cubic Hermite interpolation of 0

10
norm of interpolation error

f ( x ) = arctan( x ) . -1
10

✦ domain: I = (−5, 5)
-2
10

✦ mesh T = {−5 + hj} nj=0 ⊂ I , h = 10 n, -3

10
′
✦ exact slopes ci = f (ti ), i = 0, . . . , n
-4
10

algebraic convergence O(h4 ) -5

-6
10
-1 0 1
10 10 10
Fig. 255 meshwidth h

Approximate computation of error norms analoguous to Ex. 6.5.10.

C++11 code 6.5.16: Hermite approximation and orders of convergence with exact slopes

2 void append ( VectorXd& , const VectorXd&) ; // forward declaration of append()

3
4 // Investigation of interpolation error norms for cubic Hermite
interpolation of f
5 // (passed through Functor f)
6 // on [ a, b] with linearly averaged slopes according to (5.4.8).
7 // N gives the maximum number of mesh intervals
8 template <class Func ti on , class D e r i v a t i v e >
9 void hermi teapprox ( const F u n c t i o n& f , const D e r i v a t i v e& df ,
10 const double a , const double b , const unsigned N) {
11 std : : vector <double > l 2 e r r , l i n f e r r , h ; // save error and stepwidths in these vectors
12 f o r ( unsigned j = 2 ; j <= N; ++ j ) {
13 // xx: the fine mesh on which the error norms are computed
14 VectorXd xx ( 1 ) ; xx << a ;

6. Approximation of Functions in 1D, 6.5. Approximation by piecewise polynomials 504

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

15 // val: Hermite approximated values in points contained in xx

16 VectorXd v a l ( 1 ) ; v a l << f ( a ) ;
17
18 VectorXd t = VectorXd : : LinSpaced ( j , a , b ) ; // mesh nodes
19 VectorXd y = f e v a l ( f , t ) ; // feval returns f evaluated at t
20 VectorXd c = f e v a l ( df , t ) ; // coefficients for Hermit polynomial
representation
21
22 f o r ( unsigned k = 0 ; k < j − 1 ; ++k ) {
23 // local evaluation nodes in interval [tk , tk+1 ]
24 VectorXd vx = VectorXd : : LinSpaced (100 , t ( k ) , t ( k +1) ) ,
25 locval ;
26 // evaluate the hermite interpolant at the vx, using Code 5.4.6
27 herml oc ev al ( vx , t ( k ) , t ( k +1) , y ( k ) , y ( k +1) , c ( k ) , c ( k +1) , l o c v a l ) ;
28 // do not append the first value as that one is already in the
vector
29 append ( xx , vx . t a i l ( 9 9 ) ) ;
30 append ( v al , l o c v a l . t a i l ( 9 9 ) ) ;
31 }
32
33 // difference between exact function and interpolant
34 VectorXd d = f e v a l ( f , xx ) − v a l ;
35 const double L2 = d . lpNorm<2 >() , // L
2
norm
36 L i n f = d . lpNorm<Eigen : : I n f i n i t y > ( ) ; // L
∞ norm
37 l 2 e r r . push_back ( L2 ) ; // save L2 error
38 l i n f e r r . push_back ( L i n f ) ; // save Linf error
39 h . push_back ( ( b − a ) / j ) ; // save current meshwidth
40 }
41
42 // compute estimates for algebraic orderns of convergence
43 // using linear regression on half the data points
44
45 // L2Log contains the logarithmus of the last M l2err values,
46 // LinfLog and hLog the other according values
47 const unsigned M = unsigned ( N / 2 ) ;
48 VectorXd L2Log (M) , L i n f L o g (M) , hLog (M) ;
49 f o r ( unsigned n = 0 ; n < M; ++n ) {
50 // *(h.end() - n - 1) = h[end - n]
51 L2Log (M − n − 1) = std : : l o g ( ∗ ( l 2 e r r . end ( ) − n − 1) ) ;
52 L i n f L o g (M − n − 1) = std : : l o g ( ∗ ( l i n f e r r . end ( ) − n − 1) ) ;
53 hLog (M − n − 1) = std : : l o g ( ∗ ( h . end ( ) − n − 1) ) ;
54 }
55
56 // use linear regression
57 VectorXd p I = p o l y f i t ( hLog , Li nfLog , 1 ) ;
58 std : : cout << "Algebraic convergence rate of Infinity norm: " << p I ( 0 ) << " \n" ;
59 VectorXd pL2 = p o l y f i t ( hLog , L2Log , 1 ) ;
60 std : : cout << "Algebraic convergence rate of L2 norm: " << pL2 ( 0 ) << " \n" ;
61
62 mgl : : F i g u r e f i g ;
63 f i g . s e t l o g ( true , tr ue ) ; // double logarithmic plot
64 f i g . t i t l e ( "Hermite approximation" ) ;
65 f i g . x l a b e l ( "Meshwidth h" ) ;
66 f i g . y l a b e l ( "Norm of interpolation error" ) ;
67 f i g . legend ( ) ;
68 f i g . p l o t ( h , l 2 e r r , " +r" ) . l a b e l ( "L^2 Error" ) ;
69 f i g . p l o t ( h , l i n f e r r , " +b" ) . l a b e l ( "L^{\\ infty} Error" ) ;
70
71 // plot linear regression lines
72 f i g . p l o t ( std : : vector <double > ( { h [ 0 ] , h [ N − 2]}) , std : : vector <double > ( { std : : pow ( h [ 0 ] ,
pL2 ( 0 ) ) ∗ std : : exp ( pL2 ( 1 ) ) , std : : pow ( h [ N− 2] , pL2 ( 0 ) ) ∗ std : : exp ( pL2 ( 1 ) ) } ) , "m" ) ;
73 f i g . p l o t ( std : : vector <double > ( { h [ 0 ] , h [ N − 2]}) , std : : vector <double > ( { std : : pow ( h [ 0 ] ,
p I ( 0 ) ) ∗ std : : exp ( p I ( 1 ) ) , std : : pow ( h [ N− 2] , p I ( 0 ) ) ∗ std : : exp ( p I ( 1 ) ) } ) , "k" ) ;
74 f i g . save ( "hermiperravgsl" ) ;
75 }
76
77 // Appends a Eigen::VectorXd to another Eigen::VectorXd
78 void append ( VectorXd& x , const VectorXd& y ) {
79 x . c o n s e r v a t i v e R e s i z e ( x . siz e ( ) + y . siz e ( ) ) ;
80 x . t a i l ( y . siz e ( ) ) = y ;
81 }

6. Approximation of Functions in 1D, 6.5. Approximation by piecewise polynomials 505

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The observation made in Exp. 6.5.15 matches the theoretical prediction of the rate of algebraic conver-
gence for cubic Hermite interpolation with exact slopes for a smooth function.

Theorem 6.5.17. Convergence of approximation by cubic Hermite interpolation

Let s be the cubic Hermite interpolant of f ∈ C4 ([ a, b]) on a mesh M := { a = x0 < x1 < . . . <
xm−1 < xm = b} according to Def. 6.5.14. Then

1 4
k f − sk L∞ ([ a,b]) ≤ h f (4) ,
4! M L ∞ ([ a,b ])

with the meshwidth hM := max j | x j − x j−1 |.

In Section 5.4.2 we saw variants of cubic Hermite interpolation, for which the slopes c j = s′ ( x j ) were
computed from the values y j in preprocessing step. Now we study the use of such a scheme for approxi-
mation.

Experiment 6.5.18 (Convergence of Hermite interpolation with averaged slopes)

2
10
sup-norm
2
L -norm
1
10
Piecewise cubic Hermite interpolation of
norm of interpolation error

0
10
f ( x ) = arctan( x ) .
-1
10
✦ domain: I = (−5, 5)
✦ equidistant mesh T in I , see Exp. 6.5.15, -2
10

✦ averaged local slopes, see (5.4.8)

-3
10

algebraic convergence O(h3 ) in meshwidth

-4
10
(see Code 6.5.19)
-5
10
-1 0 1
10 10 10
Fig. 256 meshwidth h

We observe lower rate of algebraic convergence compared to the use of exact slopes due to averaging
(5.4.8). From the plot we deduce O(h3 ) asymptotic decay of the L2 - and L∞ -norms of the approximation
error for meshwidth h → 0.

C++11 code 6.5.19: Hermite approximation and orders of convergence

2 VectorXd s l opes ( const VectorXd& , const VectorXd&) ; // forward declaration

3 void append ( VectorXd& , const VectorXd&) ;
4
5 // Investigation of interpolation error norms for cubic Hermite
interpolation of f (handle f)
6 // on [ a, b] with linearly averaged slopes according to (5.4.8).
7 // N gives the maximum number of mesh intervals
8 template <class Func ti on >
9 void hermi teapprox ( const F u n c t i o n& f , const double a , const double b , const unsigned N) {
10 std : : vector <double > l 2 e r r , l i n f e r r , h ; // save error and stepwidth in these vectors
11 f o r ( unsigned j = 2 ; j <= N; ++ j ) {
12 // xx is the fine mesh on which the error norms are computed
13 VectorXd xx ( 1 ) ; xx << a ;
14 // val contains the hermite approximated values in xx
15 VectorXd v a l ( 1 ) ; v a l << f ( a ) ;
16

6. Approximation of Functions in 1D, 6.5. Approximation by piecewise polynomials 506

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

17 VectorXd t = VectorXd : : LinSpaced ( j , a , b ) ; // mesh nodes

18 VectorXd y = f e v a l ( f , t ) ; // feval returns f evaluated at t
19 VectorXd c = s l opes ( t , y ) ; // coefficients for Hermit polynomial
representation
20
21 f o r ( unsigned k = 0 ; k < j − 1 ; ++k ) {
22 // local evaluation nodes
23 VectorXd vx = VectorXd : : LinSpaced (100 , t ( k ) , t ( k +1) ) ,
24 locval ;
25 // evaluate the hermite interpolant at the vx, using Code 5.4.6
26 herml oc ev al ( vx , t ( k ) , t ( k +1) , y ( k ) , y ( k +1) , c ( k ) , c ( k +1) , l o c v a l ) ;
27 // do not append the first value as that one is already in the
vector
28 append ( xx , vx . t a i l ( 9 9 ) ) ;
29 append ( v al , l o c v a l . t a i l ( 9 9 ) ) ;
30 }
31
32 // difference between exact function and interpolant
33 VectorXd d = f e v a l ( f , xx ) − v a l ;
34 const double L2 = d . lpNorm<2 >() , // L
2norm
35 L i n f = d . lpNorm<Eigen : : I n f i n i t y > ( ) ; // L
∞ norm
36 l 2 e r r . push_back ( L2 ) ; // save L2 error
37 l i n f e r r . push_back ( L i n f ) ; // save Linf error
38 h . push_back ( ( b − a ) / j ) ; // save current meshwidth
39 }
40
41 // compute estimates for algebraic orderns of convergence
42 // using linear regression on half the data points
43
44 // L2Log contains the logarithmus of the last M l2err values,
45 // LinfLog and hLog the other according values
46 const unsigned M = unsigned ( N / 2 ) ;
47 VectorXd L2Log (M) , L i n f L o g (M) , hLog (M) ;
48 f o r ( unsigned n = 0 ; n < M; ++n ) {
49 // *(h.end() - n - 1) = h[end - n]
50 L2Log (M − n − 1) = std : : l o g ( ∗ ( l 2 e r r . end ( ) − n − 1) ) ;
51 L i n f L o g (M − n − 1) = std : : l o g ( ∗ ( l i n f e r r . end ( ) − n − 1) ) ;
52 hLog (M − n − 1) = std : : l o g ( ∗ ( h . end ( ) − n − 1) ) ;
53 }
54
55 // use linear regression
56 VectorXd p I = p o l y f i t ( hLog , Li nfLog , 1 ) ;
57 std : : cout << "Algebraic convergence rate of Infinity norm: " << p I ( 0 ) << " \n" ;
58 VectorXd pL2 = p o l y f i t ( hLog , L2Log , 1 ) ;
59 std : : cout << "Algebraic convergence rate of L2 norm: " << pL2 ( 0 ) << " \n" ;
60
61 mgl : : F i g u r e f i g ;
62 f i g . s e t l o g ( true , tr ue ) ; // double logarithmic plot
63 f i g . t i t l e ( "Hermite approximation" ) ;
64 f i g . x l a b e l ( "Meshwidth h" ) ;
65 f i g . y l a b e l ( "Norm of interpolation error" ) ;
66 f i g . legend ( ) ;
67 f i g . p l o t ( h , l 2 e r r , " +r" ) . l a b e l ( "L^2 Error" ) ;
68 f i g . p l o t ( h , l i n f e r r , " +b" ) . l a b e l ( "L^{\\ infty} Error" ) ;
69
70 // plot linear regression lines
71 f i g . p l o t ( std : : vector <double > ( { h [ 0 ] , h [ N − 2]}) , std : : vector <double > ( { std : : pow ( h [ 0 ] ,
pL2 ( 0 ) ) ∗ std : : exp ( pL2 ( 1 ) ) , std : : pow ( h [ N− 2] , pL2 ( 0 ) ) ∗ std : : exp ( pL2 ( 1 ) ) } ) , "m" ) ;
72 f i g . p l o t ( std : : vector <double > ( { h [ 0 ] , h [ N − 2]}) , std : : vector <double > ( { std : : pow ( h [ 0 ] ,
p I ( 0 ) ) ∗ std : : exp ( p I ( 1 ) ) , std : : pow ( h [ N− 2] , p I ( 0 ) ) ∗ std : : exp ( p I ( 1 ) ) } ) , "k" ) ;
73
74 f i g . save ( "hermiperravgsl" );
75 }
76
77 // Computation of slopes of piecewise linear interpolation
78 // IN : t = nodes
79 // y = values at nodes t
80 // OUT: c = slopes of piecewise linear interpolation
81 VectorXd s l opes ( const VectorXd& t , const VectorXd& y ) {
82 const unsigned n = t . siz e ( ) − 1 ;
83 VectorXd h = t . t a i l ( n ) − t . head ( n ) ,
84 d e l t a = ( y . t a i l ( n ) − y . head ( n ) ) . cwiseQuotient ( h ) ,
85 c ( n + 1) ;
86
87 c (0) = delta (0) ;

6. Approximation of Functions in 1D, 6.5. Approximation by piecewise polynomials 507

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

88 c ( n ) = d e l t a ( n −1) ;
89 f o r ( unsigned i = 1 ; i < n ; ++ i ) {
90 c ( i ) = ( h ( i ) ∗ d e l t a ( i − 1) + h ( i − 1) ∗ d e l t a ( i ) ) / ( t ( i + 1) − t ( i − 1) ) ;
91 }
92 r etur n c ;
93 }
94
95 // Appends a Eigen::VectorXd to another Eigen::VectorXd
96 void append ( VectorXd& x , const VectorXd& y ) {
97 x . c o n s e r v a t i v e R e s i z e ( x . siz e ( ) + y . siz e ( ) ) ;
98 x . t a i l ( y . siz e ( ) ) = y ;
99 }

6.5.3 Cubic spline interpolation: error estimates [?, Ch. 47]

Recall concept and algorithms for cubic spline interpolation from Section 5.5.1. As an interpolation scheme
it can also serve as the foundation for an approximation scheme according to § 6.0.6: the mesh will double
as know set, see Def. 5.5.1. Cubic spline interpolation is not local as we saw in § 5.5.19. Nevertheless,
cubic spline interpolants can be computed with an effort of O(m) as elaborated in § 5.5.5.

Experiment 6.5.20 (Approximation by complete cubic spline interpolants)

2 n
We take I = [−1, 1] and rely on an equidistant mesh (knot set) M := {−1 + j} , n ∈ N ➙
n j =0
meshwidth h = 2/n.

We study h-convergence of complete (→ § 5.5.11) cubic spline interpolation, where the slopes at the
endpoints of the interval are made to agree with the derivatives of the interpoland at these points. As
interpolands we consider


0 , if t < − 52 ,
1
f 1 (t) = ∈ C∞ ( I ) , f 2 (t) = 1
(1 + cos(π (t − 35 ))) , if − 25 < t < 35 , ∈ C1 ( I ) .
1 + e−2t 

2
1 otherwise.

-2 0
10 10
∞
L -Norm L∞-Norm
L2-Norm -1 L2-Norm
10
-4
10
-2
10

-6
10 -3
10
||s-f||
||s-f||

-4
-8 10
10

-5
10
-10
10
-6
10

-12 -7
10 -2 -1 0 10 -2 -1 0
10 10 10 10 10 10
Fig. 257 Meshwidth h Fig. 258 Meshwidth h

k f 1 − s k L∞ ([−1,1]) = O(h4 ) k f 2 − sk L∞ ([−1,1]) = O(h2 )

6. Approximation of Functions in 1D, 6.5. Approximation by piecewise polynomials 508

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

We observe algebraic order of convergence in h with empiric rate approximately given by min{1 +
regularity of f , 4}.

We remark that there is the following theoretical result [?], [?, Rem. 9.2]:
5 4 (4)
f ∈ C4 ([t0 , tn ]) k f − s k L∞ ([t0,tn ]) ≤ h f .
384 L ∞ ([ t0 ,tn ])

C++11 code 6.5.21: Computation and evaluation of complete cubic spline

2 // get coefficients for complete cubic spline interpolation
3 // IN : (t, y) = set of interpolation data
4 // c0, cn = slope at start and end point
5 // OUT: coefficients for complete cubic spline interpolation
6 VectorXd coeffsCSI ( const VectorXd& t , const VectorXd& y , const double
c0 , const double cn )
7 {
8 const long n = t . siz e ( ) − 1 ;
9 const VectorXd h = ( t . t a i l ( n ) . array ( ) −
t . head ( n ) . array ( ) ) . matrix ( ) ; // size:
n
10 const VectorXd b = ( 1 . / h . array ( ) ) . matrix ( ) ; // size: n
11 const VectorXd a = 2 ∗ ( b . head ( n − 1 ) . array ( ) + b . t a i l ( n −
1 ) . array ( ) ) . matrix ( ) ; // size: n -
1
12

13 // build rhs
14 Eigen : : VectorXd r hs ( n − 1 ) ;
15 f o r ( long i = 0 ; i < n − 1 ; ++ i ) {
16 r hs ( i ) = 3 ∗ ( ( y ( i + 1 ) − y ( i ) ) / ( h ( i ) ∗ h ( i ) ) + ( y ( i + 2 ) − y ( i +
1 ) ) / ( h ( i + 1 ) ∗h ( i + 1 ) ) ) ;
17 }
18 // modify according to complete cubic spline
19 r hs ( 0 ) −= b ( 0 ) ∗ c0 ;
20 r hs ( n − 2 ) −= b ( n − 1 ) ∗ cn ;
21

22 // build sparse system matrix

23 typedef Eigen : : T r i p l e t <double> T ;
24 std : : vector <T> t r i p l e t s ;
25 t r i p l e t s . r e s e r v e (3 ∗ n ) ;
26 f o r ( i n t i = 0 ; i < n − 2 ; ++ i ) {
27 t r i p l e t s . push_back ( T ( i , i , a ( i ) ) ) ;
28 t r i p l e t s . push_back ( T ( i , i + 1 , b ( i + 1 ) ) ) ;
29 t r i p l e t s . push_back ( T ( i + 1 , i , b ( i ) ) ) ;
30 }
31 t r i p l e t s . push_back ( T ( n − 2 , n − 2 , a ( n − 2 ) ) ) ;
32 SparseMatrix <double> A( n − 1 , n − 1 ) ;
33 A . setFromTriplets ( t r i p l e t s . begin ( ) , t r i p l e t s . end ( ) ) ;
34

35 // solve LSE and apply conditions

6. Approximation of Functions in 1D, 6.5. Approximation by piecewise polynomials 509

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

36 Eigen : : SparseLU<SparseMatrix <double>> s o l v e r ;

37 s o l v e r . compute ( A) ;
38 const VectorXd c p a r t = s o l v e r . solve ( r hs ) ;
39 VectorXd c = Eigen : : VectorXd ( n + 1 ) ;
40 c << c0 , c p a r t , cn ;
41 r et ur n c ;
42 }
43

44 // perform complete cubic spline interpolation of data (ti , yi )

45 // and evaluate at many points passed in vector x
46 VectorXd spline ( const VectorXd& t , const VectorXd& y ,
47 const double c0 , const double cn , const VectorXd& x ) {
48 const VectorXd c = coeffsCSI ( t , y , c0 , cn ) ;
49 r et ur n eval ( x , t , y , c ) ;
50 }

M ATLAB-code 6.5.22: Spline approximation error

2 # include " polyfit.hpp" // provides polyfit(), see NumCSE/Utils

3 # include "feval.hpp" // provides feval(), see NumCSE/Utils
4 # include "completespline.hpp" // provides spline()
5 using Eigen : : VectorXd ;
6
7 // Study of interpolation error norms for complete cubic spline
interpolation of f
8 // on equidistant knot set.
9 template <class Func ti on , class D e r i v a t i v e >
10 void s p l i n e a p p r o x ( F u n c t i o n& f , D e r i v a t i v e& df , const double a , const double b , const unsigned N, const
std : : s t r i n g plotname ) {
11 const unsigned K = i n t ( ( b−a ) /0.00025 ) ; // no. of evaluation points for norm
evaluation
12 const VectorXd x = VectorXd : : LinSpaced ( K , a , b ) , // mesh for norm evaluation
13 fv = feval ( f , x) ;
14
15 std : : vector <double > errL2 , e r r I n f , h ;
16 VectorXd v , // save result of spline approximation here
17 t ; // spline knots
18 f o r ( unsigned j = 3 ; j <= N; ++ j ) {
19 t = VectorXd : : LinSpaced ( j , a , b ) ; // spline knots
20 VectorXd f t = f e v a l ( f , t ) ;
21
22 // compute complete spline imposing exact first derivative at the
endpoints
23 // spline: takes set of interpolation data (t, y) and slopes at
endpoints,
24 // and returns values at points x
25 v = spline ( t , f t , d f ( a ) , d f ( b ) , x ) ;
26 // compute error norms
27 h . push_back ( ( b − a ) / j ) ; // save current mesh width
28 e r r L 2 . push_back ( ( f v − v ) . lpNorm<2 >() ) ;
29 e r r I n f . push_back ( ( f v − v ) . lpNorm<Eigen : : I n f i n i t y > ( ) ) ;
30 }
31
32 // compute algebraic orders of convergence using polynomial fit
33 const unsigned n = h . siz e ( ) ;
34 VectorXd hLog ( n ) , errL2Log ( n ) , e r r I n f L o g ( n ) ;
35 f o r ( unsigned i = 0 ; i < n ; ++ i ) {
36 hLog ( i ) = std : : l o g ( h [ i ] ) ;
37 errL2Log ( i ) = std : : l o g ( e r r L 2 [ i ] ) ;
38 e r r I n f L o g ( i ) = std : : l o g ( e r r I n f [ i ] ) ;
39 }
40
41 VectorXd p = p o l y f i t ( hLog , errL2Log , 1) ;
42 std : : cout << "L2 norm convergence rate: " << p ( 0 ) << " \n" ;
43 p = p o l y f i t ( hLog , e r r I n f L o g , 1) ;
44 std : : cout << "Supremum norm convergence rate: " << p ( 0 ) << " \n" ;

6. Approximation of Functions in 1D, 6.5. Approximation by piecewise polynomials 510

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

45
46 // plot interpolation
47 mgl : : F i g u r e f i g ;
48 f i g . t i t l e ( "Spline interpolation " + plotname ) ;
49 f i g . p l o t ( t , f e v a l ( f , t ) , " m∗" ) . l a b e l ( "Data points" ) ;
50 f i g . p l o t ( x , fv , "b" ) . l a b e l ( " f " ) ;
51 f i g . p l o t ( x , v , "r" ) . l a b e l ( "Cubic spline interpolant" ) ;
52 f i g . legend ( 1 , 0 ) ;
53 f i g . save ( "interp_" + plotname ) ;
54
55 // plot error
56 mgl : : F i g u r e e r r ;
57 e r r . t i t l e ( "Spline approximation error " + plotname ) ;
58 e r r . s e t l o g ( true , tr ue ) ;
59 e r r . p l o t ( h , errL2 , "r ; " ) . l a b e l ( "L^2 norm" ) ;
60 e r r . p l o t ( h , e r r I n f , "b; " ) . l a b e l ( "L^\\ infty norm" ) ;
61 f i g . legend ( 1 , 0) ;
62 e r r . save ( "approx_" + plotname ) ;
63 }

Summary and Learning Outcomes

6. Approximation of Functions in 1D, 6.5. Approximation by piecewise polynomials 511

Chapter 7

Numerical Quadrature

Supplementary reading. Numerical quadrature is covered in [?, VII] and [?, Ch. 10].

Z
Numerical quadrature deals with the approximate numerical evaluation of integrals f (x) dx for a given
Ω
(closed) integration domain Ω ⊂ R d . Thus, the underlying problem in the sense of § 1.5.67 is the
mapping

C0 (Ω ) → RR
I: , (7.0.1)
f 7→ Ω f (x) dx

with data space X := C0 (Ω) and result space Y := R .

If f is complex-valued or vector-valued, then so is the integral. The methods presented in this chapter can
immediately be generalized to this case by componentwise application.

(7.0.2) Integrands in procedural form

The integrand f , a continuos function f : Ω ⊂ R d 7→ R should not be thought of as given by an analytic

expression, but as given in procedural form, cf. Rem. 5.1.6,

☞ in M ATLAB function y = f(x)

☞ in C++ as a data type providing an evaluation operator double operator (Point &) or a corre-
sponding member function, see Code 5.1.7 for an example, or a lambda function

General methods for numerical quadrature should rely only on finitely many point evaluations of
the integrand.

In this chapter the focus is on the special case d = 1: Ω = [ a, b] (an interval)

(Multidimensional numerical quadrature is substantially more difficult, unless Ω is tensor-product domain,
a multi-dimensional box. Multidimensional numerical quadrature will be treated in the course “Numerical
methods for partial differential equations”.)

Remark 7.0.3 (Importance of numerical quadrature)

512
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

☞ Numerical quadrature methods are key building blocks for so-called variational methods for the nu-
merical treatment of partial differential equations. A prominent example is the finite element method.

2.5 For d = 1, from a geometric point of view

methods for numerical quadrature aimed
at computing
2

Zb
f (t) dt
f

1.5

1 ? seek to approximate an area under the

graph of the function f .
0.5
✁ area corresponding to value of an inte-
gral
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 259 t

Example 7.0.4 (Heating production in electrical circuits)

In Ex. 2.1.3 we learned about the nodal analysis of electrical circuits. Its application to a non-linear circuit
will be discussed in Ex. 8.0.1, which will reveal that every computation of currents and voltages can be
rather time-consuming. In this example we consider a non-linear circuit in quasi-stationary operation
(capacities and inductances are ignored). Then the computation of branch currents and nodal voltages
entails solving a non-linear system of equations.

Now assume time-harmonic periodic excitation with

period T > 0. R3 R4
R1
U (t) ➀ ➃
Rb
➂
➄
➁
RL
T t U (t)
Re R2
I (t)
Fig. 260
Fig. 261

The goal is to compute the energy dissipated by the circuit, which is equal to the energy injected by the
voltage source. This energy can be obtained by integrating the power P(t) = U (t) I (t) over period [0, T ]:
Z T
Wtherm = U (t) I (t) dt , where I = I (U ) .
0

double I(double U) involves solving non-linear system of equations, see Ex. 8.0.1!
This is a typical example where “point evaluation” by solving the non-linear circuit equations is the only
way to gather information about the integrand.

7. Numerical Quadrature , 7. Numerical Quadrature 513

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Contents
7.1 Quadrature Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
7.2 Polynomial Quadrature Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
7.3 Gauss Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
7.4 Composite Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
7.5 Adaptive Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532

7.1 Quadrature Formulas

Quadrature formulas realize the approximation of an integral through finitely many point evaluations of the
integrand.

Definition 7.1.1. Quadrature formula/quadrature rule

An n-point quadrature formula/quadrature rule on [ a, b] provides an approximation of the value of

an integral through a weighted sum of point values of the integrand:
Z b n

a
f (t) dt ≈ Qn ( f ) := ∑ wnj f (cnj ) . (7.1.2)
j =1

wnj : ∈R
quadrature weights
Terminology:
cnj : quadrature nodes ∈ [ a, b]

Obviously (7.1.2) is compatible with integrands f given in procedural form as double f(double t),
compare § 7.0.2.

C++-code 7.1.3: C++ template implementing generic quadrature formula

2 // Generic numerical quadrature routine implementing (7.1.2):
3 // f is a handle to a function, e.g. as lambda function
4 // c, w pass quadrature nodes c j ∈ [ a, b], and weights w j ∈ R
5 // in a Eigen::VectorXd
6 template <class Func tion >
7 double quadformula ( F u n c t i o n& f , const VectorXd& c , const VectorXd& w) {
8 const std : : s i z e _ t n = c . siz e ( ) ;
9 double I = 0 ;
10 f o r ( std : : s i z e _ t i = 0 ; i < n ; ++ i ) I += w( i ) ∗ f ( c ( i ) ) ;
11 r et ur n I ;
12 }

A single invocation costs n point evaluations of the integrand plus n additions and multiplications.

Remark 7.1.4 (Transformation of quadrature rules)

7. Numerical Quadrature , 7.1. Quadrature Formulas 514

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

In the setting of function approximation by polynomials we learned in Rem. 6.1.18 that an approximation
schemes for any interval could be obtained from an approximation scheme on a single reference interval
([−1, 1] in Rem. 6.1.18) by means of affine pullback, see (6.1.22). A similar affine transformation technique
makes it possible to derive quadrature formula for an arbitrary interval from a single quadrature formula on
a reference interval.
n
Given: quadrature formula b b j j=1 on reference interval [−1, 1]
cj, w

Idea: transformation formula for integrals

Z b Z 1
f (t) dt = 1
2 (b − a) fb(τ ) dτ ,
a −1 (7.1.5)
fb(τ ) := f ( 21 (1 − τ )a + 1
2 (τ + 1) b ) .

τ t
Fig. 262
−1 1 a b
τ 7→ t := Φ(τ )t := 21 (1 − τ )a + 12 (τ + 1)b

Note that fb is the affine pullback Φ∗ f of f to [−1, 1] as defined in Eq. (6.1.20).

quadrature formula for general interval [ a, b], a, b ∈ R :

Rb n n c j )a + 21 (1 + b
c j = 21 (1 − b c j )b ,
a f (t) dt ≈ 1
2 (b b j fb(b
− a) ∑ w c j ) = ∑ w j f (c j ) with
j =1 j =1 w j = 21 (b − a)w
bj .

In words, the nodes are just mapped through the affine transformation c j = Φ(b
c j ), the weights are scaled
by the ratio of lengths of [ a, b] and [−1, 1].

b j /nodes b
A 1D quadrature formula on arbitrary intervals can be specified by providing its weights w cj
for the integration domain [−1, 1] (reference interval). Then the above transformation is assumed.

Another common choice for the reference interval: [0, 1], pay attention!

Remark 7.1.6 (Tables of quadrature rules)

In many codes families of quadrature rules are used to control the quadrature error. Usually, suitable
sequences of weights wnj and nodes cnj are precomputed and stored in tables up to sufficiently large
values of n. A possible interface could be the following:

s t r u c t QuadTab {
t e m p l a t e < typename VecType>
s t a t i c v o i d getrule( i n t n,VecType &c,VecType &w,
double a=-1.0, double b=1.0);
}

7. Numerical Quadrature , 7.1. Quadrature Formulas 515

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Calling the method getrule() fills the vectors c and w with the nodes and the weights for a desired
n-point quadrature on [ a, b] with [−1, 1] being the default reference interval. For VecType we may assume
the basic functionality of Eigen::VectorXd.

(7.1.7) Quadrature by approximation schemes

Every approximation scheme A : C0 ([ a, b]) → V , V a space of “simple functions” on [ a, b], see § 6.0.5,
gives rise to a method for numerical quadrature according to
Z b Z b
f (t) dt ≈ QA ( f ) := (A f )(t) dt . (7.1.8)
a a

As explained in § 6.0.6 every interpolation scheme IT based on the node set T = {t0 , t1 , . . . , tn } ⊂ [ a, b]
(→ § 5.1.4) induces an approximation scheme, and, hence, also a quadrature scheme on [ a, b]:
Z b Z b
f (t) dt ≈ IT [ f (t0 ), . . . , f (tn )] ⊤ (t) dt . (7.1.9)
a a

Lemma 7.1.10. Quadrature formulas from linear interpolation schemes

Every linear interpolation operator IT according to Def. 5.1.17 spawns a quadrature formula (→
Def. 7.1.1) by (7.1.9).

Proof. Writing e j for the j-th unit vector of R n , we have by linearity

Z b n Z b
⊤
a
IT [ f (t0 ), . . . , f (tn )] dt = ∑ f (t j ) (IT (e j ))(t) dt . (7.1.11)
j =0 |a {z }
weight w j

Hence, we have arrived at an n + 1-point quadrature formula with nodes t j , whose weights are the
integrals of the cardinal interpolants for the interpolation scheme T .
✷

✓ ✏
Summing up, we have found

interpolation approximation quadrature

−→ −→
✒ ✑
schemes schemes schemes

(7.1.12) Convergence of numerical quadrature

In general the quadrature formula (7.1.2) will only provide an approximate value for the integral.

➣ For a generic integrand we will encounter a non-vanishing

Z b
quadrature error En ( f ) : = f (t) dt − Qn ( f )
a

As in the case of function approximation by interpolation Section 6.1.2, our focus will on the asymptotic
behavior of the quadrature error as a function of the number n of point evaluations of the integrand.

Therefore consider families of quadrature rules { Qn }n (→ Def. 7.1.1) described by

7. Numerical Quadrature , 7.1. Quadrature Formulas 516

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

n o
✦ quadrature weights wnj , j = 1, . . . , n and
n o n ∈N
✦ quadrature nodes cnj , j = 1, . . . , n .
n ∈N

We study the asymptotic behavior of the quadrature error E(n) for n → ∞

As in the case of interpolation errors in § 6.1.36 we make the usual qualitative distinction, see Def. 6.1.38:
✄ algebraic convergence E(n) = O(n− p ), rate p > 0
✄ exponential convergence E ( n ) = O ( q n ), 0 ≤ q < 1
Note that the number n of nodes agrees with the number of f -evaluations required for evaluation of the
quadrature formula. This is usually used as a measure for the cost of computing Qn ( f ).

Therefore, in the sequel, we consider the quadrature error as a function of n.

(7.1.13) Quadrature error from approximation error

Bounds for the maximum norm of the approximation error of an approximation scheme directly translate
into estimates of the quadrature error of the induced quadrature scheme (7.1.8):
Z b Z b
f (t) dt − QA( f ) ≤ | f (t) − A( f )(t)| dt ≤ |b − a|k f − A( f )k L∞ ([ a,b]) . (7.1.14)
a a

Hence, the various estimates derived in Section 6.1.2 and Section 6.1.3.2 give us quadrature error esti-
mates “for free”. More details will be given in the next section.

7.2 Polynomial Quadrature Formulas

Now we specialize the general recipe of § 7.1.7 for approximation schemes based on global polynomials,
the Lagrange approximation scheme as introduced in Section 6.1, Def. 6.1.32.

Supplementary reading. This topic is discussed in [?, Sect. 10.2].

Idea: replace integrand f with pn−1 := IT ∈ Pn−1 = polynomial Lagrange

interpolant of f (→ Cor. 5.2.15) for given node set T := {t0 , . . . , tn−1 } ⊂
[ a, b]
Z b Z b
f (t) dt ≈ Qn ( f ) := pn−1 (t) dt . (7.2.1)
a a

The cardinal interpolants for Lagrange interpolation are the Lagrange polynomials (5.2.11)

n −1 (5.2.13) n −1
t − tj
Li ( t ) : = ∏ ti − t j
, i = 0, . . . , n − 1 p n −1 (t) = ∑ f ( t i ) Li ( t ) .
j =0 i =0
j 6 =i

7. Numerical Quadrature , 7.2. Polynomial Quadrature Formulas 517

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Then (7.1.11) amounts to the n-point quadrature formula

Zb n −1 Zb nodes c i = ti − 1 ,
Z b
pn−1 (t) dt = ∑ f ( ti ) Li (t) dt
weights wi : = Li −1 (t) dt .
(7.2.2)
a i =0 a a

Example 7.2.3 (Midpoint rule)

The midpoint rule is (7.2.2) for n = 1 and t0 =

1
2 ( a + b). It leads to the 1-point quadrature formula
2.5

2
Zb
f (t) dt ≈ Qmp ( f ) = (b − a) f ( 12 (a + b)) .
f

1.5
a

1 “midpoint”

0.5
✁ the area under the graph of f is approximated by
the area of a rectangle.
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 263 t

Example 7.2.4 (Newton-Cotes formulas → [?, Ch. 38])

The n := m + 1-point Newton-Cotes formulas arise from Lagrange interpolation in equidistant nodes
(6.1.34) in the integration interval [ a, b]:

b−a
Equidistant quadrature nodes t j := a + hj, h := , j = 0, . . . , m:
m
The weights for the interval [0, 1] can be found, e.g., by symbolic computation using MAPLE: the following
MAPLE function expects the polynomial degree as input argument.

> newtoncotes := m -> factor(int(interp([seq(i/n, i=0..m)],

[seq(f(i/n), i=0..m)], z),z=0..1)):
Weights on general intervals [ a, b] can then be deduced by the affine transformation rule as explained in
Rem. 7.1.4.

• n = 2: Trapezoidal rule (integrate linear interpolant of integrand in endpoints)

7. Numerical Quadrature , 7.2. Polynomial Quadrature Formulas 518

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

111111111111111
000000000000000
2.5
> trapez := newtoncotes(1);
000000000000000
111111111111111
000000000000000
111111111111111
2
000000000000000
111111111111111
000000000000000
111111111111111
btrp ( f ) := 1 ( f (0) + f (1)) 000000000000000
111111111111111
Q (7.2.5) 000000000000000
111111111111111
000000000000000
111111111111111
2

f
1.5

000000000000000
111111111111111
000000000000000
111111111111111
Zb
b−a 000000000000000
111111111111111
f (t) dt ≈ ( f (a) + f (b)) 1
000000000000000
111111111111111
X

000000000000000
111111111111111
2
a 000000000000000
111111111111111
000000000000000
111111111111111
0.5
000000000000000
111111111111111
000000000000000
111111111111111
000000000000000
111111111111111
0
0 0.5
000000000000000
111111111111111
1 1.5 2 2.5 3 3.5 4
Fig. 264 t

• n = 3: Simpson rule
> simpson := newtoncotes(2);

Zb
h 1 b−a a+b
f (0 ) + 4 f ( 2 ) + f (1 ) f (t) dt ≈ f ( a) + 4 f + f (b) (7.2.6)
6 6 2
a

• n = 5: Milne rule
> milne := newtoncotes(4);

1
h 7 f (0) + 32 f ( 14 ) + 12 f ( 12 ) + 32 f ( 43 ) + 7 f (1)
90
b − a
(7 f (a) + 32 f (a + (b − a)/4) + 12 f (a + (b − a)/2) + 32 f (a + 3(b − a)/4) + 7 f (b))
90
• n = 7: Weddle rule
> weddle := newtoncotes(6);

1
h (41 f (0) + 216 f ( 61 ) + 27 f ( 31 ) + 272 f ( 12 )
840
+27 f ( 32 ) + 216 f ( 56 ) + 41 f (1))

• n ≥ 8: quadrature formulas with negative weights

> newtoncotes(8);

1
h (989 f (0) + 5888 f ( 81 ) − 928 f ( 14 ) + 10496 f ( 83 )
28350
−4540 f ( 12 ) + 10496 f ( 85 ) − 928 f ( 34 ) + 5888 f ( 87 ) + 989 f (1))

7. Numerical Quadrature , 7.2. Polynomial Quadrature Formulas 519

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

! From Ex. 6.1.41 we know that the approximation error incurred by Lagrange interpolation
in equidistant nodes can blow up even for analytic functions. This blow-up can also infect
the quadrature error of Newton-Cotes formulas for large n, which renders them essentially
useless. In addition they will be marred by large (in modulus) and negative weights, wich
compromises numerical stability (→ Def. 1.5.85)

Remark 7.2.7 (Clenshaw-Curtis quadrature rules [?])

The considerations of Section 6.1.3 confirmed the superiority of the “optimal” Chebychev nodes (6.1.84)
for globally polynomial Lagrange interpolation. This suggests that we use these nodes also for numerical
quadrature with weights given by (7.2.2). This yields the so-called Clenshaw-Curtis rules with the following
rather desirable property:

Theorem 7.2.8. Positivity of Clenshaw-Curtis weights

The weights wnj , j = 1, . . . , n, for every n-point Clenshaw-Curtis rule are positive.

The weights of any n-point Clenshaw-Curtis rule can be computed with a computational effort of O(n log n)
using FFT.

(7.2.9) Error estimates for polynomial quadrature

As a concrete application of § 7.1.13, (7.1.14) we use the L∞ -bound (6.1.50) for Lagrange interpolation

f ( n + 1)
L∞ ( I )
k f − LT f k L ∞ ( I ) ≤ max|(t − t0 ) · · · · · (t − tn )| . (6.1.50)
( n + 1) ! t∈ I

to conclude for any n-point quadrature rule based on polynomial interpolation:

Z b
1
f ∈ Cn ([ a, b]) ⇒ f (t) dt − Qn ( f ) ≤ (b − a)n +1 f (n ) . (7.2.10)
a n! L ∞ ([ a,b ])

Much sharper estimates for Clenshaw-Curtis rules (→ Rem. 7.2.7) can be inferred from the interpola-
tion error estimate (6.1.88) for Chebychev interpolation. For functions with limited smoothness algebraic
convergence of the quadrature error for Clenshaw-Curtis quadrature follows from (6.1.92). For integrands
that possess an analytic extension to the complex plane in a neighborhood of [ a, b], we can conclude
exponential convergence from (6.1.98).

7.3 Gauss Quadrature

Supplementary reading. Gauss quadrature is discussed in detail in [?, Ch. 40-41], [?,

Sect.10.3]

7. Numerical Quadrature , 7.3. Gauss Quadrature 520

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

How to gauge the “quality” of an n-point quadrature formula Qn without testing it for specific integrands?
The next definition gives an answer.

Definition 7.3.1. Order of a quadrature rule

The order of quadrature rule Qn : C0 ([ a, b]) → R is defined as

Z b
order(Qn ) := max{m ∈ N0 : Qn ( p) = p(t) dt ∀ p ∈ Pm }+1 , (7.3.2)
a

that is, as the maximal degree +1 of polynomials for which the quadrature rule is guaranteed to be
exact.

(7.3.3) Order of polynomial quadrature rules

First we note a simple consequence of the invariance of the polynomial space Pn under affine pullback,
see Lemma 6.1.21.

Corollary 7.3.4. Invariance of order under affine transformation

An affine transformation of a quadrature rule according to Rem. 7.1.4 does not change its order.

Further, by construction all polynomial n-point quadrature rules possess order at least n.

Theorem 7.3.5. Sufficient order conditions for quadrature rules

An n-point quadrature rule on [ a, b] (→ Def. 7.1.1)

n
Qn ( f ) := ∑ w j f (t j ) , f ∈ C0 ([ a, b]) ,
j =1

with nodes t j ∈ [ a, b] and weights w j ∈ R , j = 1, . . . , n, has order ≥ n, if and only if

Z b
wj = L j−1 (t) dt , j = 1, . . . , n ,
a

where Lk , k = 0, . . . , n − 1, is the k-th Lagrange polynomial (5.2.11) associated with the ordered
node set {t1 , t2 , . . . , tn }.

Proof. The conclusion of the theorem is a direct consequence of the facts that

Pn−1 = Span{ L0 , . . . , Ln−1 } and Lk (t j ) = δk+1,j , k + 1, j ∈ {1, . . . , n} :

just plug a Lagrange polynomial into the quadrature formula.

✷

By construction (7.2.2) polynomial n-point quadrature formulas (7.2.1) exact for f ∈ Pn−1 ⇒ n-point
polynomial quadrature formula has at least order n.

7. Numerical Quadrature , 7.3. Gauss Quadrature 521

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 7.3.6 (Choice of (local) quadrature weights)

Thm. 7.3.5 provides a concrete formula for quadrature weights, which guaranteed order n for an n-point
quadrature formula. Yet evaluating integrals of Lagrange polynomials may be cumbersome. Here we
give a general recipe for finding the weights w j according to Thm. 7.3.5 without dealing with Lagrange
polynomials.

Given: arbitrary nodes c1 , . . . , cn for n-point (local) quadrature formula on [ a, b]

From Def. 7.3.1 we immediately conclude the following procedure: If p0 , . . . , pn−1 is a basis of Pn , then,
thanks to the linearity of the integral and quadrature formulas,
Z b
Qn ( p j )= p j (t) dt ∀ j = 0, . . . , n − 1 ⇔ Qn has order ≥ n . (7.3.7)
a

➣ n × n linear system of equations, see (7.3.14) for an example:

    Rb 
p0 ( c1 ) ... p0 ( c n ) w1 a p0 (t) dt
 .. .. ..   .. 
 . . .  =  . . (7.3.8)
Rb
p n −1 (c1 ) . . . p n −1 (c n ) w n
a pn −1 (t) dt

For instance, for the computation of quadrature weights, one may choose the monomial basis p j (t) = t j .

Example 7.3.9 (Orders of simple polynomial quadrature formulas)

From the order rule for polynomial quadrature rule we immediately conclude the orders of simple repre-
sentatives.
n Order
1 midpoint rule 2
2 trapezoidal rule (7.2.5) 2
3 Simpson rule (7.2.6) 4
3
4 8 -rule 4
5 Milne rule 6
The orders for even n surpass the predictions of Thm. 7.3.5 by 1, which can be verified by straightforward
computations; following Def. 7.3.1 check the exactness of the quadrature rule on [0, 1] (this is sufficient →
Cor. 7.3.4) for monomials {t 7→ tk }, k = 0, . . . , q − 1, which form a basis of Pq , where q is the order that
is to be confirmed: essentially one has to show
n
1
Q({t 7→ tk }) = ∑ w j ckj = k + 1 , k = 0, . . . , q − 1 , (7.3.10)
j =1

where Q =
ˆ quadrature rule on [0, 1] given by (7.1.2).

For the Simpson rule (7.2.6) we can also confirm order 4 with symbolic calculations in MAPLE:

7. Numerical Quadrature , 7.3. Gauss Quadrature 522

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

> rule := 1/3h(f(2h)+4f(h)+f(0))

> err := taylor(rule - int(f(x),x=0..2*h),h=0,6);

1 (4)
err := D ( f )(0)h5 + O h6 , h, 6
90

➣ Composite Simpson rule possesses order 4, indeed !

(7.3.11) Maximal order of n-point quadrature rules

Natural question: Can an n-point quadrature formula achieve an order > n ?

A negative result limits the maximal order that can be achieved:

Theorem 7.3.12. Maximal order of n-point quadrature rule

The maximal order of an n-point quadrature rule is 2n.

Proof. Consider a generic n-point quadrature rule according to Def. 7.1.1

n
Qn ( f ) := ∑ wnj f (cnj ) , (7.1.2)
j =1

We build a polynomial of degree 2n that cannot be integrated exactly by Qn . We choose polynomial

q(t) := (t − c1 )2 · · · · · (t − cn )2 ∈ P2n .

On the one hand, q is strictly positive almost everywhere, which means

Z b
q(t) dt > 0 .
a

On the other hand, we find a different value

n
Q n (q) = ∑ wnj q(cnj ) = 0 .
j =1 | {z }
=0

Can we at least find n-point rules with order 2n?

Heuristics: A quadrature formula has order m ∈ N already, if it is exact for m polynomials ∈ Pm−1 that
form a basis of Pm−1 (recall Thm. 5.2.2).
m
An n-point quadrature formula has 2n “degrees of freedom” (n node positions, n weights).
⇓
It might be possible to achieve order 2n = dim P2n−1
(“No. of equations = No. of unknowns”)

7. Numerical Quadrature , 7.3. Gauss Quadrature 523

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Example 7.3.13 (2-point quadrature rule of order 4)

Necessary & sufficient conditions for order 4, cf. (7.3.8), integrate the functions of the monomial basis of
P3 exactly:
Z b
1
Qn ( p) = p(t) dt ∀ p ∈ P3 ⇔ Qn ({t 7→ tq }) = (bq+1 − aq+1 ) , q = 0, 1, 2, 3 .
a q+1

4 equations for weights w j and nodes c j , j = 1, 2 (a = −1, b = 1), cf. Rem. 7.3.6
Z 1 Z 1
1 dt = 2 = 1w1 + 1w2 , t dt = 0 = c1 w1 + c2 w2
−1 −1
Z 1 Z 1 (7.3.14)
2
t dt = = c21 w1 + c22 w2 ,
2 3
t dt = 0 = c31 w1 + c32 w2 .
−1 3 −1

Solve using MAPLE:

> eqns := {seq(int(x^k, x=-1..1) = w[1]*xi[1]^k+w[2]*xi[2]^k,k=0..3)};
> sols := solve(eqns, indets(eqns, name)):
> convert(sols, radical);

n √ √ o
➣ weights & nodes: w2 = 1, w1 = 1, c1 = 1/3 3, c2 = −1/3 3

Z 1
1 1
quadrature formula (order 4): f ( x ) dx ≈ f √ + f −√ (7.3.15)
−1 3 3

(7.3.16) Construction of n-point quadrature rules with maximal order 2n

First we search for necessary conditions that have to be met by the nodes, if an n-point quadrature rule
has order 2n.

Optimist’s assumption: ∃ family of n-point quadrature formulas on [−1, 1]

n Z 1
Qn ( f ) := ∑ wnj f (cnj ) ≈ f (t) dt , w j ∈ R , n ∈ N ,
j =1 −1

of order 2n ⇔ exact for polynomials ∈ P2n−1 . (7.3.17)

Define P̄n (t) := (t − c1n ) · · · · · (t − cnn ) , t ∈ R ⇒ P̄n ∈ Pn .

Note: P̄n has leading coefficient = 1.

By assumption on the order of Qn : for any q ∈ Pn−1

7. Numerical Quadrature , 7.3. Gauss Quadrature 524

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Z 1 n
(7.3.17)
q(t) P̄n (t) dt
−1 | {z }
= ∑ wnj q(cnj ) P̄n (cnj ) = 0 .
j =1 | {z }
∈P2n−1 =0
R1
⇒ L2 ([−1, 1])-orthogonality q(t) P̄n (t) dt = 0 ∀q ∈ Pn−1 . (7.3.18)
−1

L2 (] − 1, 1[)-inner product of q and P̄n , see (6.2.5).

Switching to a monomial representation of P̄n
P̄n (t) = tn + αn−1 t j−1 + · · · + α1 t + α0 ,
by linearity of the integral, (7.3.18) is equivalent to
n −1 Z 1 Z 1
∑ αj t j tℓ dt = − tn tℓ dt , ℓ = 0, . . . , n − 1 .
j =0 −1 −1
n −1
This is a linear system of equations A α j j=0 = b with a symmetric, positive definite (→ Def. 1.1.8)
coefficient matrix A ∈ R n,n . The A is positive definite can be concluded from
Z1 n−1 2
⊤
x Ax = ∑ (x) j t j dt > 0 , if x 6= 0 .
−1 j =0

Hence, A is regular and the coefficients α j are uniquely determined. Thus there is only one n-point
quadrature rule of order n.

The nodes of an n-point quadrature formula of order 2n, if it exists, must coincide with the unique zeros
of the polynomials P̄n ∈ Pn \ {0} satisfying (7.3.18).

Remark 7.3.19 (Gram-Schmidt orthogonalization of polynomials)

Rb
Recall: ( f , g) 7→ f (t) g(t) dt is an inner product on C0 ([ a, b]), the L2 -inner product, see
a
Rem. 6.2.20, [?, Sect. 4.4, Ex. 2], [?, Ex. 6.5]

➣ Treat space of polynomials Pn as a vector space equipped with an inner product.

➣ As we have seen in Section 6.2.2, abstract techniques for vector spaces with inner product can be
applied to polynomials, for instance Gram-Schmidt orthogonalization, cf. § 6.2.17, [?, Thm. 4.8], [?,
Alg. 6.1].
Now carry out the abstract Gram-Schmidt orthogonalization according to Algorithm (6.2.18) and recall
Thm. 6.2.19: in a vector space V with inner product (·, ·)V orthogonal vectors q0 , q1 , . . . spanning the
same subspaces as the linearly independent vectors v0 , v1 , . . . are constructed recursively via
n ( v n + 1 , q k )V
qn +1 : = v n +1 − ∑ ( q k , q k )
q k , q0 : = v 0 . (7.3.20)
k=0 V

➣ Construction of P̄n by Gram-Schmidt orthogonalization of monomial basis {1, t, t2 , . . . , tn−1 } (the

vk s!) of Pn−1 w.r.t. L2 ([−1, 1])-inner product:
n
R1
n
n −1 t P̄k (t) dt
P̄0 (t) := 1 , P̄n+1 (t) = t − ∑
R1 2 · P̄k (t) (7.3.21)
k=0 −1 P̄k (t) dt

7. Numerical Quadrature , 7.3. Gauss Quadrature 525

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Note: P̄n has leading coefficient = 1 ⇒ P̄n uniquely defined (up to sign) by (7.3.21).

The considerations so far only reveal necessary conditions on the nodes of an n-point quadrature rule of
order 2n:

They do by no means confirm the existence of such rules, but offer a clear hint on how to construct them:

Theorem 7.3.22. Existence of n-point quadrature formulas of order 2n

Let { P̄n }n∈N0 be a family of non-zero polynomials that satisfies

• ZP̄n ∈ Pn ,
1
• q(t) P̄n (t) dt = 0 for all q ∈ Pn−1 ( L2 ([−1, 1])-orthogonality),
−1
• The set {cnj }m
j=1 , m ≤ n, of real zeros of P̄n is contained in [−1, 1].
m
Then the quadrature rule (→ Def. 7.1.1) Qn ( f ) := ∑ wnj f (cnj )
j =1
with weights chosen according to Thm. 7.3.5 provides a quadrature formula of order 2n on [−1, 1].

n
Proof. Conclude from the orthogonality of the P̄n that { P̄k } k=0 is a basis of Pn and
Z 1
h(t) P̄n (t) dt = 0 ∀h ∈ Pn−1 . (7.3.23)
−1

Recall division of polynomials with remainder (Euclid’s algorithm → Course “Diskrete Mathematik”): for
any p ∈ P2n−1

p(t) = h(t) P̄n (t) + r (t) , for some h ∈ Pn−1 , r ∈ Pn−1 . (7.3.24)

Apply this representation to the integral:

Z1 Z1 Z1 m
(∗)
p(t) dt = h(t) P̄n (t) dt + r (t) dt = ∑ wnj r(cnj ) , (7.3.25)
−1 −1 −1 j =1
| {z }
=0 by (7.3.23)

(∗): by choice of weights according to Rem. 7.3.6 Qn is exact for polynomials of degree ≤ n − 1!

By choice of nodes as zeros of P̄n using (7.3.23):

m m m Z1
(7.3.24) (7.3.25)
∑ wnj p(cnj ) = ∑ wnj h(cnj ) P̄n (cnj ) + ∑ wnj r (cnj ) = p(t) dt .
j =1 j =1 | {z } j =1
=0 −1

(7.3.26) Legendre polynomials and Gauss-Legendre quadrature

The family of polynomials { P̄n }n∈N0 are so-called orthogonal polynomials w.r.t. the L2 (] − 1, 1[)-inner
product, see Def. 6.2.25. We have made use of orthogonal polynomials already in Section 6.2.2. L2 ([−1, 1])-
orthogonal polynomials play a key role in analysis.

7. Numerical Quadrature , 7.3. Gauss Quadrature 526

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Legendre polynomials

The L2 (]
− 1, 1[)-orthogonal are those already dis- 1 n=0
n=1
cussed in Rem. 6.2.34: 0.8 n=2
n=3
Definition 7.3.27. Legendre polynomials 0.6 n=4
n=5
0.4
The n-th Legendre polynomial Pn is defined by
0.2
• ZPn ∈ Pn ,

Pn(t)
1 0

• Pn (t)q(t) dt = 0 ∀q ∈ Pn−1 , -0.2

−1
-0.4
• Pn (1) = 1. -0.6

-0.8

Legendre polynomials P0 , . . . , P5 ➣ -1
-1 -0.5 0 0.5 1
Fig. 265 t

Notice: the polynomials P̄n defined by (7.3.21) and the Legendre polynomials Pn of Def. 7.3.27 (merely)
differ by a constant factor!

Gauss points ξ nj = zeros of Legendre polynomial Pn

Note: the above considerations, recall (7.3.18), show that the nodes of an n-point quadrature formula of
order 2n on [−1, 1] must agree with the zeros of L2 (] − 1, 1[)-orthogonal polynomials.
✞ ☎

✝ ✆
n-point quadrature formulas of order 2n are unique

This is not surprising in light of “2n equations for 2n degrees of freedom”.

We are not done yet: the zeros of P̄n from (7.3.21) may lie outside [−1, 1].
! In principle P̄n could also have less than n real zeros.

The next lemma shows that all this cannot happen.

Zeros of Legendre polynomials in [-1,1]

✁ Obviously:
Number n of quadrature nodes

14 Lemma 7.3.28. Zeros of Legendre polyno-

12 mials
10
Pn has n distinct zeros in ] − 1, 1[.
8

6
Zeros of Legendre polynomials = Gauss points
4

2
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 266 t

7. Numerical Quadrature , 7.3. Gauss Quadrature 527

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Proof. (indirect) Assume that Pn has only m < n zeros ζ 1 , . . . , ζ m in ] − 1, 1[ at which it changes sign.
Define
m
q(t) := ∏(t − ζ j ) ⇒ qPn ≥ 0 or qPn ≤ 0 .
j =1
Z 1
⇒ q(t) Pn (t) dt 6= 0 .
−1

As q ∈ Pn−1 , this contradicts (7.3.23).

✷
Definition 7.3.29. Gauss-Legendre quadrature formulas

The n-point Quadrature formulas whose nodes, the Gauss points, are given by the zeros of the n-th
Legendre polynomial (→ Def. 7.3.27), and whose weights are chosen according to Thm. 7.3.5, are
called Gauss-Legendre quadrature formulas.

Gauss-Legendre weights for [-1,1]

n=2
1 n=4
n=6
n=8
n=10
Obviously ✄ 0.8
n=12
n=14

Lemma 7.3.30. Positivity of Gauss-

0.6
Legendre quadrature weights
wj

The weights of the Gauss-Legendre quadra- 0.4

ture formulas are positive.

0.2

0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 267 t
j

Proof. Writing ξ nj , j = 1, . . . , n, for the nodes (Gauss points) of the n-point Gauss-Legendre quadrature
formula, n ∈ N, we define
n
qk (t) = ∏(t − ξ nj )2 ⇒ qk ∈ P2n−2 .
j =1
j6=k

This polynomial is integrated exactly by the quadrature rule: since qk (ξ nj ) = 0 for j 6= k

Z1
0< q(t) dt = wnk q(ξ kn ) ,
| {z }
−1 >0

where wnj are the quadrature weights.

✷

Remark 7.3.31 (3-Term recursion for Legendre polynomials)

7. Numerical Quadrature , 7.3. Gauss Quadrature 528

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

From Thm. 6.2.32 we learn the orthogonal polynomials satisfy the 3-term recursion (6.2.33), see also
(7.3.33). To keep this chapter self-contained we derive it independently for Legendre polynomials.

Note: the polynomials P̄n from (7.3.21) are uniquely characterized by the two properties (try a proof!)

(i) P̄n ∈ Pn with leading coefficient 1: P̄ (t) = tn + . . .,

Z 1
(ii) P̄k (t) P̄j (t) dt = 0, if j 6= k ( L2 (] − 1, 1[)-orthogonality).
−1

➣ we get the same polynomials P̄n by another Gram-Schmidt orthogonalization procedure, cf. (7.3.20)
and § 6.2.29:
R1
n
−1 τ P̄n (τ ) P̄k (τ ) dτ
P̄n+1 (t) = t P̄n (t) − ∑ R1 2 · P̄k (t)
k=0 −1 P̄k (τ ) dτ

By the orthogonality property (7.3.23) the sum collapses, since

Z 1 Z 1
τ P̄n (τ ) P̄k(τ ) dτ = P̄n (τ ) ( τ P̄k(τ )) dτ = 0 ,
−1 −1 | {z }
∈Pk+1

if k + 1 < n:
R1 R1
−1 τ P̄n (τ ) P̄n (τ ) dτ −1 τ P̄n (τ ) P̄n −1 (τ ) dτ
P̄n+1 (t) = t P̄n (t) − R1 · P̄n (t) − R1 2 · P̄n−1 (t) . (7.3.32)
2
−1 P̄n (τ ) dτ −1 P̄n −1 (τ ) dτ

After rescaling (tedious!): 3-term recursion for Legendre polynomials

2n + 1 n
Pn+1 (t) := tPn (t) − Pn−1 (t) , P0 := 1 , P1 (t) := t . (7.3.33)
n+1 n+1

Reminder (→ Section 6.1.3.1): we have a similar 3-term recursion (6.1.78) for Chebychev polynomials.
Coincidence? Of course not, nothing in mathematics holds “by accident”. By Thm. 6.2.32 3-term recur-
sions are a distinguishing feature of so-called families of orthogonal polynomials, to which the Chebychev
polynomials belong as well, spawned by Gram-Schmidt orthogonalization with respect to a weighted L2 -
inner product, however, see [?, VI].

➤ Efficient and stable evaluation of Legendre polynomials by means of 3-term recursion (7.3.33), cf.
the analoguous algorithm for Chebychev polynomials given in Code 6.1.79.

C++-code 7.3.34: computing Legende polynomials

2 // returns the values of the first n - 1 legendre polynomials
3 // in point x as columns of the matrix L
4 void legendre ( const unsigned n , const VectorXd& x , MatrixXd& L ) {
5 L = MatrixXd : : Ones ( n , n ) ; // p0 ( x) = 1
6 L . col ( 1 ) = x ; // p1 ( x) = x
7 f o r ( unsigned j = 1 ; j < n − 1 ; ++ j ) {
2j+1 j
8 // p j+1 ( x) = j+1 xp j ( x) − j+1 p j−1 ( x) Eq. (7.3.33)
9 V . col ( j +1) = ( 2 . ∗ j +1) / ( j + 1 . ) ∗V . col ( j −1) . cwiseProduct ( x )
10 − j / ( j + 1 . ) ∗V . col ( j −1) ;

7. Numerical Quadrature , 7.3. Gauss Quadrature 529

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

11 }
12 }

Comments on Code 7.3.34:

☛ return value: matrix V with (V)ij = Pi ( x j )
☛ line 2: takes into account initialization of Legendre 3-term recursion (7.3.33)

Remark 7.3.35 (Computing Gauss nodes and weights)

There are several efficient ways to find the Gauss points. Here we discuss an intriguing connection with
an eigenvalue problem.
Compute nodes/weights of Gaussian quadrature by solving an eigenvalue problem!
(Golub-Welsch algorithm [?, Sect. 3.5.4], [?, Sect. 1])

In codes Gauss nodes and weights are usually retrieved from tables, cf. Rem. 7.1.6.

C++-code 7.3.36: Golub-Welsch algorithm

2 s t r u c t QuadRule {
3 Eigen : : VectorXd nodes , weights ;
4 };
5

6 QuadRule gaussquad_ ( const unsigned n ) {

7 QuadRule qr ;
8 Eigen : : MatrixXd M = Eigen : : MatrixXd : : Zero ( n , n ) ;
9 f o r ( unsigned i = 1 ; i < n ; ++ i ) {
10 const double b = i / std : : s q r t ( 4 . ∗ i ∗ i − 1 . ) ;
11 M( i , i − 1 ) = b ;
12 // line 15 is optional as the EV-Solver only references
13 // the lower triangular part of M
14 // M(i - 1, i) = b;
15 }
16 // using EigenSolver for symmetric matrices, exploiting the
structure
17 Eigen : : SelfAdjointEigenSolver <Eigen : : MatrixXd > eig (M) ;
18

19 qr . nodes = eig . eigenvalues ( ) ;

20 qr . weights = 2 ∗ eig . e i g e n v e c t o r s ( ) . topRows<1 >() . array ( ) . pow ( 2 ) ;
21

22 r et ur n qr ;
23 }

Justification: en = √ 1 Pn
rewrite 3-term recurrence (7.3.33) for scaled Legendre polynomials P 1 n + /2
n n+1
t Pen (t) = √ Pen−1 (t) + p Pen+1 (t) . (7.3.37)
2
4n − 1 4 ( n + 1 )2 − 1
| {z } | {z }
= :β n = :β n+1

7. Numerical Quadrature , 7.3. Gauss Quadrature 530

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

For fixed t ∈ R (7.3.37) can be expressed as

 
0 β1
     
Pe0 (t)  β1 0 β2  Pe0 (t) 0
 
 Pe (t)   . .  Pe (t)   .. 
 1   β2 . . . .  1   . 
t ..  = .. .. ..  ..  +  
 .   . . .  .   0 
 
Pen−1 (t)  0 β n −1  P en−1 (t) en (t)
βn P
| {z } β n −1 0
= :p (t)∈R n | {z }
= :Jn ∈R n,n
Pen (ξ ) = 0 ⇔ ξp(ξ ) = Jn p(ξ ) (eigenvalue problem!) .

The zeros of Pn can be obtained as the n real eigenvalues of the symmetric tridiagonal matrix
Jn ∈ R n,n !
This matrix Jn is initialized in ??–?? of Code 7.3.36. The computation of the weights in ?? of Code 7.3.36
is explained in [?, Sect. 3.5.4].

(7.3.38) Quadrature error and best approximation error

The positivity of the weights wnj for all n-point Gauss-Legendre and Clenshaw-Curtis quadrature rules has
important consequences.

Theorem 7.3.39. Quadrature error estimate for quadrature rules with positive weights

For every n-point quadrature rule Qn as in (7.1.2) of order q ∈ N with weights w j ≥ 0, j = 1, . . . , n

the quadrature error satisfies
Z b
En ( f ) : = f (t) dt − Qn ( f ) ≤ 2|b − a| inf k f − p k L∞ ([ a,b]) ∀ f ∈ C0 ([ a, b]) . (7.3.40)
a p∈Pq −1
| {z }
best approximation error

Proof. The proof runs parallel to the derivation of (6.1.61). Writing En ( f ) for the quadrature error, the left
hand side of (7.3.40), we find by the definition Def. 7.3.1 of the order of a quadrature rule
Z b n
En ( f ) = En ( f − p ) ≤
a
( f − p)(t) dt + ∑ w j ( f − p)(c j ) (7.3.41)
j =1
n
≤|b − a|k f − p k L∞ ([ a,b]) + ∑ |w j | k f − pk L∞ ([ a,b]) .
j =1

Since the quadrature rule is exact for constants and w j ≥ 0

n n
∑ |w j | = ∑ w j = |b − a| ,
j =1 j =1

which finishes the proof.

✷
Drawing on best approximation estimates from Section 6.1.1 and Rem. 6.1.96, we immediately get results
about the asymptotic decay of the quadrature error for n-point Gauss-Legendre and Clenshaw-Curtis
quadrature as n → ∞:

7. Numerical Quadrature , 7.3. Gauss Quadrature 531

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

f ∈ Cr ([ a, b]) ⇒ En ( f ) → 0 algebraically with rate r,

∞
f ∈ C ([ a, b]) “analytically extensible” ⇒ En ( f ) → 0 exponentially,
as n → ∞, see Def. 6.1.38 for type of convergence.

Appealing to Thm. 6.1.15 and Rem. 6.1.23, and (6.1.50), the dependence of the constants on the length
of the integration interval can be quantified for integrands with limited smoothness.

Lemma 7.3.42. Quadrature error estimates for Cr -integrands

For every n-point quadrature rule Qn as in (7.1.2) of order q ∈ N with weights w j ≥ 0, j = 1, . . . , n

we find that the quadrature error En ( f ) for and integrand f ∈ Cr ([ a, b]), r ∈ N0 , satisfies

in the case q ≥ r: En ( f ) ≤ C q − r | b − a | r + 1 f ( r ) , (7.3.43)

L ∞ ([ a,b ])
| b − a| q +1 (q )
in the case q < r: En ( f ) ≤ f , (7.3.44)
q! L ∞ ([ a,b ])

with a constant C > 0 independent of n, f , and [ a, b].

Please note the different estimates depending on whether the smoothness of f (as described by r) or the
order of the quadrature rule is the “limiting factor”.

Example 7.3.45 (Convergence of global quadrature rules)

We examine three families of global polynomial (→ Thm. 7.3.5) quadrature rules: Newton-Cotes formulas,
Gauss-Legendre rules, and Clenshaw-Curtis rules. We record the convergence of the quadrature errors
for the interval [0, 1] and two different functions

1. f 1 (t) = 1+(15t)2 , an analytic function, see Rem. 6.1.72,

√
2. f 2 (t) = t, merely continuous, derivatives singular in t = 0.
2 Numerical quadrature of function sqrt(t)
Numerical quadrature of function 1/(1+(5t) ) 0
0
10 10
Equidistant Newton-Cotes quadrature
Chebyshev quadrature
-2 Gauss quadrature
10 -1
10

-4
10
-2
10
|quadrature error|
|quadrature error|

-6
10
-3
10
-8
10

-4
10
-10
10

-5
-12
10 10
Equidistant Newton-Cotes quadrature
Chebyshev quadrature
Gauss quadrature
-14 -6
10 10
0 2 4 6 8 10 12 14 16 18 20 0 1
Fig. 268 10 10
Number of quadrature nodes Fig. 269 Number of quadrature nodes
quadrature error, f 1 (t) := 1+(15t)2 on [0, 1]
√
quadrature error, f 2 (t) := t on [0, 1]
R1
Asymptotic behavior of quadrature error ǫn := 0 f (t) dt − Qn ( f ) for "n → ∞”:

7. Numerical Quadrature , 7.3. Gauss Quadrature 532

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

➣
exponential convergence ǫn ≈ O(qn ), 0 < q < 1, for C∞ -integrand f 1 ❀ : Newton-Cotes quadra-
ture : q ≈ 0.61, Clenshaw-Curtis quadrature : q ≈ 0.40, Gauss-Legendre quadrature : q ≈ 0.27

➣
algebraic convergence ǫn ≈ O(n−α ), α > 0, for integrand f 2 with singularity at t = 0 ❀ Newton-
Cotes quadrature : α ≈ 1.8, Clenshaw-Curtis quadrature : α ≈ 2.5, Gauss-Legendre quadrature :
α ≈ 2.7

Remark 7.3.46 (Removing a singularity by transformation)

From Ex. 7.3.45 teaches us that a lack of smoothness of the integrand can thwart exponential convergence
and severely limits the rate of algebraic convergence of a global quadrature rule for n → ∞.

Idea: recover integral with smooth integrand by “analytic preprocessing”

Here is an example:
Z b√
For a general but smooth f ∈ C∞ ([0, b]) compute t f (t) dt via a quadrature rule, e.g., n-point
0
Gauss-Legendre quadrature on [0, b]. Due to the presence of a square-root singularity at t = 0 the direct
application of n-point Gauss-Legendre quadrature will result in a rather slow algebraic convergence of the
quadrature error as n → ∞, see Ex. 7.3.45.

Trick: Transformation of integrand by substitution rule:

√ Z b√ Z √b
substitution s = t: t f (t) dt = 2s2 f (s2 ) ds . (7.3.47)
0 0

Then: Apply Gauss-Legendre quadrature rule to smooth integrand

Remark 7.3.48 (The message of asymptotic estimates)

There is one blot on most n-asymptotic estimates obtained from Thm. 7.3.39: the bounds usually in-
volve quantities like norms of higher derivatives of the interpoland that are elusive in general, in particular
for integrands given only in procedural form, see § 7.0.2. Such unknown quantities are often hidden in
“generic constants C”. Can we extract useful information from estimates marred by the presence of such
constants?

For fixed integrand f let us assume sharp algebraic convergence (in n) with rate r ∈ N of the quadrature
error En ( f ) for a family of n-point quadrature rules:
sharp
En ( f ) = O(n−r ) =⇒ En ( f ) ≈ Cn−r , (7.3.49)
with a “generic constant C > 0” independent of n.

Goal: Reduction of the quadrature error by a factor of ρ > 1

Which (minimal) increase in the number n of quadrature points accomplishes this?

−r
Cnold ! √
−r =ρ ⇔ nnew : nold = r ρ . (7.3.50)
Cnnew

7. Numerical Quadrature , 7.3. Gauss Quadrature 533

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

In the case of algebraic convergence with rate r ∈ R a reduction of the quadrature error by a factor
of ρ is bought by an increase of the number of quadrature points by a factor of ρ /r .
1

(7.3.43) ➣ gains in accuracy are “cheaper” for smoother integrands!

Now assume sharp exponential convergence (in n) of the quadrature error En ( f ) for a family of n-point
quadrature rules, 0 ≤ q < 1:

sharp
En ( f ) = O(qn ) =⇒ En ( f ) ≈ Cqn , (7.3.51)

with a “generic constant C > 0” independent of n.

Error reduction by a factor ρ > 1 results from

Cqnold ! log ρ
=ρ ⇔ nnew − nold = − log q .
Cqnnew

In the case of exponential convergence (7.3.51) a fixed increase of the number of quadrature points
by − log ρ : log q results in a reduction of the quadrature error by a factor of ρ > 1.

7.4 Composite Quadrature

In 6, Section 6.5.1 we studied approximation by piecewise polynomial interpolants. A similar idea under-
lies the so-called composite quadrature rules on an interval [ a, b]. Analogously to piecewise polynomial
techniques they start from a grid/mesh

M : = { a = x0 < x1 < . . . < x m −1 < x m = b } (6.5.2)

and appeal to the trivial identity

Z b m Z xj

a
f (t) dt = ∑ f (t) dt . (7.4.1)
j = 1 x j −1

On each mesh interval [ x j−1, x j ] we then use a local quadrature rule, which may be one of the polynomial
quadrature formulas from 7.2.

7. Numerical Quadrature , 7.4. Composite Quadrature 534

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

General construction of composite quadrature rules

Idea: Partition integration domain [ a, b] by a mesh/grid (→ Sec-

tion 6.5)
M : = { a = x0 < x1 < . . . < x m = b }
Apply quadrature formulas from Section 7.2, Section 7.3 locally
on mesh intervals I j := [ x j−1, x j ], j = 1, . . . , m, and sum up.

composite quadrature rule

Example 7.4.3 (Simple composite polynomial quadrature rules)

Composite trapezoidal rule, cf. (7.2.5)

2.5

Zb
1 2
f (t)dt = 2 ( x1 − x0 ) f (a)+ (7.4.4)
a m−1 1.5

1
∑ 2 ( x j +1 − x j−1 ) f ( x j )+
1
j =1
1
2 ( xm − x m−1 ) f (b) . 0.5

Fig. 270 0
-1 0 1 2 3 4 5 6

➣ arising from piecewise linear interpolation of f .

Composite Simpson rule, cf. (7.2.6)

Zb 3

f (t)dt = 2.5

a
1
6 ( x1 − x0 ) f (a)+ (7.4.5)
2

m−1 1.5
➣
1
∑ 6 ( x j +1 − x j−1 ) f ( x j )+
j =1 1

m
∑ 32 (x j − x j−1) f ( 21 (x j + x j−1))+
0.5

j =1 Fig. 271 0
-1 0 1 2 3 4 5 6
1
6 ( xm − x m−1 ) f (b) .

related to piecewise quadratic Lagrangian interpolation.

Formulas (7.4.4), (7.4.5) directly suggest efficient implementation with minimal number of f -evaluations.

C++-code 7.4.6: Equidistant composite trapezoidal rule (7.4.4)

1 template <class Func tion >
2 double t r a p e z o i d a l ( F u n c t i o n& f , const double a , const double b , const
unsigned N) {
3 double I = 0 ;
4 const double h = ( b − a ) /N ; // interval length
5

7. Numerical Quadrature , 7.4. Composite Quadrature 535

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6 f o r ( unsigned i = 0 ; i < N ; ++ i ) {
7 // rule: T = (b - a)/2 * (f(a) + f(b)),
8 // apply on N intervals: [a + i*h, a + (i+1)*h], i=0..(N-1)
9 I += h / 2 ∗ ( f ( a + i ∗ h ) + f ( a + ( i + 1 ) ∗ h ) ) ;
10 }
11 r et ur n I ;
12 }

C++-code 7.4.7: Equidistant composite Simpson rule (7.4.5)

1 template <class Func tion >
2 double simpson ( F u n c t i o n& f , const double a , const double b , const
unsigned N) {
3 double I = 0 ;
4 const double h = ( b − a ) /N ; // intervall length
5

6 f o r ( unsigned i = 0 ; i < N ; ++ i ) {
7 // rule: S = (b - a)/6*( f(a) + 4*f(0.5*(a + b)) + f(b) )
8 // apply on [a + i*h, a + (i+1)*h]
9 I += h / 6 ∗ ( f ( a + i ∗ h ) + 4 ∗ f ( a + ( i + 0 . 5 ) ∗ h ) + f ( a + ( i +1) ∗ h ) ) ;
10 }
11

12 r et ur n I ;
13 }

In both cases the function object passed in f must provide an evaluation operator double operator
(double)const.

Remark 7.4.8 (Composite quadrature and piecewise polynomial interpolation)

Composite quadrature scheme based on local polynomial quadrature can usually be understood as “quadra-
ture by approximation schemes” as explained in § 7.1.7. The underlying approximation schemes belong
to the class of general local Lagrangian interpolation schemes introduced in Section 6.5.1.

In other words, many composite quadrature schemes arise from replacing the integrand by a piecewise
interpolating polynomial, see Fig. 270 and Fig. 271 and compare with Fig. 250.

To see the main rationale behind the use of composite quadrature rules recall Lemma 7.3.42: for a poly-
nomial quadrature rule (7.2.1) of order q with positive weights and f ∈ Cr ([ a, b]) the quadrature error
shrinks with the min{r, q} + 1-st power of the length |b − a| of the integration domain! Hence, applying
polynomial quadrature rules to small mesh intervals should lead to a small overall quadrature error.

(7.4.9) Quadrature error estimate for composite polynomial quadrature rules

7. Numerical Quadrature , 7.4. Composite Quadrature 536

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Assume a composite quadrature rule Q on [ x0 , xm ] = [ a, b], b > a, based on n j -point local quadrature
j
rules Qn j with positive weights (e.g. local Gauss-Legendre quadrature rules or local Clenshaw-Curtis
quadrature rules) and of fixed orders q j ∈ N on each mesh interval [ x j−1, x j ]. From Lemma 7.3.42 recall
the estimate for f ∈ Cr ([ x j−1 , x j ])

Z xj
j
f (t) dt − Qn j ( f ) ≤ C| x j − x j−1 |min{r,q j }+1 f (min{r,q j }) . (7.2.10)
x j −1 L ∞ ([ x j−1 ,x j ])

with C > 0 independent of f and j.

For f ∈ Cr ([ a, b]), summing up these bounds we get for the global quadrature error
Z xm m
min{r,q j }+1
f (t) dt − Q( f ) ≤ C ∑ hj f (min{r,q j }) ,
x0 L ∞ ([ x j−1 ,x j ])
j =1

with local meshwidths h j = x j − x j−1. If q j = q, q ∈ N, for all j = 1, . . . , m, then, as ∑ j h j = b − a,

Z xm
min{q,r }
f (t) dt − Q( f ) ≤ C hM |b − a| f (min{q,r}) , (7.4.10)
x0 L ∞ ([ a,b ])

with (global) meshwidth hM := max j h j .

(7.4.10) ←→ Algebraic convergence in no. of f -evaluations for n → ∞

(7.4.11) Constructing families of composite quadrature rules

As with polynomial quadrature rules, we study the asymptotic behavior of the quadrature error for families
of composite quadrature rules as a function on the total number n of function evaluations.
As in the case of M-piecewise polynomial approximation of function (→ Section 6.5.1) families of com-
posite quadrature rules can be generated in two different ways:

(I) use a sequence of successively refined meshes Mk = { x kj } j with ♯M = m(k) + 1, m(k) →
k ∈N
∞ for k → ∞, , combined with the same (transformed, → Rem. 7.1.4) local quadrature rule on all
mesh intervals [ x kj−1 , x kj ]. Examples are the composite trapezoidal rule and composite Simpson rule
from Ex. 7.4.3 on sequences of equidistant meshes.
➣ h-convergence
m
(II) On a fixed mesh M = xj j =0
, on each cell use the same (transformed) local quadrature rule
taken from a sequence of polynomial quadrature rules of increasing order.
➣ p-convergence

Example 7.4.12 (Quadrature errors for composite quadrature rules)

Composite quadrature rules based on

7. Numerical Quadrature , 7.4. Composite Quadrature 537

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

• trapezoidal rule (7.2.5) ➣ local order 2 (exact for linear functions, see Ex. 7.3.9),
• Simpson rule (7.2.6) ➣ local order 4 (exact for cubic polynomials, see Ex. 7.3.9)
n
on equidistant mesh M := { jh} j=0, h = 1/n, n ∈ N.

2
0
numerical quadrature of function 1/(1+(5t) ) numerical quadrature of function sqrt(t)
0
10 10
trapezoidal rule trapezoidal rule
Simpson rule
Simpson rule
O(h2) -1 1.5
4
10 O(h )
O(h )

-2
-5
10
10
|quadrature error|

|quadrature error|
-3
10

-4
10
-10
10
-5
10

-6
10

-15 -7
10 -2 -1 0 10
10 10 10 -2 -1 0
10 10 10
Fig. 272 meshwidth Fig. 273 meshwidth
√
quadrature error, f 1 (t) := 1+(15t)2 on [0, 1] quadrature error, f 2 (t) := t on [0, 1]
R1
Asymptotic behavior of quadrature error E(n) := 0 f (t) dt − Qn ( f ) for meshwidth "h → 0”

☛ algebraic convergence E(n) = O(hα ) of order α > 0, n = h−1

➣ Sufficiently smooth integrand f 1 : trapezoidal rule → α = 2, Simpson rule → α = 4 !?

➣ singular integrand f 2 : α = 3/2 for trapezoidal rule & Simpson rule !

(lack of) smoothness of integrand limits convergence!

Remark 7.4.13 (Composite quadrature rules vs. global quadrature rules)

For a fixed integrand f ∈ Cr ([ a, b]) of limited smoothness on an interval [ a, b] we compare

• a family of composite quadrature rules basedon single localℓ-point rule (with positive weights) of
order q on a sequence of equidistant meshes Mk = { x kj } j ,
k ∈N
• the family of Gauss-Legendre quadrature rules from Def. 7.3.29.
We study the asymptotic dependence of the quadrature error on the number n of function evaluations.

For the composite quadrature rules we have n ≈ ℓ♯Mk ≈ ℓh− 1

M . Combined with (7.4.10), we find for
comp
quadrature error En ( f ) of the composite quadrature rules
comp
En ( f ) ≤ C1 n− min{q,r} , (7.4.14)

7. Numerical Quadrature , 7.4. Composite Quadrature 538

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

with C1 > 0 independent of M = Mk .

The quadrature errors EnGL ( f ) of the n-point Gauss-Legendre quadrature rules are given in Lemma 7.3.42,
(7.3.43):

EnGL ( f ) ≤ C2 n−r , (7.4.15)

with C2 > 0 independent of n.

Gauss-Legendre quadrature converges at least as fast fixed order composite quadrature on equidistant
meshes.

Moreover, Gauss-Legendre quadrature “automatically detects” the smoothness of the integrand, and en-
joys fast exponential convergence for analytic integrands.
✞ ☎

✝ ✆
Use Gauss-Legendre quadrature instead of fixed order composite quadrature on equidistant meshes.

Experiment 7.4.16 (Empiric convergence of equidistant trapezoidal rule)

Sometimes there are surprises: Now we will witness a convergence behavior of a composite quadrature
rule that is much better than predicted by the order of the local quadrature formula.
We consider the equidistant trapezoidal rule (order 2), see (7.4.4), Code 7.4.6
Z b m−1
1 1 b−a
a
f (t) dt ≈ Tm ( f ) := h 2 f ( a) + ∑ f (kh) + f (b) , h := 2 m
. (7.4.17)
k=1

and the 1-periodic smooth (analytic) integrand

1
f (t) = p , 0<a<1.
1 − a sin(2πt − 1)
(“exact value of integral”: use T500 )
Trapezoidal rule quadrature for 1./sqrt(1-a*sin(2*pi*x+1)) Trapezoidal rule quadrature for non-periodic function
0
0 10
10

-2
10
-1
10

-4
10
|quadrature error|
|quadrature error|

-2
-6
10
10

-8
10 -3
10

-10
10
-4
10
-12 a=0.5 a=0.5
10 a=0.9
a=0.9
a=0.95 a=0.95
a=0.99 a=0.99
-14 -5
10 10 0 1 2
0 2 4 6 8 10 12 14 16 18 20 10 10 10
Fig. 274 no. of. quadrature nodes Fig. 275 no. of. quadrature nodes

quadrature error for Tn ( f ) on [0, 1] quadrature error for Tn ( f ) on [0, 12 ]

exponential convergence !! merely algebraic convergence

7. Numerical Quadrature , 7.4. Composite Quadrature 539

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(7.4.18) Equidistant trapezoidal rule (for periodic integrands)

In this § we use I := [0, 1[ as a reference interval, cf. Exp. 7.4.16. We rely on similar techniques as in
Section 5.6, Section 5.6.2. Again, a key tool will be the bijective mapping, see Fig. 203,

ΦS1 : I → S1 := {z ∈ C : |z| = 1} , t 7→ z := exp(2πıt) , (5.6.9)

which induces the general pullback, c.f. (6.1.20),

(ΦS−11 )∗ : C0 ([0, 1[) → C0 (S1 ) , (ΦS−11 )∗ f (z) := f (ΦS−11 (z)) , z ∈ S1 .

If f ∈ Cr (R ) and 1-periodic, then (ΦS−11 )∗ f ∈ Cr (S1 ). Further, ΦS1 maps equidistant nodes on I :=
[0, 1] to equispaced nodes on S1 , which are the roots of unity:
j j j
ΦS1 ( n ) = exp(2πı n ) [ exp(2πı n )n = 1; ] . (7.4.19)

Now consider an n-point polynomial quadrature rule on S1 based on the set of equidistant nodes Z :=
{z j := exp(2πı j−n 1 ), j = 1, . . . , n} and defined as
Z n
1 1
QSn ( g) := LZ g (τ ) dS(τ ) = ∑ wSj g(z j ) , (7.4.20)
j =1
S1

where LZ is the Lagrange interpolation operator (→ Def. 6.1.32). This means that the weights obey
Thm. 7.3.5, where the definition (5.2.11) of Lagrange polynomials remains the same for complex nodes.
By sheer symmetry, all the weights have to be the same, which, since the rule will be at least of order 1,
means
1 2π
wSj = , j = 1, . . . , n .
n
1
Moreover, the quadrature rule QSn will be of order n, see Def. 7.3.1, that is, it will integrate polynomials of
degree ≤ n − 1, exactly.

1
By transformation (→ Rem. 7.1.4 and pullback (7.4.18), QSn induces a quadrature rule on I := [0, 1] by

n n
1 S1 1 1 1 j −1
QnI ( f ) := Qn (ΦS−11 )∗ f = ∑ wSj f (Φ−1 (z j )) = ∑ n f( n ) . (7.4.21)
2π 2π j =1 j =1

This is exactly the equidistant trapezoidal rule(7.4.17), if f is 1-periodic, f (0) = f (1): QnI = Tn . Hence
we arrive at the following estimate for the quadrature error
Z 1
En ( f ) : = f (t) dt − Tn ( f ) ≤ 2π max (ΦS−11 )∗ f (z) − LZ (ΦS−11 )∗ f (z) .
0 z ∈S 1

Equivalently, one can show that Tn integrates trigonometric polynomials up to degree 2n − 1 exactly:
 (

 R 1 0 , if k 6= 0 ,

 0 f (t) dt = 1 , if k = 0 .

f (t) = e2πıkt (

 m−1 2πı (4.2.8) 0 , if k 6∈ nZ ,
 1 lk =
 Tn ( f ) = m ∑ e n

1 , if k ∈ nZ .
l =0

7. Numerical Quadrature , 7.4. Composite Quadrature 540

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The equidistant trapezoidal rule Tn is exact for trigonometric polynomials of

degree < 2n !

Remark 7.4.22 (Approximate computation of Fourier coefficients)

Recall from Section 4.2.5: recovery of signal (yk )k∈Z from its Fourier transform c(t)

Z1
yj = c(t) exp (2πijt) dt . (4.2.79)
0

Task: approximate computation of y j

Recall: c(t) obtained from (yk )k∈Z through Fourier series

c (t) = ∑ yk exp(−2πikt) . (4.2.67)

k ∈Z

➣ c(t) smooth & 1-periodic for finite/rapidly decaying (yk )k∈Z .

Exp. 7.4.16 use equidistant trapezoidal rule (7.4.17)
for approximate evaluation of integral in (4.2.79).
☞ Boils down to inverse DFT (4.2.20); hardly surprising in light of the derivation of (4.2.79) in Sec-
tion 4.2.5.

C++-code 7.4.23: DFT-based approximate computation of Fourier coefficients

1 # include < ios tr eam >
2 # include <vector >
3 # include <complex >
4 # include <unsupported / Eigen / FFT>
5

6 template <class Func tion >

7 void f o u r c o e f f c o m p ( std : : vector <std : : complex <double>>& y , F u n c t i o n& c ,
const unsigned m, const unsigned ovsmpl = 2 ) {
8 // Compute the Fourier coefficients y−m , . . . , ym of the function
9 // c : [0, 1[7→ C using an oversampling factor ovsmpl.
10 // c must be a handle to a function @(t), e.g. a lambda function
11 const unsigned N = (2 ∗m + 1 ) ∗ ovsmpl ; // number of quadrature points
12 const double h = 1 . / N ;
13

14 // evaluate function in N points

15 std : : vector <std : : complex <double>> c _ev al (N) ;
16 f o r ( unsigned i = 0 ; i < N ; ++ i ) {
17 c _ev al [ i ] = c ( i ∗ h ) ;
18 }
19

20 // inverse discrete fourier transformation

21 Eigen : : FFT<double> f f t ;
22 std : : vector <std : : complex <double>> z ;

7. Numerical Quadrature , 7.4. Composite Quadrature 541

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

23 f f t . i n v ( z , c _ev al ) ;
24

25 // Undo oversampling and wrapping of Fourier coefficient array

26 // -> y contains same values as z but in a different order:
27 // y = [z(N-m+1:N), z(1:m+1)]
28 y = std : : vector <std : : complex <double > > ( ) ;
29 y . r e s e r v e (N) ;
30 f o r ( unsigned i = N − m; i < N ; ++ i ) {
31 y . push_back ( z [ i ] ) ;
32 }
33 f o r ( unsigned j = 0 ; j < m + 1 ; ++ j ) {
34 y . push_back ( z [ j ] ) ;
35 }
36 }

7.5 Adaptive Quadrature

Rb
Hitherto, we have just “blindly” applied quadrature rules for the approximate evaluation of a f (t) dt, oblivi-
ous of any properties of the integrand f . This led us to the conclusion of Rem. 7.4.13 that Gauss-Legendre
quadrature (→ Def. 7.3.29) should be preferred to composite quadrature rules (→ Section 7.4) in general.
Now the composite quadrature rule will partly be rehabilitated, because they offer the flexibility to adjust
the quadrature rule to the integrand, a policy known as adaptive quadrature.

Adaptive numerical quadrature

Rb
The policy of adaptive quadrature approximates a f (t) dt by a quadrature formula (7.1.2), whose
nodes cnj are chosen depending on the integrand f .

We distinguish
(I) a priori adaptive quadrature: the nodes a fixed before the evaluation of the quadrature for-
mula, taking into account external information about f , and
(II) a posteriori adaptive quadrature: the node positions are chosen or improved based on infor-
mation gleaned during the computation inside a loop. It terminates when sufficient accuracy
has been reached.

In this section we will chiefly discuss a posteriori adaptive quadrature for composite quadrature rules (→
Section 7.4) based on a single local quadrature rule (and its transformation).

Supplementary reading. [?, Sect. 9.7]

Example 7.5.2 (Rationale for adaptive quadrature)

7. Numerical Quadrature , 7.5. Adaptive Quadrature 542

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

This example presents an extreme case. We consider the composite trapezoidal rule (7.4.4) on a mesh
M := {a = x0 < x1 < · · · < xm = b} and for the integrand f (t) = 10−14 +t2 on [−1, 1].
10000

9000

8000

f is a spike-like function 1
✄ 7000 f (t) = 10−4 + t2
6000
Intuition: quadrature nodes should cluster around 0,

f(t)
5000
whereas hardly any are needed close to the end-
points of the integration interval, where the function 4000

has very small (in modulus) values. 3000

➣ Use locally refined mesh ! 2000

1000

0
-1 -0.5 0 0.5 1
Fig. 276 t

A quantitative justification can appeal to (7.2.10) and the resulting bound for the local quadrature error (for
f ∈ C2 ([ a, b])):

Zxk
1
f (t) dt − ( f ( xk−1 ) + f ( xk )) ≤ h3k f ′′ L ∞ ([ xk−1 ,xk ])
, h k : = x k − x k−1 . (7.5.3)
2
x k −1

➣ Suggests the use of small mesh intervals, where | f ′′ | is large !

(7.5.4) Goal: equidistribution of errors

The ultimate but elusive goal is to find a mesh with a minimal number of cells that just delivers a quadrature
error below a prescribed threshold. A more practical goal is to adjust the local meshwidths hk := xk − xk−1
in order to achieve a minimal sum of local error bounds. This leads to the constrained minimization
problem:
m m
∑ h3k f ′′ L ∞ ([ x
→ min s.t. ∑ hk = b − a . (7.5.5)
k −1 ,xk ])
k=1 k=1

Lemma 7.5.6.

Let f : R 0+ → R 0+ be a convex function with f (0) = 0 and x > 0. Then the constrained
minimization problem: seek ζ 1 , . . . , ζ m ∈ R 0+ such that

m m
∑ f (ζ k ) → min and ∑ ζk = x , (7.5.7)
k=1 k=1

x
has the solution ζ 1 = ζ 2 = · · · = ζ m = m .

This means that we should strive for equal bounds h3k k f ′′ k L∞ ([ x for all mesh cells.
k −1 ,xk ])

7. Numerical Quadrature , 7.5. Adaptive Quadrature 543

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Error equidistribution principle

The mesh for a posteriori adaptive composite numerical quadrature should be chosen to achieve
equal contributions of all mesh intervals to the quadrature error

A indicated above, guided by the equidistribution principle, the improvement of the mesh will be done
gradually in an iteration. The change of the mesh in each step is called mesh adaptation and there are
two fundamentally different ways to do it:

(I) by moving nodes, keeping their total number, but making them cluster where mesh intervals should
be small, or

(II) by adding nodes, where mesh intervals should be small (mesh refinement).

Algorithms for a posteriori adaptive quadrature based on mesh refinement usually have the following
structure:

Adaptation loop for numerical quadrature

(1) ESTIMATE: based on available information compute an approximation for the quadrature error
on every mesh interval.
(2) CHECK TERMINATION: if total error sufficient small → STOP
(3) MARK: single out mesh intervals with the largest or above average error contributions.
(4) REFINE: add node(s) inside the marked mesh intervals. GOTO (1)

(7.5.10) Adaptive multilevel quadrature

We now see a concrete algorithm based on the two composite quadrature rules introduced in Ex. 7.4.3.

Idea: local error estimation by comparing local results of two quadrature formu-
las Q1 , Q2 of different order → local error estimates

heuristics: error(Q2 ) ≪ error(Q1 ) ⇒ error(Q1 ) ≈ Q2 ( f ) − Q1 ( f ) .

Here: Q1 = trapezoidal rule (order 2) ↔ Q2 = Simpson rule (order 4)

Given: initial mesh M : = { a = x0 < x1 < · · · < x m = b }

❶ (Error estimation)

For Ik = [ xk−1 , xk ], k = 1, . . . , m (midpoints pk := 21 ( xk−1 + xk ) )

hk h
ESTk := ( f ( xk−1 ) + 4 f ( pk ) + f ( xk )) − k ( f ( xk−1 ) + 2 f ( pk ) + f ( xk )) . (7.5.11)
|6 {z } |4 {z }
Simpson rule trapezoidal rule on split mesh interval

7. Numerical Quadrature , 7.5. Adaptive Quadrature 544

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

❷ (Check termination)
Rb
Simpson rule on M ⇒ intermediate approximation I ≈ a f (t) dt
m
If ∑ ESTk ≤ RTOL · I ( RTOL := prescribed relative tolerance) ⇒ STOP (7.5.12)
k=1

❸ (Marking)
m
1
Marked intervals: S := {k ∈ {1, . . . , m}: ESTk ≥ η ·
m ∑ EST j } , η ≈ 0.9 . (7.5.13)
j =1

❹ (Local mesh refinement)

1
new mesh: M∗ := M ∪ { pk := ( xk−1 + xk ): k ∈ S} . (7.5.14)
2
Then continue with step ❶ and mesh M ← M∗ .

The following code give a non-optimal recursive M ATLAB implementation:

7. Numerical Quadrature , 7.5. Adaptive Quadrature 545

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++-code 7.5.15: h-adaptive numerical quadrature

2 // Adaptive multilevel quadrature of a function passed in f.
3 // The vector M passes the positions of current quadrature nodes
4 template <class Func tion >
5 double adaptquad ( F u n c t i o n& f , VectorXd& M, double r t o l , double a t o l ) {
6 const std : : s i z e _ t n = M. siz e ( ) ; // number of nodes
7 VectorXd h = M. t a i l ( n −1)−M. head ( n −1) , // distance of quadature nodes
8 mp = 0 . 5 ∗ (M. head ( n −1)+M. t a i l ( n −1) ) ; // midpoints
9 // Values of integrand at nodes and midpoints
10 VectorXd f x ( n ) , fm ( n − 1 ) ;
11 f o r ( unsigned i = 0 ; i < n ; ++ i ) f x ( i ) = f (M( i ) ) ;
12 f o r ( unsigned j = 0 ; j < n − 1 ; ++ j ) fm ( j ) = f (mp( j ) ) ;
13 // trapezoidal rule (7.4.4)
14 const VectorXd t r p _ l o c =
1 . / 4 ∗ h . cwiseProduct ( f x . head ( n −1)+2 ∗ fm+ f x . t a i l ( n −1) ) ;
15 // Simpson rule (7.4.5)
16 const VectorXd s imp_loc =
1 . / 6 ∗ h . cwiseProduct ( f x . head ( n −1)+4 ∗ fm+ f x . t a i l ( n −1) ) ;
17

18 // Simpson approximation for the integral value

19 double I = s imp_loc .sum ( ) ;
20 // local error estimate (7.5.11)
21 const VectorXd e s t _ l o c = ( simp_loc−t r p _ l o c ) . cwiseAbs ( ) ;
22 // estimate for quadrature error
23 const double e r r _ t o t = e s t _ l o c .sum ( ) ;
24

25 // STOP: Termination based on (7.5.12)

26 i f ( e r r _ t o t > r t o l ∗ std : : abs ( I ) && e r r _ t o t > a t o l ) { //
27 // find cells where error is large
28 std : : vector <double> new_c ells ;
29 f o r ( unsigned i = 0 ; i < e s t _ l o c . siz e ( ) ; ++ i ) {
30 // MARK by criterion (7.5.13) & REFINE by (7.5.14)
31 i f ( e s t _ l o c ( i ) > 0 . 9 / ( n −1) ∗ e r r _ t o t ) {
32 // new quadrature point = midpoint of interval with large error
33 new_c ells . push_back (mp( i ) ) ;
34 }}
35

36 // create new set of quadrature nodes

37 // (necessary to convert std::vector to Eigen vector)
38 Eigen : : Map<VectorXd> tmp ( new_c ells . data ( ) , new_c ells . siz e ( ) ) ;
39 VectorXd new_M(M. siz e ( ) + tmp . siz e ( ) ) ;
40 new_M << M, tmp ; // concatenate old cells and new cells
41 // nodes of a mesh are supposed to be sorted
42 std : : s o r t ( new_M . data ( ) ,new_M . data ( ) +new_M . siz e ( ) ) ;
43 I = adaptquad ( f , new_M , r t o l , a t o l ) ; // recursion
44 }
45 r et ur n I ;
46 }

7. Numerical Quadrature , 7.5. Adaptive Quadrature 546

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Comments on Code 7.5.15:

• Arguments: f =
ˆ handle to function f , M =
ˆ initial mesh, rtol =
ˆ relative tolerance for termination,
atol =
ˆ absolute tolerance for termination, necessary in case the exact integral value = 0, which
renders a relative tolerance meaningless.

• Line 7: compute lengths of mesh-intervals [ x j−1, x j ],

• Line 8: store positions of midpoints p j ,
• Line 9: evaluate function (vector arguments!),
• Line 13: local composite trapezoidal rule (7.4.4),
• Line 15: local simpson rule (7.2.6),
• Line 18: value obtained from composite simpson rule is used as intermediate approximation for
integral value,

• Line 20: difference of values obtained from local composite trapezoidal rule (∼ Q1 ) and local simp-
son rule (∼ Q2 ) is used as an estimate for the local quadrature error.

• Line 22: estimate for global error by summing up moduli of local error contributions,
• Line 26: terminate, once the estimated total error is below the relative or absolute error threshold,
• Line 43 otherwise, add midpoints of mesh intervals with large error contributions according to
(7.5.14) to the mesh and continue.

C++-code 7.5.16: Call of adaptquad():

1 # include " . / adaptquad . hpp "

2 # include 
3 # include <cmath>
4

5 i n t main ( ) {
6 auto f = [ ] ( double x ) { r e tur n std : : exp(− x ∗ x ) ; } ;
7 VectorXd M( 4 ) ;
8 M << − 100, 0 . 1 , 0 . 5 , 1 0 0 ;
9 std : : cout << " S q r t ( P i ) − I n t _ { − 100}^{100} exp(− x ∗ x ) dx = " ;
10 std : : cout << adaptquad ( f , M, 1e −10, 1e −12) − std : : s q r t ( M_PI ) << " \ n " ;
11 r e tur n 0 ;
12 }

Remark 7.5.17 (Estimation of “wrong quadrature error”?)

In Code 7.5.15 we use the higher order quadrature rule, the Simpson rule of order 4, to compute an ap-
proximate value for the integral. This is reasonable, because it would be foolish not to use this information
after we have collected it for the sake of error estimation.

Yet, according to our heuristics, what est_loc and est_tot give us are estimates for the error of the
second-order trapezoidal rule, which we do not use for the actual computations.

However, experience teaches that

7. Numerical Quadrature , 7.5. Adaptive Quadrature 547

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

est_loc gives useful (for the sake of mesh refinement) information about the distribution of
the error of the Simpson rule, though it fails to capture its size.

Therefore, the termination criterion of Line 26 may not be appropriate!

Experiment 7.5.18 (h-adaptive numerical quadrature)

In this numerical test we investigate whether the adaptive technique from § 7.5.10 produces an appropriate
distribution of integration nodes. We do this for different functions.
Z 1
✦ approximate exp(6 sin(2πt)) dt, initial mesh M0 = { j/10}10 j =0
0

Algorithm: adaptive quadrature, Code 7.5.15 with tolerances rtol = 10−6, abstol = 10−10

We monitor the distribution of quadrature points during the adaptive quadrature and the true and esti-
mated quadrature errors. The “exact” value for the integral is computed by composite Simpson rule on an
equidistant mesh with 107 intervals.
1
10
exact error
0
estimated error
10
500

450 -1
10
400
-2
10
350
quadrature errors

300 -3
10
250
f

-4
200 10

150 -5
10
100
-6
50 10

0 -7
0 10
0
5 0.2
-8
0.4 10
10 0.6
0.8
15 1 -9
10
x 0 200 400 600 800 1000 1200 1400 1600
Fig. 277 quadrature level Fig. 278 no. of quadrature points

Z 1
✦ approximate min{exp(6 sin(2πt)), 100} dt, initial mesh as above
0
1
10
exact error
estimated error
0
100 10

90
-1
10
80

70 -2
10
quadrature errors

60
-3
50 10
f

40
-4
10
30

20 -5
10
10
-6
0 10
0
0
5 0.2 -7
0.4 10
10 0.6
0.8
15 1 -8
10
x 0 100 200 300 400 500 600 700 800
Fig. 279 quadrature level Fig. 280 no. of quadrature points

Observation:
• Adaptive quadrature locally decreases meshwidth where integrand features variations or kinks.

7. Numerical Quadrature , 7.5. Adaptive Quadrature 548

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

• Trend for estimated error mirrors behavior of true error.

• Overestimation may be due to taking the modulus in (7.5.11)
However, the important piece of information we want to extract from ESTk is about the distribution of the
quadrature error.

Remark 7.5.19 (Adaptive quadrature in M ATLAB)

q = quad(fun,a,b,tol): adaptive multigrid quadrature

(local low order quadrature formulas)
q = quadl(fun,a,b,tol): adaptive Gauss-Lobatto quadrature

Learning Outcomes

✦ You should know what is a quadrature formula and terminology connected with it,
✦ You should be able to transform quadrature formulas to arbitrary intervals.
✦ You should understand how a interpolation and approximation schemes spawn quadrature formulas
and how quadrature errors are connected to interpolation/approximation errors.

✦ You should be able to compute the weights of polynomial quadrature formulas.

✦ You should know the concept of order of a quadrature rule and why it is invariant under (affine)
transformation

✦ You should remember the maximal and minimal order of polynomial quadrature rules.
✦ You should know the order of the n-point Gauss-Legendre quadrature rule.
✦ You should understand why Gauss-Legendre quadrature converges exponentially for integrands that
can be extended analytically and algebraically for integrands with limited smoothness.

✦ You should be apply to apply regularizing transformations to integrals with non-smooth integrands.
✦ You should know about asymptotic convergence of the h-version of composite quadrature.
✦ You should know the principles of adaptive composite quadrature.

7. Numerical Quadrature , 7.5. Adaptive Quadrature 549

Chapter 8

Iterative Methods for Non-Linear Systems of

Equations

Example 8.0.1 (Non-linear electric circuit)

Non-linear systems naturally arise in mathematical models of electrical circuits, once non-linear circuit
elements are introduced. This generalizes Ex. 2.1.3, where the current-voltage relationship for all circuit
elements was the simple proportionality (2.1.5) (of the complex amplitudes U and I ).
As an example we consider the
U+
Schmitt trigger circuit ✄
Its key non-linear circuit element is the NPN bipolar R3 R4
R1
junction transistor:
collector ➀ ➃
Rb
➂
➄ ➁
Uout
base Uin
Re R2

Fig. 281
emitter
A transistor has three ports: emitter, collector, and base. Transistor models give the port currents as
functions of the applied voltages, for instance the Ebers-Moll model (large signal approximation):
UBC
U
UBE IS BC
IC = IS e UT −e UT − e T − 1 = IC (UBE , UBC ) ,
U
βR
U U
IS BE IS BC
IB = e T −1 +
U e T − 1 = IB (UBE , UBC ) ,
U
(8.0.2)
βF βR
U UBC
U
BE IS BE
IE = IS e UT − e UT + e UT − 1 = IE (UBE , UBC ) .
βF
IC , IB , IE : current in collector/base/emitter,
UBE , UBC : potential drop between base-emitter, base-collector.

The parameters have the following meanings: β F is the forward common emitter current gain (20 to 500),
β R is the reverse common emitter current gain (0 to 20), IS is the reverse saturation current (on the order
of 10−15 to 10−12 amperes), UT is the thermal voltage (approximately 26 mV at 300 K).

550
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The circuit of Fig. 281 has 5 nodes ➀–➄ with unknown nodal potentials. Kirchhoffs law (2.1.4) plus the
constitutive relations gives an equation for each of them.

Non-linear system of equations from nodal analysis, static case (→ Ex. 2.1.3):

➀ : R3 (U1 − U+ ) + R1 (U1 − U3 ) + IB (U5 − U1 , U5 − U2 ) =0,

➁ : Re U2 + IE (U5 − U1 , U5 − U2 ) + IE (U3 − U4 , U3 − U2 ) =0,
➂: R1 (U3 − U1 ) + IB (U3 − U4 , U3 − U2 ) =0, (8.0.3)
➃: R4 (U4 − U+ ) + IC (U3 − U4 , U3 − U2 ) =0,
➄: Rb (U5 − Uin ) + IB (U5 − U1 , U5 − U2 ) =0.

5 equations ↔ 5 unknowns U1 , U2 , U3 , U4 , U5

Formally: (8.0.3) ←→ F(u) = 0 with a function F : R5 → R5

Remark 8.0.4 (General non-linear systems of equations)

A non-linear system of equations is a concept almost too abstract to be useful, because it covers an
extremely wide variety of problems . Nevertheless in this chapter we will mainly look at “generic” methods
for such systems. This means that every method discussed may take a good deal of fine-tuning before it
will really perform satisfactorily for a given non-linear system of equations.

(8.0.5) Generic/general non-linear system of equations

Given: function F : D ⊂ R n 7→ R n , n∈N

m
Possible meanings: ☞ F is known as an analytic expression.
☞ F is merely available in procedural form allowing point evaluations.

Here, D is the domain of definition of the function F, which cannot be evaluated for x 6∈ D.

Sought: solution(s) x ∈ D of non-linear equation F (x) = 0

Note: F : D ⊂ R n 7→ R n ↔ “same number of equations and unknowns”

In contrast to the situation for linear systems of equations (→ Thm. 2.2.4), the class of non-linear systems
is far too big to allow a general theory:

There are no general results existence & uniqueness of solutions of F(x) = 0

Contents
8.1 Iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
8.1.1 Speed of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
8.1.2 Termination criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549

8. Iterative Methods for Non-Linear Systems of Equations, 8. Iterative Methods for Non-Linear Systems of 551
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

8.2 Fixed Point Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552

8.2.1 Consistent fixed point iterations . . . . . . . . . . . . . . . . . . . . . . . . . 552
8.2.2 Convergence of fixed point iterations . . . . . . . . . . . . . . . . . . . . . . 554
8.3 Finding Zeros of Scalar Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
8.3.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
8.3.2 Model function methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
8.3.2.1 Newton method in scalar case . . . . . . . . . . . . . . . . . . . . . 563
8.3.2.2 Special one-point methods . . . . . . . . . . . . . . . . . . . . . . . 565
8.3.2.3 Multi-point methods . . . . . . . . . . . . . . . . . . . . . . . . . . 569
8.3.3 Asymptotic efficiency of iterative methods for zero finding . . . . . . . . . . 574
8.4 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
8.4.1 The Newton iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
8.4.2 Convergence of Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . 587
8.4.3 Termination of Newton iteration . . . . . . . . . . . . . . . . . . . . . . . . . 589
8.4.4 Damped Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
8.4.5 Quasi-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
8.5 Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
8.5.1 Minima and minimizers: Some theory . . . . . . . . . . . . . . . . . . . . . . 601
8.5.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
8.5.3 Descent methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
8.5.4 Quasi-Newton methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
8.6 Non-linear Least Squares [?, Ch. 6] . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
8.6.1 (Damped) Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
8.6.2 Gauss-Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
8.6.3 Trust region method (Levenberg-Marquardt method) . . . . . . . . . . . . . 607

8.1 Iterative methods

Remark 8.1.1 (Necessity of iterative approximation)

Gaussian elimination (→ Section 2.3) provides an algorithm that, if carried out in exact arithmetic (no
roundoff errors), computes the solution of a linear system of equations with a finite number of elementary
operations. However, linear systems of equations represent an exceptional case, because it is hardly ever
possible to solve general systems of non-linear equations using only finitely many elementary operations.

(8.1.2) Generic iterations

All methods for general non-lineare systems of equations are iterative in the sense that they will usually

8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 552
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

yield only approximate solutions whenever they terminate after finite time.

D
★ ✥
An iterative method for (approximately) solving the
non-linear equation F(x) = 0 is an algorithm
gen-
x (3)
erating an arbitrarily long sequence x(k) of ap- x (2)
k Φ
x∗
✧ ✦
proximate solutions.
x (1) x (4)
x (6)
x(k) =
ˆ k-th iterate
x (0) x (5)
Initial guess

Fig. 282

(8.1.3) (Stationary) m-point iterative method

All the iterative methods discussed below fall in the class of (stationary) m-point, m ∈ N, iterative meth-
ods, for which the iterate x(k) depends on F and the m most recent iterates x(k−1) , . . . , x(k−m) , e.g.,

x ( k ) = Φ F ( x ( k − 1) , . . . , x ( k − m ) ) (8.1.4)
| {z }
iteration function for m-point method

Terminology: Φ F is called the iteration function.

Note: The initial guess(es) x(0) , . . . , x(m−1) ∈ R n have to be provided.

(8.1.5) Key issues with iterative methods

When applying an iterative method to solve a non-linear system of equations F(x) = 0, the following
issues arise:

✦ Convergence: Does the sequence (x(k) )k converge to a limit: limk→∞ x(k) = x∗ ?

✦ Consistency: Does the limit, if it exists, provide a solution of the non-linear system of equations:
F ( x ∗ ) = 0?

✦ Speed of convergence: How “fast” does x(k) − x∗ (k·k a suitable norm on R N ) decrease for
increasing k?

More formal definitions can be given:

Definition 8.1.6. Convergence of iterative methods

k→∞
An iterative method converges (for fixed initial guess(es)) :⇔ x(k) → x∗ and F(x∗ ) = 0.

8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 553
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 8.1.7. Consistency of iterative methods

A statioanry m-point iterative method is consistent with the non-linear system of equations F(x) = 0

:⇔ Φ F (x∗ , . . . , x∗ ) = x∗ ⇔ F (x∗ ) = 0

For a consistent stationary iterative method we can study the error of the iterates x(k) defined as: e(k) :=
x(k) − x∗

Unfortunately, convergence may critically depend on the choice of initial guesses. The property defined
next weakens this dependence:

Definition 8.1.8. Local and global convergence → [?, Def. 17.1]

As stationary m-point iterative method converges locally to x∗ ∈ R n , if there is a neighborhood
U ⊂ D of x∗ , such that

x(0) , . . . , x(m−1) ∈ U ⇒ x(k) well defined ∧ lim x(k) = x∗

k→∞

where (x(k) )k∈N0 is the (infinite) sequence of iterates.

If U = D, the iterative method is globally convergent.

Illustration of local convergence ✄

(Only initial guesses “sufficiently close” to x∗ guaran-

tee convergence.) x∗
U
Unfortunately, the neighborhood U is rarely known a
priori. It may also be very small.

Fig. 283

Our goal: Given a non-linear system of equations, find iterative methods that converge (locally) to a
solution of F(x) = 0.
Two general questions: How to measure, describe, and predict the speed of convergence?
When to terminate the iteration?

8.1.1 Speed of convergence

Here and in the sequel, k·k designates a generic vector norm on R n , see Def. 1.5.70. Any occurring
matrix norm is induced by this vector norm, see Def. 1.5.76.

It is important to be aware which statements depend on the choice of norm and which do not!

“Speed of convergence” ↔ decrease of norm (see Def. 1.5.70) of iteration error

8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 554
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 8.1.9. Linear convergence

A sequence x(k) , k = 0, 1, 2, . . ., in R n converges linearly to x∗ ∈ R n ,

∃0 < L < 1: x ( k + 1) − x ∗ ≤ L x ( k ) − x ∗ ∀k ∈ N0 .

Terminology: least upper bound for L gives the rate of convergence

Remark 8.1.10 (Impact of choice of norm)

Fact of convergence of iteration is independent of choice of norm

Fact of linear convergence depends on choice of norm
Rate of linear convergence depends on choice of norm
The first statement is a consequence of the equivalence of all norms on the finite dimensional vector space
Kn :

Definition 8.1.11. Equivalence of norms

Two norms k·k a and k·k b on a vector space V are equivalent if

∃C, C > 0: C kvk a ≤ kvkb ≤ C kvk a ∀v ∈ V .

Theorem 8.1.12. Equivalence of all norms on finite dimensional vector spaces

If dim V < ∞ all norms (→ Def. 1.5.70) on V are equivalent (→ Def. 8.1.11).

Remark 8.1.13 (Detecting linear convergence)

Often we will study the behavior of a consistent iterative method for a model problem in a numerical
experiments and measure the norms of the iteration errors e(k) := x(k) − x∗ . How can we tell that the
method enjoys linear convergence?
log e(k)
norms of iteration errors
l
∼ straight line in lin-log plot

e ( k ) ≤ L k e (0) ,

log( e(k) ) ≤ k log( L) + log( e(0) ) .

Fig. 284
1 2 3 4 5 6 7 8 k

8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 555
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Let us abbreviate the error norm in step k by ǫk := x(k) − x∗ . In the case of linear convergence (see
Def. 8.1.9) assume (with 0 < L < 1)

ǫk+1 ≈ Lǫk ⇒ log ǫk+1 ≈ log L + log ǫk ⇒ log ǫk ≈ k log L + log ǫ0 . (8.1.14)

We conclude that log L < 0 determines the slope of the graph in lin-log error chart.

Related: guessing time complexity O(nα ) of an algorithm from measurements, see § 1.4.9.

Note the green dots • in Fig. 284: Any “faster” convergence also qualifies as linear convergence in the strict
sense of the definition. However, whenever this term is used, we tacitly imply, that no “faster convergence”
prevails.

Example 8.1.15 (Linearly convergent iteration)

C++11 code 8.1.16: Simple fixed point iteration in 1D

2 void f p i t ( double x0 , VectorXd &rates ,
VectorXd &e r r )
3 {
4 const unsigned i n t N = 15;
5 double x = x0 ; // initial guess
6 VectorXd y (N) ;
7

8 f o r ( i n t i = 0; i <N; ++ i ) {
9 x = x + ( cos ( x ) +1) / s i n ( x ) ;
10 y( i ) = x;
11 }
12 e r r . r e s i z e (N) ; r at es . r e s i z e (N) ;
13 e r r = y−VectorXd : : Constant (N, x ) ;
14 r at es =
e r r . bottomRows(N−1) . cwiseQuotient ( e r r . topRo
15 }

We consider the iteration (n = 1):

cos x (k) + 1
x ( k + 1) = x ( k ) + .
sin x (k)
In the C++ code (✄) x has to be initialized
with the different values for x0 .

Note: The final iterate x (15) replaces the

exact solution x ∗ in the computation of the
rate of convergence.

8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 556
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

k x (0) = 0.4 x (0) = 0.6 x (0) = 1

| x ( k) − x (15) | | x ( k) − x (15) | | x ( k) − x (15) |
x (k) x (k) x (k)
| x ( k−1) − x (15) | | x ( k−1) − x (15) | | x ( k−1) − x (15) |
2 3.3887 0.1128 3.4727 0.4791 2.9873 0.4959
3 3.2645 0.4974 3.3056 0.4953 3.0646 0.4989
4 3.2030 0.4992 3.2234 0.4988 3.1031 0.4996
5 3.1723 0.4996 3.1825 0.4995 3.1224 0.4997
6 3.1569 0.4995 3.1620 0.4994 3.1320 0.4995
7 3.1493 0.4990 3.1518 0.4990 3.1368 0.4990
8 3.1454 0.4980 3.1467 0.4980 3.1392 0.4980

Rate of convergence ≈ 0.5

1
10
x(0) = 0.4
x(0) = 0.6
x(0) = 1

Plot of modulus of iteration errors for different initial 0

guesses (see above table) ✄

iteration error
-1
10
Observation:
Linear convergence as in Def. 8.1.9
-2
10
m
error graphs = straight lines in lin-log scale -3
10

→ Rem. 8.1.13
-4
10
1 2 3 4 5 6 7 8 9 10
Fig. 285 index of iterate

There are notions of convergence that guarantee a much faster (asymptotic) decay of norm of the iteration error
than linear convergence from Def. 8.1.9.

Definition 8.1.17. Order of convergence → [?, Sect. 17.2], [?, Def. 5.14], [?, Def. 6.1]

A convergent sequence x(k) , k = 0, 1, 2, . . ., in R n with limit x∗ ∈ R n converges with order p, if

p
∃C > 0: x ( k + 1) − x ∗ ≤ C x ( k ) − x ∗ ∀k ∈ N0 , (8.1.18)

and, in addition, C < 1 in the case p = 1 (linear convergence → Def. 8.1.9).

Of course, the order p of convergence of an iterative method refers to the largest possible p in the def-
inition, that is, the error estimate will in general not hold, if p is replaced with p + ǫ for any ǫ > 0, cf.
Rem. 1.4.6.

8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 557
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

0
10

-2
10

-4
10
iteration error

-6
10
✁ Qualitative error graphs for convergence of or-
-8
10 der p
-10
10
(lin-log scale)
-12
10
p = 1.1
p = 1.2
-14
10 p = 1.4
p = 1.7
p=2
-16
10
0 1 2 3 4 5 6 7 8 9 10 11
index k of iterates

In the case of convergence of order p ( p > 1) (see Def. 8.1.17):

k
p
ǫk+1 ≈ Cǫk ⇒ log ǫk+1 = log C + p log ǫk ⇒ log ǫk+1 = log C ∑ pl + pk+1 log ǫ0
l =0

log C log C
⇒ log ǫk+1 = − + + log ǫ0 pk+1 .
p−1 p−1

In this case, the error graph is a concave power curve (for sufficiently small ǫ0 !)

Remark 8.1.19 (Detecting order of convergence)

How to guess the order of convergence (→ Def. 8.1.17) from tabulated error norms measured in a numer-
ical experiment?

Abbreviate ǫk := x(k) − x∗ (norm of iteration error):

p log ǫk+1 − log ǫk

assume ǫk+1 ≈ Cǫk ⇒ log ǫk+1 ≈ log C + p log ǫk ⇒ ≈p.
log ǫk − log ǫk−1

➣ monitor the quotients (log ǫk+1 − log ǫk )/(log ǫk − log ǫk−1 ) over several steps of the iteration.

Example 8.1.20 (quadratic convergence = convergence of order 2)

√
From your analysis course [?, Bsp. 3.3.2(iii)] recall the famous iteration for computing a , a > 0:

1 (k) a √ 1 √
x ( k + 1) = ( x + ( k ) ) ⇒ | x ( k + 1) − a | = ( k ) | x ( k ) − a | 2 . (8.1.21)
2 x 2x

√ √
By the arithmetic-geometric mean inequality (AGM) ab ≤ 12 (a + b) we conclude: x (k) > a for
k√≥ 1. Therefore estimate from (8.1.21) means that the sequence from (8.1.21) converges with order 2 to
a.

8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 558
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Note: x (k+1) < x (k) for all k ≥ 2 ➣ ( x (k) )k∈N0 converges as a decreasing sequence that is bounded
from below (→ analysis course)

Numerical experiment: iterates for a = 2:

√ |e(k) | | e ( k −1) |
k x (k) e(k) := x (k) − 2 log : log
| e ( k −1) | | e ( k −2) |
0 2.00000000000000000 0.58578643762690485
1 1.50000000000000000 0.08578643762690485
2 1.41666666666666652 0.00245310429357137 1.850
3 1.41421568627450966 0.00000212390141452 1.984
4 1.41421356237468987 0.00000000000159472 2.000
5 1.41421356237309492 0.00000000000000022 0.630

Note the doubling of the number of significant digits in each step ! [impact of roundoff !]
The doubling of the number of significant digits for the iterates holds true for any quadratically convergent
iteration:

Recall from Rem. 1.5.25 that the relative error (→ Def. 1.5.24) tells the number of significant digits. Indeed,
denoting the relative error in step k by δk , we have in the case of quadratic convergence.

x (k) = x ∗ (1 + δk ) ⇒ x (k) − x ∗ = δk x ∗ .
⇒| x ∗ δk+1 | = | x (k+1) − x ∗ | ≤ C| x (k) − x ∗ |2 = C| x ∗ δk |2
⇒ |δk+1 | ≤ C| x ∗ |δk2 . (8.1.22)

Note: δk ≈ 10−ℓ means that x(k) has ℓ significant digits.

Also note that if C ≈ 1, then δk = 10−ℓ and (8.1.20) implies δk+1 ≈ 10−2ℓ .

8.1.2 Termination criteria

Supplementary reading. Also discussed in [?, Sect. 3.1, p. 42].

As remarked above, usually (even without roundoff errors) an iteration will never arrive at an/the exact
solution x∗ after finitely many steps. Thus, we can only hope to compute an approximate solution by
accepting x(K ) as result for some K ∈ N0 . Termination criteria (stopping rules) are used to determine a
suitable value for K.

For the sake of efficiency ✄ stop iteration when iteration error is just “small enough”
(“small enough” depends on the concrete problem and user demands.)

(8.1.23) Classification of termination criteria (stopping rules) for iterative solvers for non-
linear systems of equations
✎ ☞
A termination criterion (stopping rule) is an algorithm deciding in each step of an iterative method
✍ ✌
whether to STOP or to CONTINUE.

8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 559
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

A priori termination A posteriori termination

Decision to stop based on information about F and Beside x(0) and F, also current and past iterates
x(0) , made before starting iteration. are used to decide about termination.

A termination criterion for a convergent iteration is deemed reliable, if it lets the iteration CONTINUE, until
the iteration error e(k) := x(k) − x∗ , x∗ the limit value, satisfies certain conditions (usually imposed before
the start of the iteration).

(8.1.24) Ideal termination

Termination criteria are usually meant to ensure accuracy of the final iterate x(K ) in the following sense:

x(K ) − x∗ ≤ τabs , τabs =

ˆ prescribed (absolute) tolerance.
or
(K ) ∗ ∗
x −x ≤ τrel kx k, τrel =
ˆ prescribed (relative) tolerance.

it seems that the second criterion, asking that the relative (→ Def. 1.5.24) iteration error be below a
prescribed threshold, alone would suffice, but the absolute tolerance should be checked, if, by “accident”,
kx∗ k = 0 is possible. Otherwise, the iteration might fail to terminate at all.

Both criteria enter the “ideal (a posteriori) termination rule”:


 τabs
STOP at step K = argmin{k ∈ N0 : x(k) − x∗ ≤ or (8.1.25)

τrel kx∗ k .
Obviously, (8.1.25) achieves the optimum in terms of efficiency and reliability. As obviously, this termina-
tion criterion is not practical, because x∗ is not known.

(8.1.26) Practical termination criteria for iterations

The following termination criteria are commonly used in numerical codes:

➀ A priori termination: stop iteration after fixed number of steps (possibly depending on x(0) ).

Drawback: hard to ensure prescribed accuracy!

(A priori =
ˆ without actually taking into account the computed iterates, see § 8.1.23)

Invoking additional properties of either the non-linear system of equations F(x) = 0 or the iteration
it is sometimes possible to tell that for sure x(k) − x∗ ≤ τ for all k ≥ K, though this K may be
(significantly) larger than the optimal termination index from (8.1.25), see Rem. 8.1.28.
➁ Residual based termination: STOP convergent iteration {x(k) } k∈N0 , when

F (x(k) ) ≤ τ , τ=
ˆ prescribed tolerance > 0 .

8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 560
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

no guaranteed accuracy

Consider the case n = 1. If F : D ⊂ R → R is “flat” in the neighborhood of a zero x ∗ , then a small

value of | F( x )| does not mean that x is close to x ∗ .
F( x) F( x)

x x

Fig. 286 Fig. 287

F (x(k) ) small 6⇒ | x − x ∗ | small F(x(k) ) small ⇒ | x − x ∗ | small

➂ Correction based termination: STOP convergent iteration {x(k) } k∈N0 , when



 τabs
( k + 1) (k) or τabs absolute
x −x ≤ prescribed tolerances > 0 .

 τrel x(k+1) , τrel relative

Also for this criterion, we have no guarantee that (8.1.25) will be satisfied only remotely.

A special variant of correction based termination exploits that M is finite! (→ Section 1.5.3)

C++11 code 8.1.27: Square root iteration →

Ex. 8.1.20
2 double s q r t i t ( double a )
3 {
Wait until (convergent) iteration becomes stationary 4 double x _old = −1;
in the discrete set M of machine numbers! 5 double x = a ;
possibly grossly inefficient ! 6 while ( x _old ! = x ) {
(always computes “up to machine precision”) 7 x _old = x ;
8 x = 0 . 5 ∗ ( x+a / x ) ;
9 }
10 r et ur n x ;
11 }

Remark 8.1.28 (A posteriori termination criterion for linearly convergent iterations → [?,
Lemma 5.17, 5.19])

Let us assume that we know that an iteration linearly convergent (→ Def. 8.1.9) with rate of convergence
0 < L < 1:
The following simple manipulations give an a posteriori termination criterion (for linearly convergent itera-
tions with rate of convergence 0 < L < 1):
△-inequ.
x(k) − x∗ ≤ x ( k + 1) − x ( k ) + x ( k + 1) − x ∗ ≤ x ( k + 1) − x ( k ) + L x ( k ) − x ∗ .

8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 561
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Iterates satisfy: x ( k + 1) − x ∗ ≤ L
1− L x ( k + 1) − x ( k ) . (8.1.29)

This suggests that we take the right hand side of (8.1.29) as a posteriori error bound and use it instead of
the inaccessible x(k+1) − x∗ for checking absolute and relative accuracy in (8.1.25). The resulting ter-
mination criterium will be reliable (→ § 8.1.23), since we will certainly have achieved the desired accuracy
when we stop the iteration.

Estimating the rate of convergence L might be difficult.

Pessimistic estimate for L will not compromise reliability.

(Using e
L > L in (8.1.29) still yields a valid upper bound for x(k) − x∗ .)

Example 8.1.30 (A posteriori error bound for linearly convergent iteration)

We revisit the iteration of Ex. 8.1.15:

cos x (k) + 1
x ( k + 1) = x ( k ) + ⇒ x (k) → π for x (0) close to π .
sin x (k)
Observed rate of convergence: L = 1/2
Error and error bound for x (0) = 0.4:

k | x (k) − π | L
1− L | x
(k) − x ( k − 1) | slack of bound

1 2.191562221997101 4.933154875586894 2.741592653589793

2 0.247139097781070 1.944423124216031 1.697284026434961
3 0.122936737876834 0.124202359904236 0.001265622027401
4 0.061390835206217 0.061545902670618 0.000155067464401
5 0.030685773472263 0.030705061733954 0.000019288261691
6 0.015341682696235 0.015344090776028 0.000002408079792
7 0.007670690889185 0.007670991807050 0.000000300917864
8 0.003835326638666 0.003835364250520 0.000000037611854
9 0.001917660968637 0.001917665670029 0.000000004701392
10 0.000958830190489 0.000958830778147 0.000000000587658
11 0.000479415058549 0.000479415131941 0.000000000073392
12 0.000239707524646 0.000239707533903 0.000000000009257
13 0.000119853761949 0.000119853762696 0.000000000000747
14 0.000059926881308 0.000059926880641 0.000000000000667
15 0.000029963440745 0.000029963440563 0.000000000000181
Hence: the a posteriori error bound is highly accurate in this case!

8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 562
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

8.2 Fixed Point Iterations

Supplementary reading. The contents of this section are also treated in [?, Sect. 5.3], [?,

Sect. 6.3], [?, Sect .3.3]

As before we consider a non-linear system of equations F ( x ) = 0, F : D ⊂ R n 7 → R n .

1-point stationary iterative methods, see (8.1.4), for F(x) = 0 are also called fixed point iterations.

A fixed point iteration is defined by iteration function Φ : U ⊂ R n 7→ R n :

iteration function Φ : U ⊂ R n 7→ R n
➣ iterates (x(k) )k∈N0 : x ( k + 1) : = Φ ( x ( k ) ) .
initial guess x (0) ∈ U
| {z }
→ 1-point method, cf. (8.1.4)

Here, U designates the domain of definition of the iteration function Φ.

Note that the sequence of iterates need not be well defined: x(k) 6∈ U possible !

8.2.1 Consistent fixed point iterations

Next, we specialize Def. 8.1.7 for fixed point iterations:

Definition 8.2.1. Consistency of fixed point iterations, c.f. Def. 8.1.7

A fixed point iteration x(k+1) = Φ(x(k) ) is consistent with F(x) = 0, if, for x ∈ U ∩ D,

F (x) = 0 ⇔ Φ(x) = x .

iteration function fixed point iteration (locally) x∗ is a

Note: and ⇒
Φ continuous convergent to x∗ ∈ U fixed point of Φ.

This is an immediate consequence that for a continuous function limits and function evaluations commute
[?, Sect. 4.1].

General construction of fixed point iterations that is consistent with F(x) = 0:

➊ Rewrite equivalently F(x) = 0 ⇔ Φ(x) = x and then

➋ use the fixed point iteration

x ( k + 1) : = Φ ( x ( k ) ) . (8.2.2)

Note: there are many ways to transform F(x) = 0 into a fixed point form !

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 563
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Experiment 8.2.3 (Many choices for consistent fixed point iterations)

In this example we construct three different consistent fixed point iteration for a single scalar (n = 1)
non-linear equation F( x ) = 0. In numerical experiments we will see that they behave very differently.
2

1.5

F( x ) = xe x − 1 , x ∈ [0, 1] .
1
Different fixed point forms:

F(x)
Φ1 ( x ) = e − x ,
0.5

1+x
Φ2 ( x ) = , 0
1 + ex
Φ3 ( x ) = x + 1 − xe x . -0.5

-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5
Φ

0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x x

function Φ1 function Φ2 function Φ3

With the same intial guess x (0) = 0.5 for all three fixed point iterations we obtain the following iterates:

k x ( k + 1) : = Φ1 ( x ( k ) ) x ( k + 1) : = Φ2 ( x ( k ) ) x ( k + 1) : = Φ3 ( x ( k ) )
0 0.500000000000000 0.500000000000000 0.500000000000000
1 0.606530659712633 0.566311003197218 0.675639364649936
2 0.545239211892605 0.567143165034862 0.347812678511202
3 0.579703094878068 0.567143290409781 0.855321409174107
4 0.560064627938902 0.567143290409784 -0.156505955383169
5 0.571172148977215 0.567143290409784 0.977326422747719
6 0.564862946980323 0.567143290409784 -0.619764251895580
7 0.568438047570066 0.567143290409784 0.713713087416146
8 0.566409452746921 0.567143290409784 0.256626649129847
9 0.567559634262242 0.567143290409784 0.924920676910549
10 0.566907212935471 0.567143290409784 -0.407422405542253
We can also tabulate the modulus of the iteration error and mark correct digits with red:

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 564
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

( k + 1) ( k + 1) ( k + 1)
k | x1 − x∗ | | x2 − x∗ | | x3 − x∗ |
0 0.067143290409784 0.067143290409784 0.067143290409784
1 0.039387369302849 0.000832287212566 0.108496074240152
2 0.021904078517179 0.000000125374922 0.219330611898582
3 0.012559804468284 0.000000000000003 0.288178118764323
4 0.007078662470882 0.000000000000000 0.723649245792953
5 0.004028858567431 0.000000000000000 0.410183132337935
6 0.002280343429460 0.000000000000000 1.186907542305364
7 0.001294757160282 0.000000000000000 0.146569797006362
8 0.000733837662863 0.000000000000000 0.310516641279937
9 0.000416343852458 0.000000000000000 0.357777386500765
10 0.000236077474313 0.000000000000000 0.974565695952037
(k) (k)
Observed: linear convergence of x1 , quadratic convergence of x2 ,
(k) (0)
no convergence (erratic behavior of x3 ) ( xi = 0.5 in all cases).

Question: Can we explain/forecast the behaviour of a fixed point iteration?

8.2.2 Convergence of fixed point iterations

In this section we will try to find easily verifiable conditions that ensure convergence (of a certain order) of
fixed point iterations. It will turn out that these conditions are surprisingly simple and general.

Experiment 8.2.4 (Exp. 8.2.3 revisited)

In Exp. 8.2.3 we observed vastly different behavior of different fixed point iterations for n = 1. Is it possible
to predict this from the shape of the graph of the iteration functions?
1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5
Φ

0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x x

Φ1 : linear convergence ? Φ2 : quadratic convergence ? function Φ3 : no convergence

Remark 8.2.5 (Visualization of fixed point iterations in 1D)

1D setting (n = 1): Φ : R 7→ R continuously differentiable, Φ( x ∗ ) = x ∗

fixed point iteration: x ( k + 1) = Φ ( x ( k ) )

In 1D it is possible to visualize the different convergence behavior of fixed point iterations: In order to
construct x (k+1) from x (k) one moves vertically to ( x (k) , x (k+1) = Φ( x (k) )), then horizontally to the

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 565
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

angular bisector of the first/third quadrant, that is, to the point ( x (k+1) , x (k+1) . Returning vertically to the
abscissa gives x (k+1) .
Φ( x ) Φ( x )

x x

−1 < Φ′ ( x ∗ ) ≤ 0 ➣ convergence Φ′ ( x ∗ ) < −1 ➣ divergence

Φ( x ) Φ( x )

x x

0 ≤ Φ′ ( x ∗ ) < 1 ➣ convergence 1 < Φ′ ( x ∗ ) ➣ divergence

Numerical examples for iteration functions ➣ Exp. 8.2.3, iteration functions Φ1 and Φ3

It seems that the slope of the iteration function Φ in the fixed point, that is, in the point where it intersects
the bisector of the first/third quadrant, is crucial.

Now we investigate rigorously, when a fixed point iteration will lead to a convergent iteration with a partic-
ular qualitative kind of convergence according to Def. 8.1.17.

Definition 8.2.6. Contractive mapping

Φ : U ⊂ R n 7→ R n is contractive (w.r.t. norm k·k on R n ), if

∃ L < 1: kΦ(x) − Φ(y)k ≤ Lk x − yk ∀x, y ∈ U . (8.2.7)

A simple consideration: if Φ(x∗ ) = x∗ (fixed point), then a fixed point iteration induced by a contractive
mapping Φ satisfies

(8.2.7)
x ( k + 1) − x ∗ = Φ ( x ( k ) ) − Φ ( x ∗ ) ≤ L x(k) − x∗ ,

that is, the iteration converges (at least) linearly (→ Def. 8.1.9).

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 566
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Note that Φ contractive ⇒ Φ has at most one fixed point.

Remark 8.2.8 (Banach’s fixed point theorem → [?, Satz 6.5.2],[?, Satz 5.8])

A key theorem in calculus (also functional analysis):

Theorem 8.2.9. Banach’s fixed point theorem

If D ⊂ K n (K = R, C) closed and bounded and Φ : D 7→ D satisfies

∃ L < 1: kΦ(x) − Φ(y)k ≤ Lkx − yk ∀x, y ∈ D ,

then there is a unique fixed point x∗ ∈ D, Φ(x∗ ) = x∗ , which is the limit of the sequence of iterates
x(k+1) := Φ( x (k) ) for any x(0) ∈ D.

Proof. Proof based on 1-point iteration x ( k ) = Φ ( x ( k − 1) ), x (0) ∈ D :

k+ N −1 k+ N −1
x(k+ N ) − x(k) ≤ ∑ x ( j + 1) − x ( j ) ≤ ∑ L j x (1) − x (0)
j=k j=k

Lk
≤ x (1) − x (0) k→∞
−−−→ 0 .
1−L

(x(k) )k∈N0 Cauchy sequence ➤ convergent x(k) −k−−→∞

→ x ∗ ..
∗ ∗
Continuity of Φ ➤ Φ(x ) = x . Uniqueness of fixed point is evident.
✷

A simple criterion for a differentiable Φ to be contractive:

Lemma 8.2.10. Sufficient condition for local linear convergence of fixed point iteration →
[?, Thm. 17.2], [?, Cor. 5.12]

If Φ : U ⊂ R n 7→ R n , Φ(x∗ ) = x∗ ,Φ differentiable in x∗ , and kD Φ(x∗ )k < 1, then the fixed point

iteration

x ( k + 1) : = Φ ( x ( k ) ) , (8.2.2)

converges locally and at least linearly. matrix norm, Def. 1.5.76 !

✎ notation: ˆ Jacobian (ger.: Jacobi-Matrix) of Φ at x ∈ D → [?, Sect. 7.6]

D Φ(x) =

 ∂Φ ∂Φ1 ∂Φ1

∂x (x)
1
∂x2 (x) ··· ··· ∂xn (x)
" #n  ∂Φ12 ∂Φ2 
∂Φi  ∂x (x) 
∂xn (x) 
D Φ(x) = (x) =  1
∂x j .. .. . (8.2.11)
i,j=1  . . 
∂Φ n ∂Φ n ∂Φ n
∂x1 (x) ∂x2 (x) ··· ··· ∂xn (x)

“Visualization” of the statement of Lemma 8.2.10 in Rem. 8.2.5: The iteration converges locally, if Φ is flat
in a neighborhood of x ∗ , it will diverge, if Φ is steep there.

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 567
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Proof. (of Lemma 8.2.10) By definition of derivative

kΦ(y) − Φ(x∗ ) − DΦ(x∗ )(y − x∗ )k ≤ ψ(ky − x∗ k)ky − x∗ k ,

with ψ : R 0+ 7→ R 0+ satisfying lim ψ(t) = 0.

t →0

Choose δ > 0 such that

L := ψ(t) + k DΦ(x∗ )k ≤ 21 (1 + k DΦ(x∗ )k) < 1 ∀0 ≤ t < δ .

By inverse triangle inequality we obtain for fixed point iteration

kΦ(x) − x∗ k − k DΦ(x∗ )(x − x∗ )k ≤ ψ(k x − x∗ k)kx − x∗ k

x(k+1) − x∗ ≤ (ψ(t) + k DΦ(x∗ )k) x(k) − x∗ ≤ L x(k) − x∗ ,

if x(k) − x∗ < δ.
✷

Lemma 8.2.12. Sufficient condition for linear convergence of fixed point iteration

Let U be convex and Φ : U ⊂ R n 7→ R n be continuously differentiable with

L := supk DΦ(x)k < 1 .

x ∈U

If Φ(x∗ ) = x∗ for some interior point x∗ ∈ U , then the fixed point iteration x(k+1) = Φ(x(k) )
converges to x∗ at least linearly with rate L.

Recall: U ⊂ R n convex :⇔ (tx + (1 − t)y) ∈ U for all x, y ∈ U , 0 ≤ t ≤ 1

Proof. (of Lemma 8.2.12) By the mean value theorem

Z 1
Φ(x ) − Φ(y) = DΦ(x + τ (y − x))(y − x) dτ ∀x, y ∈ dom(Φ) .
0
⇒ kΦ(x) − Φ(y)k ≤ Lky − xk ,
⇒ ( x ) ( k + 1) − x ∗ ≤ L x ( k ) − x ∗ .

We find that Φ is contractive on U with unique fixed point x∗ , to which x(k) converges linearly for k → ∞.
✷

Remark 8.2.13 (Bound for asymptotic rate of linear convergence)

By asymptotic rate of a linearly converging iteration we mean the contraction factor for the norm of the
iteration error that we can expect, when we are already very close to the limit x∗ .

If 0 < k DΦ(x∗ )k < 1, x(k) ≈ x∗ then the (worst) asymptotic rate of linear convergence is L =
k DΦ( x ∗ )k

Example 8.2.14 (Multidimensional fixed point iteration)

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 568
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

In this example we encounter the first genuine system of non-linear equations and apply Lemma 8.2.12 to
it.

System of equations in fixed point form:

x1 − c(cos x1 − sin x2 ) = 0 c(cos x1 − sin x2 ) = x1
⇒ .
( x1 − x2 ) − c sin x2 = 0 c(cos x1 − 2 sin x2 ) = x2

x1 cos x1 − sin x2 x1 sin x1 cos x2
Define: Φ =c ⇒ DΦ = −c .
x2 cos x1 − 2 sin x2 x2 sin x1 2 cos x2

Choose appropriate norm: k·k = ∞-norm k·k∞ (→ Example 1.5.78) ;

1
if c < ⇒ k DΦ(x)k∞ < 1 ∀x ∈ R2 ,
3
➣ (at least) linear convergence of the fixed point iteration. The existence of a fixed point is also guar-
anteed, because Φ maps into the closed set [−3, 3]2 . Thus, the Banach fixed point theorem, Thm. 8.2.9,
can be applied.

What about higher order convergence (→ Def. 8.1.17, cf. Φ2 in Ex. 8.2.3)? Also in this case we should
study the derivatives of the iteration functions in the fixed point (limit point).

We give a refined convergence result only for n = 1 (scalar case, Φ : dom(Φ) ⊂ R 7→ R ):

Theorem 8.2.15. Taylor’s formula → [?, Sect. 5.5]

If Φ : U ⊂ R 7→ R , U interval, is m + 1 times continuously differentiable, x ∈ U

m
1
Φ(y) − Φ( x ) = ∑ k! Φ(k) (x)(y − x)k + O(|y − x|m+1 ) ∀y ∈ U . (8.2.16)
k=1

Now apply Taylor expansion (8.2.16) to iteration function Φ:

If Φ( x ∗ ) = x ∗ and Φ : dom(Φ) ⊂ R 7→ R is “sufficiently smooth”, it tells us that

m
1
x ( k + 1) − x ∗ = Φ ( x ( k ) ) − Φ ( x ∗ ) = ∑ l! Φ(l ) (x∗ )(x(k) − x∗ )l + O(| x(k) − x∗ |m+1 ) . (8.2.17)
l =1

Here we used the Landau symbol O(·) to describe the local behavior of a remainder term in the vicinity of
x∗
Lemma 8.2.18. Higher order local convergence of fixed point iterations

If Φ : U ⊂ R 7→ R is m + 1 times continuously differentiable, Φ( x ∗ ) = x ∗ for some x ∗ in the

interior of U , and Φ(l ) ( x ∗ ) = 0 for l = 1, . . . , m, m ≥ 1, then the fixed point iteration (8.2.2)
converges locally to x ∗ with order ≥ m + 1 (→ Def. 8.1.17).

Proof. For neighborhood U of x ∗

(8.2.17) ⇒ ∃C > 0: |Φ(y) − Φ( x ∗ )| ≤ C |y − x ∗ |m+1 ∀y ∈ U .

δm C < 1/2 : | x (0) − x ∗ | < δ ⇒ | x (k) − x ∗ | < 2−k δ ➣ local convergence .

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 569
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Then appeal to (8.2.17)

✷

Experiment 8.2.19 (Exp. 8.2.4 continued)

Now, Lemma 8.2.12 and Lemma 8.2.18 permit us a precise prediction of the (asymptotic) convergence
we can expect from the different fixed point iterations studied in Exp. 8.2.3.
1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5
Φ

0.5

Φ
0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x x

function Φ1 function Φ2 function Φ3

1 − xe x
Φ2′ ( x ) = = 0 , if xe x − 1 = 0 hence quadratic convergence ! .
(1 + e x )2

∗
Since x ∗ e x − 1 = 0, simple computations yield

Φ1′ ( x ) = −e− x ⇒ Φ1′ ( x ∗ ) = − x ∗ ≈ −0.56 hence local linear convergence .

1
Φ3′ ( x ) = 1 − xe x − e x ⇒ Φ3′ ( x ∗ ) = − ∗ ≈ −1.79 hence no convergence .
x

Remark 8.2.20 (Termination criterion for contractive fixed point iteration)

We recall the considerations of Rem. 8.1.28 about a termination criterion for contractive fixed point iter-
ation (= linearly convergence fixed point iteration → Def. 8.1.9), c.f. (8.2.7), with contraction factor (=
rate of convergence) 0 ≤ L < 1:

△-ineq. k+ m−1 k+ m−1

x(k+ m) − x(k) ≤ ∑ x ( j + 1) − x ( j ) ≤ ∑ L j − k x ( k + 1) − x ( k )
j=k j=k
1− Lm 1 − L m k − l ( l + 1)
= x ( k + 1) − x ( k ) ≤ L x − x(l ) .
1−L 1−L
Hence, for m → ∞, with x∗ := lim x(k) we find the estimate
k→∞

Lk− l
x∗ − x(k) ≤ x ( l + 1) − x ( l ) . (8.2.21)
1−L

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 570
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Set l = 0 in (8.2.21) Set l = k − 1 in (8.2.21)

a priori termination criterion a posteriori termination criterion

Lk L
x∗ − x(k) ≤ x (1) − x (0) (8.2.22) x∗ − x(k) ≤ x ( k ) − x ( k − 1) (8.2.23)
1−L 1−L

With the same arguments as in Rem. 8.1.28 we see that overestimating L, that is, using a value for L that
is larger than the true value, still gives reliable termination criteria.

However, whereas overestimating L in (8.2.23) will not lead to a severe deterioration of the bound, unless
L ≈ 1, using a pessimistic value for L in (8.2.22) will result in a bound way bigger than the true bound, if
k ≫ 1. Then the a priori termination criterion (8.2.22) will recommend termination many iterations after
the accuracy requirements have already been met. This will thwart the efficiency of the method.

8.3 Finding Zeros of Scalar Functions

Supplementary reading. [?, Ch. 3] is also devoted to this topic. The algorithm of “bisection”

discussed in the next subsection, is treated in [?, Sect. 5.5.1] and [?, Sect. 3.2].

Now, we focus on scalar case n = 1: F : I ⊂ R 7→ R continuous, I interval

Sought: x∗ ∈ I : F( x∗ ) = 0

8.3.1 Bisection

Idea: use ordering of real numbers & intermediate value theorem [?, Sect. 4.6]

F( x)

Input: a, b ∈ I such that F(a) F(b) < 0 (dif-

ferent signs !)

∃ x ∗ ∈] min{a, b}, max{a, b}[:

⇒ x∗ x
F( x∗ ) = 0 ,
a b
as we conclude from the intermediate value theo-
rem.

Fig. 288

Find a sequence of intervals with geometrically decreasing lengths, in each of which F will change
sign.

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 571
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Such a sequence can easily be found by testing the sign of F at the midpoint of the current interval, see
Code 8.3.2.

(8.3.1) Bisection method

The following C++ code implements the bisection method for finding the zeros of a function passed through
the function handle F in the interval [ a, b] with absolute tolerance tol.

C++11 code 8.3.2: Bisection method for solving F( x ) = 0 on [ a, b]

2 // Searching zero of F in [ a, b] by bisection
3 template <typename Func , typename Sc alar >
4 S c a l a r b i s e c t ( Func&& F , S c a l a r a , S c a l a r b , S c a l a r t o l )
5 {
6 i f ( a > b ) std : : swap ( a , b ) ; // sort interval bounds
7 i f ( F ( a ) ∗F ( b ) > 0 ) throw " f ( a ) and f ( b ) h a ve same s i g n " ;
8 s t a t i c _ a s s e r t ( std : : i s _ f l o a t i n g _ p o i n t < Sc alar > : : value ,
9 " S c a l a r mu st be a f l o a t i n g p o i n t t y p e " ) ;
10 i n t v=F ( a ) < 0 ? 1 : −1;
11 S c a l a r x = ( a+b ) / 2 ; // determine midpoint
12 // termination, relies on machine arithmetic if tol = 0
13 while ( b−a > t o l ) { //
14 asser t ( a<x && x 0 ) b=x ;
17 // sgn( f ( x)) = sgn( f ( a)), then use x as next left boundary
18 else a=x ;
19 x = ( a+b ) / 2 ; // determine next midpoint
20 }
21 r et ur n x ;
22 }

Line 13: the test ((a<x)&& (x<b)) offers a safeguard against an infinite loop in case tol < resolution
of M at zero x ∗ (cf. “M-based termination criterion”).

This is also an example for an algorithm that (in the case of tol=0) uses the properties of machine
arithmetic to define an a posteriori termination criterion, see Section 8.1.2. The iteration will terminate,
when, e.g., a+e 21 (b − a) = a (+
e is the floating point realization of addition), which, by the Ass. 1.5.32 can
only happen, when

| 12 (b − a)| ≤ EPS · | a| .
Since the exact zero is located between a and b, this condition implies a relative error ≤ EPS of the
computed zero.
• “foolproof”, robust: will always terminate with a zero of requested accuracy,
Advantages: • requires only point evaluations of F,
• works with any continuous function F, no derivatives needed.

Merely “linear-type”
(∗)convergence: | x (k) − x ∗ | ≤ 2−k |b − a|
Drawbacks: |b − a|
log2 steps necessary
tol

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 572
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(∗): the convergence of a bisection algorithm is not linear in the sense of Def. 8.1.9, because the condition
x (k+1) − x ∗ ≤ L x (k) − x ∗ might be violated at any step of the iteration.

Remark 8.3.3 (Generalized bisection methods)

It is straightforward to combine the bisection idea with more elaborate “model function methods” as they
will be discussed in the next section: Instead of stubbornly choosing the midpoint of the probing interval
[ a, b] (→ Code 8.3.2) as next iterate, one may use a refined guess for the location of a zero of F in [ a, b].

A method of this type is used by M ATLAB’s fzero function for root finding in 1D [?, Sect. 6.2.3].

8.3.2 Model function methods

=ˆ class of iterative methods for finding zeroes of F: iterate in step k + 1 is computed according to the
following idea:

Idea: Given recent iterates (approximate zeroes)

x ( k ) , x ( k − 1) , . . . , x ( k − m + 1) , m ∈ N
➊ replace F with a model function F ek
(based on function values F( x ), F( x (k−1) ), . . . , F( x (k−m+1) ) and,
( k )

possibly, derivative values F′ ( x (k) ), F′ ( x (k−1) ), . . . , F′ ( x (k−m+1) ))

➋ x (k+1) := zero of Fek : Fek ( x (k+1) ) = 0

(has to be readily available ↔ analytic formula)

Distinguish (see § 8.1.2 and (8.1.4)):

one-point methods : x (k+1) = Φ F ( x (k) ), k ∈ N (e.g., fixed point iteration → Section 8.2)
multi-point methods : x (k+1) = Φ F ( x (k) , x (k−1) , . . . , x (k−m) ), k ∈ N, m = 2, 3, . . ..

8.3.2.1 Newton method in scalar case

Supplementary reading. Newton’s method in 1D is discussed in [?, Sect. 18.1], [?, Sect. 5.5.2],

[?, Sect. 3.4].

☞ Again we consider the problem of finding zeros of the function F : I ⊂ R → R.

☞ Now we assume that F : I ⊂ R 7→ R is continuously differentiable.

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 573
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Now model function := tangent at F in x (k) :

Fek ( x ) := F( x (k) ) + F′ ( x (k) )( x − x (k) )

take x (k+1) := zero of tangent

tangent
We obtain the Newton iteration

F (x(k))
x ( k + 1) : = x ( k ) − F ′ (x(k) )
, (8.3.4)
F x ( k + 1) x ( k )

that requires F ′ ( x ( k ) ) 6 = 0.

C++11-code 8.3.5: Newton method in the scalar case n = 1

1 template <typename FuncType , typename DervType , typename Scalar >
2 Scalar newton1D ( const FuncType &F , const DervType &DF,
3 const Scalar &x0 , double r t o l , double a t o l )
4 {
5 Scalar s , z = x0 ;
6 do {
7 s = F ( z ) / DF( z ) ; // compute Newton correction
8 z −= s ; // compute next iterate
9 }
10 // correction based termination (relative and absolute)
11 while ( ( std : : abs ( s ) > r t o l ∗ std : : abs ( z ) ) && ( std : : abs ( s ) > a t o l ) ) ;
12 r et ur n ( z ) ;
13 }

Example 8.3.6 (Square root iteration as Newton’s method)

In Ex. 8.1.20 we learned about the quadratically convergent fixed point iteration (8.1.21) for the approxi-
mate computation of the square root of a positive number. It can be derived as a Newton iteration (8.3.4)!

For F( x ) = x2 − a, a > 0, we find F′ ( x ) = 2x, and, thus, the Newton iteration for finding zeros of F
reads:

( x ( k ) )2 − a a
x ( k + 1) = x ( k ) − = 1
2 x (k) + ,
2x (k) x (k)
which is exactly (8.1.21). Thus, for this F Newton’s method converges globally with order p = 2.

Example 8.3.7 (Newton method in 1D (→ Exp. 8.2.3))

Newton iterations for two different scalar non-linear equations F( x ) = 0 with the same solution sets:
(k) (k)
x ′ x ( k + 1) (k) x (k) e x
−1 ( x ( k ) )2 + e − x
F( x ) = xe − 1 ⇒ F ( x ) = e (1 + x ) ⇒ x =x − (k) =
e x (1 + x ( k ) ) 1 + x (k)

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 574
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(k)
−x ′ −x ( k + 1) (k) x (k) − e− x 1 + x (k)
F( x) = x − e ⇒ F (x) = 1 + e ⇒ x =x − (k)
= (k)
1 + e− x 1 + ex
Exp. 8.2.3 confirms quadratic convergence in both cases! (→ Def. 8.1.17)

Note that for the computation of its zeros, the function F in this example can be recast in different forms!

In fact, based on Lemma 8.2.18 it is straightforward to show local quadratic convergence of Newton’s
method to a zero x ∗ of F, provided that F′ ( x ∗ ) 6= 0:

Newton iteration (8.3.4) ˆ fixed point iteration (→ Section 8.2) with iteration function
=
F( x) F( x ) F′′ ( x )
Φ( x ) = x − ⇒ Φ′ ( x ) = ⇒ Φ′ ( x ∗ ) = 0 , if F( x ∗ ) = 0, F′ ( x ∗ ) 6= 0 .
F′ ( x) ( F′ ( x ))2
Thus from Lemma 8.2.18 we conclude the following result:

Convergence of Newton’s method in 1D

Newton’s method locally quadratically converges (→ Def. 8.1.17) to a zero x ∗ of F, if F′ ( x ∗ ) 6= 0

Example 8.3.9 (Implicit differentiation of F)

R1 R2 R3 R4 Rn −1 Rn

U R R R R R R R

Fig. 289

How do we have to choose the leak resistance R in the linear circuit displayed in Fig. 289 in order to
achieve a prescribed potential at one of the nodes?

Using nodal analysis of the circuit introduced in Ex. 2.1.3, this problem can be formulated as: find x ∈ R ,
x := R−1 , such that

R → R
F( x ) = 0 with F : , (8.3.10)
x 7→ w⊤ (A + xI)−1 b − 1

where A ∈ R n,n is a symmetric, tridiagonal, diagonally dominant matrix, w ∈ R n is a unit vector singling
out the node of interest, and b takes into account the exciting voltage U .

In order to apply Newton’s method to (8.3.10), we have to determine the derivative F′ ( x ) and so by implicit
ˆ vector of nodal potentials as a function of x = R−1 )
differentiation [?, Sect. 7.8], first rewriting (u( x ) =

F( x ) = w⊤ u( x ) − 1 , (A + xI)u( x ) = b .

Then we differentiate the linear system of equations defining u( x ) on both sides with respect to x using
the product rule (8.4.10)
d
dx
(A + xI)u( x ) = b =⇒ (A + xI)u′ ( x ) + u( x ) = 0 .

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 575
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

u′ ( x ) = −(A + xI)−1 u( x ) . (8.3.11)

′ ⊤ ′ ⊤ −1
F ( x ) = w u ( x ) = −w (A + xI) u( x ) . (8.3.12)

Thus, the Newton iteration for (8.3.10) reads:

x ( k + 1) = x ( k ) + F ′ ( x ( k ) ) − 1 F ( x ( k ) )
w⊤ u( x (k) ) − 1 (8.3.13)
= , (A + x (k) I )u( x (k) ) = b .
w ⊤ (A + x (k) I )−1 u ( x (k) )

In each step of the iteration we have to solve two linear systems of equations, which can be done with
asymptotic effort O(n) in this case, because A + x (k) I is tridiagonal.

Note that in a practical application one must demand x > 0, in addition, because the solution must provide
a meaningful conductance (= inverse resistance.)

Also note that bisection (→ 8.3.1) is a viable alternative to using Newton’s method in this case.

8.3.2.2 Special one-point methods

Idea underlying other one-point methods: non-linear local approximation

Useful, if a priori knowledge about the structure of F (e.g. about F being a rational function, see below) is
available. This is often the case, because many problems of 1D zero finding are posed for functions given
in analytic form with a few parameters.

Prerequisite: Smoothness of F: F ∈ Cm ( I ) for some m > 1

Example 8.3.14 (Halley’s iteration → [?, Sect. 18.3])

This example demonstrates that non-polynomial model functions can offer excellent approximation of F.
In this example the model function is chosen as a quotient of two linear function, that is, from the simplest
class of true rational functions.

Of course, that this function provides a good model function is merely “a matter of luck”, unless you have
some more information about F. Such information might be available from the application context.

Given x (k) ∈ I , next iterate := zero of model function: h( x (k+1) ) = 0, where

a
h( x ) := + c (rational function) such that F( j) ( x (k) ) = h( j) ( x (k) ) , j = 0, 1, 2 .
x+b

a a ′ (k) 2a
( k )
+ c = F ( x (k)
) , − ( k ) 2
= F ( x ) , ( k ) 3
= F′′ ( x (k) ) .
x +b ( x + b) ( x + b)

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 576
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

F ( x (k) ) 1
x ( k + 1) = x ( k ) − · .
F ′ ( x (k) ) 1 − 1 F ( x ( k) ) F ′′ ( x ( k) )
2 F ′ ( x ( k ) )2

1 1
Halley’s iteration for F( x) = 2
+ − 1 , x > 0 : and x (0) = 0
( x + 1) ( x + 0.1)2
k x (k) F ( x (k) ) x ( k ) − x ( k − 1) x (k) − x ∗
1 0.19865959351191 10.90706835180178 -0.19865959351191 -0.84754290138257
2 0.69096314049024 0.94813655914799 -0.49230354697833 -0.35523935440424
3 1.02335017694603 0.03670912956750 -0.33238703645579 -0.02285231794846
4 1.04604398836483 0.00024757037430 -0.02269381141880 -0.00015850652965
5 1.04620248685303 0.00000001255745 -0.00015849848821 -0.00000000804145
Compare with Newton method (8.3.4) for the same problem:

k x (k) F ( x (k) ) x ( k ) − x ( k − 1) x (k) − x ∗

1 0.04995004995005 44.38117504792020 -0.04995004995005 -0.99625244494443
2 0.12455117953073 19.62288236082625 -0.07460112958068 -0.92165131536375
3 0.23476467495811 8.57909346342925 -0.11021349542738 -0.81143781993637
4 0.39254785728080 3.63763326452917 -0.15778318232269 -0.65365463761368
5 0.60067545233191 1.42717892023773 -0.20812759505112 -0.44552704256257
6 0.82714994286833 0.46286007749125 -0.22647449053641 -0.21905255202615
7 0.99028203077844 0.09369191826377 -0.16313208791011 -0.05592046411604
8 1.04242438221432 0.00592723560279 -0.05214235143588 -0.00377811268016
9 1.04618505691071 0.00002723158211 -0.00376067469639 -0.00001743798377
10 1.04620249452271 0.00000000058056 -0.00001743761199 -0.00000000037178
Note that Halley’s iteration is superior in this case, since F is a rational function.

! Newton method converges more slowly, but also needs less effort per step (→ Section 8.3.3)

In the previous example Newton’s method performed rather poorly. Often its convergence can be boosted
by converting the non-linear equation to an equivalent one (that is, one with the same solutions) for another
function g, which is “closer to a linear function”:
b, where Fb is invertible with an inverse Fb−1 that can be evaluated with little effort.
Assume F ≈ F

g( x ) := Fb−1 ( F( x )) ≈ x .

Then apply Newton’s method to g( x ), using the formula for the derivative of the inverse of a function

d b−1 1 1
( F )(y) = ⇒ g′ ( x ) = · F′ ( x) .
dy Fb′ ( Fb−1 (y)) Fb′ ( g( x ))

Example 8.3.15 (Adapted Newton method)

1 1
As in Ex. 8.3.14: F( x) = 2
+ −1, x > 0 :
( x + 1) ( x + 0.1)2

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 577
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10
F(x)
9 g(x)

7
Observation:
6
F( x ) + 1 ≈ 2x −2 for x ≫ 1
5

1 4
and so g( x ) := p is “almost” linear for
F( x) + 1 3
x ≫ 1. 2

0
0 0.5 1 1.5 2 2.5 3 3.5 4
x

! !
Idea: instead of F( x ) = 0 tackle g( x ) = 1 with Newton’s method (8.3.4).
 
( k )
g( x ) − 1 (k) 3/2
x ( k + 1) = x (k) − = x (k)
+ q 1
− 1  2( F ( x ) + 1)
g′ ( x (k) ) F ( x (k) ) + 1 F ′ ( x (k) )
q
2( F( x (k) ) + 1)(1 − F( x (k) ) + 1)
= x (k) + .
F ′ ( x (k) )
Convergence recorded for x (0) = 0:

k x (k) F ( x (k) ) x ( k ) − x ( k − 1) x (k) − x ∗

1 0.91312431341979 0.24747993091128 0.91312431341979 -0.13307818147469
2 1.04517022155323 0.00161402574513 0.13204590813344 -0.00103227334125
3 1.04620244004116 0.00000008565847 0.00103221848793 -0.00000005485332
4 1.04620249489448 0.00000000000000 0.00000005485332 -0.00000000000000

For zero finding there is wealth of iterative methods that offer higher order of convergence. One class is
discussed next.

(8.3.16) Modified Newton methods

Taking the cue from the iteration function of Newton’s method (8.3.4), we extend it by introducing an extra
function H :
F( x)
new fixed point iteration : Φ( x ) = x − H ( x ) with “proper” H : I 7→ R .
F′ ( x)
Still, every zero of F is a fixed point of this Φ,that is, the fixed point iteration is still consistent (→ Def. 8.2.1).

Aim: find H such that the method is of p-th order. The main tool is Lemma 8.2.18, which tells us that we
have to ensure Φ(ℓ) ( x ∗ ) = 0, 1 ≤ ℓ ≤ p − 1, guarantees local convergence of order p.

Assume: F smooth “enough” and ∃ x ∗ ∈ I : F( x ∗ ) = 0, F′ ( x ∗ ) 6= 0. Then we can compute the derivatives

of Φ appealing to the product rule and quotient rule for derivatives.

Φ = x − uH , Φ′ = 1 − u′ H − uH ′ , Φ′′ = −u′′ H − 2u′ H − uH ′′ ,

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 578
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

F FF′′ F′′ F( F′′ )2 FF′′′

with u = ⇒ u′ = 1 − , u′′ = − + 2 − .
F′ ( F ′ )2 F′ ( F ′ )3 ( F ′ )2
F ′′ ( x ∗ )
F( x ∗ ) = 0 ➤ u( x ∗ ) = 0, u′ ( x ∗ ) = 1, u′′ ( x ∗ ) = − F ′ (x∗ ) .

F′′ ( x ∗ )
Φ′ ( x ∗ ) = 1 − H ( x ∗ ) , Φ′′ ( x ∗ ) = H ( x ∗ ) − 2H ′ ( x ∗ ) . (8.3.17)
F′ ( x∗ )
Lemma 8.2.18 ➢ Necessary conditions for local convergence of order p:
p = 2 (quadratical convergence): H (x∗ ) = 1 ,
1 F′′ ( x ∗ )
p = 3 (cubic convergence): H (x∗ ) = 1 ∧ H ′ (x∗ ) = .
2 F′ ( x∗ )
Trial expression: H ( x ) = G (1 − u′ ( x )) with “appropriate” G
!
F ( x (k) ) F( x (k) ) F′′ ( x (k) )
fixed point iteration x ( k + 1) = x (k) − ′ (k) G . (8.3.18)
F (x ) ( F′ ( x (k) ))2

Lemma 8.3.19. Cubic convergence of modified Newton methods

If F ∈ C2 ( I ), F( x ∗ ) = 0, F′ ( x ∗ ) 6= 0, G ∈ C2 (U ) in a neighbourhood U of 0, G (0) = 1,
G ′ (0) = 12 , then the fixed point iteration (8.3.18) converge locally cubically to x ∗ .

Proof. We apply Lemma 8.2.18, which tells us that both derivatives from (8.3.17) have to vanish. Using
the definition of H we find.
F′′ ( x ∗ )
H ( x ∗ ) = G (0) , H ′ ( x ∗ ) = − G ′ (0)u′′ ( x ∗ ) = G ′ (0) .
F′ ( x∗ )
Plugging these expressions into (8.3.17) finishes the proof.
✷

Experiment 8.3.20 (Application of modified Newton methods)

1
• G (t) = ➡ Halley’s iteration (→ Ex. 8.3.14)
1 − 12 t
2
• G (t) = √ ➡ Euler’s iteration
1 + 1 − 2t
• G (t) = 1 + 12 t ➡ quadratic inverse interpolation
k e(k) := x (k) − x ∗
Halley Euler Quad. Inv.
Numerical experiment: 1 2.81548211105635 3.57571385244736 2.03843730027891
2 1.37597082614957 2.76924150041340 1.02137913293045
F( x ) = xe x − 1 , 3 0.34002908011728 1.95675490333756 0.28835890388161
x (0) = 5 4 0.00951600547085 1.25252187565405 0.01497518178983
5 0.00000024995484 0.51609312477451 0.00000315361454
6 0.14709716035310
7 0.00109463314926
8 0.00000000107549

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 579
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

8.3.2.3 Multi-point methods

Supplementary reading. The secant method is presented in [?, Sect. 18.2], [?, Sect. 5.5.3],

[?, Sect. 3.4].

Construction of multi-point iterations in 1D

Idea: Replace F with interpolating polynomial

producing interpolatory model function methods

(8.3.22) The secant method

F( x)

Simplest representative of model function multi-point

methods:
x ( k − 1) x ( k + 1) x (k)
secant method

x (k+1) = zero of secant (red line ✄)

Fig. 290

The secant line is the graph of the function

F ( x ( k ) ) − F ( x ( k − 1) )
s( x ) = F ( x (k) ) + ( x − x (k) ) , (8.3.23)
x ( k ) − x ( k − 1)
F( x (k) )( x (k) − x (k−1) )
x ( k + 1) = x (k) − . (8.3.24)
F ( x ( k ) ) − F ( x ( k − 1) )

C++11 code 8.3.25: Secant method for 1D non-linear equaton

2 // Secand method for solving F ( x) = 0 for F : D ⊂ R → R,
3 // initial guesses x0 , x1 ,
4 // tolerances atol (absolute), rtol (relative)
5 template <typename Func>
6 double s ec ant ( double x0 , double x1 , Func&& F ,
7 double r t o l , double a t o l , unsigned i n t maxIt )
8 {
9 double f o = F ( x0 ) ;
10 f o r ( unsigned i n t i = 0; i <maxIt ; ++ i ) {
11 double f n = F ( x1 ) ;
12 double s = f n ∗ ( x1−x0 ) / ( fn −f o ) ; // secant correction
13 x0 = x1 ; x1 = x1−s ;
14 // correction based termination (relative and absolute)
15 i f ( abs ( s ) < max ( a t o l , r t o l ∗ min ( abs ( x0 ) , abs ( x1 ) ) ) )
16 r et ur n x1 ;

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 580
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

17 fo = fn ;
18 }
19 r et ur n x1 ;
20 }

This code demonstrates several important features of the secant method:

• Only one function evaluation per step
• no derivatives required!
• 2-pojnt method: two initial guesses needed

Remember: F( x ) may only be available as output of a (complicated) procedure. In this case it is difficult
to find a procedure that evaluates F′ ( x ). Thus the significance of methods that do not involve evaluations
of derivatives.

Experiment 8.3.26 (Convergence of secant method)

Model problem: find zero of F( x ) = xe x − 1, using secant method of Code 8.3.25 with initial guesses
x ( 0 ) = 0, x ( 1 ) = 5.
log | e( k+1) |−log | e( k) |
k x (k) F ( x (k) ) e(k) := x (k) − x ∗
log | e( k) |−log | e( k−1) |
2 0.00673794699909 -0.99321649977589 -0.56040534341070
3 0.01342122983571 -0.98639742654892 -0.55372206057408 24.43308649757745
4 0.98017620833821 1.61209684919288 0.41303291792843 2.70802321457994
5 0.38040476787948 -0.44351476841567 -0.18673852253030 1.48753625853887
6 0.50981028847430 -0.15117846201565 -0.05733300193548 1.51452723840131
7 0.57673091089295 0.02670169957932 0.00958762048317 1.70075240166256
8 0.56668541543431 -0.00126473620459 -0.00045787497547 1.59458505614449
9 0.56713970649585 -0.00000990312376 -0.00000358391394 1.62641838319117
10 0.56714329175406 0.00000000371452 0.00000000134427
11 0.56714329040978 -0.00000000000001 -0.00000000000000
Rem. 8.1.19: the rightmost column of the table provides an estimate for the order of convergence →
Def. 8.1.17. For further explanations see Rem. 8.1.19.

A startling observation: the method seems to have a fractional (!) order of convergence, see Def. 8.1.17.

Remark 8.3.27 (Fractional order of convergence of secant method)

Indeed, a fractional order of convergence can be proved for the secant method, see [?, Sect. 18.2]. Here
we give an asymptotic argument that holds, if the iterates are already very close to the zero x ∗ of F.

We can write the secant method in the form (8.1.4)

F( x )( x − y)
x (k+1) = Φ( x (k) , x (k−1) ) with Φ( x, y) = Φ( x, y) := x − . (8.3.28)
F ( x ) − F (y)

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 581
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Using Φ we find a recursion for the iteration error e(k) := x (k) − x ∗ :

e ( k + 1) = Φ ( x ∗ + e ( k ) , x ∗ + e ( k − 1) ) − x ∗ . (8.3.29)

Thanks to the asymptotic perspective we may assume that |e(k) |, |e(k−1) | ≪ 1 so that we can rely on
two-dimensional Taylor expansion around ( x ∗ , x (∗) ), cf. [?, Satz 7.5.2]:
∂Φ ∗ ∗ ∂Φ ∗ ∗
Φ( x ∗ + h, x ∗ + k) = Φ( x ∗ , x ∗ ) + ( x , x )h + ( x , x )k+
∂x ∂y
2 2 2 (8.3.30)
1∂ Φ ∗ ∗ 2 ∂ Φ ∗ ∗ 1∂ Φ ∗ ∗ 2 ∗
2 ∂2 x ( x , x ) h ( x , x ) hk 2 ∂2 y ( x , x )k + R( x , h, k) ,
∂x∂y
with | R| ≤ C(h3 + h2 k + hk2 + k3 ) .
Computations invoking the quotient rule and product rule and using F( x ∗ ) = 0 show
∂Φ ∗ ∗ ∂Φ ∗ ∗ ∂2 Φ ∂2 Φ
Φ( x ∗ , x ∗ ) = x ∗ , (x , x ) = ( x , x ) = 12 2 ( x ∗ , x ∗ ) = 21 2 ( x ∗ , x ∗ ) = 0 .
∂x ∂y ∂ x ∂ y
We may also use MAPLE to find the Taylor expansion (assuming F sufficiently smooth):
> Phi := (x,y) -> x-F(x)*(x-y)/(F(x)-F(y));
> F(s) := 0;
> e2 = normal(mtaylor(Phi(s+e1,s+e0)-s,[e0,e1],4));

➣ truncated error propagation formula (products of three or more error terms ignored)

. 1 F ′′ ( x ∗ ) (k) (k−1)
e ( k + 1) = 2 F ′ (x∗ ) e e = Ce(k) e(k−1) . (8.3.31)

How can we deduce the order of converge from this recursion formula? We try e ( k ) = K ( e ( k − 1) ) p
inspired by the estimate in Def. 8.1.17:
2
⇒ e ( k + 1) = K p + 1 ( e ( k − 1) ) p
2 − p−1 √
⇒ ( e ( k − 1) ) p
= K − p C ⇒ p2 − p − 1 = 0 ⇒ p = 12 (1 ± 5) .
√
As e(k) → 0 for k → ∞ we get the order of convergence p = 12 (1 + 5) ≈ 1.62 (see Exp. 8.3.26 !)

Example 8.3.32 (local convergence of the secant method)

Model problem: find zero of 8

F( x ) = arctan( x )
6

· =
(1)

ˆ secant method converges for a pair 5

✄
( x (0) , x (1) ) ∈ R2+ of initial guesses. 4

We observe that the secant method will converge 3

only for initial guesses sufficiently close to 0 = 2

local convergence → Def. 8.1.8 1

0
0 1 2 3 4 5 6 7 8 9 10
(0)
Fig. 291 x

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 582
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(8.3.33) Inverse interpolation

Another class of multi-point methods: inverse interpolation

Assume: F : I ⊂ R 7→ R one-to-one (monotone)

F ( x ∗ ) = 0 ⇒ F − 1 (0 ) = x ∗ .

Interpolate F−1 by polynomial p of degree m − 1 determined by

p( F( x (k− j) )) = x (k− j) , j = 0, . . . , m − 1 .

New approximate zero x (k+1) := p(0)

F −1

The graph of F−1 can be obtained by reflecting the F

graph of F at the angular bisector. ✄

F ( x ∗ ) = 0 ⇔ F − 1 (0 ) = x ∗

Fig. 292

F −1

x∗
F
Case m = 2 (2-point method)
➢ secant method

The interpolation polynomial is a line. In this case x∗

we do not get a new method, because the inverse
function of a linear function (polynomial of degree 1)
is again a polynomial of degree 1.

Fig. 293

Case m = 3: quadratic inverse interpolation, a 3-point method, see [?, Sect. 4.5]

We interpolate the points ( F( x (k) ), x (k) ), ( F( x (k−1) ), x (k−2) ), ( F( x (k−1) ), x (k−2) ) with a parabola (polyno-
mial of degree 2). Note the importance of monotonicity of F, which ensures that F( x (k) ), F( x (k−1) ), F( x (k−1) )
are mutually different.

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 583
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

MAPLE code: p := x-> ax^2+bx+c;

solve({p(f0)=x0,p(f1)=x1,p(f2)=x2},{a,b,c});
assign(%); p(0);

F02 ( F1 x2 − F2 x1 ) + F12 ( F2 x0 − F0 x2 ) + F22 ( F0 x1 − F1 x0 )

x ( k + 1) = .
F02 ( F1 − F2 ) + F12 ( F2 − F0 ) + F22 ( F0 − F1 )
( F0 := F( x (k−2) ), F1 := F( x (k−1) ), F2 := F( x (k) ), x0 := x (k−2) , x1 := x (k−1) , x2 := x (k) )

Experiment 8.3.34 (Convergence of quadratic inverse interpolation)

We test the method for the model problem/initial guesses F( x ) = xe x − 1 , x (0) = 0 , x (1) = 2.5 ,x (2) = 5 .
log | e( k+1) |−log | e( k) |
k x (k) F ( x (k) ) e(k) := x (k) − x ∗ log | e( k) |−log | e( k−1) |
3 0.08520390058175 -0.90721814294134 -0.48193938982803
4 0.16009252622586 -0.81211229637354 -0.40705076418392 3.33791154378839
5 0.79879381816390 0.77560534067946 0.23165052775411 2.28740488912208
6 0.63094636752843 0.18579323999999 0.06380307711864 1.82494667289715
7 0.56107750991028 -0.01667806436181 -0.00606578049951 1.87323264214217
8 0.56706941033107 -0.00020413476766 -0.00007388007872 1.79832936980454
9 0.56714331707092 0.00000007367067 0.00000002666114 1.84841261527097
10 0.56714329040980 0.00000000000003 0.00000000000001
Also in this case the numerical experiment hints at a fractional rate of convergence p ≈ 1.8, as in the case
of the secant method, see Rem. 8.3.27.

8.3.3 Asymptotic efficiency of iterative methods for zero finding

Efficiency is measured by forming the ratio of gain and the effort required to achieve it. For iterative
methods for solving F(x) = 0, F : D ⊂ R n → R n , this means the following:

Efficiency of an iterative method computational effort to reach prescribed

↔
(for solving F(x) = 0) number of significant digits in result.

Ingredient ➊: W=
ˆ computational effort per step

#{evaluations of D F} #{evaluations of F′ }
(e.g, W≈ +n· +··· )
step step

Ingredient ➋: number of steps k = k(ρ) to achieve relative reduction of error (= gain)

e ( k ) ≤ ρ e (0) , ρ > 0 prescribed. (8.3.35)

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 584
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Let us consider an iterative method of order p ≥ 1 (→ Def. 8.1.17). Its error recursion can be converted
into expressions (8.3.36) and (8.3.37) that related the error norm e(k) to e(0) and lead to quantitative
bounds for the number of steps to achieve (8.3.35):
p
∃C > 0: e ( k ) ≤ C e ( k − 1) ∀k ≥ 1 (C < 1 for p = 1) .

p−1
Assuming C e (0) < 1 (guarantees convergence!), we find the following minimum number of
steps to achieve (8.3.35) for sure:

! log ρ
p = 1: e ( k ) ≤ C k e (0) requires k≥ , (8.3.36)
log C
! p k −1 pk log ρ
p > 1: e(k) ≤ C p −1 e (0) requires pk ≥ 1 +
log C/p−1 + log( e (0) )

log ρ
⇒ k ≥ log(1 + )/ log p , (8.3.37)
log L0
L0 : = C /p − 1 e ( 0) < 1 .
1

Now we adopt an asymptotic perspective and ask for a large reduction of the error, that is ρ ≪ 1.

Notice: | log ρ| ↔ Gain in no. of significant digits of x (k)

no. of digits gained | log ρ|

Measure for efficiency: Efficiency := = (8.3.38)
total work required k(ρ) · W

asymptotic efficiency for ρ ≪ 1 ➜ | log ρ| → ∞):

We conclude that

• when requiring high accuracy, linearly convergent iterations should not be used, because their effi-
ciency does not increase for ρ → 0,
log p
• for method of order p > 1, the factor W offers a gauge for efficiency.

Example 8.3.40 (Efficiency of iterative methods)

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 585
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10
C = 0.5
9 C = 1.0
C = 1.5

max(no. of iterations), ρ = 1.000000e-08

8 C = 2.0

We choose e(0) = 0.1, ρ = 10−8. 7

6
The plot displays the number of iteration steps ac- 5
cording to (8.3.37).
4

Higher-order method require substantially fewer 3

steps compared to low-order methods. 2

0
1 1.5 2 2.5
Fig. 294 p

7
Newton method
secant method
6

We compare
5
Newton’s method ↔ secant method
no. of iterations

in terms of number of steps required for a prescribed 4

guaranteed error reduction, assuming C = 1 in both
cases and for e(0) = 0.1. 3

2
Newton’s method requires only marginally fewer
steps than the secant method. 1

0
0 2 4 6 8 10
Fig. 295 -log (ρ)
10

We draw conclusions from the discussion above and (8.3.39):

WNewton = 2Wsecant , log pNewton log psecant

⇒ : = 0.71 .
pNewton = 2, psecant = 1.62 WNewton Wsecant

We set the effort for a step of Newton’s method to twice that for a step of the secant method from
Code 8.3.25, because we need an addition evaluation of F′ in Newton’s method.

➣ secant method is more efficient than Newton’s method!

8.4 Newton’s Method

Supplementary reading. A comprehensive monograph about all aspects of Newton’s methods

and generalizations in [?].

The multi-dimensional Newton method is also presented in [?, Sect. 19], [?, Sect. 5.6], [?, Sect. 9.1].

We consider a non-linear system of n equations with n unknowns:

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 586
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

for F : D ⊂ R n 7→ R n find x∗ ∈ D: F ( x ∗ ) = 0.
We assume: F : D ⊂ R n 7→ R n is continuously differentiable

8.4.1 The Newton iteration

Idea (→ Section 8.3.2.1): local linearization:
Given x(k) ∈ D ➣ x(k+1) as zero of affine linear model function

F(x) ≈ Fek (x) := F(x(k) ) + D F(x(k) )(x − x(k) ) ,

n
∂Fj
D F (x) ∈ R n,n = Jacobian, D F(x) := (x) .
∂xk j,k=1

Newton iteration: (generalizes (8.3.4) to n > 1)

x ( k + 1) : = x ( k ) − D F ( x ( k ) ) − 1 F ( x ( k ) ) , [ if D F(x(k) ) regular ] (8.4.1)

Terminology: − D F(x(k) )−1 F(x(k) ) = Newton correction

F2 (x) = 0
x2
Fe1 (x) = 0
x∗
F1 (x) = 0
Illustration of idea of Newton’s method for n = 2: ✄
x ( k + 1)
x(k)
Sought: intersection point x∗ of the curves F1 (x) =
0 and F2 (x) = 0.

Idea: x(k+1) = the intersection of two straight lines Fe2 (x) = 0

(= zero sets of the components of the model
function, cf. Ex. 2.2.15) that are approxima-
tions of the original curves
x1

Fig. 296

M ATLAB-code : Newton’s method

1 f u n c t i o n x = newton(x,F,DF,rtol,atol)
MATLAB template for Newton
2 f o r i=1:MAXIT
method:
3 s = DF(x) \ F(x);
Solve linear system:
A\b = A−1 b → § 2.5.4 4 x = x-s;
5 if ((norm(s)<rtol*norm(x))||(norm(s)<atol))
F,DF: function handles 6 r e t u r n ; end ;
A posteriori termination criterion 7 end

Here a correction based a posteriori termination criterion for the Newton iteration is used; it stops the
iteration if the relative size of the Newton correction drops below the prescribed relative tolerance rtol.

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 587
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

If x∗ ≈ 0 also the absolute size of the Newton correction has to be tested against an absolute tolerance
atol in order to avoid non-termination despite convergence of the iteration,

C++11-code 8.4.2: Newton’s method in C++

2 template <typename FuncType , typename JacType , typename Sc alar ,
3 i n t N = Dynamic , typename CB=void_cb >
4 Vector < Sc alar , N> newton ( FuncType &F , JacType &DFinv ,
5 const Vector < Sc alar , N> x0 , const double
rtol ,
6 const double a t o l , CB c a l l b a c k = n u l l p t r ) {
7 Vector < Sc alar , N> x = x0 ;
8 Vector < Sc alar , N> s ;
9

10 do {
11 s = DFinv ( x , F ( x ) ) ; // compute Newton correction
12 x −= s ; // compute next iterate
13

14 i f ( callback != nullpt r )
15 callback ( x , s ) ;
16 }
17 // correction based termination (relative and absolute)
18 while ( ( s . norm ( ) > r t o l ∗ x . norm ( ) ) && ( s . norm ( ) > a t o l ) ) ;
19

20 r et ur n x ;
21 }

☞ Objects of type FuncType must feature

VecType o p e r a t o r ( const VecType &x);

that evaluates F(x) (Vx ↔ x).

☞ Objects of type JacType must provide a method
VecType o p e r a t o r ( const VecType &x, const VectType &f);

that computes the Newton correction, that is it returns the solution of a linear system with system
matrix D F(x) (x ↔ x) and right hand side f ↔ f.
☞ The argument x will be overwritten with the computed solution of the non-linear system.

The next code demonstrates the invocation of newton for a 2 × 2 non-linear system from a code relying
on E IGEN. It also demonstrates the use of fixed size eigen matrices and vectors.

C++11-code 8.4.3: Calling newton with E IGEN data types

1 void newton2Ddr iv er ( void )
2 {
3 // Function F defined through lambda function
4 auto F = [ ] ( const Eigen : : Vector2d &x ) {
5 Eigen : : Vector2d z ;
6 const double x1 = x ( 0 ) , x2=x ( 1 ) ;

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 588
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

7 z << x1 ∗ x1 −2∗x1−x2 +1 , x1 ∗ x1+x2 ∗ x2 −1;

8 r et ur n ( z ) ;
9 };
10 // JacType lambda function based on D F
11 auto DF = [ ] ( const Eigen : : Vector2d &x , const Eigen : : Vector2d & f ) {
12 Eigen : : Matrix2d J ;
13 const double x1 = x ( 0 ) , x2=x ( 1 ) ;
14 J << 2 ∗ x1 − 2, − 1,2∗ x1 , 2 ∗ x2 ;
15 Eigen : : Vector2d s = J . l u ( ) . solve ( f ) ;
16 r et ur n ( s ) ;
17 };
18 Eigen : : Vector2d x ; x << 2 , 3 ; % i n i t i a l guess
19 newton ( F , DF, x , 1 E− 6,1E−8) ;
20 std : : cout << " | | F ( x ) | | = " << F ( x ) . norm ( ) << std : : endl ;
21 }

New aspect for n ≫ 1 (compared to n = 1-dimensional case, Section 8.3.2.1):

Computation of the Newton correction may be expensive!
! (because it involves the solution of a LSE, cf. Thm. 2.5.2)

Remark 8.4.4 (Affine invariance of Newton method)

An important property of the Newton iteration (8.4.1): affine invariance → [?, Sect .1.2.2]

set GA (x) := AF(x) with regular A ∈ R n,n so that F(x∗ ) = 0 ⇔ GA (x∗ ) = 0 .

☛ ✟
Affine invariance: Newton iterations for GA (x) = 0 are the same for all regular A !
✡ ✠
This is a simple computation:

D G (x) = A D F (x) ⇒ D G (x)−1 G (x) = D F(x)−1 A−1 AF(x) = D F(x)−1 F(x) .

Use affine invariance as guideline for

• convergence theory for Newton’s method: assumptions and results should be affine invariant, too.
• modifying and extending Newton’s method: resulting schemes should preserve affine invariance.
In particular, termination criteria for Newton’s method should also be affine invariant in the sense that,
when applied for GA they STOP the iteration at exactly the same step for any choice of A.

The function F : R n → R n defining the non-linear system of equations may be given in various formats,
as explicit expression or rather implicitly. In most cases, D F has to be computed symbolically in order to
obtain concrete formulas for the Newton iteration. We now learn how these symbolic computations can be
carried out harnessing advanced techniques of multi-variate calculus.

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 589
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(8.4.5) Derivatives and their role in Newton’s method

The reader will probably agree that the derivative of a function F : I ⊂ R → R in x ∈ I is a number
F′ ( x ) ∈ R, the derivative of a function F : D ⊂ R n → R m , in x ∈ D a matrix D F(x) ∈ R m,n . However,
the nature of a derivative in a point is that of a linear mapping that approximates F locally up to second
order:

Definition 8.4.6. Derivative of functions between vector spaces

Let V , W be finite dimensional vector spaces and F : D ⊂ V 7→ W a sufficiently smooth mapping.

The derivative (differential) D F(x) of F in x ∈ V is the unique linear mapping D F(x) : V 7→ W
such that there is a δ > 0 and a function ǫ : [0, δ] → R + satisfying limξ →0 ǫ(ξ ) = 0 such that

k F(x + h) − F(x) − D F(x)h k = ǫ(k hk) ∀h ∈ V , khk < δ . (8.4.7)

The vector h in (8.4.7) may be called “direction of differentiation”.

☞ Note that D F(x)h ∈ W is the vector returned by the linear mapping D F(x) when applied to h ∈ V .
☞ In Def. 8.4.6 k·k can be any norm on V (→ Def. 1.5.70).
☞ A common shorthand notation for (8.4.7) relies on the “little-o” Landau symbol:

k F(x + h) − F(x) − D F(h)h k = o(h) for h → 0 ,

which designates a remainder term tending to 0 as its arguments tends to 0.

☞ Choosing bases of V and W , D F( x ) can be described by the Jacobian (matrix) (8.2.11), because
every linear mapping between finite dimensional vector spaces has a matrix representation after
bases have been fixed. Thus, the derivative is usually written as a matrix-valued function on D.

In the context of the Newton iteration (8.4.1) the computation of the Newton correction s in the k + 1-th
step amounts to solving a linear system of equations:

s = − D F (x (k) )−1 F (x (k) ) ⇔ D F (x (k) )s = − F (x (k) ) .

Matching this with Def. 8.4.6 we see that we need only determine expressions for D F(x(k) )h, h ∈ V ,
in order to state the LSE yielding the Newton correction. This will become important when applying the
“compact” differentiation rules discussed next.

(8.4.8) Differentiation rules → Repetition of basic analysis skills

Statement of the Newton iteration (8.4.1) for F : R n 7→ R n given as analytic expression entails computing
the Jacobian D F. The safe, but tedious way is to use the definition (8.2.11) directly and compute the
partial derivatives.
To avoid cumbersome component-oriented considerations, it is sometimes useful to know the rules of
multidimensional differentiation:

Immediate from Def. 8.4.6 are the following differentiation rules:

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 590
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

• For F : V 7→ W linear, we have D F(x) = F for all x ∈ V

(For instance, if F : R n → R m , F(x) = Ax, A ∈ R m,n , then D F(x) = A for all x ∈ R n .)
• Chain rule: For F : V 7→ W , G : W 7→ U sufficiently smooth

D(G ◦ F)(x)h = D G ( F(x))(D F(x))h , h ∈ V , x ∈ D . (8.4.9)

• Product rule: F : D ⊂ V 7→ W , G : D ⊂ V 7→ U sufficiently smooth, b : W × U 7→ Z bilinear, ie.,

linear in each arguent:

T (x) = b( F(x), G (x)) ⇒ D T (x)h = b(D F(x)h, G (x)) + b( F(x), D G (x)h) , (8.4.10)
h ∈ V, x ∈ D .

Advice: If you do not feel comfortable with these

rules of multidimensional differential calculus, please
resort to detailed componentwise/entrywise calcula-
tions according to (8.2.11) (“pedestrian differentia-
tion”), though they may be tedious.

The first and second derivatives of real-valued functions occur frequently and have special names, see [?,
Def. 7.3.2] and [?, Satz 7.5.3].

Definition 8.4.11. Gradient and Hessian

For sufficiently smooth F : D ⊂ R n 7→ R the gradient grad F : D 7→ R n , and the Hessian (matrix)
H F(x) : D 7→ R n,n are defined as

grad F(x)T h := D F(x)h , h1T H F(x)h2 := D(D F(x)(h1 ))(h2 ) , h, h1 , h2 ∈ V .

Example 8.4.12 (Derivative of a bilinear form)

A simple example: Ψ : R n 7→ R, Ψ(x) := x T Ax, with A ∈ R n,n

This the general matrix representation of a bilinear form on R n .

“High level differentiation”: We apply the product rule (8.4.10) with F, G = Id, which means D F(x) =
D G (x) = I, and the bilinear form b(x, y) := x T Ay:

D Ψ(x)h = h⊤ Ax + x⊤ Ah = x⊤ A⊤ + x T A h ,
| {z }
=(grad Ψ(x))⊤

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 591
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Hence, grad Ψ(x) = (A + A T )x according to the definition of a gradient (→ Def. 8.4.11)

“Low level differentiation”: Using the rules of matrix×vector multiplication, Ψ can be written in terms of the
vector components xi , i = 1, . . . , n:
n n n n n n
Ψ(x) = ∑ ∑ (A)k,j xk x j = (A)i,i xi2 + ∑ (A)i,j xi x j + ∑ (A)k,i xk xi + ∑ ∑ (A)k,j xk x j ,
k=1 j =1 j =1 k =1 k =1 j =1
j6=i k6=i k6=i j6=i

which leads to the partial derivatives

n n n n
∂Ψ(x)
(x) =2(A)i,i xi + ∑ (A)i,j x j + ∑ (A)k,i xk = ∑ (A)i,j x j + ∑ (A)k,i xk
∂xi j =1 k =1 j =1 k=1
j6=i k6=i

=(Ax + A⊤ x)i .

Of course, both results must agree!

Example 8.4.13 (Derivative of Euclidean norm)

We seek the derivative of the Euclidean norm, that is, of the function F(x) := kxk2 , x ∈ R n \ {0} ( F is
defined but not differentiable in x = 0, just look at the case n = 1!)

“High level differentiation”: We can write F as the composition of two functions F = G ◦ H with
p
G : R + → R + , G (ξ ) := ξ,
H : Rn → R , H (x) := x⊤ x .

Using the rule for the differentiation of bilinear forms from Ex. 8.4.12 for the case A = I and basic
calculus, we find

D H (x)h = 2x⊤ h , x, h ∈ R n ,
ζ
D G (ξ )ζ = √ , ξ > 0, ζ ∈ R .
2 ξ
Finally, the chain rule (8.4.9) gives

2x⊤ h x⊤
D F(x)h = D G ( H (x))(D H (x)h) = √ = ·h . (8.4.14)
2 x⊤ x kxk2
x
Def. 8.4.11 ⇒ grad F(x) = .
k x k2

“Pedestrian differentiation”: For a single component of F(x) we have

q
( F(x)) i = x12 + x22 + · · · + xi2 + · · · + x2n ,
∂( F(x)) i 1
= q · 2xi [chain rule] .
∂xi 2 x12 + x22 + · · · + xi2 + · · · + x2n
1
grad F(x) = ( x , x2 , . . . , x n ) ⊤ .
k x k2 1

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 592
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(8.4.15) Newton iteration via product rule

This paragraph explains the use of the general product rule (8.4.10) to derive the linear system solved by
the Newton correction. It implements the insights from § 8.4.5.
We seek solutions of F(x) = 0 with F(x) := b(G (x), H (x)), where
✦ V, W are some vector spaces (finite- or even infinite-dimensional),
✦ G : D → V , H : D → W , D ⊂ R n , are continuously differentiable in the sense of Def. 8.4.6,
✦ b : V × W 7→ R n is bilinear (linear in each argument).
According to the general product rule (8.4.10) we have

D F(x)h = b(D G (x)h, H (x)) + b(G (x), D H (x)h) , h ∈ R n . (8.4.16)

This already defines the linear system of equations to be solved to compute the Newton correction s

b(D G (x(k) )s, H (x(k) )) + b(G (x(k) ), D H (x(k) )s) = −b(G (x(k) ), H (x(k) )) . (8.4.17)

Since the left-hand side is linear in s, this really represents a square linear system of n equations. The
next example will present a concrete case.

(8.4.18) A “quasi-linear” system of equations

We call a quasilinear system of equations a non-linear equation of the form

A(x)x = b with b ∈ R n , (8.4.19)

where A : D ⊂ R n → R n,n is a matrix-valued function. On other words, a quasi-linear system is a

“linear system of equations with solution-dependent system matrix”. It incarnates a zero-finding problem
for a function F : R n → R n with a bilinear structure as introduced in § 8.4.15.

For many quasi-linear systems, for which there exist solutions, the fixed point iteration (→ Section 8.2)

x ( k + 1) = A ( x ( k ) ) − 1 b ⇔ A ( x ( k ) ) x ( k + 1) = b , (8.4.20)

provides a convergent iteration, provided that a good initial guess is available.

We can also reformulate

(8.4.19) ⇔ F(x) = 0 with F(x) = A(x)x − b .

If x 7→ A(x) is differentiable, the product rule (8.4.10) yields

D F ( x ) h = (D A ( x ) h ) x + A ( x ) h , h ∈ R n . (8.4.21)

Note that D A(x(k) ) is a mapping from R n into R n,n , which gets h as an argument. Then the Newton
iteration reads

x ( k + 1) = x ( k ) − s , D F ( x ) s = (D A ( x ( k ) ) s ) x ( k ) + A ( x ( k ) ) s = A ( x ( k ) ) x ( k ) − b . (8.4.22)

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 593
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The next example will examine a concrete quasi-linear system of equations.

Example 8.4.23 (A special quasi-linear system of equations)

We consider the quasi-linear system of equations

 
γ(x) 1
 1 γ(x) 1 
 
 ..
.
.. ..
. .

 
A (x)x = b , A (x) =  . . .  ∈ R n×n , (8.4.24)
 .. .. .. 
 
 1 γ(x) 1 
1 γ(x)

where γ(x) := 3 + k xk2 (Euclidean vector norm), the right hand side vector b ∈ R n is given and x ∈ R n
is unknown.

In order to compute the derivative of F(x) := A(x)x − b it is advisable to rewrite

 
3 1
1 3 1 
 
 ... ..
. 
 3 
A(x)x = Tx + xk xk2 , T :=  .. .. .. .
 . . . 
 
 1 3 1
1 3

The derivative of the first term is straightforward, because it is linear in x, see the discussion following
Def. 8.4.6.

The “pedestrian” approach to the second term starts with writing it explicitly in components as
q
(xkxk)i = xi x12 + · · · + x2n , i = 1, . . . , n .

Then we can compute the Jacobian according to (8.2.11) by taking partial derivatives:
q
∂ xi
(xkxk)i = x12 + · · · + x2n + xi q ,
∂xi x12 + · · · + x2n
∂ xj
(xkxk)i = xi q , j 6= i .
∂x j 2
x +···+x 2
1 n

For the “high level” treatment of the second term x 7→ xkxk2 we apply the product rule (8.4.10), together
with (8.4.14):

x⊤ h xx⊤
DF(x)h = Th + kxk2 h + x = A (x) + h.
kxk2 kxk2
Thus, in concrete terms the Newton iteration (8.4.22) becomes
x (k) (x (k) )⊤ −1
x ( k + 1) = x ( k ) − A ( x ( k ) ) + (A (x(k) )x(k) − b) .
k xk2

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 594
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Note that that matrix of the linear system to be solved in each step is a rank-1-modification (2.6.17) of the
symmetric positive definite tridiagonal matrix A(x(k) ), cf. Lemma 2.8.12. Thus the Sherman-Morrison-
Woodbury formula from Lemma 2.6.22 can be used to solve it efficiently.

(8.4.25) Implicit differentiation of bilinear expressions

Given are

✦ a finite-dimensional vector space W ,

✦ a continuously differentiable function G : D ⊂ R n → V into some vector space V ,
✦ a bilinear mapping b : V × W → W .
Let F : D → W be implicitly defined by

F (x): b(G (x), F(x)) = b ∈ W . (8.4.26)

This relationship will provide a valid definition of F in a neighborhood of x0 ∈ W , if we assume that there
is x0 , z0 ∈ W such that b(G (x0 ), z0 ) = b, and that the linear mapping z 7→ b(G (x0 ), z) is invertible.
Then, for x close to x0 , F(x) can be computed by solving a square linear system of equations in W . In
Ex. 8.3.9 we already saw an example of an implicitly defined F for W = R .

We want to solve F(x) = 0 for this implicitly defined F by means of Newton’s method. In order to
determine the derivative of F we resort to implicit differentiation [?, Sect. 7.8] of the defining equation
(8.4.26) by means of the general product rule (8.4.10). We formally differentiate both sides of (8.4.26):

b(D G (x)h, F(x)) + b(G (x), D F(x)h) = 0 ∀h ∈ W , (8.4.27)

and find that the Newton correction s in the k + 1-th Newton step can be computed as follows:

D F(x(k) )s = − F(x(k) ) ⇒ b(G (x(k) ), D F(x(k) )s) = −b(G (x(k) ), F(x(k) ))

(8.4.27)
⇒ b(D G (x(k) )s, F(x(k) )) = b(G (x(k) ), F(x(k) )) ,

which constitutes an dim W × dim W linear system of equations. The next example discusses a concrete
application of implicit differentiation with W = R n,n .

Example 8.4.28 (Derivative of matrix inversion)

We consider matrix inversion as a mapping and (formally) compute its derivative, that is, the derivative of
function

R n,n
∗ → R n,n
inv : ,
X 7 → X−1
n,n
where R ∗ denotes the (open) set of invertible n × n-matrices, n ∈ N.

We apply the technique of implicit differentiation from § 8.4.25 to the equation

inv(X) · X = I , X ∈ R n,n
∗ . (8.4.29)

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 595
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Differentiation on both sides of (8.4.29) by means of the product rule (8.4.10) yields

D inv(X)H · X + inv(X) · H = O , H ∈ R n,n ,

D inv(X)H = −X−1 HX−1 , H ∈ R n,n . (8.4.30)

For n = 1 we get D inv( x )h = − xh2 , which recovers the well-known derivative of the function x → x −1 .

Example 8.4.31 (Matrix inversion by means of Newton’s method → [?, ?])

Surprisingly, it is possible to obtain the inverse of a matrix as a the solution of a non-linear system of
equations. Thus it can be computed using Newton’s method.
Given a regular matrix A ∈ R n,n , its inverse can be defined as the unique zero of a function:

−1 R n,n
∗ → R n,n
X=X ⇐⇒ F(X) = 0 for F : .
X 7 → A − X−1
n,n
Using (8.4.30) we find for the derivative of F in X ∈ R ∗

D F(X)H = X−1 HX−1 , H ∈ R n,n . (8.4.32)

The abstract Newton iteration (8.4.1) for F reads

X ( k + 1) = X ( k ) − S , S : = D F ( X ( k ) ) − 1 F ( X ( k ) ) . (8.4.33)

The Newton correction S in the k-th step solves the linear system of equations
(8.4.32) −1
−1 −1
D F (X(k) )S = X(k)
S X(k) = F (X(k) ) = A − X(k) .
−1 (k)
S = X(k) (A − X(k) )X = X(k) AX(k) − X(k) . (8.4.34)
in (8.4.33)

X(k+1) = X(k) − X(k) AX(k) − X(k) = X(k) 2I − AX(k) . (8.4.35)

To study the convergence of this iteration we derive a recursion for the iteration errors E(k) := X(k) −
A −1 :
E ( k + 1) = X ( k + 1) − A − 1
(8.4.35)
= X(k) 2I − AX(k) − A−1

= (E(k) + A−1 ) 2I − A(E(k) + A−1 ) − A−1
= (E(k) + A−1 )(I − AE(k) ) − A−1 = −E(k) AE(k) .
For the norm of the iteration error (a matrix norm → Def. 1.5.76) we conclude from submultiplicativity
(1.5.77) a recursive estimate
2
E ( k + 1) ≤ E ( k ) kAk . (8.4.36)

This holds for any matrix norm according to Def. 1.5.76, which is induced by a vector norm. For the
relative iteration error we obtain
!2
E ( k + 1) E(k)
≤ k A k A −1 , (8.4.37)
kAk kAk | {z }
| {z } | {z }
relative error relative error =cond(A)

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 596
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

where the condition number is defined in Def. 2.2.12.

From (8.4.36) we conclude that the iteration will converge (limk→∞ E(k) = 0), if

E (0) A = X (0) A − I < 1 , (8.4.38)

which gives a condition on the initial guess S(0) . Now let us consider the Euclidean matrix norm k·k2 ,
which can be expressed in terms of eigenvalues, see Cor. 1.5.82. Motivated by this relationship, we use
the initial guess X(0) = αA⊤ with a > 0 still to be determined.
! 2
X (0) A − I = αA⊤ A − I = αkAk22 − 1 < 1 ⇔ α < ,
2 2 kAk22
which is a sufficient condition for the initial guess X(0) = αA⊤ , in order to make (8.4.35) converge. In this
case we infer quadratic convergence from both (8.4.36) and (8.4.37).

Remark 8.4.39 (Simplified Newton method [?, Sect. 5.6.2])

C++11 code 8.4.40: Efficient implementation of simplified Newton method

2 // C++ template for simplified Newton method
3 template <typename Func , typename Jac , typename Vec>
4 void simpnewton ( Vec& x , Func F , Jac DF, double r t o l , double a t o l )
5 {
6 auto l u = DF( x ) . l u ( ) ; // do LU decomposition once!
7 Vec s ; // Newton correction
8 double ns , nx ; // auxiliary variables for termination control
9 do {
10 s = l u . solve ( F ( x ) ) ;
11 x = x−s ; // new iterate
12 ns = s . norm ( ) ; nx = x . norm ( ) ;
13 }
14 // termination based on relative and absolute tolerance
15 while ( ( ns > r t o l ∗ nx ) && ( ns > a t o l ) ) ;
16 }

Simplified Newton Method: ☞ use the same Jacobian D F(x(k) ) for all/several steps

Possible to reuse of LU-decomposition, cf. Rem. 2.5.10.

➣ (usually) merely linear convergence instead of quadratic convergence

Remark 8.4.41 (Numerical Differentiation for computation of Jacobian)

If D F(x) is not available (e.g. when F(x) is given only as a procedure) we may resort to approximation
by difference quotients:
∂Fi Fi (x + h~e j ) − Fi (x)
Numerical Differentiation: (x) ≈ .
∂x j h

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 597
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

√
Caution: Roundoff errors wreak havoc for small h → Ex. 1.5.45 ! Therefore use h ≈ EPS.

8.4.2 Convergence of Newton’s method

Newton iteration (8.4.1) ˆ fixed point iteration (→ Section 8.2) with iteration function
=

Φ(x ) = x − D F (x )−1 F (x ) .

[“product rule” (8.4.10) : D Φ(x) = I − D({x 7→ D F(x)−1 }) F(x) − D F(x)−1 D F(x) ]

F (x∗ ) = 0 ⇒ D Φ(x∗ ) = 0 ,

that is, the derivative (Jacobian) of the iteration function of the Newton fixed point iteration vanishes in
the limit point. Thus from Lemma 8.2.18 we draw the same conclusion as in the scalar case n = 1, cf.
Section 8.3.2.1.

Local quadratic convergence of Newton’s method, if D F(x∗ ) regular

Experiment 8.4.42 (Convergence of Newton’s method in 2D)

We study the convergence of Newton’s method empirically for n = 2 for

x12 − x24 x1 2 1
F (x) = , x= ∈R with solution F( )=0. (8.4.43)
x1 − x23 x2 1

∂ x1 F1 ( x ) ∂ x2 F1 ( x ) 2x1 −4x23
Jacobian (analytic computation): D F (x) = =
∂ x1 F2 ( x ) ∂ x2 F2 ( x ) 1 −3x22

Realization of Newton iteration (8.4.1):

1. Solve LSE

2x1 −4x23 (k) x12 − x24
∆x =− ,
1 −3x22 x1 − x23

where x(k) = [ x1 , x2 ] T .
2. Set x(k+1) = x(k) + ∆x(k) .

M ATLAB-code 8.4.44: Newton iteration for (8.4.43)

1 F=@(x) [x(1)^2-x(2)^4; x(1)-x(2)^3];

2 DF=@(x) [2*x(1),-4*x(2)^3;1,-3*x(2)^2];
3 x=[0.7;0.7]; x_ast=[1;1]; tol=1E-10;
4
5 res=[0,x’,norm (x-x_ast)];
6 s = DF(x)\F(x); x = x-s;
7 res = [res; 1,x’,norm (x-x_ast)]; k=2;
8 w h i l e (norm (s) > tol*norm (x))
9 s = DF(x)\F(x); x = x-s;
10 res = [res; k,x’,norm (x-x_ast)];

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 598
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

11 k = k+1;
12 end
13
14 ld = d i f f ( l o g (res(:,4))); %
15 rates = ld(2: end )./ld(1:end -1); %

Line 14, Line 15: estimation of order of convergence, see Rem. 8.1.19.
log ǫk+1 − log ǫk
k x(k) ǫk : = k x ∗ − x ( k ) k 2
log ǫk − log ǫk−1
0 [0.7, 0.7] T 4.24e-01
1 [0.87850000000000, 1.064285714285714] T 1.37e-01 1.69
2 [1.01815943274188, 1.00914882463936] T 2.03e-02 2.23
3 [1.00023355916300, 1.00015913936075] T 2.83e-04 2.15
4 [1.00000000583852, 1.00000002726552] T 2.79e-08 1.77
5 [0.999999999999998, 1.000000000000000] T 2.11e-15
6 [ 1, 1] T
☞ (Some) evidence of quadratic convergence, see Rem. 8.1.19.

There is a sophisticated theory about the convergence of Newton’s method. For example one can find the
following theorem in [?, Thm. 4.10], [?, Sect. 2.1]):

Theorem 8.4.45. Local quadratic convergence of Newton’s method

If:
(A) D ⊂ R n open and convex,
(B) F : D 7→ R n continuously differentiable,
(C) D F(x) regular ∀x ∈ D,
∀v ∈ R n , v + x ∈ D,
(D) ∃ L ≥ 0: D F(x)−1 (D F(x + v) − D F(x)) ≤ Lk vk 2 ,
2 ∀x ∈ D
(E) ∃x∗ : F (x∗ ) = 0 (existence of solution in D)
2
(F) initial guess x(0) ∈ D satisfies ρ : = x ∗ − x (0) < ∧ Bρ ( x ∗ ) ⊂ D .
2 L
then the Newton iteration (8.4.1) satisfies:
(i) x(k) ∈ Bρ (x∗ ) := {y ∈ R n , ky − x∗ k < ρ} for all k ∈ N,
(ii) lim x(k) = x∗ ,
k→∞
2
(iii) x ( k + 1) − x ∗ ≤ L
2 x(k) − x∗ (local quadratic convergence) .
2 2

✎ notation: ball Bρ (z) := {x ∈ R n : kx − zk2 ≤ ρ}

Terminology: (D) =
ˆ affine invariant Lipschitz condition

Usually, it is hardly possible to verify the assumptions of the theorem for a concrete non-linear
system of equations, because neither L nor x ∗ are known.

In general: a priori estimates as in Thm. 8.4.45 are of little practical relevance.

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 599
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

8.4.3 Termination of Newton iteration

An abstract discussion of ways to stop iterations for solving F(x) = 0 was presented in Section 8.1.2, with
“ideal termination” (→ § 8.1.24) as ultimate, but unfeasible, goal.

Yet, in 8.4.2 we saw that Newton’s method enjoys (asymptotic) quadratic convergence, which means rapid
decrease of the relative error of the iterates, once we are close to the solution, which is exactly the point,
when we want to STOP. As a consequence, asymptotically, the Newton correction (difference of two
consecutive iterates) yields rather precise information about the size of the error:

x ( k + 1) − x ∗ ≪ x ( k ) − x ∗ ⇒ x ( k ) − x ∗ ≈ x ( k + 1) − x ( k ) . (8.4.46)

This suggests the following correction based termination criterion:

STOP, as soon as ∆x(k) ≤ τrel x(k) or ∆x(k) ≤ τabs ,

(8.4.47)
(k) (k) −1 (k)
with Newton correction ∆x := D F (x ) F (x ).

Here, k·k can be any suitable vector norm, τrel =

ˆ relative tolerance, τabs =
ˆ absolute tolerance, see
§ 8.1.24.

➣ quit iterating as soon as x ( k + 1) − x ( k ) = D F ( x ( k ) ) − 1 F ( x ( k ) ) < τ x ( k ) ,

with τ = tolerance

→ uneconomical: one needless update, because x(k) would already be accurate enough.

Remark 8.4.48 (Newton’s iteration; computational effort and termination)

Some facts about the Newton method for solving large (n ≫ 1) non-linear systems of equations:

☛ Solving the linear system to compute the Newton correction may be expensive (asymptotic compu-
tational effort O(n3 ) for direct elimination → § 2.3.5) and accounts for the bulk of numerical cost of
a single step of the iteration.

☛ In applications only very few steps of the iteration will be needed to achieve the desired accuracy
due to fast quadratic convergence.

✄ The termination criterion (8.4.47) computes the last Newton correction ∆x(k) needlessly, because
x(k) already accurate enough!

Therefore we would like to use an a-posteriori termination criterion that dispenses with computing (and
“inverting”) another Jacobian D F(x(k) ) just to tell us that x(k) is already accurate enough.

(8.4.49) Termination of Newton iteration based on simplified Newton correction

Due to fast asymptotic quadratic convergence, we can expect D F(x(k−1) ) ≈ D F(x(k) ) during the final
steps of the iteration.

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 600
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Idea: Replace D F(x(k) ) with D F(x(k−1) ) in any correction based termination

criterion.
Rationale: LU-decomposition of D F(x(k−1) ) is already available ➤ less ef-
fort.

Terminology: ∆x̄(k) := D F(x(k−1) )−1 F(x(k) ) =

ˆ simplified Newton correction

Economical correction based termination criterion for Newton’s method:

STOP, as soon as ∆x̄(k) ≤ τrel x(k) or ∆x̄(k) ≤ τabs ,

(8.4.50)
(k) ( k − 1) − 1 (k)
with simplfied Newton correction ∆x̄ := D F (x ) F (x ).
Note that (8.4.50) is affine invariant → Rem. 8.4.4.

∆x̄(k) available
Effort: Reuse of LU-factorization (→ Rem. 2.5.10) of D F(x(k−1) ) ➤
with O(n2 ) operations

C++11 code 8.4.51: Generic Newton iteration with termination criterion (8.4.50)
2 template <typename FuncType , typename JacType , typename VecType>
3 void newton_stc ( const FuncType &F , const JacType &DF,
4 VecType &x , double r t o l , double a t o l )
5 {
6 using s c a l a r _ t = typename VecType : : S c a l a r ;
7 s c a l a r _ t sn ;
8 do {
9 auto j a c f a c = DF( x ) . l u ( ) ; // LU-factorize Jacobian ]
10 x −= j a c f a c . solve ( F ( x ) ) ; // Compute next iterate
11 // Compute norm of simplified Newton correction
12 sn = j a c f a c . solve ( F ( x ) ) . norm ( ) ;
13 }
14 // Termination based on simplified Newton correction
15 while ( ( sn > r t o l ∗ x . norm ( ) ) && ( sn > a t o l ) ) ;
16 }

Remark 8.4.52 (Residual based termination of Newton’s method)

If we used the residual based termination criterion

F (x(k) ) ≤ τ ,

then the resulting algorithm would not be affine invariant, because for F(x) = 0 and AF(x) = 0,
A ∈ R n,n regular, the Newton iteration might terminate with different iterates.

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 601
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Summary: Newton’s method

converges asymptotically very fast: doubling of number of significant digits in each step

often a very small region of convergence, which requires an initial guess

rather close to the solution.

8.4.4 Damped Newton method

Potentially big problem: Newton method converges quadratically, but only locally , which may render it use-
less, if convergence is guaranteed only for initial guesses very close to exact solution, see also Ex. 8.3.32.

In this section we study a method to enlarge the region of convergence, at the expense of quadratic
convergence, of course.

Example 8.4.54 (Local convergence of Newton’s method)

The dark side of local convergence (→ Def. 8.1.8): for many initial guesses x(0) Newton’s method will not
converge!

In 1D two main causes can be identified:

➊ “Wrong direction” of Newton correction:

1.5

F( x ) = xe x − 1 ⇒ F′ (−1) = 0 1
x 7→ xe x − 1

x (0) < − 1 ⇒ x ( k ) → − ∞ ,
0.5

x (0) > − 1 ⇒ x ( k ) → x ∗ , 0

because all Newton corrections for x (k) < −1 make

-0.5

the iterates decrease even further. -1

Fig. 297 -1.5

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1

➋ Newton correction is too large:

1.5

F( x ) = arctan(ax ) , a > 0, x ∈ R 0.5

∗
arctan(ax)

with zero x =0. 0

If x (k) is located where the function is “flat”, the in-

-0.5

tersection of the tangents with the x-axis is “far out”, -1

see Fig. 299.

-1.5
a=10
a=1
a=0.3
-2
-15 -10 -5 0 5 10 15
Fig. 298 x

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 602
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5
Diverging Newton iteration for F(x) = arctan x
1.5
4.5

1 4

3.5

0.5
3

a
2.5
0
x ( k + 1) x ( k − 1) x (k) 2

-0.5 1.5

1
-1
0.5

-1.5 0
Fig. 299 -15 -10 -5 0 5 10 15
-6 -4 -2 0 2 4 6
Fig. 300 x

In Fig. 300 the red zone = { x (0) ∈ R, x (k) → 0}, domain of initial guesses for which Newton’s
method converges.

If the Newton correction points in the wrong direction (Item ➊), no general remedy is available. If the
Newton correction is too large (Item ➋), there is an effective cure:

we observe “overshooting” of Newton correction

Idea: damping of Newton correction:

With λ(k) > 0: x(k+1) := x(k) − λ(k) D F(x(k) )−1 F(x(k) ) . (8.4.55)

Terminology: λ(k) = damping factor

Affine invariant damping strategy

Choice of damping factor: affine invariant natural monotonicity test [?, Ch. 3]:

λ(k)
choose “maximal” 0 < λ(k) ≤ 1: ∆x (λ(k) ) ≤ (1 − ) ∆x(k) (8.4.57)
2 2

∆x(k) := D F(x(k) )−1 F(x(k) ) → current Newton correction ,

where
∆x (λ(k) ) := D F(x(k) )−1 F(x(k) + λ(k) ∆x(k) ) → tentative simplified Newton correction .

Heuristics behind control of damping:

✦ When the method converges ⇔ size of Newton correction decreases ⇔ (8.4.57) satisfied.
✦ In the case of strong damping (λ(k) ≪ 1) the size of the Newton correction cannot be expected to
shrink significantly, since iterates do not change much ➣ factor (1 − 21 λ(k) ) in (8.4.57).

Note: As before, reuse of LU-factorization in the computation of ∆x(k) and ∆x (λ(k) ).

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 603
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB-code Damped Newton method

1 f u n c t i o n [x,cvg] =
dampnewton(x,F,DF,rtol,atol)
2 [L,U] = l u (DF(x)); s = U\(L\F(x));
3 xn = x-s; lambda = 1; cvg = 0; Reuse of LU-factorization, see
4 f = F(xn); st = U\(L\f); stn = norm(st); Rem. 2.5.10
5 w h i l e ((stn>rtol*norm(xn)) && (stn > atol))
a-posteriori termina-
6 w h i l e (norm(st) > (1-lambda/2)*norm(s))
tion criterion (based on
7 lambda = lambda/2; simplified Newton correction,
8 i f (lambda < LMIN), cvg = -1; r e t u r n ; end
cf. Section 8.4.3)
9 xn = x-lambda*s; f = F(xn);
10 st = U\(L\f); Natural monotonicity test
11 end (8.4.57)
12 x = xn; [L,U] = l u (DF(x)); s = U\(L\f);
Reduce damping factor λ
13 lambda = min (2*lambda,1);
14 xn = x-lambda*s; f = F(xn); st = U\(L\f);
15 end
16 x = xn;

Note: LU-factorization of Jacobi matrix D F(x(k) ) is done once per successful iteration step (Line 12 of
the above code) and reused for the computation of the simplified Newton correction in Line 10, Line 14 of
the above M ATLAB code.
Policy: Reduce damping factor by a factor q ∈]0, 1[ (usually q = 12 ) until the affine invariant natural
monotonicity test (8.4.57) passed, see Line 13 in the above M ATLAB code.

C++11 code 8.4.58: Generic damped Newton method based on natural monotonicity test
1 template <typename FuncType , typename JacType , typename VecType>
2 void dampnewton ( const FuncType &F , const JacType &DF,
3 VecType &x , double r t o l , double a t o l )
4 {
5 using i n d e x _ t = typename VecType : : Index ;
6 using s c a l a r _ t = typename VecType : : S c a l a r ;
7 const i n d e x _ t n = x . siz e ( ) ;
8 const s c a l a r _ t l m i n = 1E−3; // Minimal damping factor
9 s c a l a r _ t lambda = 1 . 0 ; // Initial and actual damping factor
10 VecType s ( n ) , s t ( n ) ; // Newton corrections
11 VecType xn ( n ) ; // Tentative new iterate
12 s c a l a r _ t sn , s t n ; // Norms of Newton corrections
13

14 do {
15 auto j a c f a c = DF( x ) . l u ( ) ; // LU-factorize Jacobian
16 s = j a c f a c . solve ( F ( x ) ) ; // Newton correction
17 sn = s . norm ( ) ; // Norm of Newton correction
18 lambda ∗= 2 . 0 ;
19 do {
20 lambda / = 2 ;
21 i f ( lambda < l m i n ) throw " No c o n v e r g e n c e : l a mb d a −> 0 " ;
22 xn = x−lambda ∗ s ; // Tentative next iterate
23 s t = j a c f a c . solve ( F ( xn ) ) ; // Simplified Newton correction

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 604
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

24 s t n = s t . norm ( ) ;
25 }
26 while ( s t n > (1− lambda / 2 ) ∗ sn ) ; // Natural monotonicity test
27 x = xn ; // Now: xn accepted as new iterate
28 lambda = std : : min ( 2 . 0 ∗ lambda , 1 . 0 ) ; // Try to mitigate damping
29 }
30 // Termination based on simplified Newton correction
31 while ( ( s t n > r t o l ∗ x . norm ( ) ) && ( s t n > a t o l ) ) ;
32 }

The arguments for Code 8.4.58 are the same as for Code 8.4.51. As termination criterion is uses (8.4.50).
Note that all calls to solve boil down to forward/backward elimination for triangular matrices and incur cost
of O(n2 ) only.

Experiment 8.4.59 (Damped Newton method)

We test the damped Newton method for Item ➋ of Ex. 8.4.54, where excessive Newton corrections made
Newton’s method fail.
k λ(k) x (k) F ( x (k) )
F( x) = arctan( x ) ,
1 0.03125 0.94199967624205 0.75554074974604
• x (0) = 20 2 0.06250 0.85287592931991 0.70616132170387
• q = 21 3 0.12500 0.70039827977515 0.61099321623952
• LMIN = 0.001 4 0.25000 0.47271811131169 0.44158487422833
We observe that damping
5 0.50000 0.20258686348037 0.19988168667351
is effective and asymptotic
6 1.00000 -0.00549825489514 -0.00549819949059
quadratic convergence is
7 1.00000 0.00000011081045 0.00000011081045
recovered.
8 1.00000 -0.00000000000001 -0.00000000000001

Experiment 8.4.60 (Failure of damped Newton method)

We examine the effect of damping in the case of Item ➊ of Ex. 8.4.54.

1.5

✦ As in Ex. 8.4.54: 1
x 7→ xe x − 1
F( x ) = xe x − 1, 0.5

✦ Initial guess for damped Newton method x (0) =

−1.5 -0.5

-1

Fig. 301 -1.5

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 605
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Observation: k λ(k) x (k) F ( x (k) )

1 0.25000 -4.4908445351690 -1.0503476286303
Newton correction pointing in 2 0.06250 -6.1682249558799 -1.0129221310944
“wrong direction” 3 0.01562 -7.6300006580712 -1.0037055902301
4 0.00390 -8.8476436930246 -1.0012715832278
➤ no convergence despite 5 0.00195 -10.5815494437311 -1.0002685596314
damping Bailed out because of lambda < LMIN !

8.4.5 Quasi-Newton Method

Supplementary reading. For related expositions refer to [?, Sect. 7.1.4], [?, 2.3.2].

How can we solve F(x) = 0 iteratively, in case D F(x) is not available and numerical differentiation (see
Rem. 8.4.41) is too expensive?

In 1D (n = 1 we can choose among many derivative-free methods that rely on F-evaluations alone, for
instance the secant method (8.3.24) from Section 8.3.2.3:

F( x (k) )( x (k) − x (k−1) )

x ( k + 1) = x ( k ) − . (8.3.24)
F ( x ( k ) ) − F ( x ( k − 1) )

Recall that the secant method converges locally with order p ≈ 1.6 and beats Newton’s method in terms
of efficiency (→ Section 8.3.3).

Comparing with (8.3.4) we realize that this iteration amounts to a “Newton-type

iteration” with the approximation

F ( x ( k ) ) − F ( x ( k − 1) )
F ′ ( x (k) ) ≈ "difference quotient" (8.4.61)
x ( k ) − x ( k − 1)
already computed ! → cheap

Not clear how to generalize the secant method to n > 1 ?

Idea: rewrite (8.4.61) as a secant condition for an approximation Jk ≈

D F ( x ( k ) ), x ( k ) =
ˆ iterate:

J k ( x ( k ) − x ( k − 1) ) = F ( x ( k ) ) − F ( x ( k − 1) ) . (8.4.62)

Iteration: x ( k + 1) : = x ( k ) − J − 1 (k)
k F (x ) . (8.4.63)

However, many matrices Jk fulfill (8.4.62)!

➣ We need extra conditions to fix Jk ∈ R n,n .

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 606
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Reasoning: If we assume that Jk is a good approximation of D F(x(k) ), then it would be foolish not to use
the information contained in Jk for the construction of Jk .

Guideline: obtain Jk through a “small” modification of Jk−1 compliant with (8.4.62)

What can “small modification” mean: Demand that Jk acts like Jk−1 on a complement of the span of
x ( k ) − x ( k − 1) !

Broyden’s conditions: Jk z = Jk−1 z ∀z: z ⊥ (x(k) − x(k−1) ) . (8.4.64)

F (x( k) )(x( k) − x( k−1) )⊤

i.e.: J k : = J k−1 + 2 (8.4.65)
k x ( k ) − x ( k −1) k 2

✦ The conditons (8.4.62) and (8.4.64) uniquely define Jk

✦ The update formula (8.4.65) means that Jk is spawned by a rank-1-modification of Jk−1 .

Final form of Broyden’s quasi-Newton method for solving F(x) = 0:

x(k+1) := x(k) + ∆x(k) , ∆x(k) := −J− 1 (k)

k F (x ) ,
F(x(k+1) )(∆x(k) )⊤ (8.4.66)
J k+1 : = J k + 2
.
∆x(k) 2

To start the iteration we have to initialize J0 , e.g. with the exact Jacobi matrix D F(x(0) ).

Remark 8.4.67 (Minimality property of Broyden’s rank-1-modification)

in another sense Jl is closest to Jk−1 under the constraint of the secant condition (8.4.62):

Let x(k) and Jk be the iterates and matrices, respectively, from Broyden’s method (8.4.66), and let J ∈ R n,n
satisfy the same secant condition (8.4.62) as Jk+1 :

J ( x ( k + 1) − x ( k ) ) = F ( x ( k + 1) ) − F ( x ( k ) ) . (8.4.68)

Then from x(k+1) − x(k) − −J− 1 (k)

k F (x ) we obtain

(I − J − 1
k J )(x
( k + 1)
− x(k) ) = − J− 1 (k) −1
k F (x ) − J k ( F (x
( k + 1)
) − F(x(k) )) = −J− 1
k F (x
( k + 1)
). (8.4.69)

From this we get the identity

!
F(x(k+1) )(x(k+1) − x(k) )⊤
I − J− 1
k J k+1 = I − J−
k
1
Jk + 2
x ( k + 1) − x ( k ) 2
( x ( k + 1) − x ( k ) ) ⊤
= −J− 1
k F (x
( k + 1)
) 2
=
x ( k + 1) − x ( k )
2
(8.4.69) (x ( k + 1 ) − x )(x(k+1)
( k ) − x(k) )⊤
= (I − J − 1
k J) 2
.
x ( k + 1) − x ( k ) 2

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 607
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Using the submultiplicative property (1.5.77) of the Euclidean matrix norm, we conclude

(x(k+1) − x(k) )(x(k+1) − x(k) )⊤

I − J− 1 −1
k J k+1 ≤ I − J k J , because 2
≤1,
x ( k + 1) − x ( k ) 2 2

which we saw in Ex. 1.5.86. This estimate holds for all matrices J satisfying (8.4.68).

We may read this as follows: (8.4.65) gives the k·k2 -minimal relative correction of Jk−1 , such that the
secant condition (8.4.62) holds.

Experiment 8.4.70 (Broydens quasi-Newton method: Convergence)

We revisit the 2 × 2 non-linear system of the Exp. 8.4.42 and take x(0) = [0.7, 0.7] T . As starting value for
the matrix iteration we use J0 = D F(x(0) ).

Broyden: ||F(x (k) )||

10 0
Broyden: error norm
(k)
Newton: ||F(x )||
Newton: error norm
-2 Newton (simplified)
10

-4
10
Euclidean norms of errors

The numerical example shows that, in

10 -6
terms of convergence, the method is:
• slower than Newton method (8.4.1), ✄ 10
-8

• faster than the simplified Newton

-10
10
method (see Rem. 8.4.39)
-12
10

-14
10

0 1 2 3 4 5 6 7 8 9 10 11
Fig. 302 Step of iteration

Remark 8.4.71 (Convergence monitors)

In general, any iterative methods for non-linear systems of equations convergence can fail, that is it may
stall or even diverge.

Demand on good numerical software: Algorithms should warn users of impending failure. For iterative
methods this is the task of convergence monitors, that is, conditions, cheaply verifiable a posteriori during
the iteration, that indicate stalled convergence or divergence.

For the damped Newton’s method this role can be played by the natural monotonicity test, see Code 8.4.58;
if it fails repeatedly, then the iteration should terminate with an error status.

For Broyden’s quasi-Newton method, a similar strategy can rely on the relative size of the “simplified
Broyden correction” Jk F(x(k+1) ):

J− 1 (k)
k−1 F (x )
Convergence monitor for (8.4.66) : µ := <1? (8.4.72)
∆x(k−1)

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 608
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Experiment 8.4.73 (Monitoring convergence for Broyden’s quasi-Newton method)

10 0

-2 10 1
10
We rely on the setting of Exp. 8.4.70.

Convergence monitor
-4 10 0
10
We track
error norm

10 -1
1. the Euclidean norm of the iteration error,
10 -6
2. and the value of the convergence monitor from
10
-8 10 -2
(8.4.72).

-10
✁ Decay of (norm of) iteration error and µ are well
10
correlated.
1 2 3 4 5 6 7 8 9 10 11
Fig. 303
Step of iteration

Remark 8.4.74 (Damped Broyden method)

Option to improve robustness (increase region of local convergence):

damped Broyden method (cf. same idea for Newton’s method, Section 8.4.4)

(8.4.75) Implementation of Broyden’s quasi-Nerwton method

As remarked, (8.4.66) represents a rank-1-update as already discussed in § 2.6.13.

Idea: use Sherman-Morrison-Woodbury update-formula from Lemma 2.6.22, which yields

! !
J− 1
k F (x
(k+1) )(∆x(k) ) T
∆x(k+1) (∆x(k) )T
J− 1
k+1 = I− 2
J− 1
k = I+ 2
J− 1
k . (8.4.76)
∆x(k) 2 + ∆x(k) · J− 1
k F (x
( k + 1) ) ∆x(k) 2

This gives a well defined Jk+1 , if

J− 1
k F (x
( k + 1) ) < ∆x(k) . (8.4.77)
2 2

"simplified Quasi-Newton correction"

Note that the condition (8.4.77) is checked by the convergence monitor (8.4.72).

Iterated application of (8.4.76) pays off, if iteration terminates after only a few steps. For large n ≫
1 it is not advisable to form the matrices J− 1
k (which will usually be dense in contrast to J k ), but we
employ fast successive multiplications with rank-1-matrices (→ Ex. 1.4.11) to apply J− 1
k to a vector. This
is implemented in the following code.

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 609
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB-code : Broyden method (8.4.66)

1 f u n c t i o n x = broyden(F,x,J,tol)
2 k = 1; unique LU-decomposition !
3 [L,U] = l u (J);
2
4 s = U\(L\F(x)); sn = dot (s,s); store ∆x(k) , ∆x(k)
5 dx = [s]; dxn = [sn]; 2
(see (8.4.76))
6 x = x - s; f = F(x);
7
solve two SLEs
8 w h i l e ( s q r t (sn) > tol), k=k+1
9 w = U\(L\f); Termination, Section 8.4.2
10 f o r l=2:k-1
11 w = w+dx(:,l)*(dx(:,l-1)’*w) ... construct w := J− 1 (k)
k F (x )
12 /dxn(l-1); (→ recursion (8.4.76))
13 end
14 i f (norm(w)>=sn) convergence monitor
warning(’Dubious step %d!’,k); kJ−k−11 F (x(k) )k
15
<1?
16 end k∆x(k−1)k
17 z = s’*w; s = (1+z/(sn-z))*w; sn=s’*s;
18 dx = [dx,s]; dxn = [dxn,sn]; correction s = J− 1 (k)
k F (x )
19 x = x - s; f = F(x);
20 end

Computational cost : ✦ O( N 2 · n) operations with vectors, (Level I)

N steps
✦ 1 LU-decomposition of J, N × solutions of SLEs, see Section 2.3.2
✦ N evaluations of F !

Memory cost :
N steps ✦ LU-factors of J + auxiliary vectors ∈ R n
✦ N vectors x(k) ∈ R n

Experiment 8.4.78 (Broyden method for a large non-linear system)

n = 1000, tol = 2.000000e-02

5
10

R n 7→ R n
F (x) =
x 7→ diag(x)Ax − b , 0
10
n
b = [1, 2, . . . , n] ∈ R ,
A = I + aa T ∈ R n,n ,
Normen

-5
1 10
a= √ (b − 1) .
1·b−1
(k)
Broyden: ||F(x )||
Initial guess: h = 2/n; x0 = (2:h:4-h)’; -10
10 Broyden: Fehlernorm
Newton: ||F(x(k))||
Newton: Fehlernorm
The results resemble those of Exp. 8.4.70 ✄ Newton (vereinfacht)

0 1 2 3 4 5 6 7 8 9
Fig. 304 Iterationsschritt

Efficiency comparison: Broyden method ←→ Newton method:

(in case of dimension n use tolerance tol = 2n · 10−5, h = 2/n; x0 = (2:h:4-h)’; )

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 610
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

20
Broyden-Verfahren
18 Newton-Verfahren 30

16
25
14
Anzahl Schritte

12 20

Laufzeit [s]
10
15
8

6 10

4
5
2 Broyden-Verfahren
Newton-Verfahren
0 0
0 500 1000 1500 0 500 1000 1500
Fig. 305 Fig. 306 n
n

☞ In conclusion,
the Broyden method is worthwhile for dimensions n ≫ 1 and low accuracy requirements.

8.5 Unconstrained Optimization

8.5.1 Minima and minimizers: Some theory

8.5.2 Newton’s method

8.5.3 Descent methods

8.5.4 Quasi-Newton methods

Learning Outcomes

• Knowledge about concepts related to the speed of convergence of an iteration for solving a non-
linear system of equations.
• Ability to estimate type and orders of convergence from empiric data.
• Ability to predict asymptotic linear, quadratic and cubic convergence by inspection of the iteration
function.
• Familiarity with (damped) Newton’s method for general non-linear systems of equations and with the
secant method in 1D.
• Ability to derive the Newton iteration for an (implicitly) given non-linear system of equations.
• Knowledge about quasi-Newton method as multi-dimensional generalizations of the secant method.

8.6 Non-linear Least Squares [?, Ch. 6]

Example 8.6.1 (Least squares data fitting)

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 611
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

In Section 5.1 we discussed the reconstruction of a parameterized function f ( x1 , . . . , xn ; ·) : D ⊂ R 7→ R

from data points (ti , yi ), i = 1, . . . , n, by imposing interpolation conditions. Necessary was that the
number n of parameters agreed with number of data points. The interpolation approach is justified in the
case of highly accurate data.

Frequently encountered: inaccurate data (due to measurement errors)

➣ interpolation approach dubious (impact of “outliers”!)

Mitigate impact of data uncertainty by

choosing fewer parameters than data points

(measurement errors can “average out”)

Non-linear least squares fitting problem: [?, Sect. 6.1]

Given: ✦ data points (ti , yi ), i = 1, . . . , m

✦ (symbolic formula) for parameterized function
f ( x1 , . . . , xn ; ·) : D ⊂ R 7→ R, n < m

Sought: parameter values x1∗ , . . . , xn∗ ∈ R such that

m
( x1∗ , . . . , xn∗ ) = argmin ∑ | f ( x1 , . . . , xn ; ti ) − yi |2 . (8.6.2)
x ∈R n i =1

Example 8.6.3 (Non-linear data fitting (parametric statistics) → Ex. 8.6.1 revisited)

Given: data points (ti , yi ), i = 1, . . . , m with measurement errors.

Known: y = f (x, t) through a function f : R n × R 7→ R depending non-linearly and smoothly on

parameters x ∈ R n .

Example: f (t) = x1 + x2 exp(− x3 t), n = 3.

Determine parameters by non-linear least squares data fitting:

m
x∗ = argmin ∑ | f (x, ti ) − yi |2 = argmin 21 k F(x)k22 , (8.6.4)
x ∈R n i =1 x ∈R n
 
f (x, t1 ) − y1
 .. 
with F (x) =  . .
f (x, tm ) − ym

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 612
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Non-linear least squares problem

Given: F : D ⊂ R n 7→ R m , m, n ∈ N, m > n.
Find: x∗ ∈ D: x∗ = argminx∈ D Φ(x) , Φ(x) := 12 k F(x)k22 . (8.6.5)

Terminology: D=
ˆ parameter space, x1 , . . . , xn =
ˆ parameter.

As in the case of linear least squares problems (→ Section 3.1.1): a non-linear least squares problem is
related to an overdetermined non-linear system of equations F(x) = 0.

As for non-linear systems of equations (→ Chapter 8): existence and uniqueness of x∗ in (8.6.5) has to
be established in each concrete case!
★ ✥
We require “independence for each parameter”: → Rem. 3.1.27

∃ neighbourhood U (x∗ )such that DF(x) has full rank n ∀ x ∈ U (x∗ ) . (8.6.6)

✧ ✦
(It means: the columns of the Jacobi matrix DF(x) are linearly independent.)

If (8.6.6) is not satisfied, then the parameters are redundant in the sense that fewer parameters would be
enough to model the same dependence (locally at x∗ ), cf. Rem. 3.1.27.

8.6.1 (Damped) Newton method

Φ(x∗ ) = min ⇒ grad Φ(x) = 0, grad Φ(x) := ( ∂x

∂Φ ∂Φ
(x), . . . , ∂x n
(x))T ∈ R n .
1

Simple idea: use Newton’s method (→ Section 8.4) to determine a zero of grad Φ : D ⊂ R n 7→ R n .

Newton iteration (8.4.1) for non-linear system of equations grad Φ(x) = 0

x(k+1) = x(k) − HΦ(x(k) )−1 grad Φ(x(k) ) , ( HΦ(x) = Hessian matrix) . (8.6.7)

Expressed in terms of F : R n 7→ R n from (8.6.5):

chain rule (8.4.9) ➤ grad Φ(x) = DF(x)T F(x) ,

m
product rule (8.4.10) ➤ HΦ(x) := D (grad Φ)(x) = DF(x)T DF(x) + ∑ Fj (x) D2 Fj (x) ,
j =1
m
n ∂2 Fj ∂Fj ∂Fj
( HΦ(x))i,k = ∑ ∂xi ∂xk (x)Fj (x) + ∂xk (x) ∂xi (x) .
j =1

Recommendation, cf. § 8.4.8: when in doubt, differentiate components of matrices and vectors!

The above derivative formulas allow to rewrite (8.6.7) in concrete terms:

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 613
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

For Newton iterate x(k) : Newton correction s ∈ R n from LSE

!
m
DF(x(k) )T DF(x(k) ) + ∑ Fj (x(k) )D2 Fj (x(k) ) s = − DF(x(k) )T F(x(k) ) . (8.6.8)
j =1
| {z }
| {z } = grad Φ(x( k) )
= HΦ(x( k) )

Remark 8.6.9 (Newton method and minimization of quadratic functional)

Newton’s method (8.6.7) for (8.6.5) can be read as successive minimization of a local quadratic approxi-
mation of Φ:
1
Φ(x) ≈ Q(s) := Φ(x(k) ) + grad Φ(x(k) )T s + sT HΦ(x(k) )s , (8.6.10)
2
(k) (k)
grad Q(s) = 0 ⇔ HΦ(x )s + grad Φ(x ) = 0 ⇔ (8.6.8) .
➣ So we deal with yet another model function method (→ Section 8.3.2) with quadratic model function
for Q.

8.6.2 Gauss-Newton method

Idea: local linearization of F: F(x) ≈ F(y) + DF(y)(x − y)

➣ sequence of linear least squares problems

argmink F(x)k2 is approximated by argmink F(x0 ) + DF(x0 )(x − x0 )k2 ,

x ∈R n x ∈R n
| {z }
(♠)

where x0 is an approximation of the solution x∗ of (8.6.5).

(♠) ⇔ argminkAx − bk with A := DF(x0 ) ∈ R m,n , b := − F(x0 ) + DF(x0 )x0 ∈ R m .

x ∈R n

This is a linear least squares problem of the form (3.1.38).

Note: (8.6.6) ⇒ A has full rank, if x0 sufficiently close to x∗ .

Note: This approach is different from local quadratic approximation of Φ underlying Newton’s method for
(8.6.5), see Section 8.6.1, Rem. 8.6.9.
Gauss-Newton iteration (under assumption (8.6.6))

Initial guess x (0) ∈ D

x(k+1) := argmin F(x(k) ) + DF(x(k) )(x − x(k) ) . (8.6.11)
x ∈R n 2

linear least squares problem

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 614
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB-\ used to solve linear least squares problem in each step:

for A ∈ R m,n
x = A\b
l
x minimizer of k Ax − bk2
with minimal 2-norm

C++-code 8.6.12: template for Gauss-Newton method

1 # include <Eigen / Dense>
2 # include <Eigen /QR>
3 using Eigen : : VectorXd ;
4 using Eigen : : MatrixXd ;
5

6 template <class Func tion , class Jacobian >

7 VectorXd gn ( const VectorXd& i n i t , const F u n c t i o n& F , const
Jacobian& J , const double t o l ) {
8 VectorXd x = i n i t ;
9 VectorXd s = J ( x ) . householderQr ( ) . solve ( F ( x ) ) ; //
10 x = x − s;
11 while ( s . norm ( ) > t o l ∗ x . norm ( ) ) { //
12 s = J ( x ) . householderQr ( ) . solve ( F ( x ) ) ; //
13 x = x − s;
14 }
15

16 r et ur n x ;
17 }

Comments on Code 8.6.12:

☞ Argument x passes initial guess x(0) ∈ R n , argument F must be a handle to a function F : R n 7→
R m , argument J provides the Jacobian of F, namely DF : R n 7→ R m,n , argument tol specifies
the tolerance for termination
☞ Line 11: iteration terminates if relative norm of correction is below threshold specified in tol.

Note: Code 8.6.12 also implements Newton’s method (→ Section 8.4.1) in the case m = n!

Summary:

Advantage of the Gauss-Newton method : second derivative of F not needed.

Drawback of the Gauss-Newton method : no local quadratic convergence.

Example 8.6.13 (Non-linear data fitting (II) → Ex. 8.6.3)

Non-linear data fitting problem (8.6.4) for f (t) = x1 + x2 exp(− x3 t).

   
x1 + x2 exp(− x3 t1 ) − y1 1 e − x3 t 1 − x 2 t1 e − x3 t 1
 ..  3 m . .. .. 
F (x) =  .  : R 7→ R , DF(x) =  .. . . 
x1 + x2 exp(− x3 tm ) − ym 1 e − x 3 t m − x2 t m e − x 3 t m

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 615
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++-code 8.6.14:
1 # include <Eigen / Dense>
2 using Eigen : : VectorXd ;
3

Numerical experiment: 4 VectorXd g n r a n d i n i t ( const VectorXd& x ) {

5 std : : srand ( ( unsigned i n t ) time ( 0 ) ) ;
convergence of the Newton 6
method, damped Newton 7 auto t = VectorXd : : LinSpaced ( ( 7 . 0 − 1 . 0 ) /
method (→ Section 8.4.4) 0.3 − 1 , 1 . 0 , 7 . 0 ) ;
and Gauss-Newton method for 8 auto y = x ( 0 ) +
different initial values x ( 1 ) ∗ (( − x ( 2 ) ∗ t ) . array ( ) . exp ( ) ) ;
9 r et ur n y + 0.1 ∗
( VectorXd : : Random( y . siz e ( ) ) . array ( ) − 0.5) ;
10 }

✦ initial value (1.8, 1.8, 0.1)T (red curves, blue curves)

✦ initial value (1.5, 1.5, 0.1)T (cyan curves, green curves)
First experiment (→ Section 8.6.1): iterative solution of non-linear least squares data fitting problem by
means of the Newton method (8.6.8) and the damped Newton method from Section 8.4.4
2 4
10 10

2
10

0
10
norm of grad Φ(x(k) )

1
10
2

-2
10
value of F (x(k) )

-4
10

0 -6
10 10

-8
10

-10
10
-1
10
-12
10

-14
10

-2 -16
10 10
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Fig. 307 No. of step of undamped Newton method Fig. 308 No. of step of undamped Newton method

Convergence behaviour of plain Newton method:

initial value (1.8, 1.8, 0.1)T (red curve) ➤ Newton method caught in local minimum,
initial value (1.5, 1.5, 0.1)T (cyan curve) ➤ fast (locally quadratic) convergence.

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 616
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

2 2
10 10

0
10

-2
10

norm of grad Φ(x(k) )

1
10
2

2
value of F (x(k) )

-4
10

-6
10
0
10
-8
10

-10
10

-1
10 -12
10

-14
10

-2 -16
10 10
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Fig. 309 No. of step of damped Newton method Fig. 310 No. of step of damped Newton method

Convergence behavior of damped Newton method:

initial value (1.8, 1.8, 0.1)T (red curve) ➤ fast (locally quadratic) convergence,
initial value (1.5, 1.5, 0.1)T (cyan curve) ➤ Newton method caught in local minimum.

Second experiment: iterative solution of non-linear least squares data fitting problem by means of the
Gauss-Newton method (8.6.11), see Code 8.6.12.
0 0
10 10

-2
10
norm of the corrector

-4
10
2

2
value of F (x(k) )

-6
10

-1 -8
10 10

-10
10

-12
10

-14
10

-2 -16
10 10
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Fig. 311 No. of step of Gauss-Newton method Fig. 312 No. of step of Gauss-Newton method

We observe: linear convergence for all initial values, cf. Def. 8.1.9, Rem. 8.1.13.

8.6.3 Trust region method (Levenberg-Marquardt method)

As in the case of Newton’s method for non-linear systems of equations, see Section 8.4.4: often over-
shooting of Gauss-Newton corrections occurs.

Remedy as in the case of Newton’s method: damping.

Idea: damping of the Gauss-Newton correction in (8.6.11) using a penalty term

2 2
instead of F(x(k) ) + DF(x(k) )s minimize F(x(k) ) + DF(x(k) )s + λksk22 .

ˆ penalty parameter (how to choose it ? →

λ>0= heuristic)

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 617
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015



 10 , if F(x(k) ) ≥ 10 ,

 2
λ = γ F (x(k) ) , γ := 1 , if 1 < F(x k) ) < 10 ,
(
2 
 2

0.01 , if F(x(k) ) ≤ 1 .
2

Modified (regularized) equation for the corrector s:

(k) T (k)
DF(x ) DF(x ) + λI s = − DF(x(k) ) F(x(k) ) . (8.6.15)

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 618
Chapter 9

Eigenvalues

Supplementary reading. [?] offers comprehensive presentation of numerical methods for the

solution of eigenvalue problems from an algorithmic point of view.

Example 9.0.1 (Resonances of linear electric circuits)

C L L
Simple electric circuit, cf. Ex. 2.1.3 ✄ ➀ ➁ ➂

✦ linear components (resistors, coils, capacitors)

U ~~ R
only, C C
✦ time-harmonic excitation (alternating volt-
age/current)
✦ “frequency domain” circuit model
Fig. 313

Ex. 2.1.3: nodal analysis of linear (↔ composed of resistors, inductors, capacitors) electric circuit in fre-
quency domain (at angular frequency ω > 0) , see (2.1.6)

➣ linear system of equations for nodal potentials with complex system matrix A
For circuit of Code 9.0.3: three unknown nodal potentials

➣ system matrix from nodal analysis at angular frequency ω > 0:

 1 1 
ıωC + ıωL − ıωL 0
A =  − ıωL 1
ıωC + R1 + ıωL
2 1
− ıωL 
1 1
0 − ıωL ıωC + ıωL
     1 1 
0 0 0 C 0 0 L − L 0
= 0 R1 0 + ıω  0 C 0  + 1/ıω − L1 L2 − L1  .
0 0 0 0 0 C 0 − L1 1
L

A(ω ) := W + iωC − iω −1 S , W, C, S ∈ R n,n symmetric . (9.0.2)

619
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

R = 1, C= 1, L= 1
30
|u1|
|u |
2
|u |
3
25
maximum nodal potential

✁ plot of |ui (U )|, i = 1, 2, 3 for R = L = C = 1

15 (scaled model)

10
Blow-up of some nodal potentials for certain ω !

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Fig. 314 angular frequency ω of source voltage U

M ATLAB-code 9.0.3: Computation of nodal potential for circuit of Code 9.0.3

1 f u n c t i o n rescirc(R,L,C)
2 % Ex. ??: Numerical nodal analysis of the resonant circuit
3 % R, L, C = ˆ network component parameters
4

5 Z = 1/R; K = 1/L;
6

7 % Matrices W, C, S for nodal analysis of circuit

8 Wmat = [0 0 0; 0 Z 0; 0 0 0];
9 Cmat = [C 0 0; 0 C 0; 0 0 C];
10 Smat = [K -K 0; -K 2*K -K; 0 -K K];
11 % System matrix from nodal analysis
12 Amat = @(w) (Wmat+i*w*Cmat+Smat/(i*w));
13

14 % Scanning source currents

15 res = [];
16 f o r w=0.01:0.01:2
17 res = [res; w, abs (Amat(w)\[C;0;0])’];
18 end
19

20 f i g u r e (’name’,’resonant circuit’);
21 p l o t (res(:,1),res(:,2),’r-’,res(:,1),res(:,3),’m-’,res(:,1),res(:,4),’b-’)
22 x l a b e l (’{\bf angular frequency \omega of source voltage
U}’,’fontsize’,14);
23 y l a b e l (’{\bf maximum nodal potential}’,’fontsize’,14);
24 t i t l e ( s p r i n t f (’R = %d, C= %d, L= %d’,R,L,C));
25 legend (’|u_1|’,’|u_2|’,’|u_3|’);
26

27 p r i n t -depsc2 ’../PICTURES/rescircpot.eps’
28

29 % Solving generalized eigenvalue problem (9.0.5)

30 Zmat = z e r o s (3,3); Imat = eye (3,3);
31 % Assemble 6 × 6-matrices M and B
32 Mmat = [Wmat,Smat; Imat, Zmat];
33 Bmat = [-i*Cmat, Zmat; Zmat , i*Imat];

9. Eigenvalues, 9. Eigenvalues 620

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

34 % Solve generalized eigenvalue problem, cf. (9.0.6)

35 omega = eig(Mmat,Bmat);
36

37 f i g u r e (’name’,’resonances’);
38 p l o t ( r e a l (omega), imag (omega),’r*’); hold on;
39 ax = a x i s ;
40 p l o t ([ax(1) ax(2)],[0 0],’k-’);
41 p l o t ([ 0 0],[ax(3) ax(4)],’k-’);
42 g r i d on;
43 x l a b e l (’{\bf Re(\omega)}’,’fontsize’,14);
44 y l a b e l (’{\bf Im(\omega)}’,’fontsize’,14);
45 t i t l e ( s p r i n t f (’R = %d, C= %d, L= %d’,R,L,C));
46 legend (’\omega’);
47

48 p r i n t -depsc2 ’../PICTURES/rescircomega.eps’

☛ ✟

✡ ✠
resonant frequencies = ω ∈ {ω ∈ R: A(ω ) singular}

If the circuit is operated at a real resonant frequency, the circuit equations will not possess a solution. Of
course, the real circuit will always behave in a well-defined way, but the linear model will break down due
to extremely large currents and voltages. In an experiment this breakdown manifests itself as a rather
explosive meltdown of circuits components. Hence, it is vital to determine resonant frequencies of circuits
in order to avoid their destruction.

➥ relevance of numerical methods for solving:

1
Find ω ∈ C \ {0}: W + ıωC + S singular .
ıω
This is a quadratic eigenvalue problem: find x 6= 0, ω ∈ C \ {0},

1
A(ω )x = (W + ıωC + S)x = 0 . (9.0.4)
ıω
1
Substitution: y= ıω x ↔ x = ıωy [?, Sect. 3.4]:

W S x −ıC 0 x
(9.0.4) ⇔ =ω (9.0.5)
I 0 y 0 −ıI y
| {z } |{z} | {z }
:=M :=z :=B

➣ generalized linear eigenvalue problem of the form: find ω ∈ C, z ∈ C2n \ {0} such that

Mz = ωBz . (9.0.6)

In this example one is mainly interested in the eigenvalues ω , whereas the eigenvectors z usually need
not be computed.

9. Eigenvalues, 9. Eigenvalues 621

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

R = 1, C= 1, L= 1
0.4
ω

0.35

0.3

0.25

0.2
✁ resonant frequencies for circuit from Code 9.0.3
Im(ω)

0.15 (including decaying modes with Im(ω ) > 0)

0.1

0.05

-0.05
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Fig. 315 Re(ω)

Example 9.0.7 (Analytic solution of homogeneous linear ordinary differential equations →

[?, Remark 5.6.1], [?, Sect. 10.1],[?, Sect. 8.1], [?, Ex. 7.3])

Autonomous homogeneous linear ordinary differential equation (ODE):

ẏ = Ay , A ∈ C n,n . (9.0.8)

 
λ1
 ..  −1 n,n z = S −1 y
A = S .  S , S ∈ C regular =⇒ ẏ = Ay ←→ ż = Dz .
λn
| {z }
= :D

➣ solution of initial value problem:

ẏ = Ay , y(0) = y0 ∈ C n ⇒ y(t) = Sz(t) , ż = Dz , z(0) = S−1 y0 .
The initial value problem for the decoupled homogeneous linear ODE ż = Dz has a simple analytic
solution

zi (t) = exp(λi t)(z0 )i = exp(λi t) (S−1 )i,:
T
y0 .

In light of Rem. 1.3.3:

 
λ1
 ..  −1
A = S . S ⇔ A((S):,i ) = λi ((S):,i ) i = 1, . . . , n . (9.0.9)
λn

In order to find the transformation matrix S all non-zero solution vectors (= eigenvectors) x ∈ C n of the
linear eigenvalue problem

Ax = λx
have to be found.

Contents

9. Eigenvalues, 9. Eigenvalues 622

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

9.1 Theory of eigenvalue problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613

9.2 “Direct” Eigensolvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
9.3 Power Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
9.3.1 Direct power method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
9.3.2 Inverse Iteration [?, Sect. 7.6], [?, Sect. 5.3.2] . . . . . . . . . . . . . . . . . . 629
9.3.3 Preconditioned inverse iteration (PINVIT) . . . . . . . . . . . . . . . . . . . 640
9.3.4 Subspace iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
9.3.4.1 Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
9.3.4.2 Ritz projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
9.4 Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658

9.1 Theory of eigenvalue problems

Supplementary reading. [?, Ch. 7], [?, Ch. 9], [?, Sect. 1.7]

Definition 9.1.1. Eigenvalues and eigenvectors → [?, Sects. 7.1,7.2], [?, Sect. 9.1]
• λ ∈ C eigenvalue (ger.: Eigenwert) of A ∈ K n,n :⇔ det(λI − A) = 0
| {z }
characteristic polynomial χ(λ)
• n,n
spectrum of A ∈ K : σ (A) := {λ ∈ C: λ eigenvalue of A}
• eigenspace (ger.: Eigenraum) associated with eigenvalue λ ∈ σ (A):
EigAλ := N λI − A
• x ∈ EigAλ \ {0} ⇒ x is eigenvector
• Geometric multiplicity (ger.: Vielfachheit) of an eigenvalue λ ∈ σ (A):
m(λ) := dim EigAλ

Two simple facts:

λ ∈ σ (A ) ⇒ dim EigAλ > 0 , (9.1.2)

T n,n T
det(A) = det(A ) ∀A ∈ K ⇒ σ (A ) = σ (A ) . (9.1.3)

ˆ spectral radius of A ∈ K n,n

✎ notation: ρ(A) := max{|λ|: λ ∈ σ(A)} =

Theorem 9.1.4. Bound for spectral radius

For any matrix norm k·k induced by a vector norm (→ Def. 1.5.76)

ρ(A ) ≤ kA k .

Proof. Let z ∈ C n \ {0} be an eigenvector to the largest (in modulus) eigenvalue λ of A ∈ C n,n . Then

kAxk kAzk
k Ak := sup ≥ = | λ| = ρ(A ) .
x∈C n,n \{0} k xk kzk

9. Eigenvalues, 9.1. Theory of eigenvalue problems 623

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✷
Lemma 9.1.5. Gershgorin circle theorem → [?, Thm. 7.13], [?, Thm. 32.1], [?, Sect. 5.1]
For any A ∈ K n,n holds true
n n
[ o
σ (A ) ⊂ z ∈ C: |z − a jj | ≤ ∑i6= j |a ji | .
j =1

Lemma 9.1.6. Similarity and spectrum → [?, Thm. 9.7], [?, Lemma 7.6], [?, Thm. 7.2]
The spectrum of a matrix is invariant with respect to similarity transformations:

∀A ∈ Kn,n : σ(S−1 AS) = σ(A) ∀ regular S ∈ Kn,n .

Lemma 9.1.7.
Existence of a one-dimensional invariant subspace

∀C ∈ C n,n : ∃u ∈ C n : C(Span{u}) ⊂ Span{u} .

Theorem 9.1.8. Schur normal form → [?, Thm .2.8.1]

∀A ∈ Kn,n : ∃U ∈ C n,n unitary: U H AU = T with T ∈ C n,n upper triangular .

Corollary 9.1.9. Principal axis transformation

A ∈ K n,n , AA H = A H A: ∃U ∈ C n,n unitary: U H AU = diag(λ1 , . . . , λn ) , λi ∈ C .

A matrix A ∈ K n,n with AA H = A H A is called normal.

9. Eigenvalues, 9.1. Theory of eigenvalue problems 624

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

• Hermitian matrices: A H = A ➤ σ (A ) ⊂ R
H
Examples of normal matrices are • unitary matrices: A = A − 1 ➤ |σ(A)| = 1
• skew-Hermitian matrices: A = −A H ➤ σ(A) ⊂ iR
➤
Normal matrices can be diagonalized by unitary similarity transformations

Symmetric real matrices can be diagonalized by orthogonal similarity transformations

In Cor. 9.1.9: – λ1 , . . . , λn = eigenvalues of A

– Columns of U = orthonormal basis of eigenvectors of A

Eigenvalue
problems: ➊ Given A ∈ K n,n find all eigenvalues (= spectrum of A).
n,n
(EVPs) ➋ Given A ∈ K find σ (A) plus all eigenvectors.
➌ Given A ∈ K n,n find a few eigenvalues and associated eigenvectors

(Linear) generalized eigenvalue problem:

Given A ∈ C n,n , regular B ∈ C n,n , seek x 6= 0, λ ∈ C

Ax = λBx ⇔ B−1 Ax = λx . (9.1.10)

x=
ˆ generalized eigenvector, λ =
ˆ generalized eigenvalue

Obviously every generalized eigenvalue problem is equivalent to a standard eigenvalue problem

Ax = λBx ⇔ B−1 A = λx .

However, usually it is not advisable to use this equivalence for numerical purposes!

Remark 9.1.11 (Generalized eigenvalue problems and Cholesky factorization)

If B = B H s.p.d. (→ Def. 1.1.8) with Cholesky factorization B = R H R

e := R− H AR−1 , y := Rx .
e = λy where A
Ax = λBx ⇔ Ay

➞ This transformation can be used for efficient computations.

9.2 “Direct” Eigensolvers

Purpose: solution of eigenvalue problems ➊, ➋ for dense matrices “up to machine precision”

M ATLAB-function: eig

d = eig(A) : computes spectrum σ (A) = {d1 , . . . , dn } of A ∈ C n,n

[V,D] = eig(A) : computes V ∈ C n,n , diagonal D ∈ C n,n such that AV = VD

9. Eigenvalues, 9.2. “Direct” Eigensolvers 625

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 9.2.1 (QR-Algorithm → [?, Sect. 7.5], [?, Sect. 10.3],[?, Ch. 26],[?, Sect. 5.5-5.7])

Note: All “direct” eigensolvers are iterative methods

Idea: Iteration based on successive unitary similarity transformations


diagonal matrix , if
A = A(0) −−−→ A(1) −−−→ . . . −−−→
upper triangular matrix , el
(→ Thm. 9.1.8)

(superior stability of unitary transformations, see ??)

M ATLAB-code 9.2.2: QR-algorithm with shift

1 f u n c t i o n d = eigqr(A,tol)
2 n = s i z e (A,1);
3 w h i l e (norm( t r i l (A,-1)) >
QR-algorithm (with shift) tol*norm(A))
4 % shift by eigenvalue of lower right 2×2
✦ in general: quadratic conver- block closest to (A)n,n
gence 5 sc = e i g (A(n-1:n,n-1:n));
✦ cubic convergence for normal 6 [dummy,si] = min ( abs (sc-A(n,n)));
matrices 7 shift = sc(si);
(→ [?, Sect. 7.5,8.2]) 8 [Q,R] = qr ( A - shift * eye (n));
9 A = Q’*A*Q;
10 end
11 d = d i a g (A);

Computational cost: O(n3 ) operations per step of the QR-algorithm

✎ ☞
Library implementations of the QR-algorithm provide numerically stable

✍ ✌
eigensolvers (→ Def.1.5.85)

Remark 9.2.3 (Unitary similarity transformation to tridiagonal form)

Successive Householder similarity transformations of A = A H :

(➞ =
ˆ affected rows/columns, =
ˆ targeted vector)

0 0 0 0 0 0
0 0 0 0
0 0 0 0 0
0 0
−−−→ −−−→ −−−→ 0

0 0 0 0 0 0

transformation to tridiagonal form ! (for general matrices a similar strategy can achieve a similarity
transformation to upper Hessenberg form)

9. Eigenvalues, 9.2. “Direct” Eigensolvers 626

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

this transformation is used as a preprocessing step for QR-algorithm ➣ eig.

Similar functionality for generalized EVP Ax = λBx, A, B ∈ C n,n

d = eig(A,B) : computes all generalized eigenvalues
[V,D] = eig(A,B) : computes V ∈ C n,n , diagonal D ∈ C n,n such that AV = BVD

Note: (Generalized) eigenvectors can be recovered as columns of V:

AV = VD ⇔ A(V):,i = (D)i,i V:,i ,

if D = diag(d1 , . . . , dn ).

Remark 9.2.4 (Computational effort for eigenvalue computations)

Computational effort (#elementary operations) for eig():

)
eigenvalues & eigenvectors of A ∈ K n,n ∼ 25n3 + O(n2 )
only eigenvalues of A ∈ K n,n ∼ 10n3 + O(n2 )
eigenvalues and eigenvectors A = A H ∈ K n,n ∼ 9n3 + O(n2 ) O ( n3)!
only eigenvalues of A = A H ∈ K n,n ∼ 43 n3 + O(n2 )
only eigenvalues of tridiagonal A = A H ∈ K n,n ∼ 30n2 + O(n)
Note: eig not available for sparse matrix arguments
Exception:
d=eig(A) for sparse Hermitian matrices

Example 9.2.5 (Runtimes of eig)

M ATLAB-code 9.2.6:
1 A = rand (500,500); B = A’*A; C = g a l l e r y (’tridiag’,500,1,3,1);

➤ ✦ A generic dense matrix

✦ B symmetric (s.p.d. → Def. 1.1.8) matrix
✦ C s.p.d. tridiagonal matrix

M ATLAB-code 9.2.7: measuring runtimes of eig

1 f u n c t i o n eigtiming
2

3 A = rand (500,500); B = A’*A;

4 C = g a l l e r y (’tridiag’,500,1,3,1);
5 times = [];
6 f o r n=5:5:500
7 An = A(1:n,1:n); Bn = B(1:n,1:n); Cn = C(1:n,1:n);
8 t1 = 1000; f o r k=1:3, t i c ; d = eig(An); t1 = min (t1, t o c ); end
9 t2 = 1000; f o r k=1:3, t i c ; [V,D] = eig(An); t2 = min (t2, t o c );

9. Eigenvalues, 9.2. “Direct” Eigensolvers 627

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

end
10 t3 = 1000; f o r k=1:3, t i c ; d = eig(Bn); t3 = min (t3, t o c ); end
11 t4 = 1000; f o r k=1:3, t i c ; [V,D] = eig(Bn); t4 = min (t4, t o c );
end
12 t5 = 1000; f o r k=1:3, t i c ; d = eig(Cn); t5 = min (t5, t o c ); end
13 times = [times; n t1 t2 t3 t4 t5];
14 end
15

16 figure;
17 l o g l o g (times(:,1),times(:,2),’r+’, times(:,1),times(:,3),’m*’,...
18 times(:,1),times(:,4),’cp’, times(:,1),times(:,5),’b^’,...
19 times(:,1),times(:,6),’k.’);
20 x l a b e l (’{\bf matrix size n}’,’fontsize’,14);
21 y l a b e l (’{\bf time [s]}’,’fontsize’,14);
22 t i t l e (’eig runtimes’);
23 legend (’d = eig(A)’,’[V,D] = eig(A)’,’d = eig(B)’,’[V,D] =
eig(B)’,’d = eig(C)’,...
24 ’location’,’northwest’);
25

26 p r i n t -depsc2 ’../PICTURES/eigtimingall.eps’
27

28 figure;
29 l o g l o g (times(:,1),times(:,2),’r+’, times(:,1),times(:,3),’m*’,...
30 times(:,1),(times(:,1).^3)/(times(1,1)^3)*times(1,2),’k-’);
31 x l a b e l (’{\bf matrix size n}’,’fontsize’,14);
32 y l a b e l (’{\bf time [s]}’,’fontsize’,14);
33 t i t l e (’nxn random matrix’);
34 legend (’d = eig(A)’,’[V,D] =
eig(A)’,’O(n^3)’,’location’,’northwest’);
35

36 p r i n t -depsc2 ’../PICTURES/eigtimingA.eps’
37

38 figure;
39 l o g l o g (times(:,1),times(:,4),’r+’, times(:,1),times(:,5),’m*’,...
40 times(:,1),(times(:,1).^3)/(times(1,1)^3)*times(1,2),’k-’);
41 x l a b e l (’{\bf matrix size n}’,’fontsize’,14);
42 y l a b e l (’{\bf time [s]}’,’fontsize’,14);
43 t i t l e (’nxn random Hermitian matrix’);
44 legend (’d = eig(A)’,’[V,D] =
eig(A)’,’O(n^3)’,’location’,’northwest’);
45

46 p r i n t -depsc2 ’../PICTURES/eigtimingB.eps’
47

48 figure;
49 l o g l o g (times(:,1),times(:,6),’r*’,...
50 times(:,1),(times(:,1).^2)/(times(1,1)^2)*times(1,2),’k-’);
51 x l a b e l (’{\bf matrix size n}’,’fontsize’,14);
52 y l a b e l (’{\bf time [s]}’,’fontsize’,14);
53 t i t l e (’nxn tridiagonel Hermitian matrix’);
54 legend (’d = eig(A)’,’O(n^2)’,’location’,’northwest’);

9. Eigenvalues, 9.2. “Direct” Eigensolvers 628

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

56 p r i n t -depsc2 ’../PICTURES/eigtimingC.eps’

eig runtimes nxn random matrix

1 2
10 10
d = eig(A) d = eig(A)
[V,D] = eig(A) [V,D] = eig(A)
d = eig(B)
1
O(n3)
0 [V,D] = eig(B) 10
10 d = eig(C)

0
10
-1
10

-1
10
time [s]

time [s]
-2
10
-2
10

-3
10
-3
10

-4
10 -4
10

-5 -5
10 10
0 1 2 3 0 1 2 3
10 10 10 10 10 10 10 10
Fig. 316 matrix size n Fig. 317 matrix size n

nxn random Hermitian matrix nxn tridiagonel Hermitian matrix

2 0
10 10
d = eig(A) d = eig(A)
[V,D] = eig(A) O(n2)
1
O(n3)
10
-1
10

0
10

-2
-1
10
10
time [s]

time [s]

-2
10 -3
10

-3
10

-4
10
-4
10

-5 -5
10 10
0 1 2 3 0 1 2 3
10 10 10 10 10 10 10 10
Fig. 318 matrix size n Fig. 319 matrix size n

☛ For the sake of efficiency: think which information you really need when computing eigenvalues/eigen-
vectors of dense matrices
Potentially more efficient methods for sparse matrices will be introduced below in Section 9.3, 9.4.

9.3 Power Methods

9.3.1 Direct power method

Supplementary reading. [?, Sect. 7.5], [?, Sect. 5.3.1], [?, Sect. 5.3]

Example 9.3.1 ((Simplified) Page rank algorithm → [?, ?])

9. Eigenvalues, 9.3. Power Methods 629

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Model: Random surfer visits a web page, stays there for fixed time ∆t, and then
➊ either follows each of ℓ links on a page with probabilty 1/ℓ.
➋ or resumes surfing at a randomly (with equal probability) selected page
Option ➋ is chosen with probability d, 0 ≤ d ≤ 1, option ➊ with probability 1 − d.

Stationary Markov chain, state space =

ˆ set of all web pages
Question: Fraction of time spent by random surfer on i-th page (= page rank xi ∈ [0, 1])

This number ∈]0, 1[ can be used to gauge the “importance” of a web page, which, in turns, offers a way
to sort the hits resulting from a keyword query: the GOOGLE idea.

Method: Stochastic simulation ✄

M ATLAB-code 9.3.2: stochastic page rank simulation

1 f u n c t i o n prstochsim(Nhops)
2 % Load web graph data stored in N × N-matrix G
3 l o a d harvard500.mat;
4 N = s i z e (G,1); d = 0.15;
5 count = z e r o s (1,N); cp = 1;
6 f i g u r e (’position’,[0 0 1200 1000]); pause ;
7 f o r n=1:Nhops
8 % Find links from current page cp
9 idx = f i n d (G(:,cp)); l = s i z e (idx,1); rn = rand(); %
10 % If no links, jump to any other pages with equal probability
11 i f ( isempt y (idx)), cp = f l o o r (rn*N)+1;
12 % With probabilty d jump to any other page
13 e l s e i f (rn < d), cp = f l o o r (rn/d*N)+1;
14 % Follow outgoing links with equal probabilty
15 e l s e cp = idx( f l o o r ((rn-d)/(1-d)*l)+1,1);
16 end
17 count(cp) = count(cp) + 1;
18 p l o t (1:N,count/n,’r.’); a x i s ([0 N+1 0 0.1]);
19 x l a b e l (’{\bf harvard500: no. of page}’,’fontsize’,14);
20 y l a b e l (’{\bf page rank}’,’fontsize’,14);
21 t i t l e ( s p r i n t f (’{\\bf page rank, harvard500: %d
hops}’,n),’fontsize’,14);
22 drawnow;
23 end

Explanations Code 9.3.2:

✦ Line 9: rand generates uniformly distributed pseudo-random numbers ∈ [0, 1[

✦ Web graph encoded in G ∈ {0, 1} N,N :

(G)ij = 1 ⇒ link j → i ,

9. Eigenvalues, 9.3. Power Methods 630

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

harvard500: 100000 hops harvard500: 1000000 hops

0.09 0.09

0.08 0.08

0.07 0.07

0.06 0.06
page rank

page rank
0.05 0.05

0.04 0.04

0.03 0.03

0.02 0.02

0.01 0.01

0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 320 harvard500: no. of page Fig. 321 harvard500: no. of page

Observation: relative visit times stabilize as the number of hops in the stochastic simulation → ∞.

The limit distribution is called stationary distribution/invariant measure of the Markov chain. This is what
we seek.

✦ Numbering of pages 1, . . . , N , ℓi =
ˆ number of links from page i
N
✦ N × N -matrix of transition probabilities page j → page i: A = (aij )i,j N,N
=1 ∈ R

aij ∈ [0, 1] =
ˆ probability to jump from page j to page i.
N
⇒ ∑ aij = 1 . (9.3.3)
i =1

A matrix A ∈ [0, 1] N,N with the property (9.3.3) is called a (column) stochastic matrix.

“Meaning” of A: given x ∈ [0, 1] N , k xk1 = 1, where xi is the probability of the surfer to visit page i,
i = 1, . . . , N , at an instance t in time, y = Ax satisfies
N N N N N N
yj ≥ 0 , ∑ yj = ∑ ∑ a ji xi = ∑ xi ∑ aij = ∑ xi = 1 .
j =1 j =1 i =1 i =1 j =1 i =1
| {z }
=1

yj =
ˆ probability for visiting page j at time t + ∆t.

Transition probability matrix for page rank computation



N
1
, if (G)ij = 0 ∀i = 1, . . . , N ,
(A)ij = (G)ij (9.3.4)

d/N + (1 − d) ℓ else.
j

random jump to any other page follow link

9. Eigenvalues, 9.3. Power Methods 631

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB-code 9.3.5: transition probability matrix for page rank

1 f u n c t i o n A = prbuildA(G,d)
2 N = s i z e (G,1);
3 l = f u l l (sum(G)); idx = f i n d (l>0);
Note: special treatment of
4 s = z e r o s (N,1); s(idx) = 1./l(idx); zero columns of G, cf.
5 ds = ones(N,1)/N; ds(idx) = d/N; (9.3.4)!
6 A = ones(N,1)*ones(1,N)* d i a g (ds) +
(1-d)*G* d i a g (s);

Stochastic simulation based on a single surfer is slow. Alternatives?

Thought experiment: Instead of a single random surfer we may consider m ∈ N, m ≫ 1, of them who
visit pages independently. The fraction of time m · T they all together spend on page i will obviously be
the same for T → ∞ as that for a single random surfer.
Instead of counting the surfers we watch the proportions of them visiting particular web pages at an
(k)
instance of time. Thus, after the k-th hop we can assign a number xi ∈ [0, 1] to web page i, which gives
(k)
(k) ni (k)
the proportion of surfers currently on that page: xi := m , where ni ∈ N0 designates the number of
surfers on page i after the k-th hop.
Now consider m → ∞. The law of law of large numbers suggests that the (“infinitely many”) surfers visiting
page j will move on to other pages proportional to the transistion probabilities aij : in terms of proportions,
for m → ∞ the stochastic evolution becomes a deterministic discrete dynamical system and we find
N
( k + 1) (k)
xi = ∑ aij x j , (9.3.6)
j =1

that is, the proportion of surfers ending up on page i equals the sum of the proportions on the “source
pages” weighted with the transition probabilities.

Notice that (9.3.6) amounts to matrix×vector. Thus, writing x(0) ∈ [0, 1] N , x (0) = 1 for the initial
distribution of the surfers on the net we find

x ( k ) = A k x (0)

will be their mass distribution after k hops. If the limit exists, the i-th component of x∗ := lim x(k) tells us
k→∞
which fraction of the (infinitely many) surfers will be visiting page i most of the time. Thus, x∗ yields the
stationary distribution of the Markov chain.

M ATLAB-code 9.3.7: tracking fractions of many surfers

1 f u n c t i o n prpowitsim(d,Nsteps)
2 % MATLAB way of specifying Default arguments
3 i f ( n a r g i n < 2), Nsteps = 5; end
4 i f ( n a r g i n < 1), d = 0.15; end
5 % load connectivity matrix and build transition matrix

9. Eigenvalues, 9.3. Power Methods 632

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6 l o a d harvard500.mat; A = prbuildA(G,d);
7 N = s i z e (A,1); x = ones(N,1)/N;
8

9 f i g u r e (’position’,[0 0 1200 1000]);

10 p l o t (1:N,x,’r+’); a x i s ([0 N+1 0 0.1]);
11 % Plain power iteration for stochastic matrix A
12 f o r l=1:Nsteps
13 pause ; x = A*x; p l o t (1:N,x,’r+’); a x i s ([0 N+1 0 0.1]);
14 t i t l e ( s p r i n t f (’{\\bf step %d}’,l),’fontsize’,14);
15 x l a b e l (’{\bf harvard500: no. of page}’,’fontsize’,14);
16 y l a b e l (’{\bf page rank}’,’fontsize’,14); drawnow;
17 end

step 5 step 15
0.1 0.1

0.09 0.09

0.08 0.08

0.07 0.07

0.06 0.06
page rank

page rank

0.05 0.05

0.04 0.04

0.03 0.03

0.02 0.02

0.01 0.01

0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 322 harvard500: no. of page Fig. 323 harvard500: no. of page

Comparison:
harvard500: 1000000 hops step 5
0.09 0.1

0.08 0.09

0.07 0.08

0.07
0.06

0.06
page rank

page rank

0.05

0.05
0.04
0.04

0.03
0.03

0.02
0.02

0.01
0.01

0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 324 harvard500: no. of page Fig. 325 harvard500: no. of page

Single surfer stochastic simulation Power method, Code 9.3.7

Observation: Convergence of the x(k) → x∗ , and the limit must be a fixed point of the iteration function:
➣ Ax∗ = x∗ ⇒ x∗ ∈ EigA1 .
Does A possess an eigenvalue = 1? Does the associated eigenvector really provide a probability distri-
bution (after scaling), that is, are all of its entries non-negative? Is this probability distribution unique? To
answer these questions we have to study the matrix A:

9. Eigenvalues, 9.3. Power Methods 633

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

For every stochastic matrix A, by definition (9.3.3)

(9.1.3)
AT 1 = 1 ⇒ 1 ∈ σ (A ) ,
Thm. 9.1.4
(1.5.80) ⇒ k A k1 = 1 ⇒ ρ(A ) = 1 ,
where ρ(A) is the spectral radius of the matrix A, see Section 9.1.

For r ∈ EigA1, that is, Ar = r, denote by |r| the vector (|ri |)iN=1 . Since all entries of A are non-negative,
we conclude by the triangle inequality that kAr k1 ≤ kA|r|k1
kAxk1 kA|r|k1 kArk1
⇒ 1 = kAk1 = sup ≥ ≥ =1.
x ∈R N
k x k1 k|r|k1 krk1
if aij >0
⇒ kA|r|k1 = kArk1 ⇒ |r| = ±r .
Hence, different components of r cannot have opposite sign, which means, that r can be chosen to have
non-negative entries, if the entries of A are strictly positive, which is the case for A from (9.3.4). After
normalization krk1 = 1 the eigenvector can be regarded as a probability distribution on {1, . . . , N }.

If Ar = r and As = s with (r)i ≥ 0, (s)i ≥ 0, krk1 = ksk1 = 1, then A(r − s) = r − s. Hence,

by the above considerations, also all the entries of r − s are either non-negative or non-positive. By the
assumptions on r and s this is only possible, if r − s = 0. We conclude that
A ∈ ]0, 1] N,N stochastic ⇒ dim EigA1 = 1 . (9.3.8)
Sorting the pages according to the size of the corresponding entries in r yields the famous “page rank”.

M ATLAB-code 9.3.9: computing page rank vector r via eig

Plot of entries of
f unct ion prevp unique vector r ∈
load harvard500 . mat ; d = 0 . 1 5 ; R N with
[ V , D] = eig ( p r b u i l d A (G, d ) ) ;
0 ≤ ( r )i ≤ 1 ,
f i g u r e ; r = V ( : , 1 ) ; N = length ( r ) ; krk1 = 1 ,
p l o t ( 1 : N, r /sum( r ) , ’m. ’ ) ; axis ( [ 0 N+1 0 0 . 1 ] ) ;
x l a b e l ( ’ { \ b f harvard500 : no . o f page } ’ , ’ f o n t s i z e ’ , 1 4 ) ; Ar = r .
y l a b e l ( ’ { \ b f e n t r y o f r −v e c t o r } ’ , ’ f o n t s i z e ’ , 1 4 ) ;
t i t l e ( ’ har v ar d 500: Perron−Fr obenius v e c t o r ’ ) ; Inefficient implemen-
p r i n t −depsc2 ’ . . / PICTURES / prevp . eps ’ ; tation!

harvard500: 1000000 hops harvard 500: Perron-Frobenius vector

0.09 0.1

0.09
0.08

0.08
0.07

0.07
0.06
entry of r-vector

0.06
page rank

0.05
0.05

0.04
0.04

0.03
0.03

0.02
0.02

0.01 0.01

0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 326 harvard500: no. of page Fig. 327 harvard500: no. of page

stochastic simulation eigenvector computation

9. Eigenvalues, 9.3. Power Methods 634

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The possibility to compute the stationary probability distribution of a Markov chain through an eigenvector
of the transition probability matrix is due to a property of stationary Markov chains called ergodicity.

0
10

A =ˆ page rank transition probability matrix, see -1

Code 9.3.5, d = 0.15, harvard500 example.

-2
10

Errors: ✄

error 1-norm
-3
10

k
A x0 − r ,
1 -4
10

with x0 = 1/N , N = 500. -5

We observe linear convergence! (→ Def. 8.1.9, iter-

ation error vs. iteration count ≈ straight line lin-log -6
10

plot)
-7
10
0 10 20 30 40 50 60
Fig. 328 iteration step

The computation of page rank amounts to finding the eigenvector of the matrix A of transition probabilities
that belongs to its largest eigenvalue 1. This is addressed by an important class of practical eigenvalue
problems:

Task: given A ∈ K n,n , find largest (in modulus) eigenvalue of A

and (an) associated eigenvector.

Idea: (suggested by page rank computation, Code 9.3.7)

Iteration: z(k+1) = Az(k) , z(0) arbitrary

Example 9.3.10 (Power iteration → Ex. 9.3.1)

Try the above iteration for general 10 × 10-matrix, largest eigenvalue 10, algebraic multiplicity 1.
1
10

M ATLAB-code 9.3.11:
0
10 d = ( 1 : 1 0 ) ’ ; n = length ( d ) ;
S = t r i u ( diag ( n : − 1 : 1 , 0 ) + . . .
ones ( n , n ) ) ;
errors

-1
10 A = S∗ diag ( d , 0 ) ∗ i n v ( S ) ;

z( k )
-2 ✁ error norm − (S):,10
10
k z( k ) k
(Note: (S):,10 =
ˆ eigenvector for eigenvalue 10)

z(0) = random vector

-3
10
0 5 10 15 20 25 30
Fig. 329 iteration step k

9. Eigenvalues, 9.3. Power Methods 635

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Observation: linear convergence of (normalized) eigenvectors!

Suggests direct power method (ger.: Potenzmethode): iterative method (→ Section 8.1)

initial guess: z(0) “arbitrary” ,

w (9.3.12)
next iterate: w := Az(k−1) , z(k) := , k = 1, 2, . . . .
kwk2

Note: the “normalization” of the iterates in (9.3.12) does not change anything (in exact arithmetic) and
helps avoid overflow in floating point arithmetic.

Computational effort: 1× matrix×vector per step ➣ inexpensive for sparse matrices

A persuasive theoretical justification for the direct power method:

Assume A ∈ K n,n to be diagonalizable:

⇔ ∃ basis {v1 , . . . , vn } of eigenvectors of A: Av j = λ j v j , λ j ∈ C.

Assume

|λ1 | ≤ |λ2 | ≤ · · · ≤ |λn−1 |<|λn | , vj 2

=1. (9.3.13)

Key observations for power iteration (9.3.12)

z ( k ) = A k z (0) (→ name “power method”) (9.3.14)

n n
z (0) = ∑ ζ j vj ⇒ z(k) = ∑ ζ j λkj v j . (9.3.15)
j =1 j =1

Due to (9.3.13) for large k ≫ 1 (⇒ |λkn | ≫ |λkj | for j 6= n) the contribution of vn (size ζ n λkn ) in the eigen-
vector expansion (9.3.15) will be much larger than the contribution (size ζ n λkj ) of any other eigenvector (,
if ζ n 6= 0): the eigenvector for λn will swamp all other for k → ∞.

Further (9.3.15) nutures expectation: vn will become dominant in z(k) the faster, the better |λn | is sepa-
rated from |λn−1 |, see Thm. 9.3.21 for rigorous statement.

z(k) → eigenvector, but how do we get the associated eigenvalue λn ?

When (9.3.12) has converged, two common ways to recover λmax → [?, Alg. 7.20]
kAz(k) k
➊ Az(k) ≈ λmax z(k) ➣ |λn | ≈ (modulus only!)
kz( k ) k

2 (z(k) ) H Az(k)
➋ λmax ≈ argmin Az(k) − θz(k) ➤ λmax ≈ 2
.
θ ∈R 2 z(k) 2

This latter formula is extremely useful, which has earned it a special name:

9. Eigenvalues, 9.3. Power Methods 636

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 9.3.16.
For A ∈ K n,n , u ∈ K n the Rayleigh quotient is defined by

u H Au
ρA (u) := .
uH u

An immediate consequence of the definitions:

λ ∈ σ(A) , z ∈ EigλA ⇒ ρA (z) = λ . (9.3.17)

Example 9.3.18 (Direct power method → Ex. 9.3.18 cnt’d)

1
10

M ATLAB-code 9.3.19:
0
10
n = length(d); S = triu(diag(n:-1:1,0)+...
ones(n,n)); A = S*diag(d,0)*inv(S);

d = (1:10)’;
errors

-1
10
o : error |λn − ρA (z(k) )|
✁ ∗ : error norm z(k) − s·,n
kAz(k−1) k
-2
10

+ : λ n − ( k −1) 2
kz k2
-3
10
0 5 10 15 20 25 30
z(0) = random vector
Fig. 330 iteration step k

Test matrices:
① d=(1:10)’; ➣ |λn−1 | : |λn | = 0.9
② d = [ones(9,1); 2]; ➣ |λn−1 | : |λn | = 0.5
③ d = 1-2.^(-(1:0.5:5)’); ➣ |λn−1 | : |λn | = 0.9866

M ATLAB-code 9.3.20: Investigating convergence of direct power method

1 % Demonstration of direct power method for Ex. 9.3.18
2 maxit = 30; d = [1:10]’; n = l e n g t h (d);
3 % Initialize the matrix A
4 S = t r i u ( d i a g (n:-1:1,0)+ones(n,n));
5 A = S* d i a g (d,0)* i n v (S);
6 % This calculates the exact eigenvalues (for error calculation)
7 [V,D] = e i g (A);
8

9 k = f i n d (d == max( abs (d)));

10 i f ( l e n g t h (k) > 1), e r r o r (’No single largest EV’); end ;
11 ev = X(:,k(1)); ev = ev/norm(ev); ev
12 ew = d(k(1)); ew
13 i f (ew < 0), sgn = -1; e l s e sgn = 1; end
14

15 z = rand (n,1); z = z/norm(z);

16 s = 1;

9. Eigenvalues, 9.3. Power Methods 637

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

17 res = [];
18

19 % Actual direct power iteration

20 f o r i=1:maxit
21 w = A*z; l = norm(w); rq = r e a l ( dot (w,z)); z = w/l;
22 res = [res;i,l,norm(s*z-ev), abs (l- abs (ew)), abs (sgn*rq-ew)];
23 s = s * sgn;
24 end
25

26 % Plot the result

27 s e milogy (res(:,1),res(:,3),’r-*’,res(:,1),res(:,4),’k-+’,res(:,1),res(:,5),’m-o’)
28 x l a b e l (’{\bf iteration step k}’,’FontSize’,14);
29 y l a b e l (’{\bf errors}’,’FontSize’,14);
30 p r i n t -deps2c ’../PICTURES/pm1.eps’;

① ② ③
(k) (k) (k) (k) (k) (k)
k ρEV ρEW ρEV ρEW ρEV ρEW
22 0.9102 0.9007 0.5000 0.5000 0.9900 0.9781
(k)
z(k) − s·,n 23 0.9092 0.9004 0.5000 0.5000 0.9900 0.9791
ρEV := , 24 0.9083 0.9001 0.5000 0.5000 0.9901 0.9800
z(k−1) − s·,n
25 0.9075 0.9000 0.5000 0.5000 0.9901 0.9809
(k) | ρA (z(k) ) − λn | 26 0.9068 0.8998 0.5000 0.5000 0.9901 0.9817
ρEW := .
| ρ A ( z ( k − 1) ) − λ n | 27 0.9061 0.8997 0.5000 0.5000 0.9901 0.9825
28 0.9055 0.8997 0.5000 0.5000 0.9901 0.9832
29 0.9049 0.8996 0.5000 0.5000 0.9901 0.9839
30 0.9045 0.8996 0.5000 0.5000 0.9901 0.9844
Observation: linear convergence (→ ??)

Theorem 9.3.21. Convergence of direct power method → [?, Thm. 25.1]

Let λn > 0 be the largest (in modulus) eigenvalue of A ∈ K n,n and have (algebraic) multiplicity 1.
Let v, y be the left and right eigenvectors of A for λn normalized according to k yk2 = k vk2 = 1.
Then there is convergence

| λn −1 |
Az(k) → λn , z(k) → ±v linearly with rate ,
2 | λn |

where z(k) are the iterates of the direct power iteration and y H z(0) 6= 0 is assumed.

Remark 9.3.22 (Initial guess for power iteration)

roundoff errors ➤ y H z(0) 6= 0 always satisfied in practical computations

Usual (not the best!) choice for x(0) = random vector

Remark 9.3.23 (Termination criterion for direct power iteration)

9. Eigenvalues, 9.3. Power Methods 638

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(→ Section 8.1.2)

Adaptation of a posteriori termination criterion (8.2.23)


 (k) (k−1) ≤ (1/L − 1)tol ,
 min z ± z

“relative change” ≤ tol:

 kAz(k) k kAz(k−1) k
 − ( k −1) ≤ (1/L − 1)tol see (8.1.29) .
kz( k ) k kz k

Estimated rate of convergence

9.3.2 Inverse Iteration [?, Sect. 7.6], [?, Sect. 5.3.2]

Example 9.3.24 ( Image segmentation )

Given: gray-scale image: intensity matrix P ∈ {0, . . . , 255}m,n , m, n ∈ N

((P)ij ↔ pixel, 0 =
ˆ black, 255 =
ˆ white)

M ATLAB-code 9.3.25: loading and displaying an image

1 M = imread(’eth.pbm’);
2 [m,n] = s i z e (M);
Loading and displaying images 3 f p r i n t f (’%dx%d grey scale pixel
in M ATLAB ✄ image\n’,m,n);
4 f i g u r e ; image(M); t i t l e (’ETH view’);
5 col = [0:1/215:1]’*[1,1,1]; colormap (col);

(Fuzzy) task: Local segmentation

Find connected patches of image of the same shade/color

More general segmentation problem (non-local): identify parts of the image, not necessarily connected,
with the same texture.

Next: Statement of (rigorously defined) problem, cf. ??:

Preparation: Numbering of pixels 1 . . . , mn, e.g, lexicographic numbering:

✦ pixel set V := {1. . . . , nm}
✦ indexing: index(pixeli,j ) = (i − 1)n + j
✎ notation: pk := (P)ij , if k = index(pixeli,j ) = (i − 1)n + j, k = 1, . . . , N := mn

9. Eigenvalues, 9.3. Power Methods 639

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

( m − 1) n + 1 mn
Local similarity matrix:

W ∈ R N,N , N := mn , (9.3.26)


0 , if pixels i, j not adjacent,
(W)ij = 0 , if i = j ,


σ ( pi , p j ) , if pixels i, j adjacent.
m
↔=
ˆ adjacent pixels ✄
Similarity function, e.g., with α > 0
n+1 n+2 2n
2
σ( x, y) := exp(−α( x − y) ) , x, y ∈ R .
1 2 3 n
Lexicographic numbering ✄
Fig. 331
n
The entries of the matrix W measure the “similarity” of neighboring pixels: if (W)ij is large, they encode
(almost) the same intensity, if (W)ij is close to zero, then they belong to parts of the picture with very
different brightness. In the latter case, the boundary of the segment may separate the two pixels.

Definition 9.3.27. Normalized cut (→ [?, Sect. 2])

For X ⊂ V define the normalized cut as

cut(X ) cut(X )
Ncut(X ) := + ,
weight(X ) weight(V \ X )
with cut(X ) := ∑ wij , weight(X ) := ∑ wij .
i ∈X ,j6∈X i ∈X ,j∈X

In light of local similarity relationship:

• cut(X ) big ➣ substantial similarity of pixels across interface between X and V \ X .

• weight(X ) big ➣ a lot of similarity of adjacent pixels inside X .
Segmentation problem (rigorous statement):

find X ∗ ⊂ V : X ∗ = argmin Ncut(X ) . (9.3.28)

X ⊂V

NP-hard combinatorial optimization problem !

9. Eigenvalues, 9.3. Power Methods 640

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Image Scanning rectangles

5 5

10 10

15 15
pixel

pixel
20 20

25 25

30 30

5 10 15 20 25 5 10 15 20 25
Fig. 332 pixel Fig. 333 pixel

Minimum NCUT rectangle

5
20

15
pixel

10
20

25
5

30
2 4 6 8 10 12 14 16 18
5 10 15 20 25
Fig. 334 pixel Fig. 335 pixel

△ Ncut(X ) for pixel subsets X defined by sliding rectangles, see Fig. 333.

Equivalent reformulation:
(
1 , if i ∈ X ,
indicator function: z : {1, . . . , N } 7→ {−1, 1} , zi := z(i ) = (9.3.29)
−1 , if i 6∈ X .

∑ −wij zi z j ∑ −wij zi z j
zi >0,z j <0 zi >0,z j <0
Ncut(X ) = + , (9.3.30)
∑ di ∑ di
zi >0 zi <0
di = ∑ wij = weight({i}) . (9.3.31)
j∈V

Sparse matrices:

D := diag(d1 , . . . , d N ) ∈ R N,N , A := D − W = A⊤ . (9.3.32)

Summary: (obvious) properties of these matrices

✦ A has positive diagonal and non-positive off-diagonal entries.

9. Eigenvalues, 9.3. Power Methods 641

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✦ A is diagonally dominant (→ Def. 2.8.8) ➣ A is positive semidefinite by Lemma 2.8.12.

✦ A has row sums = 0:
1⊤ A = A1 = 0 . (9.3.33)

M ATLAB-code 9.3.34: assembly of A, D

1 f u n c t i o n [A,D] = imgsegmat(P)
2 P = double(P); [n,m] = s i z e (P);
3 spdata = z e r o s (4*n*m,1); spidx = z e r o s (4*n*m,2);
4 k = 1;
5 f o r ni=1:n
6 f o r mi=1:m
7 mni = (mi-1)*n+ni;
8 i f (ni-1>0), spidx(k,:) = [mni,mni-1];
9 spdata(k) = Sfun(P(ni,mi),P(ni-1,mi));
10 k = k + 1;
11 end
12 i f (ni+1<=n), spidx(k,:) = [mni,mni+1];
13 spdata(k) = Sfun(P(ni,mi),P(ni+1,mi));
14 k = k + 1;
15 end
16 i f (mi-1>0), spidx(k,:) = [mni,mni-n];
17 spdata(k) = Sfun(P(ni,mi),P(ni,mi-1));
18 k = k + 1;
19 end
20 i f (mi+1<=m), spidx(k,:) = [mni,mni+n];
21 spdata(k) = Sfun(P(ni,mi),P(ni,mi+1));
22 k = k + 1;
23 end
24 end
25 end
26 % Efficient initialization, see Sect. 2.7.2, Ex. 2.7.13
27 W = spar se (spidx(1:k-1,1),spidx(1:k-1,2),spdata(1:k-1),n*m,n*m);
28 D = spdiags ( f u l l (sum(W’)’),[0],n*m,n*m);
29 A = D-W;

Lemma 9.3.35. Ncut and Rayleigh quotient (→ [?, Sect. 2])

With z ∈ {−1, 1} N according to (9.3.29) there holds

∑ di
y⊤ Ay zi >0
Ncut(X ) = ⊤ , y := (1 + z) − β(1 − z) , β := .
y Dy ∑ di
zi <0

generalized Rayleigh quotient ρA,D (y)

Proof. Note that by (9.3.29) (1 − z)i = 0 ⇔ i ∈ X , (1 + z)i = 0 ⇔ i 6∈ X . Hence, since

9. Eigenvalues, 9.3. Power Methods 642

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

( 1 + z ) ⊤ D ( 1 − z ) = 0,

⊤1 1 y⊤ Ay
4 Ncut(X ) = (1 + z) A(1 + z) + = ,
κ1 D1 (1 − κ )1⊤ D1
⊤ β1⊤ D1
β
where κ := ∑ di/∑ di = 1+ β . Also observe
z i >0 i

y⊤ Dy = (1 + z)⊤ D(1 + z) + β2 (1 − z)⊤ D(1 − z) =

4( ∑ di + β2 ∑ di ) = 4β∑ di = 4β1⊤ D1 .
zi >0 zi <0 i

This finishes the proof.

✷

✦ (9.3.33) ⇒ 1 ∈ EigA0
✦ Lemma 2.8.12: A diagonally dominant =⇒ A is positive semidefinite (→ Def. 1.1.8)
Ncut(X ) ≥ 0 and 0 is the smallest eigenvalue of A.

However, we are by no means interested in a minimizer y ∈ Span{1} (with constant entries) that does
not provide a meaningful segmentation.

Idea: weed out undesirable constant minimizers by imposing orthogonality

constraint (orthogonality w.r.t. inner product induced by D, cf. Sec-
tion 10.1)

y ⊥ D1 ⇔ 1⊤ Dy = 0 . (9.3.36)

segmentation problem (9.3.28) ⇔ argmin ρA,D (y) . (9.3.37)

y ∈{2,−2β} N , 1⊤ Dy =0

still NP-hard
➣ Minimizing Ncut(X ) amounts to minimizing a (generalized) Rayleigh quotient (→ Def. 9.3.16) over
a discrete set of vectors, which is still an NP-hard problem.

Idea: Relaxation

Discrete optimization problem → continuous optimization problem

(9.3.37) → argmin ρA,D (y) . (9.3.38)

y ∈R N , y 6=0, 1⊤ Dy =0

✎ ☞
Task: (9.3.38) ⇔ Find minimizer of (generalized) Rayleigh quotient under linear
✍ ✌
constraint

Here: linear constraint on y: 1⊤ Dy =0

The next theorem establishes a link between argument vectors that render the Rayleigh quotient extremal
and eigenspace for extremal eigenvalues.

9. Eigenvalues, 9.3. Power Methods 643

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Theorem 9.3.39. Rayleigh quotient theorem

Let λ1 < λ2 < · · · < λm , m ≤ n, be the sorted sequence of all (real!) eigenvalues of A = AH ∈
C n,n . Then

EigAλ1 = argmin ρA (y) and EigAλm = argmax ρA (y) .

y ∈C n.n \{0} y ∈C n.n \{0}

Remark 9.3.40 (Min-max theorem)

Thm. 9.3.39 is a an immediate consequence of the following more general and fundamentally important
result.
Theorem 9.3.41. Courant-Fischer min-max theorem → [?, Thm. 8.1.2]

Let λ1 < λ2 < · · · < λm , m ≤ n, be the sorted sequence of the (real!) eigenvalues of A = AH ∈
C n,n . Write
ℓ
U0 = {0} , Uℓ := ∑ EigAλ j , ℓ = 1, . . . , m and Uℓ⊥ := {x ∈ C n : uH x = 0 ∀u ∈ Uℓ } .
j =1

Then

min ρ A (y) = λℓ , 1 ≤ ℓ ≤ m , argmin ρA (y) ⊂ EigAλℓ .

⊥ \{0}
y ∈Uℓ− ⊥ \{0}
1 y ∈Uℓ− 1

Proof. For diagonal A ∈ R n,n the assertion of the theorem is obvious. Thus, Cor. 9.1.9 settles everything.

A simple conclusion from Thm. 9.3.41: If A = A⊤ ∈ R n,n with eigenvalues λ1 ≤ λ2 ≤ · · · ≤ λn , then

λ1 = minn ρA (z) , λ2 = min ρA (z) , (9.3.42)
z ∈R z∈R n ,z⊥ v1

where v1 ∈ EigAλ1 \ {0}.

Well, in Lemma 9.3.35 we encounter a generalized Rayleigh quotient ρA,D (y)! How can Thm. 9.3.39 be
applied to it?
ρA,D (D− /2 z) = ρD−1/2 AD−1/2 (z) , y ∈ R n .
1
Transformation idea: (9.3.43)
e := D− /2 AD− /2 . Elementary manipulations show
Apply Thm. 9.3.41 to transformed matrix A
1 1

1/2
z= D y
argmin ρA,D (D− /2 z)
1
(9.3.38) ⇔ argmin ρA,D (y) =
1⊤ Dy =0 1⊤ D1/2 z=0 (9.3.44)
e := D−1/2 AD−1/2 .
= argmin ρAe (z) with A
1⊤ D1/2 z=0

9. Eigenvalues, 9.3. Power Methods 644

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Related: transformation of a generalized eigenvalue problem into a standard eigenvalue problem accord-
ing to
1/2
z= B x
B− /2 AB− /2 z = λz .
1 1
Ax = λBx =⇒ (9.3.45)

B1/2 =
ˆ square root of s.p.d. matrix B → Rem. 10.3.2.
For segmentation problem: B = D diagonal with positive diagonal entries, see (9.3.32)
D−1/2 = diag(d1− /2 , . . . , d− e
1 1/2 −1/2 AD−1/2 can easily be computed.
➥ N ) and A := D

In the sequel consider minimization problem/related eigenvalue problem

z∗ = argmin ρA
e (z)
e = λz .
←→ Az (9.3.46)
1⊤ D1/2 z=0

Recover solution y∗ of (9.3.38) as y∗ = D− /2 z∗ .

Still, Thm. 9.3.39 cannot be applied to (9.3.46):

1⊤ D /2 z = 0 ?
1
How to deal with constraint

Idea: Penalization

Add term P(z) to ρA e (z) that becomes “sufficiently large” in case the con-
straint is violated.

z∗ can only be a minimizer of ρA

e (z) + P(z), if P(z) = 0.

How to choose the penalty function P(z) for the segmentation problem ?

n o |1⊤ D1/2 z|2

1⊤ D /2 z 6= 0 ⇒ P(z) > 0
1
satisfied for P(z) = µ ,
kzk22

with penalty parameter µ > 0.

penalized minimization problem dense rank-1 matrix

z⊤ (D /2 11⊤ D /2 )z
1 1
∗
z = argmin ρA e (z) + P(z) = argmin ρA e (z) +
z∈R N \{0} z∈R N \{0} z⊤ z
(9.3.47)
e + D1/2 11⊤ D1/2 .
b := A
= argmin ρ Ab (z) with A
z∈R N \{0}

How to choose the penalty parameter µ ?

In general: finding a “suitable” value for µ may be difficult or even impossible!

Here we are lucky:

9. Eigenvalues, 9.3. Power Methods 645

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(9.3.33) e (D1/2 1) = 0 ⇔ D1/2 1 ∈ Eig A0

⇒ A1 = 0 ⇒ A e .

Constraint in (9.3.46) means

Minimize over the orthogonal complement of an eigenvector. (9.3.48)

Cor. 9.1.9 ➤ The orthogonal complement of an eigenvector of a symmetric matrix is spanned by the
other eigenvectors (orthonormalization of eigenvectors belonging to the same eigenvalue
is assumed).

(9.3.48) e that belongs to the

The minimizer of (9.3.46) will be one of the other eigenvectors of A
smallest eigenvalue.

Note: This eigenvector z∗ will be orthogonal to D /2 1, it satisfies the constraint, and, thus, P(z∗ ) = 0!
1

Note: e and A
eigenspaces of A b agree.

Note: Lemma 2.8.12 e is positive semidefinite (→ Def. 1.1.8) with smallest eigenvalue 0.n
=⇒ A

Idea: Choose penalization parameter µ in (9.3.47) such that D /2 1 is guaran-

b.
teed not to be an eigenvector belonging to the smallest eigenvalue of A

b : Thm. 9.1.4 suggests

Safe choice: choose µ such that D /2 1 will belong to the largest eigenvalue of A
1

(1.5.79)
e
µ= A = 2. (9.3.49)
∞

z∗ = argmin ρA
e (z) = argmin ρA
b (z) . (9.3.50)
1⊤ D1/2 z=0 z6 =0

By Thm. 9.3.39:
b,
z∗ = eigenvector belonging to minimal eigenvalue of A
m
∗ 1/2
z = eigenvector ⊥ D 1 belonging to minimal eigenvalue of Ae,
m
D − 1/2 ∗
z = minimizer for (9.3.38).

(9.3.51) Algorithm outline: Binary grayscale image segmentation

➊ Given similarity function σ compute (sparse!) matrices W, D, A ∈ R N,N , see (9.3.26), (9.3.32).

b :=
➋ Compute y∗ , ky∗ k2 = 1, as eigenvector belonging to the smallest eigenvalue of A
D − 1/2
AD − 1/2 1/2 1/2 ⊤
+ 2(D 1)(D 1) .

➌ Set x∗ = D−1/2 y∗ and define the image segment as pixel set

N
X := {i ∈ {1, . . . , N }: xi∗ > 1
N ∑ xi∗ } . (9.3.52)
i =1

9. Eigenvalues, 9.3. Power Methods mean value of entries of x∗ 646

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB-code 9.3.53: 1st stage of segmentation of grayscale image

1 % Read image and build matrices, see Code 9.3.34 and (9.3.32)
2 P = imread(’image.pbm’); [m,n] = s i z e (P); [A,D] = imgsegmat(P);
3 % Build scaling matrics
4 N = s i z e (A,1); dv = s q r t ( spdiags (A,0));
5 Dm = spdiags (1./dv,[0],N,N); % D−1/2
6 Dp = spdiags (dv,[0],N,N); % D−1/2
7 % Build (densely populated !) matrix A b
8 c = Dp*ones(N,1); Ah = Dm*A*Dm + 2*c*c’;
9 % Compute and sort eigenvalues; grossly inefficient !
10 [W,E] = eig( f u l l (Ah)); [ev,idx] = s o r t ( d i a g (E)); W(:,idx) = W;
11 % Obtain eigenvector x∗ belonging to 2nd smallest generalized
12 % eigenvalue of A and D
13 x = W(:,1); x = Dm*v;
14 % Extract segmented image
15 xs = reshape (x,m,n); Xidx = f i n d (xs>(sum(sum(xs))/(n*m)));

1st-stage of segmentation of 31 × 25 grayscale pixel image (root.pbm, red pixels =

ˆ X , σ( x, y) =
exp(−(x−y/10)2 ))
Original Segments

5 5

10 10

15 15

20 20

25 25

30 30
Fig. 336 Fig. 337
5 10 15 20 25 5 10 15 20 25

vector r: size of entries on pixel grid

0.022

0.02

0.025 0.018

0.02 0.016

0.014
0.015 Image from Fig. 336:
0.012
0.01
0.01 ✁ eigenvector x∗ plotted on pixel grid
0.005
0.008

0 0.006
30
25 25 0.004
20 20
15 15 0.002
10 10
Fig. 338 5 5

To identify more segments, the same algorithm is recursively applied to segment parts of the image

9. Eigenvalues, 9.3. Power Methods 647

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

already determined.

Practical segmentation algorithms rely on many more steps of which the above algorithm is only one, pre-
ceeded by substantial preprocessing. Moreover, they dispense with the strictly local perspective adopted
above and take into account more distant connections between image parts, often in a randomized fashion
[?].

The image segmentation problem falls into the wider class of graph partitioning problems. Methods based
on (a few of) the eigenvector of the connectivity matrix belonging to the smallest eigenvalues are known
as spectral partitioning methods. The eigenvector belonging to the smallest non-zero eigenvalue that we
computed above is usually called the Fiedler vector of the graph, see [?, ?].

The solution of the image segmentation problem by means of eig in Code 9.3.53 amounts a tremendous
waste of computational resources: we compute all eigenvalues/eigenvectors of dense matrices, though
only a single eigenvector associated with the smallest eigenvalue is of interest.

This motivates the quest to find efficient numerical methods for the following task.

Task: given A ∈ K n,n , find smallest (in modulus) eigenvalue of regular A ∈ K n,n
and (an) associated eigenvector.

If A ∈ K n,n regular:
−1
Smallest (in modulus) EV of A = Largest (in modulus) EV of A−1

Direct power method (→ Section 9.3.1) for A−1 = inverse iteration

M ATLAB-code 9.3.54: inverse iteration for computing λmin (A) and associated eigenvector
1 f u n c t i o n [lmin,y] = invit(A,tol)
2 [L,U] = lu(A); % single intial LU-factorization, see Rem. 2.5.10
3 n = s i z e (A,1); x = rand (n,1); x = x/norm(x); % random initial guess
4 y = U\(L\x); lmin = 1/norm(y); y = y*lmin; lold = 0;
5 w h i l e ( abs (lmin-lold) > tol*lmin) % termination, if small relative
change
6 lold = lmin; x = y;
7 y = U\(L\x); % core iteration: y = A−1 x,
8 lmin = 1/norm(y); % new approxmation of λmin (A)
9 y = y*lmin; % normalization y := kyyk
2
10 end

Note: reuse of LU-factorization, see Rem. 2.5.10

Remark 9.3.55 (Shifted inverse iteration)

9. Eigenvalues, 9.3. Power Methods 648

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

More general task:

For α ∈ C find λ ∈ σ (A) such that |α − λ| = min{|α − µ|, µ ∈ σ (A)}

Shifted inverse iteration: [?, Alg .7.24]

w
z(0) arbitrary , w = (A − αI)−1 z(k−1) , z(k) := , k = 1, 2, . . . , (9.3.56)
kwk2

where: (A − αI)−1 z(k−1) = ˆ solve (A − αI)w = z(k−1) based on Gaussian elimination (↔ a single
LU-factorization of A − αI as in Code 9.3.54).

Remark 9.3.57 ((Nearly) singular LSE in shifted inverse iteration)

What if “by accident” α ∈ σ (A) (⇔ A − αI singular) ?

Stability of Gaussian elimination/LU-factorization (→ ??) will ensure that “w from (9.3.56) points in
the right direction”

In other words, roundoff errors may badly affect the length of the solution w, but not its direction.
Practice [?]: If, in the course of Gaussian elimination/LU-factorization a zero pivot element is really en-
countered, then we just replace it with eps, in order to avoid inf values!

Thm. 9.3.21 ➣ Convergence of shifted inverse iteration for A H = A:

Asymptotic linear convergence, Rayleigh quotient → λ j with rate

|λ j − α|
with λ j ∈ σ (A ) , | α − λ j | ≤ | α − λ| ∀ λ ∈ σ (A ) .
min{|λi − α|, i 6= j}

Extremely fast for α ≈ λ j !

Idea: A posteriori adaptation of shift

Use α := ρA (z(k−1) ) in k-th step of inverse iteration.

(9.3.58) Rayleigh quotient iteration → [?, Alg. 25.2]

M ATLAB-code 9.3.59: Rayleigh quotient iteration (for normal A ∈ R n,n )

1 f u n c t i o n [z,lmin] = rqui(A,tol,maxit)
2 a l p h a = 0; n = s i z e (A,1);
3 z = rand ( s i z e (A,1),1); z = z/norm(z); % z(0)
4 f o r i=1:maxit

9. Eigenvalues, 9.3. Power Methods 649

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

5 z = (A- a l p h a * speye (n))\z; % z(k+1) = (A − ρA (z(k) )I)−1 x(k)

6 z = z/norm(z); lmin= dot (A*z,z); % Computation of ρA (z(k+1) )
7 i f ( abs ( alpha -lmin) < tol*lmin) % Desired relative accuracy
reached ?
8 br eak ; end ;
9 a l p h a = lmin;
10 end

Line 5: note use of speye to preserve spare matrix data format!

✦ Drawback compared with Code 9.3.54: reuse of LU-factorization no longer possible.

✦ Even if LSE nearly singular, stability of Gaussian elimination guarantees correct direction of z, see
discussion in Rem. 9.3.57.

Example 9.3.60 (Rayleigh quotient iteration)

Monitored: iterates of Rayleigh quotient iteration (9.3.59) for s.p.d. A ∈ R n,n

0
10

M ATLAB-code 9.3.61:
d = (1:10) ’;
10
-5 n = length ( d ) ;
Z = diag ( s q r t ( 1 : n ) , 0 ) + ones ( n , n ) ;
[ Q, R] = q r ( Z ) ;
A = Q∗ diag ( d , 0 ) ∗ Q ’ ;
-10
10

o: |λmin − ρA (z(k) )|
∗ : z(k) − x j , λmin = λ j , x j ∈ EigAλ j ,
10
-15

1 2 3 4 5 6 7 8 9 10
: xj 2 = 1
k

k |λmin − ρA (z(k) )| z(k) − x j Theorem 9.3.62. → [?, Thm. 25.4]

1 0.09381702342056 0.20748822490698
If A = A H , then ρA (z(k) ) converges locally of
2 0.00029035607981 0.01530829569530
order 3 (→ Def. 8.1.17) to the smallest eigen-
3 0.00000000001783 0.00000411928759
value (in modulus), when z(k) are generated
4 0.00000000000000 0.00000000000000
by the Rayleigh quotient iteration (9.3.59).
5 0.00000000000000 0.00000000000000

9.3.3 Preconditioned inverse iteration (PINVIT)

Task: given A ∈ K n,n , find smallest (in modulus) eigenvalue of regular A ∈ K n,n
and (an) associated eigenvector.

Options: inverse iteration (→ Code 9.3.54) and Rayleigh quotient iteration (9.3.59).

9. Eigenvalues, 9.3. Power Methods 650

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

? What if direct solution of Ax = b not feasible ?

This can happen, in case

• for large sparse A the amount of fill-in exhausts memory, despite sparse elimination techniques (→
Section 2.7.5),

• A is available only through a routine evalA(x) providing A×vector.

We expect that an approximate solution of the linear systems of equations encountered during inverse iteration
should be sufficient, because we are dealing with approximate eigenvectors anyway.

Thus, iterative solvers for solving Aw = z(k−1) may be considered, see Chapter 10. However, the required
accuracy is not clear a priori. Here we examine an approach that completely dispenses with an iterative
solver and uses a preconditioner (→ Notion 10.3.3) instead.

Idea: (for inverse iteration without shift, A = A H s.p.d.)

Instead of solving Aw = z(k−1) compute w = B−1 z(k−1) with
“inexpensive” s.p.d. approximate inverse B−1 ≈ A−1

➣ B=
ˆ Preconditioner for A, see Notion 10.3.3

Possible to replace A−1 with B−1 in inverse iteration ?

! NO, because we are not interested in smallest eigenvalue of B !

Replacement A−1 → B−1 possible only when applied to residual quantity

residual quantity = quantity that → 0 in the case of convergence to exact
solution

Natural residual quantity for eigenvalue problem Ax = λx:

r := Az − ρA (z)z , ρA (z) = Rayleigh quotient → Def. 9.3.16 .

Note: only direction of A−1 z matters in inverse iteration (9.3.56)

(A−1 z) k (z − A−1 (Az − ρA (z)z)) ⇒ defines same next iterate!

[Preconditioned inverse iteration (PINVIT) for s.p.d. A]

w = z(k−1) − B−1 (Az(k−1) − ρA (z(k−1) )z(k−1) ) ,

(0)
z arbitrary, w k = 1, 2, . . . . (9.3.63)
z(k) = ,
kwk2

9. Eigenvalues, 9.3. Power Methods 651

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB-code 9.3.64: preconditioned inverse iteration (9.3.63)

1 f u n c t i o n [lmin,z,res] =
pinvit(evalA,n,invB,tol,maxit)
2 % invB =
ˆ handle to function implementing
preconditioner B−1
3 z = (1:n)’; z = z/norm(z); % initial guess
Computational effort:
4 res = []; rho = 0;
5 f o r i=1:maxit
6 v = evalA(z); rhon = dot (v,z); % Rayleigh 1 matrix×vector
quotient
7 r = v - rhon*z; % residual 1 evaluation of pre-
8 z = z - invB(r); % iteration according to conditioner
(9.3.63) A few
9 z = z/norm(z); % normalization AXPY-operations
10 res = [res; rhon]; % tracking iteration
11 i f ( abs (rho-rhon) < tol* abs (rhon)), br eak ;
12 e l s e rho = rhon; end
13 end
14 lmin = dot (evalA(z),z); res = [res; lmin],

Example 9.3.65 (Convergence of PINVIT)

S.p.d. matrix A ∈ R n,n , tridiagonal preconditioner, see Ex. 10.3.11

M ATLAB-code 9.3.66:
1 A = spdiags (repmat([1/n,-1,2*(1+1/n),-1,1/n],n,1),
[-n/2,-1,0,1,n/2],n,n);
2 evalA = @(x) A*x;
3 % inverse iteration
4 invB = @(x) A\x;
5 % tridiagonal preconditioning
6 B = spdiags ( spdiags (A,[-1,0,1]),[-1,0,1],n,n); invB = @(x) B\x;

Monitored: error decay during iteration of Code 9.3.64: |ρA (z(k) ) − λmin (A)|

9. Eigenvalues, 9.3. Power Methods 652

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(P)INVIT iterations: tolerance = 0.0001

2
10 28
INVIT, n = 50 INVIT
INVIT, n = 100 PINVIT
0 INVIT, n = 200 26
10
PINVIT, n = 50
PINVIT, n = 100
PINVIT, n = 200 24
max

-2
10
error in approximation for λ

#iteration steps
-4
10
20

-6
10 18

-8
10 16

14
-10
10

12
-12
10
10

-14
10
0 5 10 15 20 25 30 35 40 45 50 8
1 2 3 4 5
Fig. 339 # iterationstep 10 10 10 10 10
Fig. 340 n

Observation: linear convergence of eigenvectors also for PINVIT.

Theory [?, ?]:

✦ linear convergence of (9.3.63)
✦ fast convergence, if spectral condition number κ (B−1 A) small

The theory of PINVIT [?, ?] is based on the identity

w = ρA (z(k−1) )A−1 z(k−1) + (I − B−1 A)(z(k−1) − ρA (z(k−1) )A−1 z(k−1) ) . (9.3.67)

For small residual Az(k−1) − ρA (z(k−1) )z(k−1) PINVIT almost agrees with the regular inverse iteration.

9.3.4 Subspace iterations

Remark 9.3.68 (Excited resonances)

Consider the non-autonomous ODE (excited harmonic oscillator)

ÿ + λ2 y = cos(ωt) , (9.3.69)

with general solution



 1
 cos(ωt) + A cos(λt) + B sin(λt) , if λ 6= ω ,
λ2 − ω2
y(t) = (9.3.70)

 t sin(ωt) + A cos(λt) + B sin(λt)

, if λ = ω .
2ω

growing solutions possible in resonance case λ=ω!

Now consider harmonically excited vibration modelled by ODE

ÿ + Ay = b cos(ωt) , (9.3.71)

9. Eigenvalues, 9.3. Power Methods 653

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

with symmetric, positive (semi)definite matrix A ∈ R n,n , b ∈ R n . By Cor. 9.1.9 there is an orthogonal
matrix Q ∈ R n,n such that

Q⊤ AQ = D := diag(λ1 , . . . , λn ) .

where the 0 ≤ λ1 < λ2 < · · · < λn are the eigenvalues of A.

Transform ODE as in Ex. 9.0.7: with z = Q⊤ y

(9.3.71) z̈ + Dz = Q⊤ b cos(ωt) .

☛ ✟
We have obtained decoupled linear 2nd-order scalar ODEs of the type (9.3.69).
√
✡ ✠
(9.3.71) can have growing (with time) solutions, if ω = λi for some i = 1, . . . , n
p
If ω = λ j for one j ∈ {1, . . . , n}, then the solution for the initial value problem for (9.3.71) with
y(0) = ẏ(0) = 0 (↔ z(0) = ż (0) = 0) is

t
z(t) ∼ sin(ωt)e j + bounded oscillations
2ω
m
t
y(t) ∼ sin(ωt)(Q):,j + bounded oscillations .
2ω
j-th eigenvector of A
Eigenvectors of A ↔ excitable states

Example 9.3.72 (Vibrations of a truss structure cf. [?, Sect. 3], M ATLAB’s truss demo)

2.5

1.5 ✁ a “bridge” truss]

1
A truss is a structure composed of (massless) rods
and point masses; we consider in-plane (2D) trusses.
0.5

Encoding: positions of masses + (sparse) connectiv-

-0.5
ity matrix
-1

-1.5

Fig. 341
0 1 2 3 4 5

M ATLAB-code 9.3.73: Data for “bridge truss”

1 % Data for truss structure “bridge”
2 pos = [ 0 0; 1 0;2 0;3 0;4 0;5 0;1 1;2 1;3 1;4 1;2.5 0.5];
3 con = [1 2;2 3;3 4;4 5;5 6;1 7;2 7;3 8;2 8;4 8;5 9;5 10;6 10;7
8;8 9;9 10;8 11 ...
4 ; 9 11;3 11;4 11;4 9];
5 n = s i z e (pos,1);

9. Eigenvalues, 9.3. Power Methods 654

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6 top = spar se (con(:,1),con(:,2),ones( s i z e (con,1),1),n,n);

7 top = s i g n (top+top’);

Assumptions: ✦ Truss in static equilibrium (perfect balance of forces at each point mass).
✦ Rods are perfectly elastic (i.e., frictionless).
Hook’s law holds for force in the direction of a rod:
∆l
F=α , (9.3.74)
l

where ✦ l is the equilibrium length of the rod,

✦ ∆l is the elongation of the rod effected by the force F in the direction of the rod
✦ α is a material coefficient (Young’s modulus).
n point masses are numbered 1, . . . , n: pi ∈ R2 =
ˆ position of i-th mass

We consider a swaying truss: description by time-dependent displacements ui (t) ∈ R 2 of point masses:

position of i-th mass at time t = p i + u i ( t ); .

✁ deformed truss:

ˆ point masses at positions pi

•=
ˆ displacement vectors ui
→=
ˆ shifted masses at pi + ui
Fig. 342
•=
Equilibrium length and (time-dependent) elongation of rod connecting point masses i and j, i 6= j:

lij := ∆p ji , ∆p ji := p j − pi , (9.3.75)
2
∆lij (t) := ∆p ji + ∆u ji (t) − lij , ∆u ji (t) := u j (t) − ui (t) . (9.3.76)
2

Extra (reaction) force on masses i and j:

∆lij ∆p ji + ∆u ji (t)
Fij (t) = −αij · . (9.3.77)
lij ∆p ji + ∆u ji (t) 2

✞ ☎

✝ ✆
Assumption: Small displacements

2
Possibility of linearization by neglecting terms of order ui 2
!
(9.3.75) 1 1
Fij (t) = αij − · (∆p ji + ∆u ji (t)) . (9.3.78)
(9.3.76) ∆p ji + ∆u ji (t) 2
∆p ji

9. Eigenvalues, 9.3. Power Methods 655

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Lemma 9.3.79. Taylor expansion of inverse distance function

For x ∈ R d \ {0}, y ∈ R d , k yk2 < k xk2 holds for y → 0

1 1 x·y
= − 3
+ O(kyk22 ) .
k x + y k2 k xk2 kxk2

Proof. Simple Taylor expansion up to linear term for f (x) = ( x12 + · · · + x2d )−1/2 and f (x + y) =
f (x) + grad f (x) · y + O(k yk22 ).
✷

2
Linearization of force: apply Lemma 9.3.79 to (9.3.78) and drop terms O( ∆u ji 2 ):

∆p ji · ∆u ji (t)
Fij (t) ≈ − αij · (∆p ji + ∆u ji (t))
lij3
(9.3.80)
∆p ji · ∆u ji (t)
≈ − αij 3
· ∆p ji .
lij

Newton’s second law of motion: ( Fi =

ˆ total force acting on i-th mass)

n
d2 i
mi
dt2
u (t) = Fi = ∑ − Fij (t) , (9.3.81)
j =1
j6=i

mi =
ˆ mass of point mass i.

d2 i n
1
mi u (t) = ∑ αij l 3 ∆p ji (∆p ji )⊤ (u j (t) − ui (t)) . (9.3.82)
dt2 j =1 ij
j6=i

n
Compact notation: collect all displacements into one vector u ( t ) = ui ( t ) ∈ R2n
i =1

du
(9.3.82) M (t) + Au(t) = f(t) . (9.3.83)
dt2

with mass matrix M = diag(m1 , m1 , . . . , mn , mn )

and stiffness matrix A ∈ R 2n,2n with 2 × 2-blocks
n
1 ji ji ⊤

(A)2i −1:2i,2i −1,2i = ∑ αij 3 ∆p (∆p ) , i = 1, . . . , n ,
j =1 lij
j6=i
(9.3.84)
1
(A)2i −1:2i,2j−1:2j = −αij 3 ∆p ji (∆p ji )⊤ , i 6= j .
lij
n
and external forces f(t) = fi ( t ) .
i =1

Note: stiffness matrix A is symmetric, positive semidefinite (→ Def. 1.1.8).

9. Eigenvalues, 9.3. Power Methods 656

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✛ ✘
Rem. 9.3.68: if periodic external forces f(t) = cos(ωt)f, f ∈ R2n , (wind, earthquake)p
act on the
truss they can excite vibrations of (linearly in time) growing amplitude, if ω coincides with λ j for an

✚ ✙
eigenvalue λ j of A.

Excited vibrations can lead to the collapse of a truss structure, cf. the notorious Tacoma-Narrows bridge disaster.

It is essential to know whether eigenvalues of a truss structure fall into a range that can be excited
by external forces.

These will typically(∗) be the low modes ↔ a few of the smallest eigenvalues.
((∗) Reason: fast oscillations will quickly be damped due to friction, which was neglected in our model.)

M ATLAB-code 9.3.85: Computing resonant frequencies and modes of elastic truss

1 f u n c t i o n [lambda,V] = trussvib(pos,top)
2 % Computes vibration modes of a truss structure, see Ex. 9.3.72. Mass
point
3 % positions passed in the n × 2-matrix poss and the connectivity encoded
in
4 % the sparse symmetric matrix top. In addition top(i,j) also stores
the
5 % Young’s moduli αij .
6 % The 2n resonant frequencies are returned in the vector lambda, the
7 % eigenmodes in the column of V, where entries at odd positions contain
the
8 % x1 -coordinates, entries at even positions the x2 -coordinates
9 n = s i z e (pos,1); % no. of point masses
10 % Assembly of stiffness matrix according to (9.3.84)
11 A = z e r o s (2*n,2*n);
12 [Iidx,Jidx] = f i n d (top); idx = [Iidx,Jidx]; % Find connected masses
13 f o r ij = idx’
14 i = ij(1); j = ij(2);
15 dp = [pos(j,1);pos(j,2)] - [pos(i,1);pos(i,2)]; % ∆p ji
16 lij = norm(dp); % lij
17 A(2*i-1:2*i,2*j-1:2*j) = -(dp*dp’)/(lij^3);
18 end
19 % Set Young’s moduli αij (stored in top matrix)
20 A = A.* f u l l ( kron (top,[1 1;1 1]));
21 % Set 2 × 2 diagonal blocks
22 f o r i=1:n
23 A(2*i-1:2*i,2*i-1) = -sum(A(2*i-1:2*i,1:2: end )’)’;
24 A(2*i-1:2*i,2*i) = -sum(A(2*i-1:2*i,2:2: end )’)’;
25 end
26 % Compute eigenvalues and eigenmodes
27 [V,D] = eig(A); lambda = d i a g (D);

9. Eigenvalues, 9.3. Power Methods 657

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

truss resonant frequencies

✁ resonant frequencies of bridge truss from Fig. 341.

eigenvalue

3
The stiffness matrix will always possess three zero
2
eigenvalues corresponding to rigid body modes (=
displacements without change of length of the rods)
1

-1
0 5 10 15 20 25
Fig. 343 no. of eigenvalue

mode 4: frequency = 6.606390e-02 mode 5: frequency = 3.004450e-01

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2
Fig. 344 Fig. 345
-1 0 1 2 3 4 5 6 -1 0 1 2 3 4 5 6

mode 6: frequency = 3.778801e-01 mode 7: frequency = 4.427214e-01

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2
Fig. 346 Fig. 347
-1 0 1 2 3 4 5 6 -1 0 1 2 3 4 5 6

To compute a few of a truss’s lowest resonant frequencies and excitable mode, we need efficient numerical
methods for the following tasks. Obviously, Code 9.3.85 cannot be used for large trusses, because eig
invariable operates on dense matrices and will be prohibitively slow and gobble up huge amounts of
memory, also recall the discussion of Code 9.3.53.

Task: Compute m, m ≪ n, of the smallest/largest (in modulus) eigenvalues

of A = AH ∈ C n,n and associated eigenvectors.

9. Eigenvalues, 9.3. Power Methods 658

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Of course, we aim to tackle this task by iterative methods generalizing power iteration (→ Section 9.3.1)
and inverse iteration (→ Section 9.3.2).

9.3.4.1 Orthogonalization

Preliminary considerations (in R , m = 2):

According to Cor. 9.1.9: For A = A⊤ ∈ R n,n there is a factorization A = UDU⊤ with D = diag(λ1 , . . . , λn ),
λ j ∈ R, λ1 ≤ λ2 ≤ · · · ≤ λn , and U orthogonal. Thus, u j := (U):,j , j = 1, . . . , n, are (mutually orthog-
onal) eigenvectors of A.

Assume 0 ≤ λ1 ≤ · · · ≤ λn−2 <λn−1 <λn (largest eigenvalues are simple).

If we just carry out the direct power iteration (9.3.12) for two vectors both sequences will converge to the
largest (in modulus) eigenvector. However, we recall that all eigenvectors are mutually orthogonal. This
suggests that we orthogonalize the iterates of the second power iteration (that is to yield the eigenvector for
the second largest eigenvalue) with respect to those of the first. This idea spawns the following iteration,
cf. Gram-Schmidt orthogonalization in (10.2.11):

M ATLAB-code 9.3.86: one step of subspace power iteration, m = 2

1 f u n c t i o n [v,w] = sspowitstep(A,v,w)
2 v = A*v; w = A*w; % “power iteration”, cf. (9.3.12)
3 % orthogonalization, cf. Gram-Schmidt orthogonalization (10.2.11)
4 v = v/norm(v); w = w - dot (v,w)*v; w = w/norm(w); % now w ⊥ v

v
w−w· k v k2
✁ Orthogonalization of two vectors
(see Line 4 of Code 9.3.86)
v

v
Fig. 348 w· k v k2
Analysis through eigenvector expansions (v, w ∈ R n , k vk2 = kwk2 = 1)
n n
v= ∑ αjuj , w = ∑ β j uj ,
i =1 i =1
n n
⇒ Av = ∑ λj αjuj , Aw = ∑ λj β j uj ,
i =1 i =1
n −1/2 n
v
v0 : =
k vk 2
= ∑ λ2j α2j ∑ λj αjuj ,
i =1 i =1
!
n n n
Aw − (v0⊤ Aw)v0 = ∑ βj− ∑ λ2j α j β j / ∑ λ2j α2j αj λj uj .
i =1 i =1 i =1

We notice that v is just mapped to the next iterate in the regular direct power iteration (9.3.12). After many
steps, it will be very close to un , and, therefore, we may now assume v = un ⇔ α j = δj,n (Kronecker
symbol).

9. Eigenvalues, 9.3. Power Methods 659

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

n −1
z := Aw − (v0⊤ Aw)v0 = 0 · un + ∑ λj β j uj ,
i =1
n −1 −1/2 n −1
z
w(new) := = ∑ λ2j β2j ∑ λj β j uj .
kzk2 i =1 i =1

The sequence w(k) produced by repeated application of the mapping given by Code 9.3.86 asymp-
totically (that is, when v(k) has already converged to un ) agrees with the sequence produced by the
direct power method for A e := U diag(λ1 , . . . , λn−1, 0). Its convergence will be governed by the relative
gap λn−2 /λn−1 , see Thm. 9.3.21.

However: if v(k) itself converges slowly, this reasoning does not apply.

Example 9.3.87 (Subspace power iteration with orthogonal projection)

☞ construction of matrix A = A⊤ as in Ex. 9.3.60

M ATLAB-code 9.3.88: power iteration with orthogonal projection for two vectors
1 f u n c t i o n sppowitdriver(d,maxit)
2 % monitor power iteration with orthogonal projection for finding
3 % the two largest (in modulus) eigenvalues and associated eigenvectors
4 % of a symmetric matrix with prescribed eigenvalues passed in d
5 i f ( n a r g i n < 10), maxit = 20; end
6 i f ( n a r g i n < 1), d = (1:10)’; end
7 % Generate matrix
8 n = l e n g t h (d);
9 Z = d i a g ( s q r t (1:n),0) + ones(n,n);
10 [Q,R] = qr (Z); % generate orthogonal matrix
11 A = Q* d i a g (d,0)*Q’; % “synthetic” A = A T with spectrum σ(A) = {d1 , . . . , dn }
12 % Compute “exact” eigenvectors and eigenvalues
13 [V,D] = e i g (A); [d,idx] = s o r t ( d i a g (D)),
14 v_ex = V(:,idx(n)); w_ex = V(:,idx(n-1));
15 lv_ex = d(n); lw_ex = d(n-1);
16

17 v = ones(n,1); w = (-1).^v; % (Arbitrary) initial guess for

eigenvectors
18 v = v/norm(v); w = w/norm(w);
19 result = [];
20 f o r k=1:maxit
21 v_new = A*v; w_new = A*w; % “power iteration”, cf. (9.3.12)
22 % Rayleigh quotients provide approximate eigenvalues
23 lv = dot (v_new,v); lw = dot (w_new,w);
24 % orthogonalization, cf. Gram-Schmidt orthogonalization (10.2.11):
w⊥v
25 v = v_new/norm(v_new); w = w_new - dot (v,w_new)*v; w =
w/norm(w);
26 % Record errors in eigenvalue and eigenvector approximations. Note
that the
27 % direction of the eigenvectors is not specified.
28 result = [result; k, abs (lv-lv_ex), abs (lw-lw_ex), ...

9. Eigenvalues, 9.3. Power Methods 660

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

29 min (norm(v-v_ex),norm(v+v_ex)),
min (norm(w-w_ex),norm(w+w_ex))];
30 end
31

32 f i g u r e (’name’,’sspowit’);
33 semilogy (result(:,1),result(:,2),’m-+’,...
34 result(:,1),result(:,3),’r-*’,...
35 result(:,1),result(:,4),’k-^’,...
36 result(:,1),result(:,5),’b-p’);
37 t i t l e (’d = [0.5*(1:8),9.5,10]’);
38 x l a b e l (’{\bf power iteration step}’,’fontsize’,14);
39 y l a b e l (’{\bf error}’,’fontsize’,14);
40 legend (’error in \lambda_n’,’error in \lambda_n-1’,’error in
v’,’error in w’,’location’,’northeast’);
41 p r i n t -depsc2 ’../PICTURES/sspowitcvg1.eps’;
42

43 rates = result(2: end ,2: end )./result(1:end -1,2: end );

44 f i g u r e (’name’,’rates’);
45 p l o t (result(2:end ,1),rates(:,1),’m-+’,...
46 result(2:end ,1),rates(:,2),’r-*’,...
47 result(2:end ,1),rates(:,3),’k-^’,...
48 result(2:end ,1),rates(:,4),’b-p’);
49 a x i s ([0 maxit 0.5 1]);
50 t i t l e (’d = [0.5*(1:8),9.5,10]’);
51 x l a b e l (’{\bf power iteration step}’,’fontsize’,14);
52 y l a b e l (’{\bf error quotient}’,’fontsize’,14);
53 legend (’error in \lambda_n’,’error in \lambda_n-1’,’error in
v’,’error in w’,’location’,’southeast’);
54 p r i n t -depsc2 ’../PICTURES/sspowitcvgrates1.eps’;

σ(A) = {1, 2, . . . , 10}:

d = (1:10) d = (1:10)
1
10 1
error in λn
error in λn-1
0.95
error in v
error in w
0.9

0.85
0
10
error quotient

0.8
error

0.75

0.7
-1
10
0.65

0.6
error in λ
n
error in λn-1
0.55
error in v
error in w
-2
10 0.5
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 349 power iteration step Fig. 350 power iteration step

σ(A) = {0.5, 1, . . . , 4, 9.5, 10}:

9. Eigenvalues, 9.3. Power Methods 661

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

d = [0.5*(1:8),9.5,10] d = [0.5*(1:8),9.5,10]
1
10 1
error in λn
error in λn-1
0.95
error in v
error in w
0.9
0
10
0.85

error quotient
0.8
error

-1
10 0.75

0.7

0.65
-2
10
0.6
error in λn
error in λn-1
0.55
error in v
error in w
-3
10 0.5
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 351 power iteration step Fig. 352 power iteration step

Issue: generalization of orthogonalization idea to subspaces of dimension > 2

Nothing new:
Gram-Schmidt orthonormalization
(→ [?, Thm. 4.8], [?, Alg. 6.1], [?, Sect. 3.4.3])

Given: linearly independent vectors v1 , . . . , vm ∈ R n , m ∈ N

Sought: vectors q1 , . . . , qm ∈ R n such that

➋ q⊤
l qk = δlk (orthonormality) , (9.3.89)
➋ Span{q1 , . . . , qk } = Span{v1 , . . . , vk } for all k = 1, . . . , m . (9.3.90)

Constructive proof & algorithm for Gram-Schmidt orthonormalization:

z 1 = v1 ,
v2⊤ z1
z2 = v2 − z ,
z1⊤ z1 1
v3⊤ z1 v3⊤ z2 (9.3.91)
z3 = v3 − z
z1⊤ z1 1
− z
z2⊤ z2 2
,
..
.
zk
+ normalization qk = , k = 1, . . . , m . (9.3.92)
kzk k2
Easy computation: the vectors q1 , . . . , qm produced by (9.3.91) satisfy (9.3.89) and (9.3.90).

M ATLAB-code 9.3.93: Gram-Schmidt orthonormalization (do not use, unstable algorithm!)

1 f u n c t i o n Q = gso(V)
2 % Gram-Schmidt orthonormalization of the columns of V ∈ R n,m , see
3 % (9.3.91). The vectors q1 , . . . , qm are returned as the columns of
4 % the orthogonal matrix Q.
5 m = s i z e (V,2);
6 Q = V(:,1)/norm(V(:,1)); % normalization
7 f o r l=2:m
8 q = V(:,l);
9 % orthogonalization
10 f o r k=1:l-1

9. Eigenvalues, 9.3. Power Methods 662

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

11 q = q - dot (Q(:,k),V(:,l))*Q(:,k);
12 end
13 Q = [Q,q/norm(q)]; % normalization
14 end

Warning! Code 9.3.93 provides an unstable implementation of Gram-Schmidt or-

thonormalization: for large n, m impact of round-off will destroy the orthogonality of
! the columns of Q.

A stable implementation of Gram-Schmidt orthogonalization of the columns of a matrix V ∈ K n,m , m ≤ n,

is provided by the following MATLAB command:
[Q,~] = qr(V,0) ( Asymptotic computational cost: O(m2 n) )
dummy return value (for our purposes) dummy argument
Detailed description of the algorithm behind qr and meaning of the return value R → Section 3.3.3.

Example 9.3.94 (qr based orthogonalization, m = 2)

The following two M ATLAB code snippets perform the same function, cf. Code 9.3.86:

M ATLAB-code 9.3.95: M ATLAB-code 9.3.96:

1 v = v/norm(v); 1 [Q,R] = qr([v,w],0);
2 w = w - dot (v,w)*v; 2 v = Q(:,1);
3 w = w/norm(w); 3 w = Q(:,2);

Explanation ➣ discussion of Gram-Schmidt orthonormalization.

M ATLAB-code 9.3.97: General subspace power iteration step with qr based orthonormaliza-
tion
1 f u n c t i o n V = sspowitstep(A,V)
2 % power iteration with orthonormalization for A = A T .
3 % columns of matrix V span subspace for power iteration.
4 V = A*V; % actual power iteration on individual columns
5 [V,R] = qr(V,0); % Gram-Schmidt orthonormalization (9.3.91)

9.3.4.2 Ritz projection

Observations on Code 9.3.86:

✦ the first column of V, (V):,1 , is a sequence of vectors created by the standard direct power method
(9.3.12).

9. Eigenvalues, 9.3. Power Methods 663

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✦ reasoning: the other columns of V, after each multiplication with A can be expected to contain a
significant component in the direction of the eigenvector associated with the eigenvalue of largest
modulus.

Idea: use information in (V):,2 , . . . , (V):,end to accelerate convergence of

(V):,1 .

Since the columns of V span a subspace of R n , this idea can be recast as the following task:

Task: given v1 , . . . , vk ∈ K n , k ≪ n, extract (good approximations of) eigenvectors of

A = AH ∈ K n,n contained in Span{v1 , . . . , vm }.

We take for granted that {v1 , . . . , vm } is linearly independent.

Assumption: EigAλ ∩ V := Span{v1 , . . . , vm } 6= {0}

⇔ V contains an eigenvector of A

⇔ ∃w ∈ V \ {0}: Aw = λw
⇔ ∃u ∈ Km \ {0}: AVu = λVu
⇒ ∃u ∈ Km \ {0}: VH AVu = λVH Vu , (9.3.98)

where V := (v1 , . . . , vm ) ∈ K n,m and we used

V = {Vu : u ∈ K m } (linear combinations of the vi ).

(9.3.98) ➣ u ∈ K k \ {0} solves m × m generalized eigenvalue problem

(VH AV)u = λ(VH V)u . (9.3.99)

Note: {v1 , . . . , vm } linearly independent

m
V has full rank m (→ Def. 2.2.3)
m
VH V is symmetric positive definite (→ Def. 1.1.8)

If our initial assumption holds true and u solves (9.3.99) and is a simple eigenvalue, a corresponding
x ∈ EigAλ can be recovered as x = Vu.

Idea: Given a subspace V = Im(V) ⊂ K n , V ∈ K n,m , dim(V ) = m, obtain

approximations of (a few) eigenvalues and eigenvectors x1 , . . . , xm of A
by

➊ solving the generalized eigenvalue problem (9.3.99)

➔ eigenvectors u1 , . . . , uk ∈ K m (linearly independent),

9. Eigenvalues, 9.3. Power Methods 664

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

➋ and transforming them back according to xk = Vuk , k = 1, . . . , m.

Terminology: (9.3.99) is called the Ritz projection of EVP Ax = λx onto V

Terminology: σ(VH AV) = ˆ Ritz values,

eigenvectors of VH AV =ˆ Ritz vectors

Example: Ritz projection of Ax = λx onto Span{v, w}:

H α H α
(v, w) A(v, w) = λ(v, w) (v, w) .
β β

Note: If V is unitary (→ Def. 6.2.2), then the generalized eigenvalue problem (9.3.99) will become a
standard linear eigenvalue problem.

Remark 9.3.100 (Justification of Ritz projection by min-max theorem)

We revisit m = 2, see Code 9.3.86. Recall that by the min-max theorem Thm. 9.3.41

un = argmaxx∈R n ρA (x) , un−1 = argmaxx∈R n , x⊥un ρA (x) . (9.3.101)

Idea: maximize Rayleigh quotient over Span{v, w}, where v, w are output by Code 9.3.86. This leads to
the optimization problem

∗ ∗ α
(α , β ) := argmax ρA (αv + βw) = argmax ρ(v,w)⊤ A(v,w) ( ). (9.3.102)
α,β∈R, α2 + β2 =1 α,β∈R, α2 + β2 =1
β

Then a better approximation for the eigenvector to the largest eigenvalue is

v∗ := α ∗ v + β∗ w .

Note that kv∗ k2 = 1, if both v and w are normalized, which is guaranteed in Code 9.3.86.

Then, orthogonalizing w w.r.t v∗ will produce a new iterate w∗ .

Again the min-max theorem Thm. 9.3.41 tells us that we can find (α∗ , β∗ )⊤ as eigenvector to the largest
eigenvalue of

⊤ α α
(v, w) A(v, w) =λ . (9.3.103)
β β

Since eigenvectors of symmetric matrices are mutually orthogonal, we find w∗ = α2 v + β 2 w, where

(α2 , β2 )⊤ is the eigenvector of (9.3.103) belonging to the smallest eigenvalue. This assumes orthonormal
vectors v, w.

M ATLAB-code 9.3.104: one step of subspace power iteration with Ritz projection, matrix ver-
sion
1 f u n c t i o n V = sspowitsteprp(A,V)
2 V = A*V; % power iteration applied to columns of V
3 [Q,R] = qr(V,0); % orthonormalization, see Section 9.3.4.1
4 [U,D] = eig(Q’*A*Q); % Solve Ritz projected m × m eigenvalue problem
5 V = Q*U; % recover approximate eigenvectors

9. Eigenvalues, 9.3. Power Methods 665

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

6 ev = d i a g (D); % approximate eigenvalues

Note that he orthogonalization step in Code 9.3.104 is actually redundant, if exact arithmetic could be em-
ployed, because the Ritz projection could also be realized by solving the generalized eigenvalue problem.

However, prior orthogonalization is essential for numerical stability (→ Def. 1.5.85), cf. the discussion in
Section 3.3.3.

Example 9.3.105 (Power iteration with Ritz projection)

Listing 9.1: Main loop: power iteration with Ritz projection for two eigenvectors
1 % See Code 9.3.88 for generation of matrix A and output
2 f o r k=1:maxit
3 v_new = A*v; w_new = A*w; “power iteration”, cf. (9.3.12)
%
4 [Q,R] = qr([v_new,w_new],0); % orthogonalization, see Sect. 9.3.4.1
5 [U,D] = eig(Q’*A*Q); %
Solve Ritz projected eigenvalue problem
6 [ev,idx] = s o r t ( abs ( d i a g (D))),
% Sort eigenvalues
7 w = Q*U(:,idx(1)); v = Q*U(:,idx(2)); % Recover approximate
eigenvectors
8

9 % Record errors in eigenvalue and eigenvector approximations. Note that

the
10 % direction of the eigenvectors is not specified.
11 result = [result; k, abs (ev(2)-lv_ex), abs (ev(1)-lw_ex), ...
12 min (norm(v-v_ex),norm(v+v_ex)),
min (norm(w-w_ex),norm(w+w_ex))];
13 end

Matrix as in Ex. 9.3.87, σ (A) = {0.5, 1, . . . , 4, 9.5, 10}:

d = [0.5*(1:8),9.5,10] d = [0.5*(1:8),9.5,10]
1 2
10 10
error in λ error in λ
n n
error in λ -1 error in λ -1
n n
0
error in v 10 error in v
error in w error in w

0 -2
10 10

-4
10
error

error

-1 -6
10 10

-8
10

-2 -10
10 10

-12
10

-3 -14
10 10
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 353 power iteration step Fig. 354 power iteration step

simple orthonormalization, Ex. 9.3.87 with Ritz projection, Code 9.3.104

9. Eigenvalues, 9.3. Power Methods 666

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

d = [0.5*(1:8),9.5,10] d = [0.5*(1:8),9.5,10]
1 1
error in λ
n
error in λ -1
0.95 0.9 n
error in v
error in w
0.9 0.8

0.85 0.7

error quotient
error quotient

0.8 0.6

0.75 0.5

0.7 0.4

0.65 0.3

0.6 0.2
error in λn
error in λ -1
n
0.55 0.1
error in v
error in w
0.5 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 355 power iteration step Fig. 356 power iteration step

simple orthonormalization, Ex. 9.3.87 with Ritz projection, Code 9.3.104

Observation: tremendous acceleration of power iteration through Ritz projection, convergence still linear
but with much better rates.

In Code 9.3.104: diagonal entries of D provide approximations of eigenvalues. Their (relative) changes
can be used as a termination criterion.

(9.3.106) Subspace variant of direct power method with Ritz projection

M ATLAB-code 9.3.107: Subspace power iteration with Ritz projection

1 f u n c t i o n [ev,V] = sspowitrp(A,k,m,tol,maxit)
2 % Power iteration with Ritz projection for matrix A = A T ∈ R n,n :
3 % Subspace of dimension m ≤ n is used to compute the k ≤ m largest
4 % eigenvalues of A and associated eigenvectors.
5 n = s i z e (A,1); V = eye (n,m); d = z e r o s (m,1); % (Arbitrary) initial
eigenvectors
6 % The approximate eigenvectors are stored in the columns of V ∈ R n,m
7 f o r i=1:maxit
8 [Q,R] = qr(A*V,0); % Power iteration and orthonormalization
9 [U,D] = eig(Q’*A*Q); % Small m × m eigenvalue problem for Ritz
projection
10 [ev,idx] = s o r t ( d i a g (D)); % eigenvalue approximations in diagonal
of D
11 V = Q*U; % 2nd part of Ritz projection
12 i f ( abs (ev-d) < tol*max( abs (ev))), br eak ; end
13 d = ev;
14 end
15 ev = ev(m-k+1: end );
16 V = V(:,idx(m-k+1: end ));

Example 9.3.108 (Convergence of subspace variant of direct power method)

9. Eigenvalues, 9.3. Power Methods 667

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

2
10

0
10
j
-2
10
aij := min{ ij , i }
S.p.d. test matrix:
n=200; A = gallery(’lehmer’,n);
error in eigenvalue

-4
10
“Initial eigenvector guesses”:
-6
10 V = eye(n,m);
λ , m=3
-8
10
1 • Observation:
λ , m=3
-10
2
λ , m=3
linear convergence of eigenvalues
10 3
λ , m=6
1
• choice m > k boosts convergence
-12
10 λ , m=6 of eigenvalues
2
λ3, m=6
-14
10
1 2 3 4 5 6 7 8 9 10
Fig. 357 iteration step

Remark 9.3.109 (Subspace power methods)

Analoguous to § 9.3.106: construction of subspace variants of inverse iteration (→ Code 9.3.54), PINVIT
(9.3.63), and Rayleigh quotient iteration (9.3.59).

9.4 Krylov Subspace Methods

Supplementary reading. [?, Sect. 30]

All power methods (→ Section 9.3) for the eigenvalue problem (EVP) Ax = λx only rely on the last iterate
to determine the next one (1-point methods, cf. (8.1.4))

➣ NO MEMORY, cf. discussion in the beginning of Section 10.2.

“Memory for power iterations”: pursue same idea that led from the gradient method, § 10.1.11, to the con-
jugate gradient method, § 10.2.17: use information from previous iterates to achieve efficient minimization
over larger and larger subspaces.

Min-max theorem, Finding extrema/stationary points

: A = AH ⇒ EVPs ⇔
Thm. 9.3.41 of Rayleigh quotient (→ Def. 9.3.16)

Setting: EVP Ax = λx for real s.p.d. (→ Def. 1.1.8) matrix A = A T ∈ R n,n

notations used below: 0 < λ1 ≤ λ2 ≤ · · · ≤ λn : eigenvalues of A, counted with multiplicity, see

Def. 9.1.1,

u1 , . . . , u n =
ˆ corresponding orthonormal eigenvectors, cf. Cor. 9.1.9.
AU = DU , U = (u1 , . . . , un ) ∈ R n,n , D = diag(λ1 , . . . , λn ) .
We recall

9. Eigenvalues, 9.4. Krylov Subspace Methods 668

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✦ the direct power method (9.3.12) from Section 9.3.1

✦ and the inverse iteration from Section 9.3.2
and how they produce sequences (z(k) )k∈N0 of vectors that are supposed to converge to a vector ∈
EigAλ1 or ∈ EigAλn , respectively.

Idea: Better z(k) from Ritz projection onto V := Span{z(0) , . . . , z(k) }

(= space spanned by previous iterates)

Recall (→ Code 9.3.104) Ritz projection of an EVP Ax = λx onto a subspace V := Span{v1 , . . . , vm },

m < n ➡ smaller m×m generalized EVP
T
AV} x = λV T Vx , V := (v1 , . . . , vm ) ∈ R n,m .
|V {z (9.4.1)
:=H

From Rayleigh quotient Thm. 9.3.39 and considerations in Section 9.3.4.2:

un ∈ V ⇒ largest eigenvalue of (9.4.1) = λmax (A) ,

u1 ∈ V ⇒ smallest eigenvalue of (9.4.1) = λmin (A) .
Intuition: If un (u1 ) “well captured” by V (that is, the angle between the vector and the space V is
small), then we can expect that the largest (smallest) eigenvalue of (9.4.1) is a good approximation for
λmax (A)(λmin (A)), and that, assuming normalization
Vw ≈ u1 (or Vw ≈ un ) ,
where w is the corresponding eigenvector of (9.4.1).

For direct power method (9.3.12): z(k) ||Ak z(0)

V = Span{z(0) , Az(0) , . . . , A(k) z(0) } = Kk+1 (A, z(0) ) a Krylov space, → Def. 10.2.6 . (9.4.2)

M ATLAB-code 9.4.3: Ritz projections onto Krylov space

(9.4.2)
1 f u n c t i o n [V,D] = kryleig(A,m) ✁
direct power method with
2 % Ritz projection onto Krylov subspace. An Ritz projection onto Krylov
orthonormal basis of Km (A, 1) is assembled
into the columns of V. space from (9.4.2), cf.
3 n = s i z e (A,1); V = (1:n)’; V = § 9.3.106.
V/norm(V); Note: implementation for
4 f o r l=1:m-1
demonstration purposes only
5 V = [V,A*V(:, end )]; [Q,R] = qr (V,0); (inefficient for sparse matrix A!)
6 [W,D] = e i g (Q’*A*Q); V = Q*W;
7 end

Example 9.4.4 (Ritz projections onto Krylov space)

9. Eigenvalues, 9.4. Krylov Subspace Methods 669

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB-code 9.4.5:
1 n=100;
2 M= g a l l e r y (’tridiag’,-0.5*ones(n-1,1),2*ones(n,1),-1.5*ones(n-1,1));
3 [Q,R]= qr (M); A=Q’* d i a g (1:n)*Q; % synthetic matrix,
σ(A) = {1, 2, 3, . . . , 100}

2
10
100 |λ - µ |
m m
|λm-1 - µm-1 |
1
10 |λm-2 - µm-2 |

0
10

90
Ritz value

Ritz value
-1
10

85
-2
10

80
-3
10

75 µ -4
10
m
µ
m-1
µ
m-2
-5
70 10
5 10 15 20 25 30 5 10 15 20 25 30
Fig. 358 dimension m of Krylov space Fig. 359 dimension m of Krylov space

Observation: “vaguely linear” convergence of largest Ritz values (notation µi ) to largest eigenvalues.
Fastest convergence of largest Ritz value → largest eigenvalue of A
2
10
40 µ |λ - µ |
1 1 1
µ2 |λ - µ |
2 2
35 µ3 |λ - µ |
2 3
1
10
30
error of Ritz value

25 0
10
Ritz value

-1
10
15

10
-2
10

0 -3
10
5 10 15 20 25 30 5 10 15 20 25 30
Fig. 360 dimension m of Krylov space Fig. 361 dimension m of Krylov space

Observation: Also the smallest Ritz values converge “vaguely linearly” to the smallest eigenvalues of A.
Fastest convergence of smallest Ritz value → smallest eigenvalue of A.

? Why do smallest Ritz values converge to smallest eigenvalues of A?

e := νI − A, ν > λmax (A):

Consider direct power method (9.3.12) for A
z(k)
(νI − A)e
z(0) arbitrary , e
z ( k + 1) = (9.4.6)
(νI − A)ez(k) 2
As σ (νI − A) = ν − σ (A) and eigenspaces agree, we infer from Thm. 9.3.21
k→∞ k→∞
λ1 < λ2 ⇒ z(k) −→ u1 & ρA (z(k) ) −→ λ1 linearly . (9.4.7)

9. Eigenvalues, 9.4. Krylov Subspace Methods 670

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

By the binomial theorem (also applies to matrices, if they commute)

k
k k− j j
k
(νI − A) = ∑ z(0) ∈ Kk (A, z(0) ) ,
ν A ⇒ (νI − A)k e
j =0
j

Kk (νI − A, x) = Kk (A, x) . (9.4.8)

➣ u1 can also be expected to be “well captured” by Kk (A, x) and the smallest Ritz value should provide
a good aproximation for λmin (A).

✓ ✏
Recall from Section 10.2.2 Lemma 10.2.12 :

Residuals r0 , . . . , rm−1 generated in CG iteration, § 10.2.17 applied to Ax = z with x(0) = 0, provide

✒ ✑
orthogonal basis for Km (A, z) (, if rk 6= 0).

Inexpensive Ritz projection of Ax = λx onto Km (A, z): orthogonal matrix

T r0 r m−1
Vm AVm x = λx , Vm := ,..., ∈ R n,m . (9.4.9)
k r0 k k r m−1 k

recall: residuals generated by short recursions, see § 10.2.17

Lemma 9.4.10. Tridiagonal Ritz projection from CG residuals

T AV is a tridiagonal matrix.
Vm m

Proof. Lemma 10.2.12: {r0 , . . . , rℓ−1 } is an orthogonal basis of Kℓ (A, r0 ), if all the residuals are non-
zero. As AKℓ−1 (A, r0 ) ⊂ Kℓ (A, r0 ), we conclude the orthogonality rm T Ar for all j = 0, . . . , m − 2. Since
j

T
Vm AVm = riT−1 Ar j−1 , 1 ≤ i, j ≤ m ,
ij

the assertion of the theorem follows.

✷

 
α1 β 1
 β 1 α2 β 2 
 
 .. 
 β 2 α 3 . 
 
 .. ..
. . 
 
VlH AVl =   =: Tl ∈ K k,k [tridiagonal matrix] (9.4.11)
 
 .. 
 . 
 
 .. .. 
 . . β k−1 
β k−1 αk

9. Eigenvalues, 9.4. Krylov Subspace Methods 671

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB-code 9.4.12: Lanczos process, cf. Code 10.2.18

1 f u n c t i o n [V,alph,bet] = lanczos(A,k,z0)
2 % Note: this implementation of the Lanczos
process also records the orthonormal CG
residuals in the columns of the matrix V,
Algorithm for computing Vl and Tl : which is not needed when only eigenvalue
approximations are desired.
Lanczos process
3 V = z0/norm(z0);
Computational effort/step:
4 % Vectors storing entries of tridiagonal matrix
1× A×vector (9.4.11)
5 alph= z e r o s (k,1); bet = z e r o s (k,1);
2
dot products
6 f o r j=1:k
2
AXPY-operations
7 q = A*V(:,j); alph(j) = dot (q,V(:,j));
1
division
8 w = q - alph(j)*V(:,j);
Closely related to CG iteration,
9 i f (j > 1), w = w - bet(j-1)*V(:,j-1);
§ 10.2.17, Code 10.2.18.
end ;
10 bet(j) = norm(w); V = [V,w/bet(j)];
11 end
12 bet = bet(1:end -1);

Total computational effort for l steps of Lanczos process, if A has at most k non-zero entries per row:
O(nkl )

Note: Code 9.4.12 assumes that no residual vanishes. This could happen, if z0 exactly belonged to
the span of a few eigenvectors. However, in practical computations inevitable round-off errors will always
ensure that the iterates do not stay in an invariant subspace of A, cf. Rem. 9.3.22.

Convergence (what we expect from the above considerations) → [?, Sect. 8.5])

(l ) (l ) (l )
In l -th step: λ n ≈ µ l , λ n − 1 ≈ µ l − 1 , . . . , λ1 ≈ µ1 ,
(l ) (l ) (l ) (l ) (l )
σ ( T l ) = { µ1 , . . . , µ l }, µ1 ≤ µ2 ≤ · · · ≤ µ l .

Example 9.4.13 (Lanczos process for eigenvalue computation)

A from Ex. 9.4.4 A = gallery(’minij’,100);

2 4
10 10

2
10
1
10
0
10
|Ritz value-eigenvalue|

0 -2
error in Ritz values

10 10

-4
10
-1
10
-6
10

-2 -8
10 10

-10
10
λ λn
-3 n
10 λ λn-1
n-1
-12
λ 10 λ
n-2 n-2
λn-3 λn-3
-4 -14
10 10
0 5 10 15 20 25 30 0 5 10 15
Fig. 362 step of Lanzcos process Fig. 363 step of Lanzcos process

9. Eigenvalues, 9.4. Krylov Subspace Methods 672

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Observation: same as in Ex. 9.4.4, linear convergence of Ritz values to eigenvalues.

However for A ∈ R 10,10 , aij = min{i, j} good initial convergence, but sudden “jump” of Ritz values off
eigenvalues!

Conjecture: Impact of roundoff errors, cf. Ex. 10.2.21

Example 9.4.14 (Impact of roundoff on Lanczos process)

A ∈ R10,10 , aij = min{i, j} . A = gallery(’minij’,10);

Computed by [V,alpha,beta] = lanczos(A,n,ones(n,1));, see Code 9.4.12:

 
38.500000 14.813845
 14.813845 9.642857 2.062955 
 
 2.062955 2.720779 0.776284 
 
 0.776284 1.336364 0.385013 
 
 0.385013 0.826316 0.215431 
T=



0.215431 0.582380 0.126781
 
 0.126781 0.446860 0.074650 
 
 
 0.074650 0.363803 0.043121

 0.043121 3.820888 11.991094 
11.991094 41.254286

σ(A) = {0.255680,0.273787,0.307979,0.366209,0.465233,0.643104,1.000000,1.873023,5.048917,44.766069}
σ(T) = {0.263867,0.303001,0.365376,0.465199,0.643104,1.000000,1.873023,5.048917,44.765976,44.766069}

Uncanny cluster of computed eigenvalues of T (“ghost eigenvalues”, [?, Sect. 9.2.5])

 
1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000251 0.258801 0.883711
 0.000000 1.000000 −0.000000 0.000000 0.000000 0.000000 0.000000 0.000106 0.109470 0.373799 
 
 0.000000 −0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000005 0.005373 0.018347 
 
 0.000000 0.000000 0.000000 1.000000 −0.000000 0.000000 0.000000 0.000000 0.000096 0.000328 
 
 0.000000 0.000000 0.000000 −0.000000 1.000000 0.000000 0.000000 0.000000 0.000001 0.000003 
V V=
H



0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 −0.000000 0.000000 0.000000 0.000000
 
 −0.000000 −0.000000 
 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000

 −0.000000 −0.000000 
 0.000251 0.000106 0.000005 0.000000 0.000000 0.000000 1.000000 0.000000

 0.258801 0.109470 0.005373 0.000096 0.000001 0.000000 0.000000 −0.000000 1.000000 0.000000 
0.883711 0.373799 0.018347 0.000328 0.000003 0.000000 0.000000 0.000000 0.000000 1.000000

Loss of orthogonality of residual vectors due to roundoff

(compare: impact of roundoff on CG iteration, Ex. 10.2.21

9. Eigenvalues, 9.4. Krylov Subspace Methods 673

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

l σ (Tl )
1 38.500000

2 3.392123 44.750734

3 1.117692 4.979881 44.766064

4 0.597664 1.788008 5.048259 44.766069

5 0.415715 0.925441 1.870175 5.048916 44.766069

6 0.336507 0.588906 0.995299 1.872997 5.048917 44.766069

7 0.297303 0.431779 0.638542 0.999922 1.873023 5.048917 44.766069

8 0.276160 0.349724 0.462449 0.643016 1.000000 1.873023 5.048917 44.766069

9 0.276035 0.349451 0.462320 0.643006 1.000000 1.873023 3.821426 5.048917 44.766069

10 0.263867 0.303001 0.365376 0.465199 0.643104 1.000000 1.873023 5.048917 44.765976 44.766069

Idea:
✦ do not rely on orthogonality relations of Lemma 10.2.12
✦ use explicit Gram-Schmidt orthogonalization [?, Thm. 4.8], [?,
Alg .6.1]

Details: inductive approach: given {v1 , . . . , vl } ONB of Kl (A, z)

l
el +1
v
vl +1 := Avl − ∑ (v H
e j Avl ) v j , vl +1 := ⇒ vl +1 ⊥ Kl (A, z) . (9.4.15)
j =1
vl +1 k 2
ke

(Gram-Schmidt, cf. (10.2.11) ) orthogonal

Arnoldi process: In step l : 1× A×vector

l+1 dot products
l AXPY-operations
n divisions

➣ Computational cost for l steps, if at most k non-zero entries in each row of A: O(nkl 2 )

M ATLAB-code 9.4.16: Arnoldi process

1 f u n c t i o n [V,H] = arnoldi(A,k,v0)
2 % Columns of V store orthonormal basis of Krylov spaces Kl (A, v0 ).
3 % H returns Hessenberg matrix, see Lemma 9.4.17.
4 V = [v0/norm(v0)];
5 H = z e r o s (k+1,k);
6 f o r l=1:k
7 vt = A*V(:,l); % “power iteration”, next basis vector
8 f o r j=1:l
9 % Gram-Schmidt orthogonalization, cf. Sect. 9.3.4.1
10 H(j,l) = dot (V(:,j),vt);
11 vt = vt - H(j,l)*V(:,j);
12 end
13 H(l+1,l) = norm(vt);
14 i f (H(l+1,l) == 0), br eak ; end % “theoretical” termination
15 V = [V, vt/H(l+1,l)];

9. Eigenvalues, 9.4. Krylov Subspace Methods 674

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

16 end

✎ ☞
If it does not stop prematurely, the Arnoldi process of Code 9.4.16 will yield an orthonormal basis (ONB)

✍ ✌
of Kk+1 (A, v0 ) for a general A ∈ C n,n .

Algebraic view of the Arnoldi process of Code 9.4.16, meaning of output H:


 H
vi Av j , if i ≤ j ,
Vl = v1 , . . . , vl : AVl = Vl +1 H e l ∈ K l +1,l
el , H mit e
hij = k evi k2 , if i = j + 1 ,


0 else.

e l = non-square upper Hessenberg matrices

➡ H
    
    
    
    
     
    
    
     
     el
H 
 A     

v l+1
  =  
vl
v1

vl
     
     
     
    
    
    
    
    
    

Translate Code 9.4.16 to matrix calculus:

Lemma 9.4.17. Theory of Arnoldi process

e l ∈ K l +1,l arising in the l -th step, l ≤ n, of the Arnoldi process holds

For the matrices Vl ∈ K n,l , H
(i) VlH Vl = I (unitary matrix),
(ii) AVl = Vl +1H e l, H e l is non-square upper Hessenberg matrix,
(iii) VlH AVl = Hl ∈ K l,l , hij = e hij for 1 ≤ i, j ≤ l ,
(iv) H
If A = A then Hl is tridiagonal (➣ Lanczos process)

Proof. Direct from Gram-Schmidt orthogonalization and inspection of Code 9.4.16.

✷

Remark 9.4.18 (Arnoldi process and Ritz projection)

Interpretation of Lemma 9.4.17 (iii) & (i):

Hl x = λx is a (generalized) Ritz projection of EVP Ax = λx, cf. Section 9.3.4.2.

Eigenvalue approximation for general EVP Ax = λx by Arnoldi process:

9. Eigenvalues, 9.4. Krylov Subspace Methods 675

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB-code 9.4.19: Arnoldi eigenvalue approximation

1 f u n c t i o n [dn,V,Ht] =
arnoldieig(A,v0,k,tol)
2 n = s i z e (A,1); V = [v0/norm(v0)];
3 H = z e r o s (1,0); dn = z e r o s (k,1);
4 f o r l=1:n
5 d = dn;
6 Ht = [Ht, z e r o s (l,1); z e r o s (1,l)];
7 vt = A*V(:,l);
8 f o r j=1:l
9 Ht(j,l) = dot (V(:,j),vt);
10 vt = vt - Ht(j,l)*V(:,j);
11 end
12 ev = s o r t ( e i g (Ht(1:l,1:l)));
13 dn(1: min (l,k)) =
ev( end :-1:end - min (l,k)+1);
14 i f (norm(d-dn) <
tol*norm(dn))&\pnode{SPOWITx}&,
br eak ; end ;
15 Ht(l+1,l) = norm(vt);
16 V = [V, vt/Ht(l+1,l)];
17 end

M ATLAB-code 9.4.20: M ATLAB -C ODE Arnoldi eigenvalue

approximation

f u n c t i o n [ dn , V , Ht ] = a r n o l d i e i g ( A , v0 , k , t o l )

n = s i z e ( A , 1 ) ; V = [ v0 / norm ( v0 ) ] ;

H = zeros ( 1 , 0 ) ; dn = zeros ( k , 1 ) ;

f o r l = 1: n

d = dn ;

Ht = [ Ht , zeros ( l , 1 ) ; zeros ( 1 , l ) ] ;

v t = A∗V ( : , l ) ;

f o r j = 1: l

Ht ( j , l ) = dot ( V ( : , j ) , v t ) ;

v t = v t − Ht ( j , l ) ∗ V ( : , j ) ;

end

ev = s o r t ( e i g ( Ht ( 1 : l , 1 : l ) ) ) ;
9. Eigenvalues, 9.4. Krylov Subspace Methods 676
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Arnoldi process for computing the

k largest (in modulus) eigenvalues
✗ of A ∈ C n,n ✔
1 A×vector per step
(➣ attractive for sparse

✖ ✕
matrices)

However: required storage in-

creases with number of steps,
cf. situation with GMRES, Sec-
tion 10.4.1.

Heuristic termination criterion

Example 9.4.21 (Stabilty of Arnoldi process)

A ∈ R100,100 , aij = min{i, j} . A = gallery(’minij’,100);

4 4
10 10

2 2
10 10
Approximation error of Ritz value

0 0
10 10

-2 -2
error in Ritz values

10 10

-4 -4
10 10

-6 -6
10 10

-8 -8
10 10

-10 -10
10 10
λn λn
λn-1 λn-1
-12 -12
10 λ 10 λ
n-2 n-2
λn-3 λn-3
-14 -14
10 10
0 5 10 15 0 5 10 15
Fig. 364 step of Lanzcos process Fig. 365 Step of Arnoldi process

Lanczos process: Ritz values Arnoldi process: Ritz values

Ritz values during Arnoldi process for A = gallery(’minij’,10); ↔ Ex. 9.4.13

l σ (Hl )
1 38.500000

2 3.392123 44.750734

3 1.117692 4.979881 44.766064

4 0.597664 1.788008 5.048259 44.766069

5 0.415715 0.925441 1.870175 5.048916 44.766069

6 0.336507 0.588906 0.995299 1.872997 5.048917 44.766069

7 0.297303 0.431779 0.638542 0.999922 1.873023 5.048917 44.766069

8 0.276159 0.349722 0.462449 0.643016 1.000000 1.873023 5.048917 44.766069

9 0.263872 0.303009 0.365379 0.465199 0.643104 1.000000 1.873023 5.048917 44.766069

10 0.255680 0.273787 0.307979 0.366209 0.465233 0.643104 1.000000 1.873023 5.048917 44.766069

Observation: (almost perfect approximation of spectrum of A)

9. Eigenvalues, 9.4. Krylov Subspace Methods 677

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

For the above examples both the Arnoldi process and the Lanczos process are algebraically equivalent,
because they are applied to a symmetric matrix A = A T . However, they behave strikingly differently,
which indicates that they are not numerically equivalent.

The Arnoldi process is much less affected by roundoff than the Lanczos process, because it does not take
for granted orthogonality of the “residual vector sequence”. Hence, the Arnoldi process enjoys superior
numerical stability (→ ??, Def. 1.5.85) compared to the Lanczos process.

Example 9.4.22 (Eigenvalue computation with Arnoldi process)

Eigenvalue approximation from Arnoldi process for non-symmetric A, initial vector ones(100,1);

M ATLAB-code 9.4.23:
1 n=100;
2 M= f u l l ( g a l l e r y (’tridiag’,-0.5*ones(n-1,1),2*ones(n,1),-1.5*ones(n-1,1)));
3 A=M* d i a g (1:n)* i n v (M);

Approximation of largest eigenvalues 2

Approximation of largest eigenvalues
10
100

95 1
10
Approximation error of Ritz value

90
0
10
85

80
Ritz value

-1
10

75
-2
10
70

-3
65 10

60
λ -4
10 λn
n
55 λn-1 λn-1
λn-2 λ
n-2
-5
50 10
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Fig. 366 Step of Arnoldi process Fig. 367 Step of Arnoldi process

Approximation of smallest eigenvalues Approximation of smallest eigenvalues

2
10
λ1
10
λ
2
1
9 λ3 10
Approximation error of Ritz value

8 0
10

7
-1
10
Ritz value

6
-2
10
5

-3
4 10

3
-4
10

2
λ
-5 1
10
λ
1 2
λ
3
-6
0 10
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Fig. 368 Step of Arnoldi process Fig. 369 Step of Arnoldi process

Observation: “vaguely linear” convergence of largest and smallest eigenvalues, cf. Ex. 9.4.4.

9. Eigenvalues, 9.4. Krylov Subspace Methods 678

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Krylov subspace iteration methods (= Arnoldi process, Lanczos process) attractive for computing a
few of the largest/smallest eigenvalues and associated eigenvectors of large sparse matrices.

Remark 9.4.24 (Krylov subspace methods for generalized EVP)

Adaptation of Krylov subspace iterative eigensolvers to generalized EVP: Ax = λBx, B s.p.d.: replace
Euclidean inner product with “B-inner product” (x, y) 7→ x H By.

M ATLAB-functions:

d = eigs(A,k,sigma) : k largest/smallest eigenvalues of A

d = eigs(A,B,k,sigma): k largest/smallest eigenvalues for generalized EVP Ax = λBx,B
s.p.d.
d = eigs(Afun,n,k) : Afun = handle to function providing matrix×vector for A/A−1 /A −
αI/(A − αB)−1 . (Use flags to tell eigs about special properties of
matrix behind Afun.)
eigs just calls routines of the open source ARPACK numerical library.

9. Eigenvalues, 9.4. Krylov Subspace Methods 679

Chapter 10

Krylov Methods for Linear Systems of

Equations

Supplementary reading. There is a wealth of literature on iterative methods for the solution of

linear systems of equations: The two books [?] and [?] offer a comprehensive treatment of the topic
(the latter is available online for ETH students and staff).

Concise presentations can be found in [?, Ch. 4] and [?, Ch. 13].

Learning outcomes:
• Understanding when and why iterative solution of linear systems of equations may be preferred to
direct solvers based on Gaussian elimination.
•
= A class of iterative methods (→ Section 8.1) for approximate solution of large linear systems of
equations Ax = b, A ∈ K n,n .
BUT, we have reliable direct methods (Gauss elimination → Section 2.3, LU-factorization → § 2.3.30,
QR-factorization → ??) that provide an (apart from roundoff errors) exact solution with a finite number
of elementary operations!
Alas, direct elimination may not be feasible, or may be grossly inefficient, because
• it may be too expensive (e.g. for A too large, sparse), → (2.3.25),
• inevitable fill-in may exhaust main memory,
• the system matrix may be available only as procedure y=evalA(x) ↔ y = Ax

Contents
10.1 Descent Methods [?, Sect. 4.3.3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
10.1.1 Quadratic minimization context . . . . . . . . . . . . . . . . . . . . . . . . . 671
10.1.2 Abstract steepest descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
10.1.3 Gradient method for s.p.d. linear system of equations . . . . . . . . . . . . . 673
10.1.4 Convergence of the gradient method . . . . . . . . . . . . . . . . . . . . . . . 674
10.2 Conjugate gradient method (CG) [?, Ch. 9], [?, Sect. 13.4], [?, Sect. 4.3.4] . . . . . 678
10.2.1 Krylov spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
10.2.2 Implementation of CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
10.2.3 Convergence of CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
10.3 Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, Sect. 4.3.5] . . . . . . . . . . . . . . . 688

680
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10.4 Survey of Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 694

10.4.1 Minimal residual methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694
10.4.2 Iterations with short recursions [?, Sect. 4.5] . . . . . . . . . . . . . . . . . . 696

10.1 Descent Methods [?, Sect. 4.3.3]

Focus:
Linear system of equations Ax = b, A ∈ R n,n , b ∈ R n , n ∈ N given,
with symmetric positive definite (s.p.d., → Def. 1.1.8) system matrix A
➨
A-inner product (x, y) 7→ x⊤ Ay ⇒ “A-geometry”
Definition 10.1.1. Energy norm → [?, Def. 9.1]
A s.p.d. matrix A ∈ R n,n induces an energy norm

kxk A := (x⊤ Ax) /2 , x ∈ R n .

Remark 10.1.2 (Krylov methods for complex s.p.d. system matrices)

In this chapter, for the sake of simplicity, we restrict ourselves to K = R .

However, the (conjugate) gradient methods introduced below also work for LSE Ax = b with A ∈ C n,n ,
A = A H s.p.d. when ⊤ is replaced with H (Hermitian transposed). Then, all theoretical statements remain
valid unaltered for K = C.

10.1.1 Quadratic minimization context

Lemma 10.1.3. S.p.d. LSE and quadratic minimization problem [?, (13.37)]

A LSE with A ∈ R n,n s.p.d. and b ∈ R n is equivalent to a minimization problem:

Ax = b ⇔ x = arg minn J (y) , J (y) := 12 y⊤ Ay − b⊤ y . (10.1.4)

y ∈R

A quadratic functional

Proof. If x∗ := A−1 b a straightforward computation using A = A T shows

J (x) − J (x∗ ) = 21 x T Ax − b T x − 12 (x∗ )T Ax∗ + b T x∗

b = Ax∗ 1 T
= 2 x Ax − (x∗ )T Ax + 12 (x∗ )T Ax∗ (10.1.5)

= 12 kx − x∗ k2A .

Then the assertion follows from the properties of the energy norm.

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 681
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Example 10.1.6 (Quadratic functional in 2D)

2 1 1
Plot of J from (10.1.4) for A = , b= .
1 2 1
2 16 16

14
1.5 14

1 12
10

J(x1,x2)
0.5 8

6
2

0 8
x

-0.5 6
2

4
-1 0

-2
2 -2
-1.5
0
0 0.5 1 1.5 2
2 -2 -1.5 -1 -0.5
0
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x2
x1

✞ ☎
Fig. 370 x Fig. 371
1

✝ ✆
Level lines of quadratic functionals with s.p.d. A are (hyper)ellipses

Algorithmic idea: (Lemma 10.1.3 ➣) Solve Ax = b iteratively by successive

solution of simpler minimization problems

10.1.2 Abstract steepest descent

Task: Given continuously differentiable F : D ⊂ R n 7→ R ,

find minimizer x∗ ∈ D: x∗ = argmin F(x)
x∈ D

Note that a minimizer need not exist, if F is not bounded

√ from below (e.g., F( x ) = x3 , x ∈ R , or
F( x ) = log x, x > 0), or if D is open (e.g., F( x ) = x, x > 0).
The existence of a minimizer is guaranteed if F is bounded from below and D is closed (→ Analysis).

The most natural iteration:

(10.1.7) Steepest descent (ger.: steilster Abstieg)

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 682
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Initial guess x (0) ∈ D , k = 0

repeat ✦ dk =ˆ direction of steepest descent
dk := − grad F(x ) (k) ✦ linear search = ˆ 1D minimization:
use Newton’s method (→ Sec-
t∗ := argmin F(x(k) + tdk ) ( line search)
t ∈R tion 8.3.2.1) on derivative
x ( k + 1) : = x ( k ) + t ∗ d k ✦ correction based a posteriori termi-
k : = k+1 nation criterion, see Section 8.1.2
for a discussion.
until x(k) − x(k−1) ≤ τrel x(k) or
(τ =ˆ prescribed tolerance)
( k
x −x ) ( k − 1 ) ≤τabs

The gradient (→ [?, Kapitel 7])

 ∂F 
∂xi
 
grad F(x) =  ...  ∈ R n (10.1.8)
∂F
∂xn

provides the direction of local steepest ascent/des-

cent of F

Fig. 372

Of course this very algorithm can encounter plenty of difficulties:

• iteration may get stuck in a local minimum,

• iteration may diverge or lead out of D,
• line search may not be feasible.

10.1.3 Gradient method for s.p.d. linear system of equations

However, for the quadratic minimization problem (10.1.4) § 10.1.7 will converge:

(“Geometric intuition”, see Fig. 370: quadratic functional J with s.p.d. A has unique global minimum,
grad J 6= 0 away from minimum, pointing towards it.)

Adaptation: steepest descent algorithm § 10.1.7 for quadratic minimization problem (10.1.4), see [?,
Sect. 7.2.4]:

F(x) := J (x) = 12 x⊤ Ax − b⊤ x ⇒ grad J (x) = Ax − b . (10.1.9)

This follows from A = A⊤ , the componentwise expression

n n
1
J (x) = 2 ∑ aij xi x j − ∑ bi xi
i,j=1 i =1

and the definition (10.1.8) of the gradient.

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 683
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

➣ For the descent direction in § 10.1.7 applied to the minimization of J from (10.1.4) holds
dk = b − Ax(k) =: rk the residual (→ Def. 2.4.1) for x(k−1) .
§ 10.1.7 for F = J from (10.1.4): function to be minimized in line search step:
ϕ(t) := J (x(k) + tdk ) = J (x(k) ) + td⊤
k (Ax
(k)
− b) + 12 t2 d⊤
k Adk ➙ a parabola ! .

dϕ ∗ d⊤
k dk
(t ) = 0 ⇔ t∗ = (unique minimizer) . (10.1.10)
dt d⊤
k Adk

Note: dk = 0 ⇔ Ax(k) = b (solution found !)

Note: A s.p.d. (→ Def. 1.1.8) ⇒ d⊤
k Adk > 0, if dk 6= 0

ϕ(t) is a parabola that is bounded from below (upward opening)

Based on (10.1.9) and (10.1.10) we obtain the following steepest descent method for the minimization
problem (10.1.4):

Steepest descent iteration = gradient method for LSE Ax = b, A ∈ R n,n s.p.d., b ∈ R n :

(10.1.11) Gradient method for s.p.d. LSE

M ATLAB-code 10.1.12: gradient method for Ax =

b, A s.p.d.

Initial guess x(0) ∈ R n , k = 0 1 function x =

r0 := b − Ax(0) gradit(A,b,x,rtol,atol,maxit)
repeat
2 r = b-A*x; % residual → Def. 2.4.1
f o r k=1:maxit
r⊤
3
∗ k rk
t := 4 p = A*r;
r⊤k Ar k 5 ts = (r’*r)/(r’*p); % cf.
x ( k + 1)
:= x(k) + t∗ rk (10.1.10)
rk+1 := rk − t∗ Ark 6 x = x + ts*r;
k : = k+1 7 cn = ( abs (ts)*norm(r); % norm of
correction
until x(k) − x(k−1) ≤ τrel x(k) or 8 i f ((cn < tol*norm(x)) ||
9 (cn < atol))
x(k) − x(k−1) ≤τabs
10 r e t u r n ; end
11 r = r - ts*p; %
12 end

Recursion for residuals, see Line 11 of Code 10.1.12:

rk+1 = b − Ax(k+1) = b − A(x(k) + t∗ rk ) = rk − t∗ Ark . (10.1.13)

✬ ✩
One step of gradient method involves
✦ A single matrix×vector product with A ,
✦ 2 AXPY-operations (→ Section 1.3.2) on vectors of length n,
✦ 2 dot products in R n .
✫ ✪
Computational cost (per step) = cost(matrix×vector) + O(n)

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 684
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

➣ If A ∈ R n,n is a sparse matrix (→ ??) with “O(n) nonzero entries”, and the data structures allow
to perform the matrix×vector product with a computational effort O(n), then a single step of the
gradient method costs O(n) elementary operations.
➣ Gradient method of § 10.1.11 only needs A×vector in procedural form y = evalA(x).

10.1.4 Convergence of the gradient method

Example 10.1.14 (Gradient method in 2D)

S.p.d. matrices ∈ R 2,2 :

1.9412 −0.2353 7.5353 −1.8588
A1 = , A2 =
−0.2353 1.0588 −1.8588 0.5647

Eigenvalues: σ(A1 ) = {1, 2}, σ(A2 ) = {0.1, 8}

✎ notation: spectrum of a matrix ∈ K n,n σ(M) := {λ ∈ C: λ is eigenvalue of M}
10 10

9 9

8 8

x(0)
7 7

6 6
2

5 5
x

x(1)

4 4

3 3

2 2
x(2)
x(3)
1 1

0 0
0 2 4 6 8 10 0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 373 x1 Fig. 374 x1

iterates of § 10.1.11 for A1 iterates of § 10.1.11 for A2

Recall theorem on principal axis transformation: every real symmetric matrix can be diagonalized by
orthogonal similarity transformations, see Cor. 9.1.9, [?, Thm. 7.8], [?, Satz 9.15],

A = A⊤ ∈ R n,n ⇒ ∃Q ∈ R n,n orthogonal: A = QDQ⊤ , D = diag(d1 , . . . , dn ) ∈ R n,n diagonal .

(10.1.15)

n
J (Qb b⊤ Db
y ) = 12 y y − (Q⊤ b)⊤ b
| {z }
y= 1
2 ∑ di yb2i − bbi ybi .
i =1
b⊤
=:b

Hence, a rigid transformation (rotation, reflection) maps the level surfaces of J from (10.1.4) to ellipses
with principal axes di . As A s.p.d. di > 0 is guaranteed.

Observations:
• Larger spread of spectrum leads to more elongated ellipses as level lines ➣ slower convergence
of gradient method, see Fig. 374.

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 685
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

• Orthogonality of successive residuals rk , rk+1 .

Clear from definition of § 10.1.11:

r⊤
k rk
r⊤ ⊤ ⊤
k r k+1 = r k r k − r k Ark = 0 . (10.1.16)
r⊤
k Ar k

Example 10.1.17 (Convergence of gradient method)

Convergence of gradient method for diagonal matrices, x∗ = [1, . . . , 1]⊤ , x(0) = 0:

1 d = 1:0.01:2; A1 = d i a g (d);
2 d = 1:0.1:11; A2 = d i a g (d);
3 d = 1:1:101; A3 = d i a g (d);

4 2
10 10

2
10 0
10

0
10 -2
10

-2
10
energy norm of error

-4
10
2-norm of residual

-4
10
-6
10
-6
10
-8
10
-8
10
-10
10
-10
10

-12
-12 10
10

-14
-14
10 A = diag(1:0.01:2) 10 A = diag(1:0.01:2)
A = diag(1:0.1:11) A = diag(1:0.1:11)
A = diag(1:1:101) A = diag(1:1:101)
-16 -16
10 10
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Fig. 375 iteration step k Fig. 376 iteration step k

Note: To study convergence it is sufficient to consider diagonal matrices, because

1. (10.1.15): for every A ∈ R n,n with A⊤ = A there is an orthogonal matrix Q ∈ R n.n such that
A = Q⊤ DQ with a diagonal matrix D (principal axis transformation), → Cor. 9.1.9, [?, Thm. 7.8],
[?, Satz 9.15],

2. when applying the gradient method § 10.1.11 to both Ax = b and De e := Qb, then the iterates
x=b
( k )
x and e ( k )
x are related by Qx = e( k ) (
x .k )

With x(k) := Qx(k) , using Q⊤ Q = I:

erk := Qrk , e

Initial guess x(0) ∈ R n , k = 0 x (0) ∈ R n , k = 0

Initial guess e
r0 := b − Q⊤ DQx(0) e − De
er0 := b x (0)
repeat repeat
r⊤ Q⊤ Qrk er⊤
kerk
t∗ := ⊤k ⊤ t∗ :=
rk Q DQrk er⊤k Derk
( k + 1) (k) ∗
x := x + t rk e
x ( k + 1)
:= ex(k) + t∗erk
rk+1 := rk − t∗ Q⊤ DQrk erk+1 := erk − t∗ Derk
k := k + 1 k := k + 1
until x ( k ) − x ( k − 1) ≤ τ x ( k ) until e x ( k − 1) ≤ τ e
x(k) − e x(k)

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 686
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Observation:
✦ linear convergence (→ Def. 8.1.9), see also Rem. 8.1.13
✦
rate of convergence increases (↔ speed of convergence decreases) with spread of
spectrum of A
Impact of distribution of diagonal entries (↔ eigenvalues) of (diagonal matrix) A
(b = x∗ = 0, x0 = cos((1:n)’);)
Test matrix #1: A=diag(d); d = (1:100);
Test matrix #2: A=diag(d); d = [1+(0:97)/97 , 50 , 100];
Test matrix #3: A=diag(d); d = [1+(0:49)*0.05, 100-(0:49)*0.05];
Test matrix #4: eigenvalues exponentially dense at 1
3
10
error norm, #1
norm of residual, #1
error norm, #2
norm of residual, #2
#4 error norm, #3
norm of residual, #3
2
10 error norm, #4
norm of residual, #4

#3
matrix no.

2-norms 1
10

0
10
#1

-1
10
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45
Fig. 377 diagonal entry iteration step k

Observation: Matrices #1, #2 & #4 ➣ little impact of distribution of eigenvalues on asymptotic con-
vergence (exception: matrix #2)

Theory [?, Sect. 9.2.2], [?, Sect.7.2.4]:

Theorem 10.1.18. Convergence of gradient method/steepest descent

The iterates of the gradient method of § 10.1.11 satisfy

cond2 (A) − 1
x ( k + 1) − x ∗ ≤ L x(k) − x∗ , L := ,
A A cond2 (A) + 1

that is, the iteration converges at least linearly (→ Def. 8.1.9) w.r.t. energy norm (→ Def. 10.1.1).

ˆ condition number (→ Def. 2.2.12) of A induced by 2-norm

✎ notation: cond2 (A) =

Remark 10.1.19 (2-norm from eigenvalues → [?, Sect. 10.6], [?, Sect. 7.4])

A = A⊤ ⇒ kAk2 = max(|σ(A)|) , (10.1.20)

A −1 = min(|σ(A)|)−1 , if A regular.
2

λmax (A) λmax (A) := max(|σ(A)|) ,

A = A⊤ ⇒ cond2 (A) = , where (10.1.21)
λmin (A) λmin (A) := min(|σ(A)|) .

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 687
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

λmax (A)
✎ other notation κ (A ) := =
ˆ spectral condition number of A
λmin (A)
(for general A: λmax (A)/λmin (A) largest/smallest eigenvalue in modulus)
These results are an immediate consequence of the fact that
∀A ∈ R n,n , A⊤ = A ∃U ∈ R n,n , U−1 = U⊤ : U⊤ AU is diagonal,
see (10.1.15), Cor. 9.1.9, [?, Thm. 7.8], [?, Satz 9.15].

Please note that for general regular M ∈ R n,n we cannot expect cond2 (M) = κ (M).

10.2 Conjugate gradient method (CG) [?, Ch. 9], [?, Sect. 13.4], [?,
Sect. 4.3.4]

Again we consider a linear system of equations Ax = b with s.p.d. (→ Def. 1.1.8) system matrix A ∈
R n,n and given b ∈ R n .

Liability of gradient method of Section 10.1.3: NO MEMORY

1D line search in § 10.1.11 is oblivious of former line searches, which rules out reuse of information gained
in previous steps of the iteration. This is a typical drawback of 1-point iterative methods.

Idea:
Replace linear search with subspace correction
Given:
✦ initial guess x(0)
✦ nested subspaces U1 ⊂ U2 ⊂ U3 ⊂ · · · ⊂ Un = R n , dim Uk = k

x(k) := argmin J ( x ) , (10.2.1)

x∈Uk + x ( 0)

quadratic functional from (10.1.4)

Note: Once the subspaces Uk and x(0) are fixed, the iteration (10.2.1) is well defined, because J|U +x(0)
k
always possess a unique minimizer.

Obvious (from Lemma 10.1.3): x (n ) = x ∗ = A −1 b

Thanks to (10.1.5), definition (10.2.1) ensures: x ( k + 1) − x ∗ ≤ x(k) − x∗

A A

How to find suitable subspaces Uk ?

Idea: Uk+1 ← Uk + “ local steepest descent direction”

given by − grad J (x(k) ) = b − Ax(k) = rk (residual → Def. 2.4.1)

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 688
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Uk+1 = Span{Uk , rk } , x(k) from (10.2.1). (10.2.2)

Obvious: rk = 0 ⇒ x(k) = x∗ := A−1 b done ✔

Lemma 10.2.3. rk ⊥ Uk

With x(k) according to (10.2.1), Uk from (10.2.2) the residual rk := b − Ax(k) satisfies

r⊤
k u = 0 ∀ u ∈ Uk (”r k ⊥ Uk ”).

Geometric consideration: since x(k) is the minimizer of J over the affine space Uk + x(0) , the projection of
the steepest descent direction grad J (x(k) ) onto Uk has to vanish:

x(k) := argmin J ( x ) ⇒ grad J (x(k) ) ⊥ Uk . (10.2.4)

x∈Uk + x(0)

Proof. Consider

ψ(t) = J (x(k) + tu) , u ∈ Uk , t ∈ R .

By (10.2.1), t 7→ ψ(t) has a global minimum in t = 0, which implies
dψ
(0) = grad J (x(k) )⊤ u = (Ax(k) − b)⊤ u = 0 .
dt
Since u ∈ Uk was arbitrary, the lemma is proved.
✷

Corollary 10.2.5.

If rl 6= 0 for l = 0, . . . , k, k ≤ n, then {r0 , . . . , rk } is an orthogonal basis of Uk .

Lemma 10.2.3 also implies that, if U0 = {0}, then dim Uk = k as long as x(k) 6= x∗ , that is, before we
have converged to the exact solution.

(10.2.1) and (10.2.2) define the conjugate gradient method (CG) for the iterative solution of Ax = b
(hailed as a “top ten algorithm” of the 20th century, SIAM News, 33(4))

10.2.1 Krylov spaces

Definition 10.2.6. Krylov space

For A ∈ R n,n , z ∈ R n , z 6= 0, the l -th Krylov space is defined as

Kl (A, z) := Span{z, Az, . . . , Al −1 z} .

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 689
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Equivalent definition: Kl (A, z) = { p(A)z: p polynomial of degree ≤ l }

Lemma 10.2.7.
The subspaces Uk ⊂ R n , k ≥ 1, defined by (10.2.1) and (10.2.2) satisfy

Uk = Span{r0 , Ar0 , . . . , Ak−1 r0 } = Kk (A, r0 ) ,

where r0 = b − Ax(0) is the initial residual.

Proof. (by induction) Obviously AKk (A, r0 ) ⊂ Kk+1 (A, r0 ) . In addition

rk = b − A(x(0) + z) for some z ∈ Uk ⇒ rk = r0 − Az

|{z} .
|{z}
∈Kk+1 (A,r0 ) ∈Kk+1 (A,r0 )

Since Uk+1 = Span{Uk , rk }, we obtain Uk+1 ⊂ Kk+1 (A, r0 ). Dimensional considerations based on
Lemma 10.2.3 finish the proof.
✷

10.2.2 Implementation of CG

Assume: basis {p1 , . . . , pl }, l = 1, . . . , n, of Kl (A, r) available

x(l ) ∈ x(0) + Kl (A, r0 ) ➣ set x ( l ) = x ( 0 ) + γ1 p 1 + · · · + γ l p l .

For ψ(γ1 , . . . , γl ) := J (x(0) + γ1 p1 + · · · + γl pl ) holds

∂ψ
(10.2.1) ⇔ = 0 , j = 1, . . . , l .
∂γ j

This leads to a linear system of equations by which the coefficients γ j can be computed:
    
p1⊤ Ap1 · · · p1⊤ Apl γ1 p1⊤ r
 .. .. ..   ..  (0)
 . . .  =  .  , r := b − Ax . (10.2.8)
p⊤
l Ap1 · · · p⊤
l Apl
γl p⊤
l r

Great simplification, if {p1 , . . . , pl } A-orthogonal basis of Kl (A, r): p⊤

j Api = 0 for i 6= j.

Recall: s.p.d. A induces an inner product ➣ concept of orthogonality [?, Sect. 4.4], [?, Sect. 6.2].
“A-geometry” like standard Euclidean space.

Assume: A-orthogonal basis {p1 , . . . , pn } of R n available, such that

Span{p1 , . . . , pl } = Kl (A, r) .

(Efficient) successive computation of x(l ) becomes possible, see [?, Lemma 13.24]
(LSE (10.2.8) becomes diagonal !)

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 690
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Input: : initial guess x(0) ∈ R n

Given: : A-orthogonal bases {p1 , . . . , pl } of Kl (A, r0 ), l = 1, . . . , n
Output: : approximate solution x(l ) ∈ R n of Ax = b

r0 := b − Ax(0) ;
p⊤
j r0 (10.2.9)
( j) ( j − 1)
for j = 1 to l do { x := x + pj }
p⊤
j Ap j

Task: Efficient computation of A-orthogonal vectors {p1 , . . . , pl } spanning Kl (A, r0 )

during the CG iteration.
A-orthogonalities/orthogonalities ➤ short recursions
Lemma 10.2.3 implies orthogonality p j ⊥ rm := b − Ax(m) , 1 ≤ j ≤ m ≤ l . Also by A-orthogonality of
the pk
 
m p⊤k r0
p Tj (b − Ax(m) ) = p⊤ b − Ax(0) − ∑ Apk  = 0 . (10.2.10)
j | {z } k=1 p⊤ Ap
k
= r0 k

From linear algebra we already know a way to construct orthogonal basis vectors:
(10.2.10) ⇒ Idea:
Gram-Schmidt orthogonalization [?, Thm. 4.8], [?, Alg. 6.1],
of residuals r j := b − Ax( j) w.r.t. A-inner product:

j
p⊤ Ar j
p1 := r0 , p j+1 := (b − Ax( j) ) − ∑ ⊤k p , j = 1, . . . , l − 1 .
| {z } k=1 pk Apk k
= :r j | {z }
(∗)
(10.2.11)

rj
Geometric interpretation of
K j (A, r0 ) (10.2.11):

(∗) =
ˆ orthogonal projection of r j on the subspace
(∗)
Span{p1 , . . . , p j }
Fig. 378

Lemma 10.2.12. Bases for Krylov spaces in CG

If they do not vanish, the vectors p j , 1 ≤ j ≤ l , and r j := b − Ax( j) , 0 ≤ j ≤ l , from (10.2.9),

(10.2.11) satisfy
(i) {p1 , . . . , p j } is A-orthogonal basis von K j (A, r0 ),
(ii) {r0 , . . . , r j−1 } is orthogonal basis of K j (A, r0 ), cf. Cor. 10.2.5

Proof. A-orthogonality of p j by construction, study (10.2.11).

j
p⊤
k r0
j
p⊤
k Ar j
(10.2.9) & (10.2.11) ⇒ p j +1 = r0 − ∑ p⊤
Apk − ∑ p⊤
pk .
k=1 k Apk k=1 k Apk
⇒ p j+1 ∈ Span{r0 , p1 , . . . , p j , Ap1 , . . . , Ap j } .

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 691
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

A simple induction argument confirms (i)

(10.2.11) ⇒ r j ∈ Span{p1 , . . . , p j+1 } & p j ∈ Span{r0 , . . . , r j−1 } . (10.2.13)

Span{p1 , . . . , p j } = Span{r0 , . . . , r j−1 } = Kl (A, r0 ) . (10.2.14)

(10.2.10) ⇒ r j ⊥ Span{p1 , . . . , p j } = Span{r0 , . . . , r j−1 } . (10.2.15)

✷
Orthogonalities from Lemma 10.2.12 ➤ short recursions for pk , rk , x(k) !

p⊤
j Ar j
(10.2.10) ⇒ (10.2.11) collapses to p j+1 := r j − p j , j = 1, . . . , l .
p⊤
j Ap j

recursion for residuals:

p⊤
j r0
(10.2.9) r j = r j −1 − Ap j .
p⊤
j Ap j
!T
m−1 r0⊤ pk
Lemma 10.2.12, (i) rH
j −1 p j = r0 + ∑ Apk p j =r0⊤ p j . (10.2.16)
k=1 pkT Apk

The orthogonality (10.2.16) together with (10.2.15) permits us to replace r0 with r j−1 in the actual imple-
mentation.

(10.2.17) CG method for solving Ax = b, A s.p.d. → [?, Alg. 13.27]

Input : initial guess x(0) ∈ R n Input: ˆ x (0) ∈ R n

initial guess x =
Output : approximate solution x(l ) ∈ R n tolerance τ > 0
Output: approximate solution x = ˆ x(l )
p1 = r0 : = b − Ax(0) ;
for j = 1 to l do { p := r0 := r := b − Ax;
p Tj r j−1 for j = 1 to lmax do {
x ( j ) : = x ( j − 1) + pj; β : = r T r;
p Tj Ap j h := Ap;
β
α := p T h ;
p Tj r j−1
r j = r j −1 − Ap j ; x := x + αp;
p Tj Ap j r := r − αh;
if krk ≤ τ kr0 k then stop;
(Ap j )T r j T
p j +1 = r j − pj; β : = r βr ;
p Tj Ap j p := r + βp;
} }

In CG algorithm: r j = b − Ax(k) agrees with the residual associated with the current iterate (in exact
arithmetic, cf. Ex. 10.2.21), but computation through short recursion is more efficient.

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 692
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

➣ We find that the CG method possesses all the algorithmic advantages of the gradient method, cf. the
discussion in Section 10.1.3.

✎ ☞
1 matrix×vector product, 3 dot products, 3 AXPY-operations per step:
✍ ✌
If A sparse, nnz(A) ∼ n ➤ computational effort O(n) per step

M ATLAB-code 10.2.18: basic CG iteration for solving Ax = b, § 10.2.17

1 f u n c t i o n x = cg(evalA,b,x,tol,maxit)
2 % x supplies initial guess, maxit maximal number of CG steps
3 % evalA must pass a handle to a MATLAB function realizing A*x
4 r = b - evalA(x); rho = 1; n0 = norm(r);
5 f o r i = 1 : maxit
6 rho1 = rho; rho = r’ * r;
7 i f (i == 1), p = r;
8 else b e t a = rho/rho1; p = r + b e t a * p; end
9 q = evalA(p); a l p h a = rho/(p’ * q);
10 x = x + a l p h a * p; % update of approximate solution
11 i f (norm(b-A*x) <= tol*n0) r e t u r n ; end % termination, see
Rem. 10.2.19
12 r = r - a l p h a * q; % update of residual
13 end

M ATLAB-function:

x=pcg(A,b,tol,maxit,[],[],x0) : Solve Ax = b with at most maxit CG steps: stop,

when krl k : k r0 k < tol.
x=pcg(Afun,b,tol,maxit,[],[],x0): Afun = handle to function for computing A×vector.
[x,flag,relr,it,resv] = pcg(. . .) : diagnostic information about iteration

Remark 10.2.19 (A posteriori termination criterion for plain CG)

For any vector norm and associated matrix norm (→ Def. 1.5.76) hold (with residual rl := b − Ax(l ) )

1 krl k kx( l ) − x∗ k kr k
≤ x(0) −x∗ ≤ cond(A) l . (10.2.20)
cond(A) kr0 k k k k r0 k

relative decrease of iteration error

(10.2.20) can easily be deduced from the error equation A(x(k) − x∗ ) = rk , see Def. 2.4.1 and (2.4.12).

10.2.3 Convergence of CG

Note: CG is a direct solver, because (in exact arithmetic) x(k) = x∗ for some k ≤ n

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 693
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Example 10.2.21 (Impact of roundoff errors on CG → [?, Rem. 4.3])

1
10

0
10

Numerical experiment: A=hilb(20),

x(0) = 0, b = [1, . . . , 1]⊤

2-norm of residual
-1
10

Hilbert-Matrix: extremely ill-conditioned

-2
10

residual
h norms during
i CG iteration ✄:
(10)
R = r0 , . . . , r -3
10

-4
10
0 2 4 6 8 10 12 14 16 18 20
Fig. 379 iteration step k

R⊤ R =
 
1.000000 −0.000000 0.000000 −0.000000 0.000000 −0.000000 0.016019 −0.795816 −0.430569 0.348133
−0.000000 1.000000 −0.000000 0.000000 −0.000000 0.000000 −0.012075 0.600068 −0.520610 0.420903 
 
 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000000 0.001582 −0.078664 0.384453 −0.310577
 
−0.000000 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000024 0.001218 −0.024115 0.019394 
 
 0.000000 −0.000000 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000002 0.000151 −0.000118
 
−0.000000 0.000000 −0.000000 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000000 0.000000 
 
 0.016019 −0.012075 0.001582 −0.000024 0.000000 −0.000000 1.000000 −0.000000 −0.000000 0.000000 
 
−0.795816 −0.078664 −0.000002 −0.000000 −0.000000
 0.600068 0.001218 0.000000 1.000000 0.000000

−0.430569 −0.520610 0.384453 −0.024115 0.000151 −0.000000 −0.000000 0.000000 1.000000 0.000000 

0.348133 0.420903 −0.310577 0.019394 −0.000118 0.000000 0.000000 −0.000000 0.000000 1.000000

➣ Roundoff
✦ destroys orthogonality of residuals
✦ prevents computation of exact solution after n steps.

Numerical instability (→ Def. 1.5.85) ➣ pointless to (try to) use CG as direct solver!

Practice: CG used for large n as iterative solver : x(k) for some k ≪ n is expected to provide good
approximation for x∗

Example 10.2.22 (Convergence of CG as iterative solver)

CG (Code 10.2.18) & gradient method (Code 10.1.12) for LSE with sparse s.p.d. “Poisson matrix”

A = gallery(’poisson’,m); x0 = (1:n)’; b = zeros(n,1);

2 ,m2
➣ A ∈ Rm

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 694
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Poisson matrix, m = 10
0

eigenvalues of Poisson matrices, m=10,20,30

10 40

20 35

30 30

size m of poisson matrix

40 25

50 20

60 15

70 10

80 5

90 0
-2 -1 0 1
10 10 10 10
Fig. 381 eigenvalues poissoneig
100
0 10 20 30 40 50 60 70 80 90 100
Fig. 380 nz = 460 poissonspy
0 0
10 10

-1
10
normalized (!) 2-norms
normalized (!) norms

-2
10

-1
10

-3
10

m=10, error A-norm -4 m=10, error norm

10
m=10, 2-norm of residual m=10, norm of residual
m=20, error A-norm m=20, error norm
m=20, 2-norm of residual m=20, norm of residual
m=30 error A-norm m=30 error norm
m=30, 2-norm of residual m=30, norm of residual, #3
-2 -5
10 10
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25
Fig. 382 gradient iteration step k poissongmcvg
Fig. 383 no. of CG steps cgcvgpoisso
Observations:

• CG much faster than gradient method (as expected, because it has “memory”)
• Both, CG and gradient method converge more slowly for larger sizes of Poisson matrices.

Convergence theory: [?, Sect. 9.4.3]

A simple consequence of (10.1.5) and (10.2.1):

Corollary 10.2.23. “Optimality” of CG iterates

Writing x∗ ∈ R n for the exact solution of Ax = b the CG iterates satisfy

x∗ − x(l ) = min{ky − x∗ k A : y ∈ x(0) + Kl (A, r0 )} , r0 := b − Ax(0) .

This paves the way for a quantitative convergence estimate:

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 695
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

y ∈ x(0) + Kl (A, r) ⇔ y = x(0) + A p(A)(x − x(0) ) , p = polynomial of degree ≤ l − 1 .

x − y = q(A)(x − x(0) ), q = polynomial of degree ≤ l , q(0) = 1 .

x − x(l ) ≤ min{ max |q(λ)|: q polynomial of degree ≤ l , q(0) = 1} · x − x (0) .

A λ∈σ(A) A

(10.2.24)

Bound this minimum for λ ∈ [λmin (A), λmax (A)] by using suitable “polynomial candidates”

Tool: Chebychev polynomials (→ Section 6.1.3.1) ➣ lead to the following estimate [?, Satz 9.4.2],
[?, Satz 13.29]

Theorem 10.2.25. Convergence of CG method

The iterates of the CG method for solving Ax = b (see Code 10.2.18) with A = A⊤ s.p.d. satisfy
l
2 1− √1
(l ) κ (A)
x−x ≤ 2l 2l x − x (0)
A A
1+ √1 + 1− √1
κ (A) κ (A)
p !l
κ (A ) − 1
≤ 2 p x − x (0) .
κ (A ) + 1 A

(recall: κ (A) = spectral condition number of A, κ (A) = cond2 (A))

of this theorem confirms asymptotic linear convergence of the CG method (→ Def. 8.1.9)
The estimate p
κ (A ) − 1
with a rate of p
κ (A ) + 1

Plots of bounds for error reduction (in energy norm) during CG iteration from Thm. 10.2.25:
100

9
90 0.

1
error reduction (energy norm)

0.8 70
9
0.
0.6 60
0.8
κ(A)1/2

0.4 50

0.2 40 0.
9 0.8 0.7

30
0 0.7
100 0.8 0.6
80 20
10 0.5 0.4
9 0.7
60 8 0. 0.6
6 0.5 0.4 0.3
40 10 0.2
4 0.8 0.1
20 0.6 0.4 0.3
0.2 0.1
2 0.7 0. 5
0.3 0.1
0 0.2
0
κ(A)1/2 CG step l
1 2 3 4 5 6 7 8 9 10
CG step l

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 696
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB-code 10.2.26: lotting theoretical bounds for CG convergence rate

1 f u n c t i o n plottheorate
2 % Python: ../PYTHON/ConjugateGradient.py plottheorate()
3

4 [X,Y] = meshgrid (1:10,1:100); R = z e r o s (100,10);

5 f o r I=1:100
6 t = 1/I;
7 f o r j=1:10
8 R(I,j) = 2*(1-t)^j/((1+t)^(2*j)+(1-t)^(2*j));
9 end
10 end
11

12 f i g u r e ; view ([-45,28]); mesh(X,Y,R); colormap hsv ;

13 x l a b e l (’{\bf CG step l}’,’Fontsize’,14);
14 y l a b e l (’{\bf \kappa(A)^{1/2}}’,’Fontsize’,14);
15 z l a b e l (’{\bf error reduction (energy norm)}’,’Fontsize’,14);
16

17 p r i n t -depsc2 ’../PICTURES/theorate1.eps’;
18

19 f i g u r e ; [C,h] = c o n t o u r (X,Y,R); c l a b e l (C,h);

20 x l a b e l (’{\bf CG step l}’,’Fontsize’,14);
21 y l a b e l (’{\bf \kappa(A)^{1/2}}’,’Fontsize’,14);
22

23 p r i n t -depsc2 ’../PICTURES/theorate2.eps’;

Example 10.2.27 (Convergence rates for CG method)

M ATLAB-code 10.2.28: CG for Poisson matrix

Measurement of
1 A = g a l l e r y (’poisson’,m); n =
rate of (linear) convergence:
s i z e (A,1);
2 x0 = (1:n)’; b = ones(n,1); maxit = s
10 kr30 k2
30; tol =0; rate ≈ . (10.2.29)
3 [x, f l a g ,relres,iter,resvec] = kr20 k2
pcg(A,b,tol,maxit,[],[],x0);

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 697
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

CG convergence for Poisson matrix

CG convergence for Poisson matrix
1000 1 1

0.9

0.8

convergence rate of CG
convergence rate of CG
0.7

0.6
cond (A)
2

500 0.5 0.5

0.4

0.3

0.2

0.1 theoretical bound

measured rate
0 0 0
5 10 15 20 25 30
5 10 15 20 25 30
m m

Justification for estimating the rate of linear convergence (→ Def. 8.1.9) of k rk k2 → 0:

k r k + 1 k 2 ≈ L k r k k2 ⇒ k r k + m k2 ≈ L m k r k k2 .

Example 10.2.30 (CG convergence and spectrum → Ex. 10.1.17)

Test matrix #1: A=diag(d); d = (1:100);

Test matrix #2: A=diag(d); d = [1+(0:97)/97 , 50 , 100];
Test matrix #3: A=diag(d); d = [1+(0:49)*0.05, 100-(0:49)*0.05];
Test matrix #4: eigenvalues exponentially dense at 1
x0 = cos((1:n)’); b = zeros(n,1);
0
10

-2
#4 10

-4
10
#3
matrix no.

2-norms

-6
10

-8
10
error norm, #1
norm of residual, #1
#1 error norm, #2
-10
10 norm of residual, #2
error norm, #3
norm of residual, #3
error norm, #4
norm of residual, #4
-12
10
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25
diagonal entry no. of CG steps

Observations: Distribution of eigenvalues has crucial impact on convergence of CG

(This is clear from the convergence theory, because detailed information about the spec-
trum allows a much better choice of “candidate polynomial” in (10.2.24) than merely
using Chebychev polynomials)
➣ Clustering of eigenvalues leads to faster convergence of CG

✞ ☎
(in stark contrast to the behavior of the gradient method, see Ex. 10.1.17)

✝ ✆
CG convergence boosted by clustering of eigenvalues

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 698
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10.3 Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, Sect. 4.3.5]

Thm. 10.2.25 ➣ (Potentially) slow convergence of CG in case κ (A) ≫ 1.

Idea: Preconditioning
Apply CG method to transformed linear system

Ae e , A
ex = b e := B−1/2 AB−1/2 , e e := B−1/2 b ,
x := B /2 x , b
1
(10.3.1)

e ), B = B⊤ ∈ R N,N s.p.d. =
with “small” κ (A ˆ preconditioner.

Remark 10.3.2 (Square root of a s.p.d. matrix)

What is meant by the “square root” B /2 of a s.p.d. matrix B ?

Recall (10.1.15) : for every B ∈ R n,n with B⊤ = B there is an orthogonal matrix Q ∈ R n.n such that
B = Q⊤ DQ with a diagonal matrix D (→ Cor. 9.1.9, [?, Thm. 7.8], [?, Satz 9.15]). If B is s.p.d. the
(diagonal) entries of D are strictly positive and we can define
p p
D = diag(λ1 , . . . , λn ), λi > 0 ⇒ D /2 := diag(
1
λ1 , . . . , λn ) .

This is generalized to

B /2 := Q⊤ D /2 Q ,
1 1

and one easily verifies, using Q⊤ = Q−1 , that (B /2 )2 = B and that B /2 is s.p.d. In fact, these two
1 1

requirements already determine B /2 uniquely:

B1/2 is the unique s.p.d. matrix such that (B1/2 )2 = B.

Notion 10.3.3. Preconditioner

A s.p.d. matrix B ∈ R n,n is called a preconditioner (ger.: Vorkonditionierer) for the s.p.d. matrix
A ∈ R n,n , if
1. κ (B− /2 AB− /2 ) is “small” and
1 1

2. the evaluation of B−1 x is about as expensive (in terms of elementary operations) as the
matrix×vector multiplication Ax, x ∈ R n .

λmax (A)
Recall: spectral condition number κ (A ) := , see (10.1.21)
λmin (A)

There are several equivalent ways to express that κ (B− /2 AB− /2 ) is “small”:
1 1

• κ (B−1 A) is “small”,
because spectra agree σ (B−1 A) = σ (B− /2 AB− /2 ) due to similarity (→ Lemma 9.1.6)
1 1

• ∃ 0 < γ < Γ, Γ/γ “small”: γ (x⊤ Bx) ≤ x⊤ Ax ≤ Γ (x⊤ Bx) ∀x ∈ R n ,

where equivalence is seen by transforming y := B− /2 x and appealing to the min-max Thm. 9.3.41.
1

10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, 699
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

☛ ✟
“Reader’s digest” version of Notion 10.3.3:

✡ ✠
S.p.d. B preconditioner :⇔ B−1 = cheap approximate inverse of A

Problem: B /2 , which occurs prominently in (10.3.1) is usually not available with acceptable computational
1

costs.

However, if one formally applies § 10.2.17 to the transformed system

e
Aex := B −1/2
AB −1/2 e := B−1/2 b ,
(B /2 x) = b
1

from (10.3.1), it becomes apparent that, after suitable transformation of the iteration variables p j and r j ,
B1/2 and B−1/2 invariably occur in products B−1/2 B−1/2 = B−1 and B1/2 B−1/2 = I. Thus, thanks to this
intrinsic transformation square roots of B are not required for the implementation!

CG for Ae e
ex = b Equivalent CG with transformed variables
x (0) ∈ R n
Input : initial guess e Input : initial guess x(0) ∈ R n
Output : approximate solution ex(l ) ∈ R n Output : approximate solution x(l ) ∈ R n

e − B−1/2 AB−1/2e
e 1 := er0 := b
p x (0) ; e − AB−1/2e
B1/2er0 := B1/2 b x (0) ;
for j = 1 to l do { B− /2 p
1
e 1 := B−1 (B /2er0 ) ;
1

for j = 1 to l do {
e Tjer j−1
p
α := (B−1/2 p
e j )T B1/2er j−1
e Tj B−1/2 AB−1/2 p
p ej α :=
(B−1/2 p
e j )T AB−1/2 p ej
x ( j − 1) + α p
x( j) := e
e e j; −1/2 ( j) −1/2 ( j−1)
+ α B− /2 p
1
B x := B
e e
x e j;
er j = er j−1 − αB− /2 AB /2 p
1 1
e j;
B /2er j = B /2er j−1 − αAB− /2 p
1 1 1
e j;
(B−1/2 AB−1/2 p
e j )Ter j
B− /2 p
e j+1 = B−1 (B− /2er j )
1 1
e j +1
p = er j − T −1/2 −1/2 e j;
p
ej B
p AB ej
p
(B−1/2 p
e j )T AB−1 (B1/2er j ) −1/2
} − B e j;
p
(B−1/2 pe j )T AB−1/2 p
ej
}

with the transformations:

x(k) = B /2 x(k) , erk = B− /2 rk , p

e k = B− /2 rk .
1 1 1
e (10.3.4)

(10.3.5) Preconditioned CG method (PCG) [?, Alg. 13.32], [?, Alg. 10.1]

10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, 700
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Input: ˆ x(0) ∈ R n , tolerance τ > 0

initial guess x ∈ R n =
Output: approximate solution x =ˆ x(l )

p := r := b − Ax; p := B−1 r; q := p; τ0 := p⊤ r;
for l = 1 to lmax do {
β
β := r⊤ q; h := Ap; α := p⊤ h ;
x := x + αp;
r := r − αh; (10.3.6)
r⊤ q
q := B−1 r; β := β ;
if |q⊤ r| ≤ τ · τ0 then stop;
p := q + βp;
}

M ATLAB-code 10.3.7: simple implementation of PCG algorithm § 10.3.5

1 f u n c t i o n [x,rn,xk] = pcgbase(evalA,b,tol,maxit,invB,x)
2 % evalA must pass a handle to a function implementing A*x
3 % invB is to be a handle to a function providing the action of the
4 % preconditioner on a vector. The other arguments like for M A T L A B ’s
pcg.
5 r = b - evalA(x); rho = 1; rn = [];
6 i f ( n a r g o u t > 2), xk = x; end
7 f o r i = 1 : maxit
8 y = invB(r);
9 rho_old = rho; rho = r’ * y; rn = [rn,rho];
10 i f (i == 1), p = y; rho0 = rho;
11 e l s e i f (rho < rho0*tol), r e t u r n ;
12 e l s e b e t a = rho/rho_old; p = y+ b e t a *p; end
13 q = evalA(p); a l p h a = rho /(p’ * q);
14 x = x + a l p h a * p;
15 r = r - a l p h a * q;
16 i f ( n a r g o u t > 2), xk = [xk,x]; end
17 end

✛ ✘
Computational effort per step: 1 evaluation A×vector,
1 evaluation B−1 ×vector,

✚ ✙
3 dot products, 3 AXPY-operations

Remark 10.3.8 (Convergence theory for PCG)

e
Assertions of Thm. 10.2.25 remain valid with κ (A) replaced with κ (B−1 A) and energy norm based on A
instead of A.

10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, 701
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Example 10.3.9 (Simple preconditioners)

B = easily invertible “part” of A

✦ B =diag(A): Jacobi preconditioner (diagonal scaling)

(
(A)ij , if |i − j| ≤ k ,
✦ (B)ij = for some k ≪ n.
0 else,

✦Symmetric Gauss-Seidel preconditioner

Idea: Solve Ax = b approximately in two stages:

➀ Approximation A−1 ≈ tril(A) (lower triangular part): x = tril(A)−1 b
e
➁ Approximation A−1 ≈ triu(A) (upper triangular part) and use this to approximately “solve” the
error equation A(x − e
x ) = r, with residual r := b − Ae
x:

x + triu(A)−1 (b − Ae
x=e x) .

With L A := tril(A), U A := triu(A) one finds

x = (L− 1 −1 −1 −1
A + U A − U A AL A )b ➤ B
−1
= L− 1 −1 −1 −1
A + U A − U A AL A . (10.3.10)

For all these approaches the evaluation of B−1 r can be done with effort of O(n) in the case of a sparse
matrix A (e.g. with O(1) non-zero entries per row). However, there is absolutely no guarantee that
κ (B−1 A) will be reasonably small. It will crucially depend on A, if this can be expected.

More complicated preconditioning strategies:

✦ Incomplete Cholesky factorization, M ATLAB-ichol, [?, Sect. 13.5]]

✦ Sparse approximate inverse preconditioner (SPAI)

Example 10.3.11 (Tridiagonal preconditioning)

Efficacy of preconditioning of sparse LSE with tridiagonal part:

M ATLAB-code 10.3.12: LSE for Ex. 10.3.11

1 A =
spdiags (repmat([1/n,-1,2+2/n,-1,1/n],n,1),[-n/2,-1,0,1,n/2],n,n);
2 b = ones(n,1); x0 = ones(n,1); tol = 1.0E-4; maxit = 1000;
3 evalA = @(x) A*x;
4

5 % no preconditioning, see Code 10.3.7

6 invB = @(x) x; [x,rn] = pcgbase(evalA,b,tol,maxit,invB,x0);
7

8 % tridiagonal preconditioning, see Code 10.3.7

9 B = spdiags ( spdiags (A,[-1,0,1]),[-1,0,1],n,n);

10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, 702
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10 invB = @(x) B\x; [x,rnpc] = pcgbase(evalA,b,tol,maxit,invB,x0);

The Code 10.3.12 highlights the use of a preconditioner in the context of the PCG method; it only takes a
function that realizes the application of B−1 to a vector. In Line 10 of the code this function is passed as
function handle invB.
5 5
10 10

0
10 0
10

-5
10

B -norm of residuals
-5
10
A-norm of error

-10
10
-10
10
-15
10

-15

-1
10
-20
10

CG, n = 50 CG, n = 50
-20 CG, n = 100
CG, n = 100 10
-25
10 CG, n = 200 CG, n = 200
PCG, n = 50 PCG, n = 50
PCG, n = 100 PCG, n = 100
PCG, n = 200 PCG, n = 200
-30 -25
10 10
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 384 # (P)CG step Fig. 385 # (P)CG step

n # CG steps # PCG steps 10

3
PCG iterations: tolerance = 0.0001

16 8 3 PCG

32 16 3
64 25 4
2

128 38 4 10
# (P)CG step

256 66 4
512 106 4
1024 149 4 1
10
2048 211 4
4096 298 3
8192 421 3
16384 595 3 0
10
1 2 3 4 5
32768 841 3 Fig. 386
10 10 10
n
10 10

Clearly in this example the tridiagonal part of the matrix is dominant for large n. In addition, its condition
number grows ∼ n2 as is revealed by a closer inspection of the spectrum.

Preconditioning with the tridiagonal part manages to suppress this growth of the condition number of
B−1 A and ensures fast convergence of the preconditioned CG method

Remark 10.3.13 (Termination of PCG)

Recall Rem. 10.2.19, (10.2.20):

1 krl k kx( l ) − x∗ k kr k
≤ (0) ∗ ≤ cond(A) l . (10.2.20)
cond(A) kr0 k k x − x k k r0 k

B good preconditioner ➤ cond2 (B−1/2 AB−1/2 ) small (→ Notion 10.3.3)

10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, 703
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Idea: consider (10.2.20) for

✦ Euclidean norm k·k = k·k2 ↔ cond2
✦ transformed quantities e
x, er, see (10.3.1), (10.3.4)
Monitor 2-norm of transformed residual:

e x = B−1/2 r ⇒ kerk2 = r⊤ B−1 r .

e − Ae
er = b 2

(10.2.20) 2
estimate for 2-norm of transformed iteration errors: e(l )
e = (e(l ) )⊤ Be(l )
2

Analogous to (10.2.20), estimates for energy norm (→ Def. 10.1.1) of error e(l ) := x − x(l ) , x∗ := A−1 b:

Use error equation Ae(l ) = rl :

2
r⊤ −1 −1 (l ) ⊤
l B r l = (B Ae ) Ae
(l )
≤ λmax (B−1 A) e(l ) ,
A
2
e(l ) = (Ae(l ) )⊤ e(l ) = r⊤ −1 −1 ⊤ −1 −1 −1 ⊤
l A r l = B r BA r l ≤ λmax (BA ) (B r l ) r l .
A

available during PCG iteration (10.3.6)

2 2
1 e(l ) (B−1 r l )⊤ r l e(l )
A
≤ ≤ κ (B−1 A ) A
(10.3.14)
κ (B−1 A ) e ( 0 ) 2 ( B−1 r0 ) ⊤ r0 e ( 0 ) 2
A A

κ (B−1 A) “small” ➤ B−1 -energy norm of residual ≈ A-norm of error !

(rl · B−1 rl = q⊤ r in Algorithm (10.3.6))

M ATLAB-function: [x,flag,relr,it,rv] = pcg(A,b,tol,maxit,B,[],x0);

(A, B may be handles to functions providing Ax and B−1 x, resp.)

Remark 10.3.15 (Termination criterion in M ATLAB-pcg → [?, Sect. 4.6])

Implementation (skeleton) of M ATLAB built-in pcg:

10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, 704
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Listing 10.1: M ATLAB-code : PCG algorithm

1 f u n c t i o n x = pcg(Afun,b,tol,maxit,Binvfun,x0)
2 x = x0; r = b - f e v a l (Afun,x); rho = 1;
3 f o r i = 1 : maxit
4 y = feval(Binvfun,r);
5 rho1 = rho; rho = r’ * y;
6 i f (i == 1)
7 p = y;
8 else
9 b e t a = rho / rho1;
10 p = y + b e t a * p;
11 end
12 q = feval(Afun,p);
13 a l p h a = rho /(p’ * q);
14 x = x + a l p h a * p;
15 if (norm(b - evalf(Afun,x)) <= tolb*norm(b)) , r e t u r n ; end
16 r = r - a l p h a * q;
17 end

Dubious termination criterion !

10.4 Survey of Krylov Subspace Methods

10.4.1 Minimal residual methods

Idea: Replace Euclidean inner product in CG with A-inner product

x(l ) − x replaced with A (x(l ) − x) = krl k2

A 2

MINRES method [?, Sect. 9.5.2] (for any symmetric matrix !)

Theorem 10.4.1.

For A = A H ∈ R n,n the residuals rl generated in the MINRES iteration satisfy

krl k2 = min{kAy − b k2 : y ∈ x(0) + Kl (A, r0 )}

l
2 1 − κ(1A)
krl k2 ≤ 2l 2l kr0 k2 .
1 1
1 + κ (A) + 1 − κ (A)

p
Note: similar formula for (linear) rate of convergence as for CG, see Thm. 10.2.25, but with κ (A )
replaced with κ (A) !

10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 705
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Iterative solver for Ax = b with symmetric system matrix A:

M ATLAB-functions: • [x,flg,res,it,resv] = minres(A,b,tol,maxit,B,[],x0);

• [. . .] = minres(Afun,b,tol,maxit,Binvfun,[],x0);
Computational costs : 1 A×vector, 1 B−1 ×vector per step, a few dot products & SAXPYs
Memory requirement: a few vectors ∈ R n

Extension to general regular A ∈ R n,n :

Idea: Solver overdetermined linear system of equations

x(l ) ∈ x(0) + Kl (A, r0 ): Ax(l ) = b

in least squares sense, → Chapter 3.

x(l ) = argmin{k Ay − bk2 : y ∈ x(0) + Kl (A, r0 )} .

➤ GMRES method for general matrices A ∈ R n,n → [?, Ch. 16], [?, Sect. 4.4.2]
M ATLAB-function: • [x,flag,relr,it,rv] = gmres(A,b,rs,tol,maxit,B,[],x0);
• [. . .] = gmres(Afun,b,rs,tol,maxit,Binvfun,[],x0);

Computational costs : 1 A×vector, 1 B−1 ×vector per step,

: O(l ) dot products & SAXPYs in l -th step
Memory requirements: O(l ) vectors ∈ K n in l -th step

Remark 10.4.2 (Restarted GMRES)

After many steps of GMRES we face considerable computational costs and memory requirements for
every further step. Thus, the iteration may be restarted with the current iterate x(l ) as initial guess →
rs-parameter triggers restart after every rs steps (Danger: failure to converge).

10.4.2 Iterations with short recursions [?, Sect. 4.5]

Iterative methods for general regular system matrix A:

Idea: Given x(0) ∈ R n determine (better) approximation x(l ) through Petrov-

Galerkin condition

x(l ) ∈ x(0) + Kl (A, r0 ): p H (b − Ax(l ) ) = 0 ∀p ∈ Wl ,

with suitable test space Wl , dim Wl = l , e.g. Wl := Kl (A H , r0 ) (→

bi-conjugate gradients, BiCG)

Zoo of methods with short recursions (i.e. constant effort per step)

MATLAB-function: • [x,flag,r,it,rv] = bicgstab(A,b,tol,maxit,B,[],x0)

• [. . .] = bicgstab(Afun,b,tol,maxit,Binvfun,[],x0);

10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 706
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Computational costs : 2 A×vector, 2 B−1 ×vector, 4 dot products, 6 SAXPYs per step
Memory requirements: 8 vectors ∈ R n

MATLAB-function: • [x,flag,r,it,rv] = qmr(A,b,tol,maxit,B,[],x0)

• [. . .] = qmr(Afun,b,tol,maxit,Binvfun,[],x0);

Computational costs : 2 A×vector, 2 B−1 ×vector, 2 dot products, 12 SAXPYs per step
Memory requirements: 10 vectors ∈ R n

✦ little (useful) convergence theory available

✦ stagnation & “breakdowns” commonly occur

Example 10.4.3 (Failure of Krylov iterative solvers)

 
0 1 0 ··· ··· 0  
0
 .. 
0 0 1 0 .   .. 
.  .
. ..
.
..
.
..
.   
.   
 .. ..   
A= . .  , b=  x = e1 .
  .
 .. .. ..   .. 
. . 0.   
   0
0 0 1 
1 0 ··· ··· 0 1

x(0) = 0 ➣ r0 = en ➣ Kl (A, r0 ) = Span{en , en−1, . . . , en−l +1}

(
1 , if l ≤ n ,
min{k y − xk2 : y ∈ Kl (A, r0 )} =
0 , for l = n .

☛ ✟

✡ ✠
TRY & PRAY

Example 10.4.4 (Convergence of Krylov subspace methods for non-symmetric system ma-
trix)

1 A = g a l l e r y (’tridiag’,-0.5*ones(n-1,1),2*ones(n,1),-1.5*ones(n-1,1));
2 B = g a l l e r y (’tridiag’,0.5*ones(n-1,1),2*ones(n,1),1.5*ones(n-1,1));

Plotted: k r l k 2 : k r0 k 2 :

10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 707
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3 0
10 10
bicgstab bicgstab
qmr qmr

2
10
Relative 2-norm of residual

Relative 2-norm of residual

-1
10

1
10

-2
10

0
10

-1 -3
10 10
0 5 10 15 20 25 0 5 10 15 20 25
iteration step iteration step

tridiagonal matrix A tridiagonal matrix B

Summary:
Advantages of Krylov methods vs. direct elimination (IF they converge at all/sufficiently fast).
• They require system matrix A in procedural form y=evalA(x) ↔ y = Ax only.
• They can perfectly exploit sparsity of system matrix.
• They can cash in on low accuracy requirements (IF viable termination criterion available).
• They can benefit from a good initial guess.

10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 708
Chapter 11

Numerical Integration – Single Step Methods

Contents
11.1 Initial value problems (IVP) for ODEs . . . . . . . . . . . . . . . . . . . . . . . . . 698
11.1.1 Modeling with ordinary differential equations: Examples . . . . . . . . . . . 699
11.1.2 Theory of initial value problems . . . . . . . . . . . . . . . . . . . . . . . . . 703
11.1.3 Evolution operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
11.2 Introduction: Polygonal Approximation Methods . . . . . . . . . . . . . . . . . . 709
11.2.1 Explicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
11.2.2 Implicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712
11.2.3 Implicit midpoint method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
11.3 General single step methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714
11.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714
11.3.2 Convergence of single step methods . . . . . . . . . . . . . . . . . . . . . . . 717
11.4 Explicit Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
11.5 Adaptive Stepsize Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732

11.1 Initial value problems (IVP) for ODEs

Acronym: ODE = ordinary differential equation

(11.1.1) Terminology and notations related to ordinary differential equations

In our parlance, a (first-order) ordinary differential equation (ODE) is an equation of the form

ẏ = f(t, y) , (11.1.2)

with

☞ a (continuous) right hand side function (r.h.s) f : I × D → R d of time t ∈ R and state y ∈ R d

☞ defined on a (finite) time interval I ⊂ R, and state space D ⊂ R d .

In the context of mathematical modeling the state vector y ∈ R d is supposed to provide a complete (in the
sense of the model) description of a system. Then (11.1.2) models a finite-dimensional dynamical system.

709
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Examples will be provided below.

For d > 1 ẏ = f(t, y) can be viewed as a system of ordinary differential equations:

   
ẏ1 f 1 (t, y1 , . . . , yd )
 ..   .. 
ẏ = f(t, y) ⇐⇒ .= . .
ẏd f d (t, y1 , . . . , yd )

✎ Notation (Newton): dot ˙ =

ˆ (total) derivative with respect to time t

Definition 11.1.3. Solution of an ordinary differential equation

A solution of the ODE ẏ = f(t, y) with continuous right hand side function f is a continuously
differentiable function “of time t” y : J ⊂ I → D, defined on an open interval J , for which ẏ(t) =
f(t, y(t)) holds for all t ∈ J .

A solution describes a continuous trajectory in state space, a one-parameter family of states, parameter-
ized by time.

It goes without saying that smoothness of the right hand side function f is inherited by solutions of the
ODE:

Lemma 11.1.4. Smoothness of solutions of ODEs

Let y : I ⊂ R → D be a solution of the ODE ẏ = f(t, y) on the time interval I .

If f : I × D → R d is r-times continuously differentiable with respect to both arguments, r ∈ N0 ,

then the trajectory t 7→ y(t) is r + 1-times continuously differentiable in the interior of I .

Supplementary reading. Some grasp of the meaning and theory of ordinary differential equa-

tions (ODEs) is indispensable for understanding the construction and properties of numerical meth-
ods. Relevant information can be found in [?, Sect. 5.6, 5.7, 6.5].

Books dedicated to numerical methods for ordinary differential equations:

• [?] excellent textbook, but geared to the needs of students of mathematics.

• [?] and [?] : the standard reference.
• [?]: wonderful book conveying deep insight, with emphasis on mathematical concepts.

11.1.1 Modeling with ordinary differential equations: Examples

Example 11.1.5 (Growth with limited resources [?, Sect. 1.1], [?, Ch. 60])

11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 710
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

This is an example from population dyanmics with a one-dimensional state space D = R 0+ , d = 1:

y : [0, T ] 7→ R: bacterial population density as a function of time

ODE-based model: autonomous logistic differential equations [?, Ex. 5.7.2]

ẏ = f (y) := (α − βy) y (11.1.6)

ˆ population density, [y] = m12

✦ y=
➣ ẏ =
ˆ instantaneous change (growth/decay) of population density
m2
✦ growth rate α − βy with growth coefficients α, β > 0, [α] = 1s , [ β] = s : decreases due to more
fierce competition as population density increases.
1.5

By separation of variables we can compute a family

of solutions of (11.1.6) parameterized by the initial
1 value y(0) = y0 > 0:

αy0
y(t) = ,
y

(11.1.7)
βy0 + (α − βy0 ) exp(−αt)
0.5
for all t ∈ R .

Note: f (y∗ ) = 0 for y∗ ∈ {0, α/β}, which are the

stationary points for the ODE (11.1.6). If y(0) = y∗
0
0 0.5 1 1.5 the solution will be constant in time.
Fig. 387 t

Solution for different y(0) (α, β = 5)

Note that by fixing the initial value y(0) we can single out a unique representative from the family of
solutions. This will turn out to be a general principle, see Section 11.1.2.

Definition 11.1.8. Autonomous ODE

An ODE of the from ẏ = f(y), that is, with a right hand side function that does not depend on time,
but only on state, is called autonomous.

For an autonomous ODE the right hand side function defines a vector field (“velocity field”) y 7→ f(y) on
state space.

Example 11.1.9 (Predator-prey model [?, Sect. 1.1],[?, Sect. 1.1.1],[?, Ch. 60], [?, Ex. 11.3])

Predators and prey coexist in an ecosystem. Without predators the population of prey would be gov-
erned by a simple exponential growth law. However, the growth rate of prey will decrease with increasing
numbers of predators and, eventually, become negative. Similar considerations apply to the predator
population and lead to an ODE model.

11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 711
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

ODE-based model: autonomous Lotka-Volterra ODE:

u̇ = (α − βv)u u (α − βv)u
↔ ẏ = f(y) with y = , f (y) = , (11.1.10)
v̇ = (δu − γ)v v (δu − γ)v

with positive model parameters α, β, γ, δ > 0.

population densities:
u(t) → density of prey at time t,
v(t) → density of predators at time t

Right hand side vector field f for Lotka-Volterra ODE

v
✄
α/β
Solution curves are trajectories of particles carried
along by velocity field f.

(Parameter values for Fig. 388: α = 2, β = 1, δ =

1, γ = 1.) γ/δ
Fig. 388 u

6
u=y 6
1
v = y2
5
5

4
4

3
v = y2
y

2
2

1
1

0
1 2 3 4 5 6 7 8 9 10 0
0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 389
t
Fig. 390 u = y1
u(t) u (0 ) 4
Solution for y0 := = Solution curves for (11.1.10)
v(t) v (0 ) 2
Parameter values for Fig. 390, 389: α = 2, β = 1, δ = 1, γ = 1
stationary point

Example 11.1.11 (Heartbeat model → [?, p. 655])

This example centers around a phenomenological model from physiology.

l = l (t) =
ˆ length of muscle fiber
State of heart described by quantities:
p = p(t) =
ˆ electro-chemical potential

l˙ = −(l 3 − αl + p) ,
Phenomenological model: (11.1.12)
ṗ = βl ,

with parameters: α =
ˆ pre-tension of muscle fiber
β =
ˆ (phenomenological) feedback parameter

11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 712
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

This is the so-called Zeeman model: it is a phenomenological model entirely based on macroscopic
observations without relying on knowledge about the underlying molecular mechanisms.

Vector fields and solutions for different choices of parameters:

Phase flow for Zeeman model (α = 3,β=1.000000e-01) Heartbeat according to Zeeman model (α = 3,β=1.000000e-01)
2.5 3
l(t)
p(t)
2

2
1.5

1
1

0.5

l/p
p

0 0

-0.5

-1
-1

-1.5
-2

-2

-2.5 -3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 0 10 20 30 40 50 60 70 80 90 100
Fig. 391 l Fig. 392 time t

Phase flow for Zeeman model (α = 5.000000e-01, β=1.000000e-01) Heartbeat according to Zeeman model (α = 5.000000e-01, β=1.000000e-01)
2.5 3
l(t)
p(t)
2

2
1.5

1
1

0.5
l/p
p

0 0

-0.5

-1
-1

-1.5
-2

-2

-2.5 -3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 0 10 20 30 40 50 60 70 80 90 100
Fig. 393 l Fig. 394 time t

Observation: α ≪ 1 ➤ ventricular fibrillation, a life-threatening condition.

Example 11.1.13 (Transient circuit simulation [?, Ch. 64])

In Chapter 1 and Chapter 8 we discussed circuit analysis as a source of linear and non-linear systems of
equations, see Ex. 2.1.3 and Ex. 8.0.1. In the former example we admitted time-dependent currents and
potentials, but dependence on time was confined to be “sinusoidal”. This enabled us to switch to frequency
domain, see (2.1.6), which gave us a complex linear system of equations for the complex nodal potentials.
Yet, this trick is only possible for linear circuits. In the general case, circuits have to be modelled by ODEs
connecting time-dependent potentials and currents. This will be briefly explained now.

11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 713
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The approach is transient nodal analysis, cf.

Ex. 2.1.3, based on the Kirchhoff current law, cf.
C

i R (t) − i L (t) − i C (t) = 0 . (11.1.14) R

u(t)
Transient constitutive relations for basic linear circuit L
elements:

resistor: i R (t) = R−1 u R (t) , (11.1.15) Us ( t )

du
capacitor: i C (t) = C C (t) , (11.1.16)
dt
di L
coil: u L (t) = L (t) . (11.1.17)
dt Fig. 395

Given: source voltage Us (t)

To apply nodal analysis to the circuit of Fig. 395 we differentiate (11.1.14) w.r.t. t

di R di di
(t) − L (t) − C (t) = 0 ,
dt dt dt
and plug in the above constitutive relations for circuit elements:

−1 du R −1 d2 u C
R (t) − L u L (t) − C 2 (t) = 0 .
dt dt
We continue following the policy of nodal analysis and express all voltages by potential differences between
nodes of the circuit.

u R ( t ) = Us ( t ) − u ( t ) , u C ( t ) = u ( t ) − 0 , u L ( t ) = u ( t ) − 0 .

For this simple circuit there is only one node with unknown potential, see Fig. 395. Its time-dependent
potential will be denoted by u(t) and this is the unknown of the model, a function of time obeying the
ordinary differential equation

d2 u
R−1 (U̇s (t) − u̇(t)) − L−1 u(t) − C (t) = 0 .
dt2
This is an autonomous 2nd-order ordinary differential equation:

Cü + R−1 u̇ + L−1 u = R−1 U̇s . (11.1.18)

The attribute “2nd-order” refers to the occurrence of a second derivative with respect to time.

11.1.2 Theory of initial value problems

Supplementary reading. [?, Sect. 11.1], [?, Sect. 11.3]

(11.1.19) Initial value problems

Wit start with an abstract mathematical description

11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 714
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

A generic Initial value problem (IVP) for a first-order ordinary differential equation (ODE) (→ [?,
Sect. 5.6], [?, Sect. 11.1]) can be stated as: find a function y : I → D that satisfies, cf. Def. 11.1.3,

ẏ = f(t, y) , y(t0 ) = y0 . (11.1.20)

• f : I × D 7→ R d = ˆ right hand side (r.h.s.) (d ∈ N),

• I⊂R= ˆ (time)interval ↔ “time variable” t,
• D ⊂ Rd = ˆ state space/phase space ↔ “state variable” y,
• Ω := I × D = ˆ extended state space (of tupels (t, y)),
• t0 ∈ I =ˆ initial time, y0 ∈ D = ˆ initial state ➣ initial conditions.

(11.1.21) IVPs for autonomous ODEs

Recall Def. 11.1.8: For an autonomous ODE ẏ = f(y), that is the right hand side f does not depend on
time t.

Hence, for autonomous ODEs we have I = R and the right hand side function y 7→ f(y) can be regarded
as a stationary vector field (velocity field), see Fig. 388 or Fig. 391.

An important observation: If t 7→ y(t) is a solution of an autonomous ODE, then, for any τ ∈ R , also the
shifted function t 7→ y(t − τ ) is a solution.

➣ For initial value problems for autonomous ODEs the initial time is irrelevant and therefore we can
always make the “canonical choice t0 = 0.

Autonomous ODEs naturally arise when modeling time-invariant systems or phenomena. All examples for
Section 11.1.1 belong to this class.

(11.1.22) Autonomization: Conversion into autonomous ODE

In fact, autonomous ODEs already represent the general case, because every ODE can be converted into
an autonomous one:

Idea: include time as an extra d + 1-st component of an extended state vector.

This solution component has to grow linearly ⇔ temporal derivative = 1

′
y(t) z f (zd+1 , z ′ )
z(t) := = : ẏ = f(t, y) ↔ ż = g(z) , g(z) := .
t zd+1 1

➣ We restrict ourselves to autonomous ODEs in the remainder of this chapter.

Remark 11.1.23 (From higher order ODEs to first order systems [?, Sect. 11.2])

11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 715
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

An ordinary differential equation of order n ∈ N has the form

y(n) = f(t, y, ẏ, . . . , y(n−1) ) . (11.1.24)

dn
✎ Notation: superscript (n) =
ˆ n-th temporal derivative t: dtn
No special treatment of higher order ODEs is necessary, because (11.1.24) can be turned into a 1st-order
ODE (a system of size nd) by adding all derivatives up to order n − 1 as additional components to the
state vector. This extended state vector z(t) ∈ R nd is defined as
 
    z2
y(t) z1  
 y (1) ( t )  z   z3 
   2  .. 
z(t) :=  .. =  ..  ∈ R dn : (11.1.24) ↔ ż = g(z) , g(z) :=  . .
 .  .  
( n − 1 )
 zn 
y (t) zn
f(t, z1 , . . . , zn )
(11.1.25)

Note that the extended system requires initial values y(t0 ), ẏ(t0 ), . . . , y(n−1) (t0 ): for ODEs of order n ∈
N well-posed initial value problems need to specify initial values for the first n − 1 derivatives.

Now we review results about existence and uniqueness of solutions of initial value problems for first-order
ODEs. These are surprisingly general and do not impose severe constraints on right hand side functions.

Definition 11.1.26. Lipschitz continuous function (→ [?, Def. 4.1.4])

Let Θ := I × D, I ⊂ R an interval, D ⊂ R d an open domain. A function f : Θ 7→ R d is Lipschitz

continuous (in the second argument) on Θ, if

∃ L > 0: kf(t, w) − f(t, z)k ≤ Lkw − z k ∀(t, w), (t, z) ∈ Θ . (11.1.27)

Definition 11.1.28. Local Lipschitz continuity (→ [?, Def. 4.1.5])

Let Ω := I × D, I ⊂ R an interval, D ⊂ R d an open domain. A functions f : Ω 7→ R d is locally

Lipschitz continuous, if for every (t, y) ∈ Ω there is a closed box B with (t, y) ∈ B such that f is
Lipschitz continuous on B:

∀(t, y) ∈ Ω: ∃δ > 0, L > 0:

kf(τ, z) − f(τ, w)k ≤ Lkz − wk (11.1.29)
∀z, w ∈ D: kz − yk ≤ δ, kw − yk ≤ δ, ∀τ ∈ I: |t − τ | ≤ δ .

The property of local Lipschitz continuity means that the function (t, y) 7→ f(t, y) has “locally finite slope”
in y.

Example 11.1.30 (A function that is not locally Lipschitz continuous)

11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 716
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The meaning of local Lipschitz continuity is best explained by giving an example of a function that fails to
possess this property.
√
Consider the square root function t 7→ t on the closed interval [0, 1]. Its slope in t = 0 is infinite and so
it is not locally Lipschitz continuous on [0, 1].

However, if we consider the square root on the open interval ]0, 1[, then it is locally Lipschitz continuous
there.

✎ Notation: ˆ derivative of f w.r.t. state variable, a Jacobian ∈ R d,d as defined in (8.2.11).

Dy f =

The next lemma gives a simple criterion for local Lipschitz continuity, which can be proved by the mean
value theorem, cf. the proof of Lemma 8.2.12.

Lemma 11.1.31. Criterion for local Liptschitz continuity

If f and Dy f are continuous on the extended state space Ω, then f is locally Lipschitz continuous
(→ Def. 11.1.28).

Theorem 11.1.32. Theorem of Peano & Picard-Lindelöf [?, Satz II(7.6)], [?, Satz 6.5.1], [?,
Thm. 11.10], [?, Thm. 73.1]

If the right hand side function f : Ω̂ 7→ R d is locally Lipschitz continuous (→ Def. 11.1.28) then
for all initial conditions (t0 , y0 ) ∈ Ω̂ the IVP (11.1.20) has a solution y ∈ C1 ( J (t0 , y0 ), R d ) with
maximal (temporal) domain of definition J (t0 , y0 ) ⊂ R .

In light of § 11.1.22 and Thm. 11.1.32 henceforth we mainly consider

autonomous IVPs: ẏ = f(y) , y(0 ) = y0 , (11.1.33)

with locally Lipschitz continuous (→ Def. 11.1.28) right hand side f.

(11.1.34) Domain of definition of solutions of IVPs

Solutions of an IVP have an intrinsic maximal domain of definition

! domain of definition/domain of existence J (t0 , y0 ) usually depends on (t0 , y0 ) !

Terminology: if J (t0 , y0 ) = I ➥ solution y : I 7→ R d is global.

Notation: for autonomous ODE we always have t0 = 0, and therefore we write J (y0 ) := J (0, y0 ).

Example 11.1.35 (Finite-time blow-up)

11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 717
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Let us explain the still mysterious “maximal domain of definition” in statement of Thm. 11.1.32. I is related
to the fact that every solution of an initial value problem (11.1.33) has its own largest possible time interval
J (y0 ) ⊂ R on which it is defined naturally.

As an example we consider the autonomous scalar (d = 1) initial value problem, modeling “explosive
growth” with a growth rate increasing linearly with the density:

ẏ = y2 , y(0) = y0 ∈ R . (11.1.36)

We choose I = D = R . Clearly, y 7→ y2 is locally Lipschitz-continuous, but only locally! Why not

globally?

10
y = -0.5
0

We find the solutions 8

y = -1
0
y =1
0

( 1 6
y = 0.5
0

, if y0 6= 0 ,
y(t) = y0−1 − t (11.1.37) 4

0 , if y0 = 0 , 2
y(t)
0
with domains of definition
 -2
−1
] − ∞, y0 [ , if y0 > 0 ,
 -4

J ( y0 ) = R , if y0 = 0 ,

 −1
-6

] y0 , ∞ [ , if y0 < 0 . -8

-10
-3 -2 -1 0 1 2 3
Fig. 396 t

In this example, for y0 > 0 the solution experiences a blow-up in finite time and ceases to exists afterwards.

11.1.3 Evolution operators

For the sake of simplicity we restrict the discussion to autonomous IVPs (11.1.33) with locally Lipschitz
continuous right hand side and make the following assumption. A more general treatment is given in [?].

Assumption 11.1.38. Global solutions

All solutions of (11.1.33) are global: J (y0 ) = R for all y0 ∈ D.

Now we return to the study of a generic ODE (11.1.2) instead of an IVP (11.1.20). We do this by temporarily
changing the perspective: we fix a “time of interest” t ∈ R \ {0} and follow all trajectories for the duration
t. This induces a mapping of points in state space:

t D 7→ D
➣ mapping Φ : , t 7→ y(t) solution of IVP (11.1.33) ,
y0 7 → y ( t )
This is a well-defined mapping of the state space into itself, by Thm. 11.1.32 and Ass. 11.1.38.

Now, we may also let t vary, which spawns a family of mappings Φ t of the state space into itself.
However, it can also be viewed as a mapping with two arguments, a duration t and an initial state value
y0 !

11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 718
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 11.1.39. Evolution operator/mapping

Under Ass. 11.1.38 the mapping

R × D 7→ D
Φ: ,
(t, y0 ) 7→ Φt y0 := y(t)

where t 7→ y(t) ∈ C1 (R, R d ) is the unique (global) solution of the IVP ẏ = f(y), y(0) = y0 , is
the evolution operator/mapping for the autonomous ODE ẏ = f(y).

Note that t 7→ Φt y0 describes the solution of ẏ = f(y) for y(0) = y0 (a trajectory). Therefore, by virtue
of definition, we have

∂Φ
(t, y) = f(Φ t y) .
∂t

Example 11.1.40 (Evolution operator for Lotka-Volterra ODE (11.1.10))

For d = 2 the action of an evolution operator can be visualized by tracking the movement of point sets in
state space. Here this is done for the Lotka-Volterra ODE (11.1.10):
6
Flow map for Lotka-Volterra system, α=2, β=γ =δ =1
8
t=0
t=0.5
t=1
5 7 t=1.5
t=2
t=3

5
v (predator)
v = y2

3 4

3
2 X

1
1

0
0 0 1 2 3 4 5 6
0 0.5 1 1.5 2 2.5 3 3.5 4 398
Fig.
u = y1 u (prey)
Fig. 397

state mapping y 7→ Φt y
trajectories t 7→ Φ t y0

Remark 11.1.41 (Group property of autonomous evolutions)

Under Ass. 11.1.38 the evolution operator gives rise to a group of mappings D 7→ D:

Φ s ◦ Φt = Φs+t , Φ−t ◦ Φt = Id ∀t ∈ R . (11.1.42)

This is a consequence of the uniqueness theorem Thm. 11.1.32. It is also intuitive: following an evolution
up to time t and then for some more time s leads us to the same final state as observing it for the whole
time s + t.

11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 719
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

11.2 Introduction: Polygonal Approximation Methods

We target an initial value problem (11.1.20) for a first-order ordinary differential equation

ẏ = f(t, y) , y(t0 ) = y0 . (11.1.20)

As usual, the right hand side function f may be given only in procedural form, in M ATLAB as

function v = f(t,y),
or in a C++ code as an object providing an evaluation operator, see Rem. 5.1.6. An evaluation of f may
involve costly computations.

(11.2.1) Objectives of numerical integration

Two basic tasks can be identified in the field of numerical integration = approximate solution of initial value
problems for ODEs (Please distinguish from “numerical quadrature”, see Chapter 7.):

(I) Given initial time t0 , final time T , and initial state y0 compute an approximation of y(T ), where
t 7→ y(t) is the solution of (11.1.20). A corresponding function in C++ could look like
State solveivp( double t0, double T,State y0);

Here State is a type providing a fixed size or variable size vector ∈ R d :

using State = Eigen::Matrix< double ,statedim,1>;

Here statedim is the dimension d of the state space that has to be known at compile time.

(II) Output an approximate solution t → yh (t) of (11.1.20) on [t0 , T ] up to final time T 6= t0 for “all
times” t ∈ [t0 , T ] (actually for many times t0 = τ0 < τ1 < τ2 < · · · < τm−1 < τm = T
consecutively): “plot solution”!
s t d :: v e c t o r <State>
solveivp(State y0, const s t d :: v e c t o r < double > &tauvec);

This section presents three methods that provide a piecewise linear, that is, “polygonal” approximation of
solution trajectories t 7→ y(t), cf. Ex. 5.1.10 for d = 1.

(11.2.2) Temporal mesh

As in Section 6.5.1 the polygonal approximation in this section will be based on a (temporal) mesh (→
§ 6.5.1)

M : = { t0 < t1 < t2 < · · · < t N − 1 < t N : = T } ⊂ [ t0 , T ] , (11.2.3)

covering the time interval of interest between initial time t0 and final time T > t0 . We assume that the
interval of interest is contained in the domain of definition of the solution of the IVP: [t0 , T ] ⊂ J (t0 , y0 ).

11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 720
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

11.2.1 Explicit Euler method

Example 11.2.4 (Tangent field and solution curves)

For d = 1 polygonal methods can be constructed by geometric considerations in the t − y plane, a model
for the extended state space. We explain this for the Riccati differential equation, a scalar ODE:

ẏ = y2 + t2 ➤ d = 1, I, D = R + . (11.2.5)

1.5 1.5

1 1
y

y
0.5 0.5

0 0
0 0.5 1 1.5 0 0.5 1 1.5
Fig. 399 t Fig. 400 t

tangent field solution curves

The solution curves run tangentially to the tangent field in each point of the extended state space.

Idea: “follow the tangents over short periods of time”

➊ timestepping: successive approximation of evolution on mesh inter-

vals [tk−1 , tk ], k = 1, . . . , N , t N := T ,

➋ approximation of solution on [tk−1 , tk ] by tangent line to solution tra-

jectory through (tk−1 , yk−1 ).

y1
y

y(t) explicit Euler method (Euler 1768)

y0 ✁ First step of explicit Euler method (d = 1):

Slope of tangent = f (t0 , y0 )

t y1 serves as initial value for next step!

t0 t1 See also [?, Ch. 74], [?, Alg. 11.4]
Fig. 401

Example 11.2.6 (Visualization of explicit Euler method)

11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 721
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

2.4
exact solution
Temporal mesh 2.2
explicit Euler

2
M := {t j := j/5: j = 0, . . . , 5} .
1.8

IVP for Riccati differential equation, see Ex. 11.2.4 1.6

y
1.4
ẏ = y2 + t2 . (11.2.5)
1.2

Here: y0 = 12 , t0 = 0, T = 1, ✄ 1

0.8
—=
ˆ “Euler polygon” for uniform timestep h = 0.2
0.6

7→ =
ˆ tangent field of Riccati ODE 0.4
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Fig. 402 t

Formula: When applied to a general IVP of the from (11.1.20) the explicit Euler method generates a
N
sequence (yk )k=0 by the recursion

yk+1 = yk + hk f(tk , yk ) , k = 0, . . . , N − 1 , (11.2.7)

with local (size of) timestep (stepsize) h k : = tk+1 − tk .

Remark 11.2.8 (Explicit Euler method as difference scheme)

d
One can obtain (11.2.7) by approximating the derivative dt by a forward difference quotient on the (tem-
poral) mesh M := {t0 , t1 , . . . , t N }:

yk+1 − yk
ẏ = f(t, y) ←→ = f(tk , yh (tk )) , k = 0, . . . , N − 1 . (11.2.9)
hk

Difference schemes follow a simple policy for the discretization of differential equations: replace all deriva-
tives by difference quotients connecting solution values on a set of discrete points (the mesh).

Remark 11.2.10 (Output of explicit Euler method)

To begin with, the explicit Euler recursion (11.2.7) produces a sequence y0 , . . . , y N of states. How does it
deliver on the task (I) and (II) stated in § 11.2.1? By “geometric insight” we expect

yk ≈ y(tk ) .

(As usual, we use the notation t 7→ y(t) for the exact solution of an IVP.)

Task (I): Easy, because y N already provides an approximation of y(T ).

Task (II): The trajectory t 7→ y(t) is approximated by the piecewise linear function (‘Euler polygon”)
tk+1 − t t − tk
y h : [ t0 , t N ] → R d , y h ( t ) : = y k + yk+1 for t ∈ [ tk , tk+1 ] , (11.2.11)
tk+1 − tk tk+1 − tk

11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 722
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

see Fig. 402. This function can easily be sampled on any grid of [t0 , t N ]. In fact, it is the M-piecewise
linear interpolant of the data points (tk , yk ), k = 0, . . . , N , see Section 5.3.2).

The same considerations apply to the methods discussed in the next two sections and will not be repeated
there.

11.2.2 Implicit Euler method

Why forward difference quotient and not backward difference quotient? Let’s try!

On (temporal) mesh M := {t0 , t1 , . . . , t N } we obtain

yk+1 − yk
ẏ = f (t, y) ←→ = f (tk+1 , yh (tk+1 )) , k = 0, . . . , N − 1 . (11.2.12)
hk
backward difference quotient

This leads to another simple timestepping scheme analoguous to (11.2.7):

yk+1 = yk + hk f(tk+1 , yk+1 ) , k = 0, . . . , N − 1 , (11.2.13)

with local timestep (stepsize) h k : = tk+1 − tk .

(11.2.13) = implicit Euler method

Note: (11.2.13) requires solving a (possibly non-linear) system of equations to obtain yk+1 !
(➤ Terminology “implicit”)

y
y h ( t1 ) Geometry of implicit Euler method:
y(t)
Approximate solution through (t0 , y0 ) on [t0 , t1 ] by
y0
• straight line through (t0 , y0 )
• with slope f (t1 , y1 )
✁ —= ˆ trajectory through (t0 , y0 ),
t —= ˆ trajectory through (t1 , y1 ),
—= ˆ tangent at — in (t1 , y1 ).
t0 t1
Fig. 403

Remark 11.2.14 (Feasibility of implicit Euler timestepping)

Issue: Is (11.2.13) well defined, that is, can we solve it for yk+1 and is this solution unique?

Intuition: for small timesteps h > 0 the right hand side of (11.2.13) is a “small perturbation of the identity”.

11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 723
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Formal: Consider an autonomous ODE ẏ = f(y), assume a continuously differentiable right hand side
function f, f ∈ C1 ( D, R d ), and regard (11.2.13) as an h-dependent non-linear system of equa-
tions:

yk+1 = yk + hk f(tk+1 , yk+1 ) ⇔ G (h, yk+1 ) = 0 with G (h, z) := z − hf(tk+1 , z) − yk .

To investigate the solvability of this non-linear equation we start with an observation about a partial deriva-
tive of G:
dG dG
(h, z) = I − h Dy f(tk+1 , z) ⇒ (0, z) = I .
dz dz
In addition, G (0, yk ) = 0. Next, recall the implicit function theorem [?, Thm. 7.8.1]:

Theorem 11.2.15. Implicit function theorem

Let G = G (x, y) a continuously differentiable function of x ∈ R k and y ∈ R ℓ , defined on the open

set Ω ⊂ R k × R ℓ with values in R ℓ : G : Ω ⊂ R k × R ℓ → R ℓ .

x0
Assume that G has a zero in z0 := ∈ Ω , x 0 ∈ R k , y 0 ∈ R ℓ : G ( z 0 ) = 0.
y0

If the Jacobian ∂G
∂y (p0 ) ∈ R
ℓ,ℓ is invertible, then there is an open neighborhood U of x ∈ R k and
0
l
a continuously differentiable function g : U → R such that

g(x0 ) = y0 and G (x, g(x)) = 0 ∀x ∈ U .

For sufficiently small |h| it permits us to conclude that the equation G (h, z) = 0 defines a continuous
function g = g(h) with g(0) = yk .

➣ for sufficiently small h > 0 the equation (11.2.13) has a unique solution yk+1 .

11.2.3 Implicit midpoint method

Beside using forward or backward difference quotients, the derivative ẏ can also be approximated by the
symmetric difference quotient, see also (5.2.44),

y(t + h ) − y(t − h )
ẏ(t) ≈ . (11.2.16)
2h

The idea is to apply this formula in t = 12 (tk + tk+1 ) with h = hk/2, which transforms the ODE into

yk+1 − yk 1

ẏ = f (t, y) ←→ = f 2 (tk + tk+1 ), yh ( 21 (tk+1 + tk+1 )) , k = 0, . . . , N − 1 .
hk
(11.2.17)

The trouble is that the value yh ( 12 (tk+1 + tk+1 )) does not seem to be available, unless we recall that the
approximate trajectory t 7→ yh (t) is supposed to be piecewise linear, which implies yh ( 12 (tk+1 + tk+1 )) =

11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 724
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

1
2 (yh (tk )
+ yh (tk+1 )). This gives the recursion formula for the implicit midpoint method in analogy to
(11.2.7) and (11.2.13):

1

yk+1 = yk + h k f 2 (tk + tk+1 ), 12 (yk + yk+1 ) , k = 0, . . . , N − 1 , (11.2.18)

with local timestep (stepsize) h k : = tk+1 − tk .

y Implicit midpoint method: a geometric view:

y∗ y h ( t1 ) Approximaate trajectory through (t0 , y0 ) on [t0 , t1 ]

by
y0 • straight line through (t0 , y0 )
• with slope f (t∗ , y∗ ), where
f (t∗ , y∗ ) t∗ := 21 (t0 + t1 ), y∗ = 21 (y0 + y1 )
t ✁ —=
ˆ trajectory through (t0 , y0 ),
t0 t∗ t1 ˆ trajectory through (t∗ , y∗ ),
—=
Fig. 404
ˆ tangent at — in (t∗ , y∗ ).
—=
As in the case of (11.2.13), also (11.2.18) entails solving a (non-linear) system of equations in order to
obtain yk+1 . Rem. 11.2.14 also holds true in this case: for sufficiently small h (11.2.18) will have a unique
solution yk+1 , which renders the recursion well defined.

11.3 General single step methods

Now we fit the numerical schemes introduced in the previous section into a more general class of methods
for the solution of (autonomous) initial value problems (11.1.33) for ODEs. Throughout we assume that all
times considered belong to the domain of definition of the unique solution t → y(t) of (11.1.33), that is,
for T > 0 we take for granted [0, T ] ⊂ J (y0 ) (temporal domain of definition of the solution of an IVP is
explained in § 11.1.34).

11.3.1 Definition

(11.3.1) Discrete evolution operators

Recall the Euler the methods for autonomous ODE ẏ = f(y):

explicit Euler: yk+1 = yk + h k f (yk ) ,

implicit Euler: yk+1 : yk+1 = yk + hk f(yk+1 ) .

Both formulas, for sufficiently small h (→ Rem. 11.2.14), provide a mapping

(yk , hk ) 7→ Ψ(h, yk ) := yk+1 . (11.3.2)

If y0 is the initial value, then y1 := Ψ (h, y0 ) can be regarded as an approximation of y(h), the value
returned by the evolution operator (→ Def. 11.1.39) for ẏ = f(y) applied to y0 over the period h. y(tk ):

y1 = Ψ(h, y0 ) ←→ y(h) = Φh y0 ➣ Ψ(h, y) ≈ Φ h y , (11.3.3)

11. Numerical Integration – Single Step Methods, 11.3. General single step methods 725
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

In a sense the polygonal approximation methods as based on approximations for the evolution operator
associated with the ODE.
This is what every single step method does: it tries to approximate the evolution operator Φ for an ODE
by a mapping of the type (11.3.2).
➙ mapping Ψ from (11.3.2) is called discrete evolution.

✎ Notation: for discrete evolutions we often write Ψh y := Ψ(h, y)

Remark 11.3.4 (Discretization)

The adjective “discrete” used above designates (components of) methods that attempt to approximate the
solution of an IVP by a sequence of finitely many states. “Discretization” is the process of converting an
ODE into a discrete model. This parlance is adopted for all procedures that reduce a “continuous model”
involving ordinary or partial differential equations to a form with a finite number of unknowns.

Above we identified the discrete evolutions underlying the polygonal approximation methods. Vice versa,
a mapping Ψ as given in (11.3.2) defines a single step method.

Definition 11.3.5. Single step method (for autonomous ODE) → [?, Def. 11.2]

Given a discrete evolution Ψ : Ω ⊂ R × D 7→ R d , an initial state y0 , and a temporal mesh

M := {0 =: t0 < t1 < · · · < t N := T } the recursion

yk+1 := Ψ(tk+1 − tk , yk ) , k = 0, . . . , N − 1 , (11.3.6)

defines a single step method (SSM) for the autonomous IVP ẏ = f(y), y(0) = y0 on the interval
[0, T ].

☞ In a sense, a single step method defined through its associated discrete evolution does not ap-
proximate a concrete initial value problem, but tries to approximate an ODE in the form of its
evolution operator.

In M ATLAB syntax a discrete evolutions can be incarnated by a function of the following form:
Ψh y ←→ function y1 = discevl(h,y0) .
( function y1 = discevl(@(y) rhs(y),h,y0) )

The concept of single step method according to Def. 11.3.5 can be generalized to non-autonomous ODEs,
which leads to recursions of the form:
yk+1 := Ψ(tk , tk+1 , yk ) , k = 0, . . . , N − 1 ,
for a discrete evolution operator Ψ defined on I × I × D.

(11.3.7) Consistent single step methods

All meaningful single step methods turn out to be modifications of the explicit Euler method (11.2.7).

11. Numerical Integration – Single Step Methods, 11.3. General single step methods 726
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Consistent discrete evolution

The discrete evolution Ψ defining a single step method according to Def. 11.3.5 and (11.3.6) for the
autonomous ODE ẏ = f(y) invariably is of the form

h ψ : I × D → R d continuous,
Ψ y = y + hψ (h, y) with (11.3.9)
ψ(0, y) = f(y) .

Definition 11.3.10. Consistent single step methods

A single step method according to Def. 11.3.5 based on a discrete evolution of the form (11.3.9) is
called consistent with the ODE ẏ = f(y).

Example 11.3.11 (Consistency of implicit midpoint method)

The discrete evolution Ψ and, hence, the function ψ = ψ(h, y) for the implicit midpoint method are defined
only implicitly, of course. Thus, consistency cannot immediately be seen from a formula for ψ.

We examine consistency of the implicit midpoint method defined by

1
yk+1 = yk + hf (tk + tk+1 ), 21 (yk + yk+1 ) , k = 0, . . . , N − 1 . (11.2.18)
2
Assume that

• the right hand side function f is smooth, at least f ∈ C1 ( D ),

• and |h| is sufficiently small to guarantee the existence of a solution yk+1 of (11.2.18), see Rem. 11.2.14.
The idea is to verify (11.3.9) by formal Taylor expansion of yk+1 in h. To that end we plug (11.2.18) into
itself and rely on Taylor expansion of f:

(11.2.18)
yk+1 = yk + hf( 21 (yk + yk+1 )) = yk + h f(yk + 12 hf( 21 (yk + yk+1 ))) .
| {z }
= ψ(h,y k )

Since, by the implicit function theorem, yk+1 continuously depends on h and yk , ψ(h, yk ) has the desired
properties, in particular ψ(0, y) = f(y) is clear.

Remark 11.3.12 (Notation for single step methods)

Many authors specify a single step method by writing down the first step for a general stepsize h

y1 = expression in y0 , h and f .

Actually, this fixes the underlying discrete evolution. Also this course will sometimes adopt this practice.

11. Numerical Integration – Single Step Methods, 11.3. General single step methods 727
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(11.3.13) Output of single step methods

Here we resume and continue the discussion of Rem. 11.2.10 for general single step methods according
to Def. 11.3.5. Assuming unique solvability of the systems of equations faced in each step of an implicit
method, every single step method based on a mesh M = {0 = t0 < t1 < · · · < t N := T } produces a
finite sequence (y0 , y1 , . . . , y N ) of states, where the first agrees with the initial state y0 .

We expect that the states provide a pointwise approximation of the solution trajectory t → y(t):

yk ≈ y(tk ) , k = 1, . . . , N .

Thus task (I) from § 11.2.1, computing an approximation for y(T ), is again easy: output y N as an
approximation of y(T ).

Task (II) from § 11.2.1, computing the solution trajectory, requires interpolation of the data points (tk , yk )
using some of the techniques presented in Chapter 5. The natural option is M-piecewise polynomial
interpolation, generalizing the polygonal approximation (11.2.11) used in Section 11.2.

Note that from the ODE ẏ = f(y) the derivatives ẏh (tk ) = f(yk ) are available without any further
approximation. This facilitates cubic Hermite interpolation (→ Def. 5.4.1), which yields

dyh
yh ∈ C1 ([0, T ]): yh |[ xk−1,xk ] ∈ P3 , yh (tk ) = yk , (t ) = f (yk ) .
dt k
Summing up, an approximate trajectory t 7→ yh (t) is built in two stages:

(i) Compute sequence (yk )k by running the single step method.

(ii) Post-process the obtained sequence, usually by applying interpolation, to get yh .

11.3.2 Convergence of single step methods

Supplementary reading. See [?, Sect. 11.5] and [?, Sect. 11.3] for related presentations.

(11.3.14) Discretization error of single step methods

Errors in numerical integration are called discretization errors, cf. Rem. 11.3.4.

Depending on the objective of numerical integration as stated in § 11.2.1 different notions of discretization
error appropriate

(I) If only the solution at final time is sought, the discretization error is

ǫ N : = k y( T ) − y N k ,

where k·k is some vector norm on R d .

11. Numerical Integration – Single Step Methods, 11.3. General single step methods 728
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(II) If we want to approximate the solution trajectory for (11.1.33) the discretization error is the function

t 7→ e (t) , e (t) : = y(t) − yh (t) ,

where t 7→ yh (t) is the approximate trajectory obtained by post-processing, see § 11.3.13. In

this case accuracy of the method is gauged by looking at norms of the function e, see § 5.2.65 for
examples.

(III) Between (I) and (II) is the pointwise discretization error, which is the sequence (grid function)

e : M → D , ek := y(tk ) − yk , k = 0, . . . , N . (11.3.15)

In this case we may consider the maximum error in the mesh points

k(e)k∞ := max kek k ,

k∈{1,...,N }

where k·k is a suitable vector norm on R d , usually the Euclidean vector norm.

(11.3.16) Asymptotic convergence of single step methods

Once the discrete evolution Ψ associated with the ODE ẏ = f(y) is specified, the single step method
according to Def. 11.3.5 is fixed. The only way to control the accuracy of the solution y N or t 7→ yh (t) is
through the selection of the mesh M = {0 = t0 < t1 < · · · < t N = T }.

Hence we study convergence of single step methods for families of meshes {Mℓ } and track the decay of
(a norm) of the discretization error (→ § 11.3.14) as a function of the number N := ♯M of mesh points.
In other words, we examine h-convergence. We already did this in the case of piecewise polynomial
interpolation in Section 6.5.1 and composite numerical quadrature in Section 7.4.

When investigating asymptotic convergence of single step methods we often resort to families of equidis-
tant meshes of [0, T ]:

k
M N := {tk := T: k = 0 . . . , N } . (11.3.17)
N
T
We also call this the use of uniform timesteps of size h := N .

Example 11.3.18 (Speed of convergence of Euler methods)

The setting for this experiment is a follows:

✦ We consider the following IVP for the logistic ODE, see Ex. 11.1.5

ẏ = λy(1 − y) , y(0) = 0.01 .

✦ We apply explicit and implicit Euler methods (11.2.7)/(11.2.13) with uniform timestep h = 1/N ,
N ∈ {5, 10, 20, 40, 80, 160, 320, 640}.
✦ Monitored: Error at final time E(h) := |y(1) − y N |

11. Numerical Integration – Single Step Methods, 11.3. General single step methods 729
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

1 1
10 10
λ = 1.000000 λ = 1.000000
λ = 3.000000 λ = 3.000000
λ = 6.000000 λ = 6.000000
0 λ = 9.000000 0 λ = 9.000000
10 10
O(h) O(h)
error (Euclidean norm)

error (Euclidean norm)

-1 -1
10 10

-2 -2
10 10

-3 -3
10 10

-4 -4
10 10

-5 -5
10 10
-3 -2 -1 0 -3 -2 -1 0
10 10 10 10 10 10 10 10
Fig. 405 timestep h Fig. 406 timestep h

explicit Euler method implicit Euler method

O( N −1 ) = O(h) algebraic convergence in both cases for h → 0
0
10

-1
10

-2
10

However, polygonal approximation methods can do

error (Euclidean norm)

-3
10
better:
-4
10

✁ We study the convergence of the implicit midpoint

-5
10
method (11.2.18) in the above setting.
-6
10
We observe algebraic convergence O(h2 ) for
-7
10
λ = 1.000000
λ = 2.000000
h → 0.
-8
10 λ = 5.000000
λ = 10.000000
O(h2)
-9
10
-3 -2 -1 0
10 10 10 10
Fig. 407 timestep h

Parlance: based on the observed rate of algebraic convergence, the two Euler methods are said to
“converge with first order”, whereas the implicit midpoint method is called “second-order con-
vergent”.

The observations made for polygonal timestepping methods reflect a general pattern:

Algebraic convergence of single step methods

Consider numerical integration of an initial value problem

ẏ = f(t, y) , y(t0 ) = y0 , (11.1.20)

with sufficiently smooth right hand side function f : I × D → R d .

Then customary single step methods (→ Def. 11.3.5) will enjoy algebraic convergence in the mesh-
width, more precisely, see [?, Thm. 11.25],
there is a p ∈ N such that the sequence (yk )k generated by the single step method
for ẏ = f(t, y) on a mesh M := {t0 < t1 < · · · < t N = T } satisfies

maxkyk − y(tk )k ≤ Ch p for h := max |tk − tk−1 | → 0 , (11.3.20)

k k=1,...,N

11. Numerical Integration – Single Step Methods, 11.3. General single step methods 730
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

with C > 0 independent of M

Definition 11.3.21. Order of a single step method

The minimal integer p ∈ N for which (11.3.20) holds for a single step method when applied to an
ODE with (sufficiently) smooth right hand side, is called the order of the method.

As in the case of quadrature rules (→ Def. 7.3.1) their order is the principal intrinsic indicator for the
“quality” of a single step method.

(11.3.22) Convergence analysis for the explicit Euler method [?, Ch. 74]

We consider the explicit Euler method (11.2.7) on a mesh M := {0 = t0 < t1 < · · · < t N = T } for a
generic autonomous IVP (11.1.20) with sufficiently smooth and (globally ) Lipschitz continuous f, that is,

∃ L > 0: kf(y) − f(z)k ≤ Lky − zk ∀y, z ∈ D , (11.3.23)

and exact solution t 7→ y(t). Throughout we assume that solutions of ẏ = f(y) are defined on [0, T ] for
all initial states y0 ∈ D.

Recall: recursion for explicit Euler method

yk+1 = yk + hk f(yk ) , k = 1, . . . , N − 1 . (11.2.7)

D
y(t)
Error sequence: e k : = yk − y(tk ) .
yk+1 Ψ yk+2
Ψ
✁ —=ˆ trajectory t 7→ y(t)
yk
—=ˆ Euler polygon,
Ψ
yk−1
•=
ˆ y ( t k ),
•=
ˆ yk ,
t
−→ =ˆ discrete evolution Ψtk+1 −tk
tk−1 tk tk+1 tk+2

➀ Abstract splitting of error:

Here and in what follows we rely on the abstract concepts of the evolution operator Φ associated with the
ODE ẏ = f(y) (→ Def. 11.1.39) and discrete evolution operator Ψ defining the explicit Euler single step
method, see Def. 11.3.5:

(11.2.7) ⇒ Ψh y = y + hf(y) . (11.3.24)

We argue that in this context the abstraction pays off, because it helps elucidate a general technique for
the convergence analysis of single step methods.

11. Numerical Integration – Single Step Methods, 11.3. General single step methods 731
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

y k+1
propagated error
Fundamental error splitting
e k+1
e k+1 = Ψ hk yk − Φ hk y(tk )
Ψhk (y(tk ))
= Ψ hk yk − Ψhk y(tk ) yk
| {z }
propagated error (11.3.25)
hk
+ Ψ y(tk ) − Φ y(tk ) . hk ek y ( t k+1 )
| {z }
one-step error one-step error
y ( tk )

Fig. 408
tk t k+1
A generic one-step error expressed through continu-
D Φh y ous and discrete evolutions:

τ (h, y) := Ψh y − Φh y . (11.3.26)
τ (h, yk )

✁ geometric visualisation of one-step error for ex-

y Ψh y plicit Euler method (11.2.7), cf. Fig. 401, h :=
t tk+1 − tk
tk tk+1 —: solution trajectory through (tk , y)
Fig. 409

➁ Estimate for one-step error τ (hk , y(tk )):

Geometric considerations: distance of a smooth curve and its tangent shrinks as the square of the distance
to the intersection point (curve locally looks like a parabola in the ξ − η coordinate system, see Fig. 411).
η
D η h
Φ y(tk )

τ (h, yk )
τ (h, yk )

ξ
y(tk ) Ψ h y(tk ) ξ
t
tk tk+1
Fig. 410 Fig. 411

The geometric considerations can be made rigorous by analysis: recall Taylor’s formula for the function
y ∈ CK +1 [?, Satz 5.5.1]:
K t+h
Z
hj ( j) (t + h − τ )K
y(t + h ) − y(t) = ∑ y (t) + y ( K + 1) ( τ ) dτ , (11.3.27)
j =0
j! K!
t
| {z }
y ( K + 1 ) (ξ ) K +1
= h
K!
for some ξ ∈ [t, t + h]. We conclude that, if y ∈ C2 ([0, T ]), which is ensured for smooth f, see
Lemma 11.1.4, then
y(tk+1 ) − y(tk ) = ẏ(tk )hk + 12 z̈ (ξ k )h2k = f(y(tk ))hk + 12 ÿ(ξ k )h2k ,

11. Numerical Integration – Single Step Methods, 11.3. General single step methods 732
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

for some tk ≤ ξ k ≤ tk+1 .This leads to an expression for the one-step error from (11.3.26)

τ (hk , y(tk ))=Ψhk y(tk ) − y(tk+1 )

(11.3.24)
= y(tk ) + hk f(y(tk )) − y(tk ) − f(y(tk ))hk + 12 ÿ(ξ k )h2k (11.3.28)
= 21 ÿ(ξ k )h2k .

Sloppily speaking, we observe τ (hk , y(tk )) = O(h2k ) uniformly for hk → 0.

➂ Estimate for the propagated error from (11.3.25)

Ψhk yk − Ψhk y(tk ) = k yk + hk f(yk ) − y(tk ) − hk f(y(tk ))k

(11.3.29)
(11.3.23)
≤ (1 + Lhk )kyk − y(tk )k .

➂ Obtain recursion for error norms ǫk := kek k by △-inequality:

ǫk+1 ≤ (1 + hk L)ǫk + ρk , ρk := 21 h2k max k ÿ(τ )k . (11.3.30)

t k ≤ τ ≤ t k +1

Taking into account ǫ0 = 0, this leads to

k l −1
ǫk ≤ ∑ ∏(1 + Lh j ) ρl , k = 1, . . . , N . (11.3.31)
l =1 j =1

Use the elementary estimate (1 + Lh j ) ≤ exp( Lh j ) (by convexity of exponential function):

k l −1 k
l −1
(11.3.31) ⇒ ǫk ≤ ∑ ∏ exp( Lh j ) · ρl = ∑ exp( L ∑ j=1 h j )ρl .
l =1 j =1 l =1

l −1
Note: ∑ h j ≤ T for final time T and conclude
j =1

k k
ρk
ǫk ≤ exp( LT ) ∑ ρl ≤ exp( LT ) max ∑ hl ≤ T exp( LT ) l =max hl · max kÿ(τ )k .
l =1 k hk l =1 1,...,k t0 ≤ τ ≤ t k

kyk − y(tk )k ≤ T exp( LT ) max hl · max kÿ(τ )k . (11.3.32)

l =1,...,k t0 ≤ τ ≤ t k

We can summarize the insight gleaned through this theoretical analysis as follows:

Total error arises from accumulation of propagated one-step errors!

First conclusions from (11.3.32):

✦ error bound = O(h), h := max hl (➤ 1st-order algebraic convergence)

✦ Error bound grows exponentially with the length T of the integration interval.

11. Numerical Integration – Single Step Methods, 11.3. General single step methods 733
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(11.3.33) One-step error and order of a single step method

In the analysis of the global discretization error of the explicit Euler method in § 11.3.22 a one-step error
of size O(h2k ) led to a total error of O(h) through the effect of error accumulation over N ≈ h−1 steps.
This relationship remains valid for almost all single step methods:

Consider an IVP (11.1.20) with solution t 7→ y(t) and a single step method defined by the
discrete evolution Ψ (→ Def. 11.3.5). If the one-step error along the solution trajectory satisfies (Φ
is the evolution map associated with the ODE, see Def. 11.1.39)

Ψh y(t) − Φh y(t) ≤ Ch p+1 ∀h sufficiently small, t ∈ [0, T ] ,

for some p ∈ N and C > 0, then, usually,

p
maxkyk − y(tk )k ≤ ChM ,
k

with C > 0 independent of the temporal mesh M.

A rigorous statement as a theorem would involve some particular assumptions on Ψ, which we do not
want to give here. These assumptions are satisfied, for instance, for all the methods presented in the
sequel.

11.4 Explicit Runge-Kutta Methods

Supplementary reading. [?, Sect. 11.6], [?, Ch. 76], [?, Sect. 11.8]

So far we only know first and second order methods from 11.2: the explicit and implicit Euler method
(11.2.7) and (11.2.13), respectively, are of first order, the implicit midpoint rule of second order. We
observed this in Ex. 11.3.18 and it can be proved rigorously for all three methods adapting the arguments
of § 11.3.22.

Thus, barring the impact of roundoff, the low-order polygonal approximation methods are guaranteed to
achieve any prescribed accuracy provided that the mesh is fine enough. Why should we need any other
timestepping schemes?

Remark 11.4.1 (Rationale for high-order single step methods cf. [?, Sect. 11.5.3])

We argue that the use of higher-order timestepping methods is highly advisable for the sake of efficiency.
The reasoning is very similar to that of Rem. 7.3.48, when we considered numerical quadrature. The
reader is advised to study that remark again.

As we saw in § 11.3.16 error bounds for single step methods for the solution of IVPs will inevitably feature
unknown constants “C > 0”. Thus they do not give useful information about the discretization error for

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 734
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

a concrete IVP and mesh. Hence, it is too ambitious to ask how many timesteps are needed so that
ky(T ) − y N k stays below a prescribed bound, cf. the discussion in the context of numerical quadrature.

However, an easier question can be answered by asymptotic estimates l ike (11.3.20):

What extra computational effort buys a prescribed reduction of the error ?

(also recall the considerations in Section 8.3.3!)

The usual concept of “computational effort” for single step methods (→ Def. 11.3.5) is as follows

Computational effort ∼ total number of f-evaluations for approximately solving the IVP,
∼ number of timesteps, if evaluation of discete evolution Ψh (→ Def. 11.3.5) re-
quires fixed number of f-evaluations,
∼ h−1 , in the case of uniform timestep size h > 0 (equidistant mesh (11.3.17)).

Now, let us consider a single step method of order p ∈ N, employed with a uniform timestep hold . We
focus on the maximal discretization error in the mesh points, see § 11.3.14. As in (7.3.49) we assume that
the asymptotic error bounds are sharp:

err(h) ≈ Ch p for small meshwidth h > 0 ,

with a “generic constant” C > 0 independent of the mesh.

err(hnew ) ! 1
Goal: = for reduction factor ρ>1.
err(hold ) ρ
p
hnew ! 1
(11.3.20) ⇒ p = ⇔ hnew = ρ−1/p hold .
hold ρ

For single step method of order p ∈ N

increase effort by factor ρ /p

1
reduce error by factor ρ > 1

☞ the larger the order p, the less effort for a prescribed reduction of the error!

We remark that another (minor) rationale for using higher-order methods [?, Sect. 11.5.3]: curb impact of
roundoff errors (→ Section 1.5.3) accumulating during timestepping.

(11.4.2) Bootstrap construction of explicit single step methods

Now we will build a class of methods that are explicit and achieve orders p > 2. The starting point is
a simple integral equation satisfied by any solution t 7→ y(t) of an initial value problems for the ODE
ẏ = f(y):
Z t1
ẏ(t) = f(t, y(t)) ,
IVP: ⇒ y ( t 1 ) = y0 + f(τ, y(τ )) dτ
y ( t 0 ) = y0 t0

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 735
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Idea: approximate the integral by means of s-point quadrature formula (→ Sec-

tion 7.1, defined on the reference interval [0, 1]) with nodes c1 , . . . , cs ,
weights b1 , . . . , bs .
s
y(t1 ) ≈ y1 = y0 + h ∑ bi f(t0 + ci h, y(t0 + ci h) ) , h := t1 − t0 .
i =1
(11.4.3)

Obtain these values by bootstrapping

“Bootstrapping” = use the same idea in a simpler version to get y(t0 + ci h), noting that these values
can be replaced by other approximations obtained by methods already constructed (this approach will be
elucidated in the next example).

What error can we afford in the approximation of y(t0 + ci h) (under the assumption that f is Lipschitz
continuous)? We take the cue from the considerations in § 11.3.22.

Goal: aim for one-step error bound y ( t 1 ) − y1 = O ( h p + 1 )

Note that there is a factor h in front of the quadrature sum in (11.4.3). Thus, our goal can already be
achieved, if only

y(t0 + ci h) is approximated up to an error O(h p ),

again, because in (11.4.3) a factor of size h multiplies f(t0 + ci , y(t0 + ci h)).

This is accomplished by a less accurate discrete evolution than the one we are about to build. Thus,
we can construct discrete evolutions of higher and higher order, in turns, starting with the explicit Euler
method. All these methods will be explicit, that is, y1 can be computed directly from point values of f.

Example 11.4.4 (Simple Runge-Kutta methods by quadrature & boostrapping)

Now we apply the boostrapping idea outlined above. We write kℓ ∈ R d for the approximations of y(t0 +
c i h ).
• Quadrature formula = trapezoidal rule (7.2.5):
1
Q( f ) = 12 ( f (0) + f (1)) ↔ s = 2: c1 = 0, c2 = 1 , b1 = b2 = , (11.4.5)
2
and y(t1 ) approximated by explicit Euler step (11.2.7)

k1 = f(t0 , y0 ) , k2 = f(t0 + h, y0 + hk1 ) , y1 = y0 + h2 (k1 + k2 ) . (11.4.6)

(11.4.6) = explicit trapezoidal method (for numerical integration of ODEs).

• Quadrature formula → simplest Gauss quadrature formula = midpoint rule (→ Ex. 7.2.3) & y( 12 (t1 +
t0 )) approximated by explicit Euler step (11.2.7)

k1 = f(t0 , y0 ) , k2 = f(t0 + h2 , y0 + 2h k1 ) , y1 = y0 + hk2 . (11.4.7)

(11.4.7) = explicit midpoint method (for numerical integration of ODEs) [?, Alg. 11.18].

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 736
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Example 11.4.8 (Convergence of simple Runge-Kutta methods)

We perform an empiric study of the order of the explicit single step methods constructed in Ex. 11.4.4.
✦ IVP: ẏ = 10y(1 − y) (logistic ODE (11.1.6)), y(0) = 0.01, T = 1,
✦ Explicit single step methods, uniform timestep h.
0
1 10
y(t) s=1, Explicit Euler
0.9 Explicit Euler s=2, Explicit trapezoidal rule
Explicit trapezoidal rule s=2, Explicit midpoint rule
0.8 Explicit midpoint rule -1
O(h2)
10

0.7

error |yh(1)-y(1)|
-2
0.6 10
y

0.5
-3
0.4 10

0.3
-4
0.2 10

0.1

0 -2 -1
0 0.2 0.4 0.6 0.8 1 10 10
Fig. 412 t Fig. 413 stepsize h

yh ( j/10), j = 1, . . . , 10 for explicit RK-methods Errors at final time yh (1) − y(1)

Observation: obvious algebraic convergence in meshwidth h with integer rates/orders:
explicit trapezoidal rule (11.4.6) → order 2
explicit midpoint rule (11.4.7) → order 2

This is what one expects from the considerations in Ex. 11.4.4.

The formulas that we have obtained follow a general pattern:

Definition 11.4.9. Explicit Runge-Kutta method

For bi , aij ∈ R , ci := ∑ij− 1

=1 aij , i, j = 1, . . . , s, s ∈ N, an s-stage explicit Runge-Kutta single step
method (RK-SSM) for the ODE ẏ = f(t, y), f : Ω → R d , is defined by (y0 ∈ D)

i −1 s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 := y0 + h ∑ bi ki .
j =1 i =1

The vectors ki ∈ R d , i = 1, . . . , s, are called increments, h > 0 is the size of the timestep.

Recall Rem. 11.3.12 to understand how the discrete evolution for an explicit Runge-Kutta method is spec-
ified in this definition by giving the formulas for the first step. This is a convention widely adopted in the
literature about numerical methods for ODEs. Of course, the increments ki have to be computed anew in
each timestep.

The implementation of an s-stage explicit Runge-Kutta single step method according to Def. 11.4.9 is
straightforward: The increments ki ∈ R d are computed successively, starting from k1 = f(t0 + c1 h, y0 ).

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 737
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Only s f-evaluations and AXPY operations (→ Section 1.3.2) are required.

Butcher scheme notation for explicit RK-SSM

c1 0 ··· 0
Shorthand notation for (explicit) Runge- .. ..
c2 a21 . .
Kutta methods [?, (11.75)] c A
:= .. .. .. .. .
bT . . . .
Butcher scheme ✄
(Note: A is strictly lower triangular s ×
cs as1 ··· as,s−1 0
s-matrix) b1 ··· bs − 1 bs
(11.4.11)

Note that in Def. 11.4.9 the coefficients bi can be regarded as weights of a quadrature formula on [0, 1]:
apply explicit Runge-Kutta single step method to “ODE” ẏ = f (t). The quadrature rule with these weights
and nodes c j will have order ≥ 1, if the weights add up to 1!

Corollary 11.4.12. Consistent Runge-Kutta single step methods

A Runge-Kutta single step method according to Def. 11.4.9 is consistent (→ Def. 11.3.10) with the
ODE ẏ = f(t, y), if and only if
s
∑ bi = 1 .
i =1

Example 11.4.13 (Butcher schemes for some explicit RK-SSM [?, Sect. 11.6.1])

The following explicit Runge-Kutta single step methods are often mentioned in literature.

0 0
• Explicit Euler method (11.2.7): ➣ order = 1
1

0 0 0
• explicit trapezoidal rule (11.4.6): 1 1 0 ➣ order = 2
1 1
2 2

0 0 0
1 1
• explicit midpoint rule (11.4.7): 2 2 0 ➣ order = 2
0 1

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 738
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

0 0 0 0 0
1 1
2 2 0 0 0
1 1
• Classical 4th-order RK-SSM: 2 0 2 0 0 ➣ order = 4
1 0 0 1 0
1 2 2 1
6 6 6 6

0 0 0 0 0
1 1
3 3 0 0 0
2
• Kutta’s 3/8-rule: 3 − 13 1 0 0 ➣ order = 4
1 1 −1 1 0
1 3 3 1
8 8 8 8

Hosts of (explicit) Runge-Kutta methods can be found in the literature, see for example the Wikipedia page.
They are stated in the form of Butcher schemes (11.4.11) most of the time.

Remark 11.4.14 (Construction of higher order Runge-Kutta single step methods)

Runge-Kutta single step methods of order p > 2 are not found by bootstrapping as in Ex. 11.4.4, because
the resulting methods would have quite a lot of stages compared to their order.

Rather one derives order conditions yielding large non-linear systems of equations for the coefficients aij
and bi in Def. 11.4.9, see [?, Sect .4.2.3] and [?, Ch. III]. This approach is similar to the construction of a
Gauss quadrature rule in Ex. 7.3.13. Unfortunately, the systems of equations are very difficult to solve and
no universal recipe is available. Nevertheless, through massive use of symbolic computation, Runge-Kutta
methods of order up to 19 have been constructed in this way.

Remark 11.4.15 (“Butcher barriers” for explicit RK-SSM)

The following table gives lower bounds for the number of stages needed to achieve order p for an explicit
Runge-Kutta method.
order p 1 2 3 4 5 6 7 8 ≥9
minimal no. s of stages 1 2 3 4 6 7 9 11 ≥ p+3
No general formula is has been discovered. What is known is that for explicit Runge-Kutta single step
methods according to Def. 11.4.9
order p ≤ number s of stages of RK-SSM

(11.4.16) E IGEN-compatible adaptive explicit embedded Runge-Kutta integrator

An implementation of an explicit embedded Runge-Kutta single-step method with adaptive stepsize con-
trol for solving an autonomous IVP is provided by the utility class ode45. The terms “embedded” and
“adaptive” will be explained in Section 11.5.

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 739
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The class is templated with two type parameters:

(i) StateType: type for vectors in state space V , e.g. a fixed size vector type of E IGEN:
Eigen::Matrix<double,N,1>, where N is an integer constant § 11.2.1.
(ii) RhsType: a functor type, see Section 0.2.3, for the right hand side function f; must match State-
Type, default type provided.

C++11 code 11.4.17: Runge-Kutta-Fehlberg 4(5) numerical integrator class

2 template <class StateType ,
3 class RhsType = std : : f u n c t i o n <StateType ( const StateType &)>>
4 class ode45 {
5 public :
6 // Idle constructor
7 ode45 ( const RhsType & r hs ) : f ( r hs ) { }
8 // Main timestepping routine
9 template <class NormFunc = decltype ( _norm<StateType > )>
10 std : : vector < std : : p a i r <StateType , double>>
11 solve ( const StateType & y0 , double T , const NormFunc & norm =
_norm<StateType > ) ;
12 // Print statistics and options of this class instance.
13 void p r i n t ( ) ;
14 s t r u c t Options {
15 bool s a v e _ i n i t = t r ue ; // Set true if you want to save the
initial data
16 bool f i x e d _ s t e p s i z e = f a l s e ; // TODO: Set true if you want a fixed
step size
17 unsigned i n t m a x _ i t e r a t i o n s = 5000;
18 double min_dt = − 1.; // Set the minimum step size (-1 for none)
19 double max_dt = − 1.; // Set the maximum step size (-1 for none)
20 double i n i t i a l _ d t = − 1.; // Set an initial step size
21 double s t a r t _ t i m e = 0 ; // Set a starting time
22 double r t o l = 1e −6; // Relative tolerance for the error.
23 double a t o l = 1e −8; // Absolute tolerance for the error.
24 bool d o _ s t a t i s t i c s = f a l s e ; // Set to true before solving to save
statistics
25 bool verbose = f a l s e ; // Print more output.
26 } options ;
27 // Data structure for usage statistics.
28 // Available after a call of solve(), if do_statistics is set to
true.
29 struct Statistics {
30 unsigned i n t c y c l e s = 0 ; // Number of loops (sum of all accepted
and rejected steps)
31 unsigned i n t s teps = 0 ; // Number of actual time steps performed
(accepted step)
32 unsigned i n t r e j e c t e d _ s t e p s = 0 ; // Number of rejected steps per
step
33 unsigned i n t f u n c a l l s = 0 ; // Function calls
34 } statistics ;
35 };

The functor for the right hand side f : D ⊂ V → V of the ODE ẏ = f(y) is specified as an argument of
the constructor.

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 740
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The single-step numerical integrator is invoked by the templated method

t e m p l a t e < c l a s s NormFunc = d e c l t y p e (_norm<StateType>)>
s t d :: v e c t o r < s t d ::pair<StateType, double >>
solve( const StateType & y0, double T, const NormFunc & norm =
_norm<StateType>);

The following arguments have to be supplied:

1. y0: the initial value y0

2. T: the final time T , initial time t0 = 0 is assumed, because the class can deal with autonomous
ODEs only, recall § 11.1.21.

3. norm: a functor returning a suitable norm for a state vector. Defaults to E IGEN’s maximum vector
norm.

The method returns a vector of 2-tuples (yk , tk ) (note the order!), k = 0, . . . , N , of temporal mesh points
tk , t0 = 0, t N = T , see § 11.2.2, and approximate states yk ≈ y(tk ), where t 7→ y(t) stands for the
exact solution of the initial value problem.

The next self-explanatory code snippet uses the numerical integrator class ode45 for solving a scalar
autonomous ODE.

C++11 code 11.4.18: Invocation of adaptive embedded Runge-Kutta-Fehlberg integrator

2 i n t main ( void )
3 {
4 // Types to be used for a scalar ODE with state space R
5 using StateType = double ;
6 using RhsType = std : : f u n c t i o n <StateType ( StateType ) >;
7 // Logistic differential equation (11.1.6)
8 RhsType f = [ ] ( StateType y ) { r et ur n 5 ∗ y ∗(1 − y ) ; } ;
9 StateType y0 = 0 . 2 ; // Initial value
10 // Exact solution of IVP, see (11.1.7)
11 auto y = [ y0 ] ( double t ) { r et ur n y0 / ( y0+(1− y0 ) ∗ exp( −5∗ t ) ) ; } ;
12 // State space R, simple modulus supplies norm
13 auto normFunc = [ ] ( StateType x ) { r et ur n fabs ( x ) ; } ;
14

15 // Invoke explicit Runge-Kutta method with stepsize control

16 ode45<StateType , RhsType> i n t e g r a t o r ( f ) ;
17 std : : vector <std : : p a i r <StateType , double>> s t a t e s =
i n t e g r a t o r . solve ( y0 , 1 . 0 , normFunc ) ;
18 // Output information accumulation during numerical integration
19 i n t e g r a t o r . o p t i o n s . d o _ s t a t i s t i c s = t r ue ; i n t e g r a t o r . p r i n t ( ) ;
20

21 f o r ( auto s t a t e : s t a t e s )
22 std : : cout << " t = " << s t a t e . second << " , y = " << s t a t e . f i r s t
23 << " , | e r r | = " << fabs ( s t a t e . f i r s t − y ( s t a t e . second ) ) <<
std : : endl ;
24 }

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 741
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Remark 11.4.19 (Explicit ODE integrator in M ATLAB)

M ATLAB provides a built-in numerical integrator based on explicit RK-SSM, see [?] and [?, Sect. 7.2]. Its
calling syntax is
[t,y] = ode45(odefun,tspan,y0);

odefun : Handle to a function of type @(t,y) ↔ r.h.s. f(t, y)

tspan : vector [t0 , T [⊤ , initial and final time for numerical integration
y0 : (vector) passing initial state y0 ∈ R d
Return values:
t : temporal mesh {t0 < t1 < t2 < · · · < t N −1 = t N = T }
y : sequence (yk )kN=0 (column vectors)

M ATLAB-code 11.4.20: Code excerpts from M ATLAB’s integrator ode45

1 f u n c t i o n varargout = ode45 (ode,tspan,y0,options,varargin)
2 % Processing of input parameters omitted
.
3 % ..
4 % Initialize method parameters, c.f. Butcher scheme (11.4.11)
5 pow = 1/5;
6 A = [1/5, 3/10, 4/5, 8/9, 1, 1];
7 B = [
8 1/5 3/40 44/45 19372/6561 9017/3168 35/384
9 0 9/40 -56/15 -25360/2187 -355/33 0
10 0 0 32/9 64448/6561 46732/5247
500/1113
11 0 0 0 -212/729 49/176 125/192
12 0 0 0 0 -5103/18656
-2187/6784
13 0 0 0 0 0 11/84
14 0 0 0 0 0 0
15 ];
16 E = [71/57600; 0; -71/16695; 71/1920; -17253/339200; 22/525; -1/40];
.
17 % .
. (choice of stepsize and main loop omitted)
18 % ADVANCING ONE STEP.
19 hA = h * A;
20 hB = h * B;
21 f(:,2) = f e v a l (odeFcn,t+hA(1),y+f*hB(:,1),odeArgs{:});
22 f(:,3) = f e v a l (odeFcn,t+hA(2),y+f*hB(:,2),odeArgs{:});
23 f(:,4) = f e v a l (odeFcn,t+hA(3),y+f*hB(:,3),odeArgs{:});
24 f(:,5) = f e v a l (odeFcn,t+hA(4),y+f*hB(:,4),odeArgs{:});
25 f(:,6) = f e v a l (odeFcn,t+hA(5),y+f*hB(:,5),odeArgs{:});
26

27 tnew = t + hA(6);
28 i f done, tnew = tfinal; end % Hit end point exactly.
29 h = tnew - t; % Purify h.
30 ynew = y + f*hB(:,6);
.
31 % .. (stepsize control, see Sect. 11.5 dropped

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 742
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Example 11.4.21 (Numerical integration of logistic ODE in M ATLAB)

This example demonstrates the use of ode45 for a scalar ODE (d = 1)

M ATLAB -C ODE: usage of ode45 M ATLAB-integrator: ode45():
fn = @(t,y) 5*y*(1-y);
[t,y] = ode45(fn,[0 1.5],y0); Handle passing r.h.s. function f = f(t, y),
plot(t,y,’r-’); initial and final time as row vector,
initial state y0 , as column vector,

11.5 Adaptive Stepsize Control

Supplementary reading. [?, Sect. 11.7], [?, Sect. 11.8.2]

Example 11.5.1 (Oregonator reaction)

Chemical reaction kinetics is a field where ODE based models are very common. This example presents
a famous reaction with extremely abrupt dynamics. Refer to [?, Ch. 62] for more information about the
ODE-based modelling of kinetics of chemical reactions.

This is a apecial case of an “oscillating” Zhabotinski-Belousov reaction [?]:

BrO3− + Br− 7→ HBrO2

HBrO2 + Br− 7→ Org
BrO3− + HBrO2 7→ 2 HBrO2 + Ce(IV) (11.5.2)
2 HBrO2 7→ Org
Ce(IV) 7→ Br−

y1 := c(BrO3− ): ẏ1 = − k 1 y1 y2 − k 3 y1 y3 ,
y2 := c(Br− ): ẏ2 = − k 1 y1 y2 − k 2 y2 y3 + k 5 y5 ,
y3 := c(HBrO2 ): ẏ3 = k1 y1 y2 − k2 y2 y3 + k3 y1 y3 − 2k4 y23 , (11.5.3)
y4 := c(Org): ẏ4 = k2 y2 y3 + k4 y23 ,
y5 := c(Ce(IV)): ẏ5 = k 3 y1 y3 − k 5 y5 ,

with (non-dimensionalized) reaction constants:

k1 = 1.34 , k2 = 1.6 · 109 , k3 = 8.0 · 103 , k4 = 4.0 · 107 , k5 = 1.0 .

periodic chemical reaction ➽ Video 1, Video 2

M ATLAB simulation with initial state y1 (0) = 0.06, y2 (0) = 0.33 · 10−6, y3 (0) = 0.501 · 10−10 , y4 (0) =
0.03, y5 (0) = 0.24 · 10−7:

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 743
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Concentration of Br- Concentration of HBrO2

-3 -5
10 10

-4
10 -6
10

-5
10
-7
10

-6
10
c(t)

c(t)
-8
10
-7
10

-9
10
-8
10

-10
-9 10
10

-10 -11
10 10
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Fig. 414 t Fig. 415 t

We observe a strongly non-uniform behavior of the solution in time.

This is very common with evolutions arising from practical models (circuit models, chemical reaction mod-
els, mechanical systems)

Example 11.5.4 (Blow-up)

100
y =1
0
y = 0.5
90 0

We return to the “explosion ODE” of Ex. 11.1.35 and y0 = 2

80
consider the scalar autonomous IVP:
70

ẏ = y2 , y(0) = y0 > 0 . 60

y0
y(t)

y(t) = , t < 1/y0 . 50

1 − y0 t 40

As we have seen a solution exists only for finite time 30

and then suffers a Blow-up, that is, lim y(t) = ∞ 20

t→1/y0
: J (y0 ) =] − ∞, 1/y0 ]! 10

0
-1 -0.5 0 0.5 1 1.5 2 2.5
Fig. 416 t

How to choose temporal mesh {t0 < t1 < · · · < t N −1 < t N } for single step method in case J (y0 ) is not
known, even worse, if it is not clear a priori that a blow up will happen?

Just imagine: what will result from equidistant explicit Euler integration (11.2.7) applied to the above IVP?

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 744
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

solution by ode45
100
y0 = 1
y0 = 0.5
90
y =2
0

80
Simulation with M ATLAB’s ode45:
70
M ATLAB-code 11.5.5:
60 1 fun = @(t,y) y.^2;
t1,y1] = ode45 (fun,[0 2],1);
k

50 2
y

40
3 [t2,y2] = ode45 (fun,[0
2],0.5);
30
4 [t3,y3] = ode45 (fun,[0 2],2);
20

0
-1 -0.5 0 0.5 1 1.5 2 2.5
Fig. 417 t

M ATLAB warning messages:

Warning: Failure at t=9.999694e-01. Unable to meet integration
tolerances without reducing the step size below the smallest
value allowed (1.776357e-15) at time t.
> In ode45 at 371
In simpleblowup at 22
Warning: Failure at t=1.999970e+00. Unable to meet integration
tolerances without reducing the step size below the smallest
value allowed (3.552714e-15) at time t.
> In ode45 at 371
In simpleblowup at 23
Warning: Failure at t=4.999660e-01. Unable to meet integration
tolerances without reducing the step size below the smallest
value allowed (8.881784e-16) at time t.
> In ode45 at 371
In simpleblowup at 24

We observe: ode45 manages to reduce stepsize more and more as it approaches the singularity of the
solution! How can it accomplish this feat!

Key challenge (discussed for autonomous ODEs below):

How to choose a good temporal mesh {0 = t0 < t1 < · · · < t N −1 < t N }

for a given single step method applied to a concerete IVP?

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 745
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

What does “good” mean ?

Be efficient! Be accurate!

Stepsize adaptation for single step methods

max k y(tk ) − yk k<TOL

Objective: N as small as possible & k=1,...,N , TOL = tolerance
or ky(T ) − y N k < TOL
Policy: Try to curb/balance one-step error by )
local-in-time
✦ adjusting current stepsize hk , stepsize control
✦ predicting suitable next timestep hk+1
Tool: Local-in-time one-step error estimator (a posteriori, based on yk , hk−1 )

Why local-in-time timestep control (based on estimating only the one-step error)?

Consideration: If a small time-local error in a single timestep leads to large error kyk − y(tk )k at later
times, then local-in-time timestep control is powerless about it and will not even notice!!

Nevertheless, local-in-time timestep control is used almost exclusively,

☞ because we do not want to discard past timesteps, which could amount to tremendous waste of
computational resources,
☞ because it is inexpensive and it works for many practical problems,
☞ because there is no reliable method that can deliver guaranteed accuracy for general IVP.

(11.5.7) Local-in-time error estimation

We “recycle” heuristics already employed for adaptive quadrature, see Section 7.5, § 7.5.10. There we
tried to get an idea of the local quadrature error by comparing two approximations of different order. Now
we pursue a similar idea over a single timestep.

Idea: Estimation of one-step error

h
e of different order over
Compare results for two discrete evolutions Ψ h , Ψ
current timestep h:

If e ) > Order(Ψ), then we expect

Order(Ψ

e h y(tk ) − Ψ h y(tk ) .
Φ h y(tk ) − Ψh y(tk ) ≈ ESTk := Ψ (11.5.8)
| {z }
one-step error

Heuristics for concrete h

(11.5.9) Temporal mesh refinement

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 746
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

We take for granted a local error estimate ESTk .

absolute tolerance

ESTk ↔ ATOL
Compare ➣ Reject/accept current step (11.5.10)
ESTk ↔ RTOLkyk k

relative tolerance
For a similar use of absolute and relative tolerances see Section 8.1.2: termination criteria for iterations,
in particular (8.1.25).

☞ Simple algorithm:
ESTk < max{ATOL, kyk kRTOL}: Carry out next timestep (stepsize h)
Use larger stepsize (e.g., αh with some α > 1) for following step (∗)

ESTk > max{ATOL, kyk kRTOL}: Repeat current step with smaller stepsize < h, e.g., 12 h

Rationale for (∗): if the current stepsize guarantees sufficiently small one-step error, then it might be
possible to obtain a still acceptable one-step error with a larger timestep, which would enhance efficiency
(fewer timesteps for total numerical integration). This should be tried, since timestep control will usually
provide a safeguard against undue loss of accuracy.

C++11 code 11.5.11: Simple local stepsize control for single step methods
2 // Auxiliary function: default norm for an E I G E N vector type
3 template <class State >
4 double _norm ( const S t a t e &y ) { r et ur n y . norm ( ) ; }
5

6 template <class DiscEvolOp , class State , class NormFunc =

decltype ( _norm< State > )>
7 std : : vector <std : : p a i r <double , State > >
8 o d e i n t a d a p t ( DiscEvolOp &Psilow , DiscEvolOp &Psihigh ,
9 S t a t e& y0 , double T , double h0 ,
10 double r e l t o l , double a b s t o l , double hmin ,
11 NormFunc &norm = _norm< State > ) {
12 double t = 0 ; // initial time
13 S t a t e y = y0 ; // initial state
14 double h = h0 ; // timestep to start with
15 std : : vector <std : : p a i r <double , State >> s t a t e s ; // for output
16 s t a t e s . push_back ( { t , y } ) ;
17

18 while ( ( s t a t e s . back ( ) . f i r s t < T ) && ( h >= hmin ) ) { //

h
19 S t a t e yh = Psihigh ( h , y0 ) ; // high order discrete evolution Ψ e
h
20 S t a t e yH = Psilow ( h , y0 ) ; // low order discrete evolution Ψ
21 double e s t = norm ( yH−yh ) ; // local error estimate ESTk
22

23 i f ( e s t < max ( r e l t o l ∗norm ( y0 ) , a b s t o l ) ) { // step accepted

24 y0 = yh ; // use high order approximation
25 t = t +min ( T−t , h ) ; // next time tk
26 s t a t e s . push_back ( { t , y0 } ) ; //

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 747
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

27 h = 1.1 ∗ h ; // try with increased stepsize

28 }
29 else // step rejected
30 h = h / 2 ; // try with half the stepsize
31 }
32 // Numerical integration has ground to a halt !
33 i f ( h < hmin ) {
34 c e r r << " W a r n i n g : F a i l u r e a t t = " << s t a t e s . back ( ) . f i r s t
35 << " . U n a b l e t o meet i n t e g r a t i o n t o l e r a n c e s w i t h o u t r e d u c i n g
the step "
36 << " s i z e b e l o w t h e s m a l l e s t v a l u e a l l o w e d ( " << hmin << " ) a t
t i m e t . " << endl ;
37 }
38 r et ur n s t a t e s ;
39 }

Comments on Code 11.5.11:

• Input arguments:
– Psilow, Psihigh: functors passing discrete evolution operators for autonomous ODE of
different order, type @(y,h), expecting a state (usually a column vector) as first argument,
and a stepsize as second,
– T: final time T > 0,
– y0: initial state y0 ,
– h0: stepsize h0 for the first timestep
– reltol, abstol: relative and absolute tolerances, see (11.5.10),
– hmin: minimal stepsize, timestepping terminates when stepsize control hk < hmin , which is
relevant for detecting blow-ups or collapse of the solution.
• line 18: check whether final time is reached or timestepping has ground to a halt (hk < hmin ).
• line 19, 20: advance state by low and high order integrator.
• line 21: compute norm of estimated error, see (11.5.8).
• line 23: make comparison (11.5.10) to decide whether to accept or reject local step.
• line 26, 27: step accepted, update state and current time and suggest 1.1 times the current stepsize
for next step.
• line 30 step rejected, try again with half the stepsize.
• Return value is a vector of pairs consisting of
– times t ↔ temporal mesh t0 < t1 < t2 < . . . < t N < T , where t N < T indicated
premature termination (collapse, blow-up),
– states y ↔ sequence (yk )kN=0 .

Remark 11.5.12 (Estimation of “wrong” error?)

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 748
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

We face the same conundrum as in the case of adaptive numerical quadrature, see Rem. 7.5.17:

By the heuristic considerations, see (11.5.8) it seems that ESTk measures the one-step error for
! the low-order method Ψ and that we should use yk+1 = Ψ hk yk , if the timestep is accepted.

hk
e y , since it is available for free. This is
However, it would be foolish not to use the better value yk+1 = Ψ k
what is done in every implementation of adaptive methods, also in Code 11.5.11, and this choice can be
justified by control theoretic arguments [?, Sect. 5.2].

Example 11.5.13 (Simple adaptive stepsize control)

We test adaptive timestepping routine from Code 11.5.11 for a scalar IVP and compare the estimated local
error and true local error.
✦ IVP for ODE ẏ = cos(αy)2 , α > 0, solution y(t) = arctan(α(t − c))/α for y(0) ∈] − π/2, π/2[
✦ Simple adaptive timestepping based on explicit Euler (11.2.7) and explicit trapezoidal rule (11.4.6)
Adaptive timestepping, rtol = 0.010000, atol = 0.000100, a = 20.000000 Adaptive timestepping, rtol = 0.010000, atol = 0.000100, a = 20.000000
0.08 0.025
y(t) true error |y(tk)-y k|
yk estimated error EST
k
0.06 rejection

0.02
0.04

0.02
0.015
error
y

0.01
-0.02

-0.04
0.005

-0.06

-0.08 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Fig. 418 t Fig. 419 t

Statistics: 66 timesteps, 131 rejected timesteps

Observations:
☞ Adaptive timestepping well resolves local features of solution y(t) at t = 1
☞ Estimated error (an estimate for the one-step error) and true error are not related! To understand
this recall Rem. 11.5.12.

Example 11.5.14 (Gain through adaptivity → Ex. 11.5.13)

In this experiment we want to explore whether adaptive timestepping is worth while, as regards reduction
of computational effort without sacrificing accuracy.
We retain the simple adaptive timestepping from previous experiment Ex. 11.5.13 and also study the same
IVP.
New: initial state y(0) = 0!

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 749
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Now we examine the dependence of the maximal discretization error in mesh points on the computational
effort. The latter is proportional to the number of timesteps.
Solving dt y = a cos(y)2 with a = 40.000000 by simple adaptive timestepping Error vs. no. of timesteps for dt y = a cos(y)2 with a = 40.000000
1
0.05 10
uniform timestep
adaptive timestep
0.045
0
10
0.04

0.035
-1
10

max |y(t )-y |

k
0.03

k
-2
y

0.025 10

k
0.02
-3
10
0.015
rtol = 0.400000
0.01 rtol = 0.200000
-4
rtol = 0.100000 10
rtol = 0.050000
0.005 rtol = 0.025000
rtol = 0.012500
rtol = 0.006250
-5
0 10
1 2 3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 10 10
Fig. 420 t Fig. 421 no. N of timesteps

Solutions (yk )k for different values of rtol Error vs. computational effort
Observations:
☞ Adaptive timestepping achieves much better accuracy for a fixed computational effort.

Example 11.5.15 (“Failure” of adaptive timestepping → Ex. 11.5.14)

Same ODE and simple adaptive timestepping as in previous experiment Ex. 11.5.14.
π π
ẏ = cos2 (αy) ⇒ y(t) = arctan(α(t − c))/α ,y(0) ∈] − ,− [ ,
2α 2α
for α = 40.
π
Now: initial state y(0) = −0.0386 ≈ 2α as in Ex. 11.5.13
Solving dt y = a cos(y)2 with a = 40.000000 by simple adaptive timestepping Error vs. no. of timesteps for d y = a cos(y)2 with a = 40.000000
t
0
0.05 10
uniform timestep
adaptive timestep
0.04

0.03

0.02
-1
10
max |y(t )-y |
k

0.01
k
y

0
k

-0.01
-2
10
-0.02
rtol = 0.400000
-0.03 rtol = 0.200000
rtol = 0.100000
rtol = 0.050000
-0.04 rtol = 0.025000
rtol = 0.012500
rtol = 0.006250
-3
-0.05 10
1 2 3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 10 10
Fig. 422 t Fig. 423 no. N of timesteps

Solutions (yk )k for different values of rtol Error vs. computational effort
Observations:
☞ Adaptive timestepping leads to larger errors at the same computational cost as uniform timestepping
!

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 750
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Explanation: the position of the steep step of the solution has a sensitive dependence on an initial value,
if y(0) ≈ 2alπpha :
1
y(t) = α arctan(α(t + tan(y0/α ))) , step at ≈ − tan(y0/α) .
Hence, small local errors in the initial timesteps will lead to large errors at around time t ≈ 1. The stepsize
control is mistaken in condoning these small one-step errors in the first few steps and, therefore, incurs
huge errors later.

However, the perspective of backward error analysis (→ § 1.5.84) rehabilitates adaptive stepsize control
in this case: it gives us a numerical solution that is very close to the exact solution of the ODE with slightly
perturbed initial state y0 .

Remark 11.5.16 (Refined local stepsize control → [?, Sect. 11.7])

The above algorithm (Code 11.5.11) is simple, but the rule for increasing/shrinking of timestep “squanders”
the information contained in ESTk : TOL:

More ambitious goal ! When ESTk > TOL : stepsize adjustment better hk = ?
When ESTk < TOL : stepsize prediction good hk+1 = ?

Assumption: At our disposal are two discrete evolutions:

✦ Ψ with order(Ψ) = p (➙ “low order” single step method)

e with order(Ψ
✦ Ψ e )> p (➙ “higher order” single step method)
These are the same building blocks as for the simple adaptive strategy employed in Code 11.5.11 (passed
as arguments Psilow, Psihigh there).

Asymptotic expressions for one-step error for h → 0:

p+2
Ψhk y(tk ) − Φ hk y(tk ) = ch p+1 + O(hk ),
(11.5.17)
e hk y(tk ) − Φ hk y(tk ) = O (h p+2 ) ,
Ψ k
with some (unknown) c > 0.
Why h p+1 ? Remember estimate (11.3.28) from the error analysis of the explicit Euler method: we also
found O(h2k ) there for the one-step error of a single step method of order 1.

p+2
Heuristics: the timestep hk is small ➥ “higher order terms” O(hk ) can be ignored.
. p+1 p+2
Ψhk y(tk ) − Φhk y(tk ) = chk + O(hk ), . p+1
⇒ ESTk = chk . (11.5.18)
e hk y(tk ) − Φhk y(tk ) =. O(h p+2 ) .
Ψ k

.
✎ notation: = equality up to higher order terms in hk

. p+1 . ESTk
ESTk = chk ⇒ c= p+1
. (11.5.19)
hk

Available in algorithm, see (11.5.8)

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 751
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

For the sake of accuracy (stipulates “ESTk < TOL”) & efficiency (favors “>”) we aim for
!
ESTk = TOL := max{ATOL, kyk kRTOL} . (11.5.20)

What timestep h∗ can actually achieve (11.5.20), if we “believe” in (11.5.18) (and, therefore, in (11.5.19))?

ESTk p+1
(11.5.19) & (11.5.20) ⇒ TOL = p+1
h∗ .
hk

q
TOL
"‘Optimal timestep”: h∗ = h p +1
ESTk . (11.5.21)
(stepsize prediction)

adjusted stepsize (R)

& suggested stepsize
(A)
(Reject): In case ESTk > TOL ➣ repeat step with stepsize h∗ .
(Accept): If ESTk ≤ TOL ➣ use h∗ as stepsize for next step.

C++11 code 11.5.22: Refined local stepsize control for single step methods
2 // Auxiliary function: default norm for an E I G E N vector type
3 template <class State >
4 double _norm ( const S t a t e &y ) { r et ur n y . norm ( ) ; }
5

6 template <class DiscEvolOp , class State , class NormFunc =

decltype ( _norm< State > )>
7 std : : vector <std : : p a i r <double , State > >
8 o d e i n t s s c t r l ( DiscEvolOp &Psilow , unsigned i n t p , DiscEvolOp &Psihigh ,
9 S t a t e& y0 , double T , double h0 ,
10 double r e l t o l , double a b s t o l , double hmin ,
11 NormFunc &norm = _norm< State > ) {
12 double t = 0 ; // initial time
13 S t a t e y = y0 ; // initial state
14 double h = h0 ; // timestep to start with
15 std : : vector <std : : p a i r <double , State >> s t a t e s ; // for output
16 s t a t e s . push_back ( { t , y } ) ;
17

18 // Main timestepping loop

19 while ( ( s t a t e s . back ( ) . f i r s t < T ) && ( h >= hmin ) ) { //
20 S t a t e yh = Psihigh ( h , y0 ) ; // high order discrete evolution Ψ eh
21 S t a t e yH = Psilow ( h , y0 ) ; // low order discrete evolution Ψh
22 double e s t = norm ( yH−yh ) ; // ↔ ESTk
23 double t o l = max ( r e l t o l ∗norm ( y0 ) , a b s t o l ) ; // effective tolerance
24

25 // Optimal stepsize according to (11.5.21)

26 h = h ∗max ( 0 . 5 , min ( 2 . , pow ( t o l / es t , 1 . / ( p+1) ) ) ) ; //
27 i f ( e s t < t o l ) // step accepted
28 s t a t e s . push_back ( { t = t +min ( T−t , h ) , y0 = yh } ) ; // store next
approximate state

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 752
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

29 }
30 i f ( h < hmin ) {
31 c e r r << " W a r n i n g : F a i l u r e a t t = "
32 << s t a t e s . back ( ) . f i r s t
33 << " . U n a b l e t o meet i n t e g r a t i o n t o l e r a n c e s w i t h o u t r e d u c i n g
the step "
34 << " s i z e b e l o w t h e s m a l l e s t v a l u e a l l o w e d ( " << hmin << " ) a t
t i m e t . " << endl ;
35 }
36 r et ur n s t a t e s ;
37 }

Comments on Code 11.5.22 (see comments on Code 11.5.11 for more explanations):
• Input arguments as for Code 11.5.11, except for p =
ˆ order of lower order discrete evolution.
• line 26: compute presumably better local stepsize according to (11.5.21),
• line 27: decide whether to repeat the step or advance,
• line 27: extend output arrays if current step has not been rejected.

Remark 11.5.23 (Stepsize control in M ATLAB)

The name of M ATLAB’s standard integrator ode45 already indicates the orders of the pair of single step
methods used for adaptive stepsize control:
Ψ=
ˆ RK-method of order 4 e=
Ψ ˆ RK-method of order 5
ode45
Specifying tolerances for M ATLAB’s integrators is done as follows:
options = odeset(’abstol’,atol,’reltol’,rtol,’stats’,’on’);
[t,y] = ode45(@(t,x) f(t,x),tspan,y0,options);
(f = function handle, tspan =
ˆ [t0 , T ], y0 =
ˆ y0 , t =
ˆ tk , y =
ˆ yk )

The possibility to pass tolerances to numerical integrators based on adaptive timestepping may tempt
one into believing that they allow to control the accuracy of the solutions. However, as is clear from
Rem. 11.5.16, these tolerances are solely applied to local error estimates and, inherently, have nothing to
do with global discretization errors, see Ex. 11.5.13.

No global error control through local-in-time adaptive timestepping

The absolute/relative tolerances imposed for local-in-time adaptive timestepping do not allow to
predict accuracy of solution!

Remark 11.5.25 (Embedded Runge-Kutta methods)

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 753
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

For higher order RK-SSM with a considerable number of stages computing different sets of increments
(→ Def. 11.4.9) for two methods of different order just for the sake of local-in-time stepsize control would
mean incommensurate effort.

Embedding idea: Use two RK-SSMs based on the same increments, that is, built with the same coefficients
aij , but different weights bi , see Def. 11.4.9 for the formulas, and different orders p and p + 1.

c1 a11 ··· a1s

Butcher scheme for embedded explicit Runge-Kutta c A .. .. ..
. . .
methods ✄ bT := cs as1 ··· ass .
(Lower order scheme has weights b
bi .) cT
b b1 ··· bs
b
b1 ··· b
bs

Example 11.5.26 (Commonly used embedded explicit Runge-Kutta methods)

The following two embedded RK-SSM, presented in the form of their extended Butcher schemes, provided
single step methods of orders 4 & 5.

0 0
1 1 1 1
3 3 2 2
1 1 1 1 1
3 6 6 2 0 2
1 1 3 1 0 0 1
2 8 0 8
1 3 5 7 13 1
1 2 0 − 23 2 4 32 32 32 − 32
1 2 1 1 1 1 1
y1 6 0 0 3 6
y1 6 3 3 6

yb1 1
10 0 3
10
2
5
1
5
yb1 − 21 7
3
7
3
13
6 − 16
3

Merson’s embedded RK-SSM Fehlberg’s embedded RK-SSM

Example 11.5.27 (Adaptive timestepping for mechanical problem)

We test the effect of adaptive stepsize control in M ATLAB for the equations of motion describing the planar
movement of a point mass in a conservative force field x ∈ R 2 7→ F(x) ∈ R 2 : Let t 7→ y(t) ∈ R 2 be
the trajectory of point mass (in the plane).

2y
From Newton’s law: ÿ = F(y) := − . (11.5.28)
kyk22
acceleration force
As in Rem. 11.1.23 we can convert the second-order ODE (11.5.28) into an equivalent 1st-order ODE by
introducing the velocity v := ẏ as an extra solution component:
" v #
ẏ
(11.5.28) ⇒ = − 2y . (11.5.29)
v̇ k y k2 2

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 754
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The following initial values used in the experiment:

−1 0.1
y (0 ) : = , v (0 ) : =
0 −0.1

Adaptive numerical integration in M ATLAB

ode45(@(t,x) f(t,x),[0 4],[-1;0;0.1;-0.1,],options):
➊ options = odeset(’reltol’,0.001,’abstol’,1e-5);
➋ options = odeset(’reltol’,0.01,’abstol’,1e-3);

abstol = 0.000010, reltol = 0.001000 abstol = 0.000010, reltol = 0.001000

4 5 0.2
y1(t) (exakt) y1(tk) (Naeherung)
y2(t) (exakt) y (t ) (Naeherung)
3 2 k
v (t) (exakt) v1(tk) (Naeherung)
1

2 v2(t) (exakt) v2(tk) (Naeherung)

Zeitschrittweite
0 y (t) 0 0.1
i

-1

-2

-3

-4 -5 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
t t

abstol = 0.001000, reltol = 0.010000 abstol = 0.001000, reltol = 0.010000

4 5 0.2
y1(t) (exakt) y1(tk) (Naeherung)
y2(t) (exakt) y (t ) (Naeherung)
3 2 k
v (t) (exakt) v1(tk) (Naeherung)
1

2 v2(t) (exakt) v2(tk) (Naeherung)

Zeitschrittweite
y (t)

0 0 0.1
i

-1

-2

-3

-4 -5 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
t t

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 755
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
2

2
y

y
-0.2 -0.2

-0.4 -0.4

-0.6 -0.6

-0.8 -0.8
Exakte Bahn Exakte Bahn
Naeherung Naeherung
-1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
y y
1 1

reltol=0.001, abstol=1e-5 reltol=0.01, abstol=1e-3

Observations:

☞ Fast changes in solution components captured by adaptive approach through very small timesteps.
☞ Completely wrong solution, if tolerance reduced slightly.

In this example we face a rather sensitive dependence of the trajectories on initial states or intermediate
states. Small perturbations at one instance in time can be have a massive impact on the solution at later
times. Loca stepsize control is powerless about preventing this.

Summary and Learning Outcomes

After having studied this chapter you should

• know the concept of evolution operator for an ODE and its relationship with solutions of associated
initial value problems.

• be able to convert higher-order and non-autonomous ODEs into the form ẏ = f(y).
• know about discrete evolutions and how they induce single step methods (SSMs).
• remember that single step method converge asymptotically algebraically for stepsize h → 0 and
that the rate of convergence is called the order of the SSM.

• know the general form of explicit Runge-Kutta methods and Butcher schemes.
• understand when adaptive timestep control is essential for meaningful numerical integration of initial
value problems.

• be able to describe the policy to time-local adaptive timestep control for embedded Runge-Kutta
methods.

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 756
Chapter 12

Single Step Methods for Stiff Initial Value

Problems

Supplementary reading. [?, Sect. 11.9]

Explicit Runge-Kutta methods with stepsize control (→ Section 11.5) seem to be able to provide approxi-
mate solutions for any IVP with good accuracy provided that tolerances are set appropriately.
Everything settled about numerical integration?

Example 12.0.1 (ode45 for stiff problem)

In this example we will witness the near failure of a high-order adaptive explicit Runge-Kutta method for a
simple scalar autonomous ODE.

IVP considered: ẏ = λy2 (1 − y) , λ := 500 , y(0) = 1

100 . (12.0.2)

This is a logistic ODE as introduced in Ex. 11.1.5. We try to solve it by means of an explicit adaptive
embedded Runge-Kutta-Fehlberg method (→ Rem. 11.5.25) using the class ode45 from § 11.4.16 (Pre-
processor switch MATLABCOEFF activated).

C++11 code 12.0.3: Solving (12.0.2) with class ode45

2 // Types to be used for a scalar ODE with state space R
3 using StateType = double ;
4 using RhsType = std : : f u n c t i o n <StateType ( StateType ) >;
5 // Logistic differential equation (11.1.6)
6 double lambda = 5 0 0 . 0 ;
7 RhsType f = [ lambda ] ( StateType y ) { r et ur n lambda ∗ y ∗ y ∗(1 − y ) ; } ;
8 StateType y0 = 0 . 0 1 ; // Initial value, will create a STIFF IVP
9 // State space R, simple modulus supplies norm
10 auto normFunc = [ ] ( StateType x ) { r et ur n fabs ( x ) ; } ;
11

12 // Invoke explicit Runge-Kutta method with stepsize control

13 ode45<StateType , RhsType> i n t e g r a t o r ( f ) ;
14 // Set rather loose tolerances
15 integrator . options . r t o l = 0.1;

757
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

16 integrator . options . atol = 0.001;

17 i n t e g r a t o r . o p t i o n s . min_dt = 1E− 18;
18 std : : vector <std : : p a i r <StateType , double>> s t a t e s =
i n t e g r a t o r . solve ( y0 , 1 . 0 , normFunc ) ;
19 // Output information accumulation during numerical integration
20 i n t e g r a t o r . o p t i o n s . d o _ s t a t i s t i c s = t r ue ; i n t e g r a t o r . p r i n t ( ) ;

1 − number o f s teps : 183

Statistics of the integrator run ✄ 2 − number o f r e j e c t e d s teps : 185
3 − function calls : 1302

The following plots have been generated with M ATLAB using its built-in adaptive explicit RK-SSM:

M ATLAB-script 12.0.4: Use of M ATLAB integrator ode45 for a stiff problem

1 fun = @(t,x) 500*x^2*(1-x);
2 options = odeset(’reltol’,0.1,’abstol’,0.001,’stats’,’on’);
3 [t,y] = ode45 (fun,[0 1],y0,options);

186 successful steps

The option stats = ‘on’ makes M ATLAB print
55 failed attempts
statistics about the run of the integrators.
1447 function evaluations
ode45 for d y = 500.000000 y2(1-y) 2
t ode45 for d y = 500.000000 y (1-y)
1.4 t
y(t) 1.5 0.03
ode45
1.2

1
1 0.02

0.8

Stepsize
y(t)
y

0.6
0.5 0.01
0.4

0.2

0 0
0 0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 Fig.
1 425 t
Fig. 424 t
Stepsize control of ode45 running amok!
The solution is virtually constant from t > 0.2 and, nevertheless, the integrator uses tiny timesteps
? until the end of the integration interval.

Contents
12.1 Model problem analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747
12.2 Stiff Initial Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
12.3 Implicit Runge-Kutta Single Step Methods . . . . . . . . . . . . . . . . . . . . . . 766
12.3.1 The implicit Euler method for stiff IVPs . . . . . . . . . . . . . . . . . . . . . 767
12.3.2 Collocation single step methods . . . . . . . . . . . . . . . . . . . . . . . . . 768
12.3.3 General implicit RK-SSMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
12.3.4 Model problem analysis for implicit RK-SSMs . . . . . . . . . . . . . . . . . 774
12.4 Semi-implicit Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 780

12. Single Step Methods for Stiff Initial Value Problems, 12. Single Step Methods for Stiff Initial Value Problems758
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

12.5 Splitting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783

12.1 Model problem analysis

Supplementary reading. See also [?, Ch. 77], [?, Sect. 11.3.3].

Experiment 12.1.1 (Adaptive explicity RK-SSM for simple decay ODE)

To rule out that what we observed in Ex. 12.0.1 might have been a quirk of the IVP (12.0.2) we conduct
the same investigations for the simple linear, scalar, autonomous IVP

ẏ = −λy , λ := 80 , y(0) = 1 . (12.1.2)

1 − number o f s teps : 33
We use the ode45 class to solve (12.1.2) with the 2 − number o f r e j e c t e d s teps : 32
same parameters as in Code 12.0.3. ✄ 3 − function calls : 231

ode45 for dty = -80.000000 y

1 ode45 for dty = -80.000000 y
y(t) 1 0.015
ode45
0.8

0.6 0.5 0.01

Stepsize
y(t)

0.4
y

0.2 0 0.005

-0.5 0
0 0.2 0.4 0.6 0.8 1
-0.2 Fig. 427
0 0.2 0.4 0.6 0.8 1 t
Fig. 426 t

Observation: Though y(t) ≈ 0 for t > 0.1, the integrator keeps on using “unreasonably small” timesteps
even then.

In this section we will discover a simple explanation for the startling behavior of ode45 in Ex. 12.0.1.

Example 12.1.3 (Blow-up of explicit Euler method)

The simplest explicit RK-SSM is the explicit Euler method, see Section 11.2.1. We know that it should
converge like O(h) for meshwidth h → 0. In this example we will see that this may be true only for
sufficiently small h, which may be extremely small.

✦ We consider the IVP for the scalar linear decay ODE:

ẏ = f (y) := λy , y(0) = 1 .

12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 759
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

✦ We apply the explicit Euler method (11.2.7) with uniform timestep h = 1/N , N ∈
{5, 10, 20, 40, 80, 160, 320, 640}.
Explicit Euler method for saalar model problem Explicit Euler, h=174.005981Explicit Euler, h=175.005493
20
10 3.5
λ = -10.000000
λ = -30.000000 3
λ = -60.000000
error at final time T=1 (Euclidean norm)

10 λ = -90.000000
10 2.5
O(h)

0
10 1.5

-10

y
10 0.5

-20
10 -0.5

-1

-30
10 -1.5

-2 exact solution
-40
explicit Euler
10
-3 -2 -1 0
10 10 10 10 0 0.2 0.4 0.6 0.8 1 1.2 1.4
Fig. 428 timestep h Fig. 429 t

λ large: blow-up of yk for large timestep h λ = −20: — =

ˆ y ( t ), — =
ˆ Euler polygon
Explanation: From Fig. 429 we draw the geometric conclusion that, if h is “large in comparison with λ−1 ”,
then the approximations yk way miss the stationary point y = 0 due to overshooting.

This leads to a sequence (yk )k with exponentially increasing oscillations.

✦ Now we look at an IVP for the logistic ODE, see Ex. 11.1.5:

ẏ = f (y) := λy(1 − y) , y(0) = 0.01 .

✦ As before, we apply the explicit Euler method (11.2.7) with uniform timestep h = 1/N , N ∈
{5, 10, 20, 40, 80, 160, 320, 640}.
140
10
λ = 10.000000 1.4
λ = 30.000000
10
120 λ = 60.000000
λ = 90.000000 1.2

100
10 1
error (Euclidean norm)

80 0.8
10

0.6
60
y

10
0.4
40
10
0.2

20
10 0

10
0 -0.2

exact solution
-0.4 explicit Euler
-20
10
-3 -2 -1 0
10 10 10 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 430 timestep h Fig. 431 t

λ large: blow-up of yk for large timestep h λ = 90: — =

ˆ y ( t ), — =
ˆ Euler polygon

For large timesteps h we also observe oscillatory blow-up of the sequence (yk )k .

Deeper analysis:

For y ≈ 1: f (y) ≈ λ(1 − y) ➣ If y(t0 ) ≈ 1, then the solution of the IVP will behave like the solution
of ẏ = λ(1 − y), which is a linear ODE. Similary, z(t) := 1 − y(t) will behave like the solution of the
“decay equation” ż = −λz. Thus, around the stationary point y = 1 the explicit Euler method behaves
like it did for ẏ = λy in the vicinity of the stationary point y = 0; it grossly overshoots.

12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 760
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(12.1.4) Linear model problem analysis: explicit Euler method

The phenomenon observed in the two previous examples is accessible to a remarkably simple rigorous
analysis: Motivated by the considerations in Ex. 12.1.3 we study the explicit Euler method (11.2.7) for the
linear model problem: ẏ = λy , y(0) = y0 , with λ ≪ 0 , (12.1.5)
which has exponentially decaying exact solution
y(t) = y0 exp(λt) → 0 for t → ∞ .
Recall the recursion for the explicit Euler with uniform timestep h > 0 method for (12.1.5):
(11.2.7) for f (y) = λy: yk+1 = yk (1 + λh) . (12.1.6)
We easily get a closed form expression for the approximations yk :
(
0 , if λh > −2 (qualitatively correct) ,
yk = y0 (1 + λh)k ⇒ |yk | →
∞ , if λh < −2 (qualitatively wrong) .

Observed: timestep constraint

Only if |λ|h < 2 we obtain a decaying solution by the explicit Euler method!

Could it be that the timestep control is desperately trying to enforce the qualitatively correct behavior of the
numerical solution in Ex. 12.1.3? Let us examine how the simple stepsize control of Code 11.5.11 fares
for model problem (12.1.5):

Example 12.1.8 (Simple adaptive timestepping for fast decay)

In this example we let a transparent adaptive timestep struggle with “overshooting”:

✦ “Linear model problem IVP”: ẏ = λy, y(0) = 1, λ = −100
✦ Simple adaptive timestepping method as in Ex. 11.5.13, see Code 11.5.11. Timestep control based
on the pair of 1st-order explicit Euler method and 2nd-order explicit trapezoidal method.
Decay equation, rtol = 0.010000, atol = 0.000100, λ = 100.000000 x 10
-3 Decay equation, rtol = 0.010000, atol = 0.000100, λ = 100.000000
1 3
y(t) true error |y(t )-y |
k k
y estimated error ESTk
k
0.9
rejection
2.5
0.8

0.7
2

0.6
error
y

0.5 1.5

0.4

1
0.3

0.2
0.5

0.1

0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Fig. 432 t Fig. 433 t

12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 761
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Observation: in fact, stepsize control enforces small timesteps even if y(t) ≈ 0 and persistently triggers
rejections of timesteps. This is necessary to prevent overshooting in the Euler method, which contributes
to the estimate of the one-step error.

We see the purpose of stepsize control thwarted, because after only a very short time the solution is
almost zero and then, in fact, large timesteps should be chosen.

Are these observations a particular “flaw” of the explicit Euler method? Let us study the behavior of another
simple explicit Runge-Kutta method applied to the linear model problem.

Example 12.1.9 (Explicit trapzoidal method for decay equation → [?, Ex. 11.29])

Recall recursion for the explicit trapezoidal method derived in Ex. 11.4.4:

k1 = f(t0 , y0 ) , k2 = f(t0 + h, y0 + hk1 ) , y1 = y0 + h2 (k1 + k2 ) . (11.4.6)

Apply it to the model problem (12.1.5), that is, the scalar autonomous ODE with right hand side function
f(y) = f (y) = λy, λ < 0:

k1 = λy0 , k2 = λ(y0 + hk1 ) ⇒ y1 = (1 + λh + 12 (λh)2 ) y0 . (12.1.10)

| {z }
= :S (hλ)

the sequence of approximations generated by the explicit trapezoidal rule can be expressed in
closed form as

yk = S(hλ)k y0 , k = 0, . . . , N . (12.1.11)

Stability polynomial for explicit trapezoidal rule

2.5

Clearly, blow-up can be avoided only if |S(hλ)| ≤ 1:

2 z 7→ 1 − z + 12 z2
|S(hλ)| < 1 ⇔ − 2 < hλ < 0 .
1.5
Qualitatively correct decay behavior of (yk )k only un-
S(z)

der timestep constraint

h ≤ |2/λ| . (12.1.12)
0.5

✁ the stability function for the explicit trapezoidal

0
-3 -2.5 -2 -1.5 -1 -0.5 0
method
Fig. 434 z

(12.1.13) Model problem analysis for general explicit Runge-Kutta single step methods

A c
Apply the explicit Runge-Kutta method (→ Def. 11.4.9): encoded by the Butcher scheme to the
bT
autonomous scalar linear ODE (12.1.5) (ẏ = λy). We write down the equations for the increments and y1

12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 762
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

from Def. 11.4.9 for f (y) := λy and then convert the resulting system of equations into matrix form:
i −1
ki = λ(y0 + h ∑ aij k j ) ,
j =1 I − zA 0 k 1
⇒ = y0 , (12.1.14)
s −zb⊤ 1 y1 1
y1 = y0 + h ∑ bi ki
i =1

where k ∈ R s =ˆ denotes the vector [k1 , . . . , ks ]⊤ /λ of increments, and z := λh. Next we apply block
Gaussian elimination (→ Rem. 2.3.11) to solve for y1 and obtain

y1 = S ( z ) y0 with S(z) := 1 + zb T (I − zA) −1 1 . (12.1.15)

Alternatively we can express y1 through determinants appealing to Cramer’s rule,

I − zA 1
det
−zb⊤ 1
y1 = y0 ⇒ S(z) = det(I − zA + z1b T ) , (12.1.16)
I − zA 0
det
−zb⊤ 1
and note that A is a strictly lower triangular matrix, which means that det(I − zA) = 1. Thus we have
proved the following theorem.

Theorem 12.1.17. Stability function of explicit Runge-Kutta methods → [?, Thm. 77.2], [?,
Sect. 11.8.4]

The discrete evolution Ψhλ of an explicit s-stage Runge-Kutta single step method (→ Def. 11.4.9)
c A
with Butcher scheme (see (11.4.11)) for the ODE ẏ = λy amounts to a multiplication with
bT
the number

Ψhλ = S(λh) ⇔ y1 = S(λh)y0 ,

where S is the stability function

S(z) := 1 + zb T (I − zA) −1 1 = det(I − zA + z1b T ) , 1 := [1, . . . , 1]⊤ ∈ R s . (12.1.18)

Example 12.1.19 (Stability functions of explicit Runge-Kutta single step methods)

From Thm. 12.1.17 and their Butcher schemes we can instantly compute the stability functions of explicit
RK-SSM. We do this for a few methods whose Butcher schemes were listed in Ex. 11.4.13

0 0 S(z) = 1 + z .
• Explicit Euler method (11.2.7): ➣
1

0 0 0
• Explicit trapezoidal method (11.4.6): 1 1 0 ➣ S(z) = 1 + z + 21 z2 .
1 1
2 2

12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 763
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

0 0 0 0 0
1 1
2 2 0 0 0
1 2 1 3 1 4
• Classical RK4 method: 1 1
2 0 2 0 0 ➣ S(z) = 1 + z + 2 z + 6 z + 24 z .
1 0 0 1 0
1 2 2 1
6 6 6 6

These examples confirm an immediate consequence of the determinant formula for the stability function
S ( z ).

Corollary 12.1.20. Polynomial stability function of explicit RK-SSM

For a consistent (→ Def. 11.3.10) s-stage explicit Runge-Kutta single step method according to
Def. 11.4.9 the stability function S defined by (12.1.18) is a non-constant polynomial of degree ≤ s:
S ∈ Ps .

Remark 12.1.21 (Stability function and exponential function)

Compare the two evolution operators:

ˆ evolution operator (to Def. 11.1.39) for ẏ = λy,
• Φ=
ˆ discrete evolution operator (→ § 11.3.1) for an s-stage Runge-Kutta single step method.
• Ψ=

Φh y = eλh y ←→ Ψh y = S(λh)y .
In light of Ψ ≈ Φ, see (11.3.3), we expect that

S(z) ≈ exp(z) for small |z| . (12.1.22)

A more precise statement is made by the following lemma:

Lemma 12.1.23. Stability function as approximation of exp for small arguments

Let S denote the stability function of an s-stage explicit Runge-Kutta single step method of order
q ∈ N. Then

|S(z) − exp(z)| = O(|z|q+1 ) for |z| → 0 . (12.1.24)

This means that the lowest q + 1 coefficients of S(z) must be equal to the first coefficients of the expo-
nential series:
q
1
S(z) = ∑ j! z j + zq+1 p(z) with some p ∈ Ps− q −1 .
j =0

Corollary 12.1.25. Stages limit order of explicit RK-SSM

An explicit s-stage RK-SSM has maximal order q ≤ s.

12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 764
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(12.1.26) Stability induced timestep constraint

∞
In § 12.1.13 we established that for the sequence (yk )k=0 produced by an explicit Runge-Kutta single step
method applied to the linear scalar model ODE ẏ = λy, λ ∈ R , with uniform timestep h > 0 holds

yk+1 = S(λh)yk ⇒ yk = S(λhk y0 .

(yk ) ∞
k=0 non-increasing ⇔ |S(λh)| ≤ 1 ,
(12.1.27)
(yk ) ∞
k=0 exponentially increasing ⇔ |S(λh)| > 1 .

where S = S(z) is the stability function of the RK-SSM as defined in (12.1.18).

Invariably polynomials tend to ±∞ for large (in modulus) arguments:

∀S ∈ Ps , S 6= const : lim S(z) = ∞ uniformly . (12.1.28)

| z|→∞

So, for any λ 6= 0 there will be a threshold hmax > 0 so that |yk | → ∞ as |h| > hmax .

Reversing the argument we arrive at a timestep constraint, as already observed for the explicit Euler
methods in § 12.1.4.

Only if one ensures that |λh| is sufficiently small, one can avoid exponentially increasing approxi-
mations yk (qualitatively wrong for λ < 0) when applying an explicit RK-SSM to the model problem
(12.1.5) with uniform timestep h > 0,

For λ ≪ 0 this stability induced timestep constraint may force h to be much smaller than required by
demands on accuracy : in this case timestepping becomes inefficient.

Remark 12.1.29 (Stepsize control detects instability)

Ex. 12.0.1, Ex. 12.1.8 send the message that local-in-time stepsize control as discussed in Section 11.5
selects timesteps that avoid blow-up, with a hefty price tag however in terms of computational cost and
poor accuracy.

Objection: simple linear scalar IVP (12.1.5) may be an oddity rather than a model problem: the weakness
of explicit Runge-Kutta methods discussed above may be just a peculiar response to an unusual situation.
Let us extend our investigations to systems of linear ODEs, d > 1.

(12.1.30) Systems of linear ordinary differential equations

A generic linear ordinary differential equation on state space R d has the form

ẏ = My with a matrix M ∈ R d,d . (12.1.31)

12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 765
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

As explained in [?, Sect. 8.1], (12.1.31) can be solved by diagonalization: If we can find a regular matrix
V ∈ C d,d such that
 
λ1 0
 ..  d,d
MV = VD with diagonal matrix D =  . ∈C , (12.1.32)
0 λd

then the 1-parameter family of global solutions of (12.1.31) is given by

 
exp(λ1 t) 0
 ..  −1 d
y(t) = V  .  V y0 , y0 ∈ R . (12.1.33)
0 exp(λd t)

The columns of V are a basis of eigenvectors of M, the λ j ∈ C, j = 1, . . . , d are the associated eigenval-
ues of M, see Def. 9.1.1.

The idea behind diagonalization is the transformation of (12.1.31) into d decoupled scalar linear ODEs:

ż1 = λ1 z1
z ( t ) : = V −1 y ( t ) ..
ẏ = My −−−−−−−−→ ż = Dz ↔ . , since M = VDV−1 .
żd = λd zd

The formula (12.1.33) can be generalized to

∞
1 k
y(t) = exp(Mt)y0 with matrix exponential exp(B) := ∑ B , B ∈ C d,d . (12.1.34)
k=0
k!

Example 12.1.35 (Transient simulation of RLC-circuit)

Consider circuit from Ex. 11.1.13 ✄ R

Transient nodal analysis leads to the second-order u(t)
linear ODE L

ü + αu̇ + βu = g(t) , Us ( t )
with coefficients α := ( RC)−1 , β = ( LC)−1 , g(t) =
αU̇s (t).

Fig. 435

We transform it to a linear 1st-order ODE as in Rem. 11.1.23 by introducing v := u̇ as additional solution

component:

u̇ 0 1 u 0
= − .
v̇ − β −α v g(t)
|{z} | {z }
= :ẏ = :f(t,y )

We integrate IVPs for this ODE by means of M ATLAB’s adaptive integrator ode45.

12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 766
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB-code 12.1.36: simulation of linear RLC circuit using ode45

1 f u n c t i o n stiffcircuit(R,L,C,Us,tspan,filename)
2 % Transient simulation of simple linear circuit of
Ex. refex:stiffcircuit
3 % R,L,C: paramters for circuits elements (compatible units required)
4 % Us: exciting time-dependent voltage Us = Us (t), function handle
5 % zero initial values
6

7 % Coefficient for 2nd-order ODE ü + αu̇ + β = g(t)

8 a l p h a = 1/(R*C); b e t a = 1/(C*L);
9 % Conversion to 1st-order ODE y = My + ( g(0t)). Set up right hand side
function.
10 M = [0 , 1; - b e t a , - a l p h a ]; rhs = @(t,y) (M*y - [ 0 ;
a l p h a *Us(t)]);
11 % Set tolerances for M A T L A B integrator, see Rem. 11.5.23
12 options = odeset(’reltol’,0.1,’abstol’,0.001,’stats’,’on’);
13 y0 = [0;0]; [t,y] = ode45 (rhs,tspan,y0,options);

RCL-circuit: R=100.000000, L=1.000000, C=0.000001

0.01
u(t)

0.008
v(t)/100 R = 100Ω, L = 1H, C = 1µF, Us (t) = 1V sin(t),
0.006
u(0) = v(0) = 0 (“switch on”)
0.004 ode45 statistics:
0.002 17897 successful steps
u(t),v(t)

0
1090 failed attempts
-0.002
113923 function evaluations
-0.004

-0.006 Inefficient: way more timesteps than required for re-

-0.008
solving smooth solution, cf. remark in the end of
-0.01
§ 12.1.26.
0 1 2 3 4 5 6
Fig. 436 time t

Maybe the time-dependent right hand side due to the time-harmonic excitation severly affects ode45?
Let us try a constant exciting voltage:
-3 RCL-circuit: R=100.000000, L=1.000000, C=0.000001
x 10
2
u(t)
v(t)/100

0
R = 100Ω, L = 1H, C = 1µF, Us (t) = 1V,
-2
u(0) = v(0) = 0 (“switch on”)

-4
ode45 statistics:
u(t),v(t)

17901 successful steps

-6 1210 failed attempts
114667 function evaluations
-8

-10
Tiny timesteps despite virtually constant solution!
-12
0 1 2 3 4 5 6
Fig. 437 time t

We make the same observation as in Ex. 12.0.1, Ex. 12.1.8: the local-in-time stepsize control of ode45
(→ Section 11.5) enforces extremely small timesteps though the solution almost constant except at t = 0.

12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 767
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

To understand the structure of the solutions for this transient circuit example, let us apply the diagonaliza-
tion technique from § 12.1.30 to the linear ODE

0 1
ẏ = y , y (0 ) = y0 ∈ R 2 . (12.1.37)
− β −α
| {z }
= :M

Above we face the situation β ≫ 41 α2 ≫ 1.

We can obtain the general solution of ẏ = My, M ∈ R 2,2 , by diagonalization of M (if possible):

λ1
MV = M(v1 , v2 ) = (v1 , v2 ) . (12.1.38)
λ2

where v1 , v2 ∈ R 2 \ {0} are the the eigenvectors of M, λ1 , λ2 are the eigenvalues of M, see Def. 9.1.1.
For the latter we find
(p
α2 − 4β , if α2 ≥ 4β ,
λ1/2 = 12 (α ± D ) , D := p
ı 4β − α2 , if α2 < 4β .

Note that the eigenvalue have non-vanishing imaginary part in the setting of the experiment.

Then we transform ẏ = My into decoupled scalar linear ODEs:

−1 −1 −1 z ( t ) : = V −1 y ( t ) λ1
ẏ = My ⇔ V ẏ = V MV(V y) ⇔ ż = z. (12.1.39)
λ2
This yields the general solution of the ODE ẏ = My, see also [?, Sect. 5.6]:

y(t) = Av1 exp(λ1 t) + Bv2 exp(λ2 t) , A, B ∈ R . (12.1.40)

Note: t 7→ exp(λi t) is general solution of the ODE żi = λi zi .

(12.1.41) “Diagonalization” of explicit Euler method

Recall discrete evolution of explicit Euler method (11.2.7) for ODE ẏ = My, M ∈ R d,d :

Ψh y = y + hMy ↔ yk+1 = yk + hMyk .

As in § 12.1.30 we assume that M can be diagonalized, that is (12.1.32) holds: V−1 MV = D with a
diagonal matrix D ∈ C d,d containing the eigenvalues of M on its diagonal. Next, apply the decoupling by
diagonalization idea to the recursion of the explicit Euler method.

z k : = V −1 y k
V−1 yk+1 = V−1 yk + hV−1 MV(V−1 yk ) ⇔ (zk+1 )i = (zk )i + hλi (zk )i . (12.1.42)
| {z }
ˆ explicit Euler step for żi = λi zi
=

Crucial insight:
∞
The explicit Euler method generates uniformly bounded solution sequences (yk )k=0 for ẏ = My
with diagonalizable matrix M ∈ R d,d with eigenvalues λ1 , . . . , λd , if and only if it generates uniformly
bounded sequences for all the scalar ODEs ż = λi z, i = 1, . . . , d.

12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 768
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

So far we conducted the model problem analysis under the premises λ < 0.
p
However, in Ex. 12.1.35 we face λ1/2 = − 21 α ± i 4β − α2 (complex eigenvalues!). Let us now
examine how the explicit Euler method and even general explicit RK-methods respond to them.

Example 12.1.43 (Explicit Euler method for damped oscillations)

Consider linear model IVP (12.1.5) for λ ∈ C:

Re λ < 0 ⇒ exponentially decaying solution y(t) = y0 exp(λt) ,

because | exp(λt)| = exp(Re λ · t).

The model problem analysis from Ex. 12.1.3, Ex. 12.1.9 can be extended verbatim to the case of λ ∈ C.
It yields the following insight for the for the explicit Euler method and λ ∈ C:
The sequence generated by the explicit Euler method (11.2.7) for the model problem (12.1.5) satisfies

yk+1 = yk (1 + hλ) lim yk = 0 ⇔ |1 + hλ| < 1 . (12.1.6)

k→∞

timestep constraint to get decaying (discrete) solution !

1.5

0.5

✁ { z ∈ C : |1 + z | < 1}
Im z

0
The green region of the complex plane marks values
for λh, for which the explicit Euler method will pro-
-0.5
duce exponentially decaying solutions.

-1

-1.5
-2.5 -2 -1.5 -1 -0.5 0 0.5 1
Fig. 438 Re z
q
Now we can conjecture what happens in Ex. 12.1.35: the eigenvalues λ1/2 = ± i β − 41 α2 of − 21 α
M have a very large (in modulus) negative real part. Since ode45 can be expected to behave as if it
integrates ż = λ2 z, it faces a severe timestep constraint, if exponential blow-up is to be avoided, see
Ex. 12.1.3. Thus stepsize control must resort to tiny timesteps.

(12.1.44) Extended model problem analysis for explicit Runge-Kutta single step methods

A c
We apply an explicit s-stage RK-SSM (→ crefdef:rk) described by the Butcher scheme to the
bT
autonomous linear ODE ẏ = My, M ∈ C d,d , and obtain (for the first step with timestep size h > 0)
s−1 s
k ℓ = M ( y0 + h ∑ aℓ j k j ) , ℓ = 1, . . . , s , y1 = y0 + h ∑ bi kℓ . (12.1.45)
j =1 ℓ=1

12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 769
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Now assume that M can be diagonalized, that is (12.1.32) holds: V−1 MV = D with a diagonal matrix
D ∈ C d,d containing the eigenvalues λi ∈ C of M on its diagonal. Then apply the substitutions
b ℓ := V−1 kℓ , ℓ = 1, . . . , s ,
k yk := V−1 yk , k = 0, 1 ,
b

to (12.1.45), which yield

s−1 s
b ℓ = D (b
k y0 + h ∑ b j ) , ℓ = 1, . . . , s , b
aℓ j k y1 = b
y0 + h ∑ bi kb ℓ . (12.1.46)
j =1 ℓ=1
m
s−1 s
bℓ
k b j ) , (b
= λi (( y0 )i + h ∑ aℓ j k y1 )i = ( b
y0 ) i + h ∑ b ℓ , i = 1, . . . , d .
bi k (12.1.47)
i i i
j =1 ℓ=1

We infer that, if (yk )k is the sequence produced by an explicit RK-SSM applied to ẏ = My, then
 
[ 1]
y 0
 k ..
 −1
yk = V 
 . V ,

[ d]
0 yk

[i ]
where yk is the sequence generated by the same RK-SSM with the same sequence of timesteps for
k
the IVP ẏ = λi y, y(0) = V−1 y0 i .
✗ ✔
The RK-SSM generates uniformly bounded solution sequences (yk ) ∞
for ẏ = My with diagonal-
k=0
izable matrix M ∈ R d,d with eigenvalues λ1 , . . . , λd , if and only if it generates uniformly bounded

✖ ✕
sequences for all the scalar ODEs ż = λi ż, i = 1, . . . , d.

Hence, understanding the behavior of RK-SSM for autonomous scalar linear ODEs ẏ = λy with λ ∈ C
is enough to predict their behavior for general autonomous linear systems of ODEs.

From the considerations of § 12.1.26 we deduce the following fundamental result.

Theorem 12.1.48. (Absolute) stability of explicit RK-SSM for linear systems of ODEs

The sequence (yk )k of approximations generated by an explicit RK-SSM (→ Def. 11.4.9) with
stability function S (defined in (12.1.18)) applied to the linear autonomous ODE ẏ = My, M ∈ C d,d ,
with uniform timestep h > 0 decays exponentially for every initial state y0 ∈ C d , if and only if
|S(λi h)| < 1 for all eigenvalues λi of M.

Please note that

Re λi < 0 ∀i ∈ {1, . . . , d} =⇒ ky(t)k → 0 for t → ∞ ,

for any solution of ẏ = My. This is obvious from the representation formula (12.1.33).

(12.1.49) Region of (absolute) stability of explicit RK-SSM

12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 770
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

We consider an explicit Runge-Kutta single step method with stability function S for the model linear scalar
IVP ẏ = λy, y(0) = y0 , λ ∈ C. From Thm. 12.1.17 we learn that for uniform stepsize h > 0 we have
yk = S(λh)k y0 and conclude that

yk → 0 for k → ∞ ⇔ |S(λh)| < 1 . (12.1.50)

Hence, the modulus |S(λh)| tells us for which combinations of λ and stepsize h we achieve exponential
decay yk → ∞ for k → ∞, which is the desirable behavior of the approximations for Re λ < 0.

Definition 12.1.51. Region of (absolute) stability

Let the discrete evolution Ψ for a single step method applied to the scalar linear ODE ẏ = λy,
λ ∈ C, be of the form

Ψh y = S(z)y , y ∈ C, h > 0 with z := hλ (12.1.52)

and a function S : C → C. Then the region of (absolute) stability of the single step method is given
by

SΨ := {z ∈ C: |S(z)| < 1} ⊂ C .

Of course, by Thm. 12.1.17, in the case of explicit RK-SSM the function S will coincide with their stability function
from (12.1.18).

We can easily combine the statement of Thm. 12.1.48 with the concept of a region of stability and conclude
that an explicit RK-SSM will generate expoentially decaying solutions for the linear ODE ẏ = My, M ∈
C d,d , for every initial state y0 ∈ C d , if and only if λi h ∈ SΨ for all eigenvalues λi of M.

Adopting the arguments of § 12.1.26 we conclude from Cor. 12.1.20 that

✦ the regions of (absolute) stability of explicit RK-SSM are bounded,

✦ a timestep constraint depending on the eigenvalues of M is necessary to have a guaranteed expo-
nential decay RK-solutions for ẏ = My.

Example 12.1.53 (Regions of stability of some explicit RK-SSM)

The green domains ⊂ C depict the bounded regions of stability for some RK-SSM from Ex. 11.4.13.

12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 771
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

3 3
2.5

2
2 2
1.5

1 1 1

0.5

Im
0 0
Im

-0.5
-1 -1
-1

-1.5
-2 -2
-2

-2.5 -3 -3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Re Re Re

SΨ : explicit Euler (11.2.7) SΨ : explicit trapezoidal method SΨ : classical RK4 method

In general we have for a consistent RK-SSM (→ Def. 11.3.10) that their stability functions staidfy S(z) =
1 + z + O(z2 ) for z → 0. Therefore, SΨ 6= ∅ and the imaginary axis will be tangent to SΨ in z = 0.

12.2 Stiff Initial Value Problems

Supplementary reading. [?, Sect. 11.10]

This section will reveal that the behavior observed in Ex. 12.0.1 and Ex. 12.1.3 is typical for a large class
of problems and that the model problem (12.1.5) really represents a “generic case”. This justifies the
attention paid to linear model problem analysis in Section 12.1.

Example 12.2.1 (Kinetics of chemical reactions → [?, Ch. 62])

In Ex. 11.5.1 we already saw an ODE model for the dynamics of a chemical reaction. Now we study an
abstract reaction.

2 k k4
reaction: A + B ←−
−→ C , A + C ←−
−→ D
k1 k3 (12.2.2)
| {z } | {z }
fast reaction slow reaction

Vastly different reaction constants: k1 , k2 ≫ k3 , k4

If c A (0) > c B (0) ➢ 2nd reaction determines overall long-term reaction dynamics

Mathematical model: non-linear ODE involving concentrations y(t) = (c A (t), c B (t), cC (t), c D (t))T
   
cA − k1 c A c B + k2 c C − k3 c A c C + k4 c D
d  cB   − k1 c A c B + k2 c C 
ẏ :=   = f (y) : =  
 k1 c A c B − k2 c C − k3 c A c C + k4 c D  . (12.2.3)
dt  cC 
cD k3 c A c C − k4 c D

M ATLAB computation: t0 = 0, T = 1, k1 = 104 , k2 = 103 , k3 = 10, k4 = 1

12. Single Step Methods for Stiff Initial Value Problems, 12.2. Stiff Initial Value Problems 772
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

M ATLAB-script 12.2.4: Simulation of “stiff” chemical reaction

1 f u n c t i o n chemstiff
2 % Simulation of kinetics of coupled chemical reactions with vastly
different reaction
3 % rates, see (12.2.3) for the ODE model.
4 % reaction rates k1 , k2 , k3 , k4 , k1 , k2 ≫ k3 , k4 .
5 k1 = 1E4; k2 = 1E3; k3 = 10; k4 = 1;
6 % definition of right hand side function for ODE solver
7 fun = @(t,y) ([-k1*y(1)*y(2) + k2*y(3) - k3*y(1)*y(3) + k4*y(4);
8 -k1*y(1)*y(2) + k2*y(3);
9 k1*y(1)*y(2) - k2*y(3) - k3*y(1)*y(3) + k4*y(4);
10 k3*y(1)*y(3) - k4*y(4)]);
11 tspan = [0 1]; % Integration time interval
12 L = tspan(2)-tspan(1); % Duration of simulation
13 y0 = [1;1;10;0]; % Initial value y0
14 % compute “exact” solution, using ode113 with tight error tolerances
15 options = odeset(’reltol’,10*eps ,’abstol’, eps ,’stats’,’on’);
16 % get the ’exact’ solution using ode113
17 [tex,yex] = ode113(fun,[0 1],y0,options);

Chemical reaction: concentrations

12 Chemical reaction: stepsize x 10
-5

c (t) 10 7
A
c (t)
C
10
cA,k, ode45 9 6
cC,k, ode45
8 8 5
concentrations

6 7 4

timestep
c (t)
C

4 6 3

2 5 2

4 1
0

3 0
-2 0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1 440
Fig. t
Fig. 439 t

Observations: After a fast initial transient phase, the solution shows only slow dynamics. Nevertheless,
the explicit adaptive integrator ode113 insists on using a tiny timestep. It behaves very much like ode45
in Ex. 12.0.1.

Example 12.2.5 (Strongly attractive limit cycle)

We consider the non-linear Autonomous ODE ẏ = f(y) with

0 −1
f (y) : = y + λ (1 − k y k 2 ) y , (12.2.6)
1 0
on the state space D = R 2 \ {0}
cos ϕ
For λ = 0, the initial value problem ẏ = f(y), y(0) = ( sin ϕ ), ϕ ∈ R has the solution

cos(t − ϕ)
y(t) = , t∈R. (12.2.7)
sin(t − ϕ)

12. Single Step Methods for Stiff Initial Value Problems, 12.2. Stiff Initial Value Problems 773
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

For this solution we have ky(t)k2 = 1 for all times.

(12.2.7) provides a solution even for λ 6= 0, if k y(0)k2 = 1, because in this case the term
λ(1 − k yk2 ) y will never become non-zero on the solution trajectory.
1.5 2

1.5
1

0.5

0.5
2

0
y

-0.5 -0.5

-1
-1

-1.5

-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Fig. 441 y Fig. 442 -2
1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

vectorfield f (λ = 1) solution trajectories (λ = 10)

M ATLAB-script 12.2.8: Application of ode45 for limit cycle problem

1 % MATLAB script for solving limit cycle ODE (12.2.6)
2 % define right hand side vectorfield
3 fun = @(t,y) ([-y(2);y(1)] +
lambda*(1-y(1)\symbol{94}2-y(2)\symbol{94}2)*y);
4 % standard invocation of MATLAB integrator, see Ex. 11.4.21
5 tspan = [0,2* p i ]; y0 = [1,0];
6 opts = odeset(’stats’,’on’,’reltol’,1E-4,’abstol’,1E-4);
7 [t45,y45] = ode45 (fun,tspan,y0,opts);

1
We study the response of ode45 to different choice of λ with initial state y0 = . According to the
0
above considerations this initial state should completely “hide the impact of λ from our view”.
ode45 for attractive limit cycle -4 ode45 for rigid motion
x 10
1.5 8 1 0.2

1 7

0.5 6
timestep

timestep
y (t)

y (t)

0 5 0 0.1
i

-0.5 4

-1 3

y y
1,k 1,k
y y
2,k 2,k
-1.5 2 -1 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Fig. 443 t Fig. 444 t

many (3794) steps (λ = 1000) accurate solution with few steps (λ = 0)

Confusing observation: we have k y0 k = 1, which implies k y(t)k = 1 ∀ t!

12. Single Step Methods for Stiff Initial Value Problems, 12.2. Stiff Initial Value Problems 774
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Thus, the term of the right hand side, which is multiplied by λ will always vanish on the exact solution
trajectory, which stays on the unit circle.
Nevertheless, ode45 is forced to use tiny timesteps by the mere presence of this term!

We want to find criteria that allow to predict the massive problems haunting explicit single step methods
in the case of the non-linear IVP of Ex. 12.0.1, Ex. 12.2.1, and Ex. 12.2.5. Recall that for linear IVPs of
the form ẏ = My, y(0) = y0 , the model problem analysis of Section 12.1 tells us that, knowledge of
the region of stability of the timestepping scheme, the eigenvalues of the matrix M ∈ C d,d provide full
information about timestep constraint we are going to face. Refer to Thm. 12.1.48 and § 12.1.49.

Issue: extension of stability analysis to non-linear ODEs ?

We start with a “phenomenological notion”, just a keyword to refer to the kind of difficulties presented by
the IVPs of Ex. 12.0.1, Ex. 12.2.1, Ex. 12.1.8, and Ex. 12.2.5.

Notion 12.2.9. Stiff IVP

An initial value problem is called stiff, if stability imposes much tighter timestep constraints on explicit
single step methods than the accuracy requirements.

(12.2.10) Linearization of ODEs

We consider a general autonomous ODE: ẏ = f(y), f : D ⊂ R d → R d

As usual, we assume f to be C2 -smooth and that it enjoys local Lipschitz continuity (→ Def. 11.1.28) on
D so that unique solvability of IVPs is guaranteed by Thm. 11.1.32.
We fix a state y∗ ∈ D, D the state space, write t 7→ y(t) for the solution with y(0) = y∗ . We set
z(t) = y(t) − y∗ , which satisfies

z(0) = 0 , ż = f(y∗ + z) = f(y∗ ) + D f(y∗ )z + R(y∗ , z) , with k R(y∗ , z)k = O(k zk2 ) .
This is obtained by Taylor expansion of f at y∗ , see [?, Satz 7.5.2]. Hence, in a neighborhood of a state
y∗ on a solution trajectory t 7→ y(t), the deviation z(t) = y(t) − y∗ satisfies
ż ≈ f(y∗ ) + D f(y∗ )z . (12.2.11)

The short-time evolution of y with y(0) = y∗ is approximately governed by the affine-linear ODE

ẏ = M(y − y∗ ) + b , M := D f(y∗ ) ∈ R d,d , b := f(y∗ ) ∈ R d . (12.2.12)

(12.2.13) Linearization of explicit Runge-Kutta single step methods

We consider one step a general s-stage RK-SSM according to Def. 11.4.9 for the autonomous ODE
ẏ = f(y), with smooth right hand side function f : D ⊂ R d → R d :
i −1 s
ki = f(y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 = y0 + h ∑ bi ki .
j =1 i =1

12. Single Step Methods for Stiff Initial Value Problems, 12.2. Stiff Initial Value Problems 775
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

We perform linearization at y∗ := y0 and ignore all terms at least quadratic in the timestep size h:
i −1 s
ki ≈ f(y∗ ) + D f(y∗ )h ∑ aij k j , i = 1, . . . , s , y1 = y0 + h ∑ bi ki .
j =1 i =1
The defining equations for the same RK-SSM applied to
ż = M(z) + b , M := D f(y∗ ) ∈ R d,d , b := f(y∗ ) ,
which agrees with (12.2.12) after substitution z(t) − y(t) − y∗ , are
i −1 s
ki ≈ b + Mh ∑ aij k j , i = 1, . . . , s , y1 = y0 + h ∑ bi ki .
j =1 i =1
We find that for small timesteps

the discrete evolution of the RK-SSM for ẏ = f(y) in the state y∗ is close to the discrete
evolution of the same RK-SSM applied to the linearization (12.2.12) of the ODE in y∗ .

By straightforward manipulations of the defining equations of an explicit RK-SSM we find that, if

• (yk )k is the sequence of states generated by the RK-SSM applied to the affine-linear ODE ẏ =
M(y − y0 ) + b, M ∈ C d,d regular,
• (wk )k is the sequence of states generated by the same RK-SSM applied to the linear ODE ẇ =
Mw and w0 := M−1 b, then

w k = y k − y0 + M − 1 b .

➣ The analysis of the behavior of an RK-SSM for an affine-linear ODE can be reduces to understanding
its behavior for a linear ODE with the same matrix.

Combined with the insights from § 12.1.44 this means that

for small timestep the behavior of an explicit RK-SSM applied to ẏ = f(y) close to the state y∗
is determined by the eigenvalues of the Jacobian D f(y∗ ).

In particular, if D f(y∗ ) has at least one eigenvalue whose modulus is large, then an exponential drift-off
of the approximate states yk away from y∗ can only be avoided for sufficiently small timestep, again a
timestep constraint.

How to distinguish stiff initial value problems

An initial value problem for an autonomous ODE ẏ = f(y) will probably be stiff, if, for substantial
periods of time,

min{Re λ : λ ∈ σ(D f(y(t)))} ≪ 0 , (12.2.15)

max{0, Re λ : λ ∈ σ(D f(y(t)))} ≈ 0 , (12.2.16)

where t 7→ y(t) is the solution trajectory and σ (M) is the spectrum of the matrix M, see Def. 9.1.1.

The condition (12.2.16) has to be read as “the real parts of all eigenvalues are below a bound with small
modulus”. If this is not the case, then the exact solution will experience blow-up. It will change drastically
over very short periods of time and small timesteps will be required anyway in order to resolve this.

12. Single Step Methods for Stiff Initial Value Problems, 12.2. Stiff Initial Value Problems 776
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Example 12.2.17 (Predicting stiffness of non-linear IVPs)

➊ We consider the IVP from Ex. 12.0.1:

IVP considered: ẏ = f (y) := λy2 (1 − y) , λ := 500 , y(0) = 1
100 .
We find

f ′ (y) = λ(2y − 3y2 ) ⇒ f ′ (1 ) = − λ .

Hence, in case λ ≫ 1 as in Fig. 425, we face a stiff problem close to the stationary state y = 1.
The observations made in Fig. 425 exactly match this prediction.
➋ The solution of the IVP from Ex. 12.2.5

0 −1
ẏ = y + λ (1 − k y k 2 ) y , k y0 k 2 = 1 . (12.2.6)
1 0
satisfies k y(t)k2 = 1 for all times. Using the product rule (8.4.10) of multi-dimensional differential
calculus, we find

0 −1 ⊤ 2
D f (y) = + λ −2yy + (1 − kyk2 I) .
1 0
n p p o
σ(D f(y)) = −λ − λ2 − 1, −λ + λ2 − 1 , if k yk2 = 1 .

Thus, for λ ≫ 1, D f(y(t)) will always have an eigenvalue with large negative real part, whereas
the other eigenvalue is close to zero: the IVP is stiff.

Remark 12.2.18 (Characteristics of stiff IVPs)

Often one can already tell from the expected behavior of the solution of an IVP, which is often clear from
the modeling context, that one has to brace for stiffness.

Typical features of stiff IVPs:

✦ Presence of fast transients in the solution, see Ex. 12.1.3, Ex. 12.1.35,
✦ Occurrence of strongly attractive fixed points/limit cycles, see Ex. 12.2.5

12.3 Implicit Runge-Kutta Single Step Methods

Explicit Runge-Kutta single step method cannot escape tight timestep constraints for stiff IVPs that may
render them inefficient, see § 12.1.49. In this section we are going to augment the class of Runge-Kutta
methods by timestepping schemes that can cope well with stiff IVPs.

Supplementary reading. [?, Sect. 11.6.2], [?, Sect. 11.8.3]

12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 777
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

12.3.1 The implicit Euler method for stiff IVPs

Example 12.3.1 (Euler methods for stiff decay IVP)

We revisit the setting of Ex. 12.1.3 and again consider Euler methods for the decay IVP

ẏ = λy , y(0) = 1 , λ < 0 .

We apply both the explicit Euler method (11.2.7) and the implicit Euler method (11.2.13) with uniform
timesteps h = 1/N , N ∈ {5, 10, 20, 40, 80, 160, 320, 640} and monitor the error at final time T = 1 for
different values of λ.
Explicit Euler method (11.2.7) Implicit Euler method (11.2.13)
Explicit Euler method for saalar model problem Implicit Euler method for saalar model problem
20 0
10 10
λ = -10.000000
λ = -30.000000
λ = -60.000000 -5
10
error at final time T=1 (Euclidean norm)

error at final time T=1 (Euclidean norm)

10 λ = -90.000000
10
O(h)
-10
10

0
10
-15
10

-10 -20
10 10

-25
10
-20
10

-30
10

-30 λ = -10.000000
10
-35
λ = -30.000000
10 λ = -60.000000
λ = -90.000000
O(h)
-40 -40
10 10
-3 -2 -1 0 -3 -2 -1 0
10 10 10 10 10 10 10 10
Fig. 445 timestep h Fig. 446 timestep h

λ large: blow-up of yk for large timestep h λ large: stable for all timesteps h > 0 !

We observe onset of convergence of the implicit Euler method already for large timesteps h.

(12.3.2) Linear model problem analysis: implicit Euler method

We follow the considerations of § 12.1.4 and consider the implicit Euler method (11.2.13) for the

linear model problem: ẏ = λy , y(0) = y0 , with Re λ ≪ 0 , (12.1.5)

with exponentially decaying (maybe osscillatory for Im λ 6= 0) exact solution

y(t) = y0 exp(λt) → 0 for t → ∞ .

The recursion of the implicit Euler method for (12.1.5) is defined by

(11.2.13) for f (y) = λy ⇒ yk+1 = yk + λhyk+1 k ∈ N0 . (12.3.3)

k
1
generated sequence yk := y0 . (12.3.4)
1 − λh

⇒ Re λ < 0 ⇒ lim yk = 0 ∀h > 0 ! (12.3.5)

k→∞

No timestep constraint: qualitatively correct behavior of (yk )k for Re λ < 0 and any h > 0!

12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 778
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

As in § 12.1.41 this analysis can be extended to linear systems of ODEs ẏ = My, M ∈ C d,d , by means
of diagonalization.

As in § 12.1.30 and § 12.1.41 we assume that M can be diagonalized, that is (12.1.32) holds: V−1 MV =
D with a diagonal matrix D ∈ C d,d containing the eigenvalues of M on its diagonal. Next, apply the
decoupling by diagonalization idea to the recursion of the implicit Euler method.

z k : = V −1 y k 1
V−1 yk+1 = V−1 yk + h |V−{z
1
MV}(V−1 yk+1 ) ⇔ ( z k + 1 )i = (z ) . (12.3.6)
1 − λi h k i
=D | {z }
ˆ implicit Euler step for żi = λi zi
=

Crucial insight:

For any timestep, the implicit Euler method generates exponentially decaying solution sequences
(yk )∞
k=0 for ẏ = My with diagonalizable matrix M ∈ R
d,d with eigenvalues λ , . . . , λ , if Re λ < 0
1 d i
for all i = 1, . . . , d.

Thus we expect that the implicit Euler method will not face stability induced timestep constraints for stiff
problems (→ Notion 12.2.9).

12.3.2 Collocation single step methods

Unfortunately the implicit Euler method is of first order only, see Ex. 11.3.18. This section presents an
algorithm for designing higher order single step methods generalizing the implicit Euler method.

Setting: We consider the general ordinary differential equation ẏ = f(t, y), f : I × D → R d locally
Lipschitz continuous, which guarantees the local existence of unique solutions of initial value problems,
see Thm. 11.1.32.

We define the single step method through specifying the first step y0 = y(t0 ) → y1 ≈ y(t1 ), where
y0 ∈ D is the initial step at initial time t0 ∈ I . We assume that the exact solution trajectory t 7→ y(t)
exists on [t0 , t1 ]. Use as a timestepping scheme on a temporal mesh (→ § 11.2.2) in the sense of
Def. 11.3.5 is straightforward.

(12.3.7) Collocation principle

Abstract collocation idea

Collocation is a paradigm for the discretization (→ Rem. 11.3.4) of differential equations:
(I) Write the discrete solution uh , a function, as linear combination of N ∈ N sufficiently smooth
(basis) functions ➣ N unknown coefficients.
(II) Demand that uh satisfies the differential equation at N points/times ➣ N equations.

We apply this policy to the differential equation ẏ = f(t, y) on [t0 , t1 ]:

12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 779
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Idea: ➊ Approximate t 7→ y(t), t ∈ [t0 , t1 ], by a function t 7→ yh (t) ∈ V , V

an d · (s + 1)-dimensional trial space V of functions [t0 , t1 ] 7→ R d →
Item (I).
➋ Fix yh ∈ V by imposing collocation conditions

yh (t0 ) = y0 , ẏh (τj ) = f(τj , yh (τj )) , j = 1, . . . , s , (12.3.9)

for collocation points t0 ≤ τ1 < . . . < τs ≤ t1 → Item (II).

➌ Choose y1 := yh (t1 ).
☛ ✟

✡ ✠
d
Our choice (the “standard option”): (Componentwise) polynomial trial space V = (Ps )

Recalling dim Ps = s + 1 from Thm. 5.2.2 we see that our choice makes the number N := d(s + 1) of
collocation conditions match the dimension of the trial space V .

Now we want to derive a concrete representation for the polynomial yh . We draw on concepts introduced
in Section 5.2.2. We define the collocation points as

τj := t0 + c j h , j = 1, . . . , s , for 0 ≤ c1 < c2 < . . . < cs ≤ 1 , h := t1 − t0 .

s
Let { L j } j=1 ⊂ Ps−1 denote the set of Lagrange polynomials of degree s − 1 associated with the node
s
set c j j=1, see (5.2.11). They satisfy L j (ci ) = δij , i, j = 1, . . . , s and form a basis of Ps−1 .

In each of its d components, the derivative ẏh is a polynomial of degree s − 1: ẏ ∈ (Ps−1 )d . Hence, it
has the following representation, compare (5.2.13).
s
ẏh (t0 + τh) = ∑ ẏh (t0 + c j h) L j (τ ) . (12.3.10)
j =1

As τj = t0 + c j h, the comcollocation conditions make it possible to replace ẏh (c j h) with an expression in

the right hand side function f:
(12.3.9) s
ẏh (t0 + τh) = ∑ k j L j (τ ) with “coefficients” k j := f (t0 + c j h, yh (t0 + c j h)) .
j =1

Next we integrate and use yh (t0 ) = y0

s Z τ
yh (t0 + τh) = y0 + h ∑ k j L j (ζ ) dζ .
j =1 0

This yields the following formulas for the computation of y1 , which characterize the s-stage collocation
single step method induced by the (normalized) collocation points c j ∈ [0, 1], j = 1, . . . , s.

s Z ci
ki = f (t0 + ci h, y0 + h ∑ aij k j ) , aij := L j (τ ) dτ ,
j =1 0
where Z 1 (12.3.11)
s
y1 := yh (t1 ) = y0 + h ∑ bi ki . bi := Li (τ ) dτ .
0
i =1

12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 780
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Note that, since arbitrary y0 ∈ D, t0 , t1 ∈ I were admitted, this defines a discrete evolution Ψ : I × I ×
D → R d by Ψt0 ,t1 y0 := yh (t1 ).

Remark 12.3.12 (Implicit nature of collocation single step methods)

Note that (12.3.11) represents a generically non-linear system of s · d equations for the s · d components
of the vectors ki , i = 1, . . . , s. Usually, it will not be possible to obtain ki by a fixed number of evaluations
of f. For this reason the single step methods defined by (12.3.11) are called implicit.

With similar arguments as in Rem. 11.2.14 one can prove that for sufficiently small |t1 − t0 | a unique
solution for k1 , . . . , ks can be found.

(12.3.13) Collocation single step methods and quadrature

Clearly, in the case d = 1, f (t, y) = f (t), y0 = 0 the computation of y1 boils down to the evaluation of a
quadrature formula on [t0 , t1 ], because from (12.3.11) we get
s Z 1
y1 = h ∑ bi f (t0 + ci h) , bi := Li (τ ) dτ , (12.3.14)
i =1 0

which is a polynomial quadrature formula (7.2.2) on [0, 1] with nodes c j transformed to [t0 , t1 ] according
to (7.1.5).

Experiment 12.3.15 (Empiric Convergence of collocation single step methods)

We consider the scalar logistic ODE (11.1.6) with parameter λ = 10 (→ only mildly stiff), initial state
y0 = 0.01, T = 1.

Numerical integration by timestepping with uniform timestep h based on collocation single step method
(12.3.11).

0
10

➊ -2
10
j
Equidistant collocation points, c j = s+1 , j =
1, · · · , s.
max |y (t )-y(t) )|
k

-4
10
We observe algebraic convergence with the empiric
h k

rates
k

-6
s = 1 : p = 1.96 10

s = 2 : p = 2.03
s = 3 : p = 4.00 -8
10 s=1
s = 4 : p = 4.04 s=2
s=3
-10
s=4
10 -2 -1 0
10 10 10
Fig. 447 h

In this case we conclude the following (empiric) order (→ Def. 11.3.21) of the collocation single step

12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 781
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

method:
(
s for even s ,
(empiric) order =
s + 1 for odd s .

0
10

➊
Gauss points in [0, 1] as normalized collocation
points c j , j = 1, . . . , s. 10
-5

max |y (t )-y(t) )|
k
We observe algebraic convergence with the empiric

h k
rates

k
s = 1 : p = 1.96 -10
10
s = 2 : p = 4.01
s = 3 : p = 6.00 s=1
s = 4 : p = 8.02 s=2
s=3
-15
s=4
10 -2 -1 0
10 10 10
Fig. 448 h

Obviously, for the (empiric) order (→ Def. 11.3.21) of the Gauss collocation single step method holds

(empiric) order = 2s .

Note that the 1-stage Gauss collocation single step method is the implicit midpoint method from Sec-
tion 11.2.3.

(12.3.16) Order of collocation single step method

What we have observed in Exp. 12.3.15 reflects a fundamental result on collocation single step methods
as defined in (12.3.11).

Theorem 12.3.17. Order of collocation single step method [?, Satz .6.40]

Provided that f ∈ C p ( I × D ), the order (→ Def. 11.3.21) of an s-stage collocation single step
method according to (12.3.11) agrees with the order (→ Def. 7.3.1) of the quadrature formula on
[0, 1] with nodes c j and weights b j , j = 1, . . . , s.

➣ By Thm. 7.3.22 the s-stage Gauss collocation single step method whose nodes c j are chosen as the s
Gauss points on [0, 1] is of order 2s.

12.3.3 General implicit RK-SSMs

The notations in (12.3.11) have deliberately been chosen to allude to Def. 11.4.9. In that definition it takes
only letting the sum in the formula for the increments run up to s to capture (12.3.11).

12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 782
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Definition 12.3.18. General Runge-Kutta single step method (cf. Def. 11.4.9)

For bi , aij ∈ R , ci := ∑sj=1 aij , i, j = 1, . . . , s, s ∈ N, an s-stage Runge-Kutta single step method

(RK-SSM) for the IVP (11.1.20) is defined by
s s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 := y0 + h ∑ bi ki .
j =1 i =1

As before, the ki ∈ R d are called increments.

Note: computation of increments ki may now require the solution of (non-linear) systems of equations of
size s · d (→ “implicit” method, cf. Rem. 12.3.12)

General Butcher scheme notation for RK-SSM

c1 a11 ··· a1s

Shorthand notation for Runge-Kutta methods c A .. .. ..
:= . . . .
Butcher scheme ✄ bT cs as1 ··· ass
Note: now A can be a general s × s-matrix. b1 ··· bs
(12.3.20)

Summary: terminology for Runge-Kutta single step methods:

A strict lower triangular matrix ➤ explicit Runge-Kutta method, Def. 11.4.9

A lower triangular matrix ➤ diagonally-implicit Runge-Kutta method (DIRK)

Many of the techniques and much of the theory discussed for explicit RK-SSMs carry over to general
(implicit) Runge-Kutta single step methods:
• Sufficient condition for consistence from Cor. 11.4.12
• Algebraic convergence for meshwidth h → 0 and the related concept of order (→ Def. 11.3.21)
• Embedded methods and algorithms for adaptive stepsize control from Section 11.5

Remark 12.3.21 (Stage form equations for increments)

In Def. 12.3.18 instead of the increments we can consider the stages

s
gi := h ∑ aij k j , i = 1, . . . , s , ⇔ ki = f(t0 + ci h, y0 + gi ) . (12.3.22)
j =1

This leads to the equivalent defining equations in “stage form” for an implicit RK-SSM
s s
gi = h ∑ aij f(t0 + ci h, y0 + g j ) , y1 = y0 + h ∑ bi f(t0 + c j h, y0 + gi ) . (12.3.23)
j =1 i =1

12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 783
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

In terms of implementation there is no difference.

Remark 12.3.24 (Solving the increment equations for implicit RK-SSMs)

We reformulate the increment equations in stage form (12.3.23) as a non-linear system of equations in
standard form F(x) = 0. Unknowns are the total s · d components of the stage vectors gi , i = 1, . . . , s as
defined in (12.3.22).
 
g = [ g1 , . . . , g s ] ⊤ , f(t0 + c1 h, y0 + g1 )
s  ..  !
gi := h ∑ aij f(t0 + c j h, y0 + g j ) F (g) = g − h (A ⊗ I )  . =0,
j =1 f(t0 + cs h, y0 + gs )

where I is the d × d identity matrix and ⊗ designates the Kronecker product introduced in Def. 1.4.17.

We compute an approximate solution of F(g) = 0 iteratively by means of the simplified Newton method
presented in Rem. 8.4.39. This is a Newton method with “frozen Jacobian”. As g → 0 for h → 0, we
choose zero as initial guess:

g( k + 1) = g( k ) − D F ( 0 ) − 1 F (g( k ) ) k = 0, 1, 2, . . . , g(0) = 0 . (12.3.25)

with the Jacobian

 
∂f ∂f
I − ha11 ∂y ( t 0 , y0 ) · · · −ha1s ∂y ( t 0 , y0 )
 .. .. .. 
D F (0) = 
 . . .  ∈ R sd,sd .
 (12.3.26)
∂f ∂f
−has1 ∂y ( t 0 , y0 ) ··· I − hass ∂y ( t 0 , y0 )

Obviously, D F(0) → I for h → 0. Thus, D F(0) will be regular for sufficiently small h.

In each step of the simplified Newton method we have to solve a linear system of equations with coefficient
matrix D F(0). If s · d is large, an efficient implementation has to reuse the LU-decomposition of D F(0),
see Code 8.4.40 and Rem. 2.5.10.

12.3.4 Model problem analysis for implicit RK-SSMs

Model problem analysis for general Runge-Kutta single step methods (→ Def. 12.3.18) runs parallel to
that for explicit RK-methods as elaborated in Section 12.1, § 12.1.13. Familiarity with the techniques and
results of this section is assumed. The reader is asked to recall the concept of stability function from
Thm. 12.1.17, the diagonalization technique from § 12.1.44, and the definition of region of (absolute)
stability from Def. 12.1.51.

12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 784
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Theorem 12.3.27. Stability function of Runge-Kutta methods, cf. Thm. 12.1.17

[Stability function of general Runge-Kutta methods]

The discrete evolution Ψhλ of an s-stage Runge-Kutta single step method (→ Def. 12.3.18) with
c A
Butcher scheme (see (12.3.20)) for the ODE ẏ = λy is given by a multiplication with
bT

det(I − zA + z1b T )
S(z) := 1 + zb T (I − zA) −1 1 = , z := λh , 1 = [1, . . . , 1] T ∈ R s .
| {z } det(I − zA)
stability function

Example 12.3.28 (Regions of stability for simple implicit RK-SSM)

We determine the Butcher schemes (12.3.20) for simple implicit RK-SSM and apply the formula from
Thm. 12.3.27 to compute their stability functions.

1 1 1
• Implicit Euler method: ➣ S(z) = .
1 1−z

1
2
1
2
1 + 21 z
• Implicit midpoint method: ➣ S(z) = .
1 1 − 21 z

Their regions of stability SΨ as defined in Def. 12.1.51 can easily found from the respective stability func-
tions:
3
3

2 2

1 1
Im

0 0

-1 -1

-2 -2

-3
-3 -3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3Fig. 450 Re
Fig. 449 Re
SΨ : implicit midpoint method (11.2.18)
SΨ : implicit Euler method (11.2.13)

We see that in both cases |S(z)| < 1, if Re z < 0.

From the determinant formula for the stability function S(z) we can conclude a generalization of Cor. 12.1.20.

12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 785
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Corollary 12.3.29. Rational stability function of explicit RK-SSM

For a consistent (→ Def. 11.3.10) s-stage general Runge-Kutta single step method according to
P(z)
Def. 12.3.18 the stability function S is a non-constant rational function of the form S(z) =
Q (z)
with polynomials P ∈ Ps , Q ∈ Ps .

Of course, a rational function z 7→ S(z) can satisfy lim|z|→∞ |S(z)| < 1 as we habe seen in Ex. 12.3.28.
As a consequence, the region of stability for implicit RK-SSM need not be bounded.

(12.3.30) A-stability

A general RK-SSM with stability function S applied to the scalar linear IVP ẏ = λy, y(0) = y0 ∈ C,
λ ∈ C, with uniform timestep h > 0 will yield the sequence (yk )∞
k=0 defined by

yk = S(z)k y0 , z = λh . (12.3.31)

Hence, the next property of a RK-SSM guarantees that the sequence of approximations decays exponen-
tially whenever the exact solution of the model problem IVP (12.1.5) does so.

Definition 12.3.32. A-stability of a Runge-Kutta single step method

A Runge-Kutta single step method with stability function S is A-stable, if

C − := {z ∈ C: Re z < 0} ⊂ SΨ . (SΨ =
ˆ region of stability Def. 12.1.51)

From Ex. 12.3.28 we conclude that both the implicit Euler method and the implicit midpoint method are
A-stable.

A-stable Runge-Kutta single step methods will not be affected by stability induced timestep constraints
when applied to stiff IVP (→ Notion 12.2.9).

(12.3.33) “Ideal” region of stability

In order to reproduce the qualitative behavior of the exact solution, a single step method when applied to
the scalar linear IVP ẏ = λy, y(0) = y0 ∈ C, λ ∈ C, with uniform timestep h > 0,

• should yield an exponentially decaying sequence (yk )∞

k=0 , whenever Re λ < 0,

• should produce an exponentially increasing sequence sequence (yk )∞

k=0 , whenever Re λ > 0.

Thus, in light of (12.3.31), we agree that the stability if

“ideal” region of stability is SΨ = C − . (12.3.34)

12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 786
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Are there RK-SSMs that can boast of an ideal region of stability?

Regions of stability of Gauss collocation single step methods, see Exp. 12.3.15:

5 20 50 0.7 1.5
0.7
1.5

1.1

1.1
0.9
0.9

1 0.7

1.1
4 40
15

0.9
1.5 5
1.

1
3 5 30
10
1.

0.
7

0.7
2 20
0.7

0.4 0.

1.5
0. 4
4 5
1 10
4

0.4
0.

0.4

0.9 1

0.9 1.51.1
1.1
0.91

Im
1.1
0 0
Im

0
0.4

1
-1 -10
-5
1.5

0.
4
0.7
0.4
0.4 0.4

0.7
-2 -20
0.7

1.
-10

5
-3 -30
1.
5

0.9

1.1
1.1

-15 1.5
0.9

0.7 -40

1.1
1
-4
1

1.5

0.9
1
0.7 1.5
0.7 -20 -50
-5 -20 -10 0 10 20 -60 -40 -20 0 20 40 60
-6 -4 -2 0 2 4 Fig.6 452 Fig. 453
Fig. 451 Re Re Re

Implicit midpoint method s = 2 (order 4) s = 4 (order 8)

Level lines for |S(z)| for Gauss collocation methods

Theorem 12.3.35. Region of stability of Gauss collocation single step methods [?,
Satz 6.44]

s-stage Gauss collocation single step methods defined by (12.3.11) with the nodes cs given by the
s Gauss points on [0, 1], feature the “ideal” stability domain:

SΨ = C − . (12.3.34)

In particular, all Gauss collocation single step methods are A-stable.

Experiment 12.3.36 (Implicit RK-SSMs for stiff IVP)

We consider the stiff IVP

ẏ = −λy + β sin(2πt) , λ = 106 , β = 106 , y(0) = 1 ,

whose solution essentially is the smooth function t 7→ sin(2πt). Applying the criteria (12.2.15) and
(12.2.16) we immediately see that this IVP is extremely stiff.

1
We solve it with different implicit RK-SSM on [0, 1] with large uniform timestep h = 20 .

12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 787
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

4
1
y(t)
Impliziter Euler 0.8 exp(z)
3
Kollokations RK-ESV s=1 Impliziter Euler
0.6
Kollokations RK-ESV s=2 Gauss-Koll.-RK-ESV s=1
2 Kollokations RK-ESV s=3 0.4 Gauss-Koll.-RK-ESV s=2
Kollokations RK-ESV s=4 Gauss-Koll.-RK-ESV s=3
0.2
1

Re(S(z))
Gauss-Koll.-RK-ESV s=4
0
y

0 -0.2

-0.4
-1
-0.6

-2 -0.8

-1
-3 -1000 -800 -600 -400 -200 0
0 0.2 0.4 0.6 0.8 1 455
Fig. z
Fig. 454 t

Solutions by RK-SSMs Stability functions on R −

We observe that Gauss collocation RK-SSMs incur a huge discretization error, whereas the simple implicit
Euler method provides a perfect approximation!

Explanation: The stability functions for Gauss collocation RK-SSMs satisfy

lim |S(z)| = 1 .
| z|→∞

Hence, when they are applied to ẏ = λy with extremely large (in modulus) λ < 0, they will produce
sequences that decay only very slowly or even oscillate, which misses the very rapid decay of the ex-
act solution. The stability function for the implicity Euler method is S(z) = (1 − z)−1 and satisfies
lim|z|→∞ S(z) = 0, which will mean a fast exponential decay of the yk .

(12.3.37) L-stability

In light of what we learned in the previous experiment we can now state what we expect from the stability
function of a Runge-Kutta method that is suitable for stiff IVP (→ Notion 12.2.9):

Definition 12.3.38. L-stable Runge-Kutta method → [?, Ch. 77]

A Runge-Kutta method (→ Def. 12.3.18) is L-stable/asymptotically stable, if its stability function (→
Thm. 12.3.27) satisfies

(i ) Re z < 0 ⇒ |S(z)| < 1 , (12.3.39)

(ii ) lim S(z) = 0 . (12.3.40)
Re z→− ∞

Remember: L-stable :⇔ A-stable & “S(−∞) = 0’ ’

Remark 12.3.41 (Necessary condition for L-stability of Runge-Kutta methods)

c A
Consider a Runge-Kutta single step method (→ Def. 12.3.18) described by the Butcher scheme .
bT

12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 788
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Assume that A ∈ R s,s is regular, which can be fulfilled only for an implicit RK-SSM.

P (z)
For a rational function S(z) = Q(z) the limit for |z| → ∞ exists and can easily be expressed by the leading
coefficients of the polynomials P and Q:

Thm. 12.3.27 ⇒ S(−∞) = 1 − b T A−1 1 . (12.3.42)

If b T = (A):,j
T
(row of A) ⇒ S(−∞) = 0 . (12.3.43)

c1 a11 ··· a1s

.. .. ..
Butcher scheme (12.3.20) for L-stable RK- c A . . .
✄ := cs−1 as−1,1 ··· as−1,s .
methods, see Def. 12.3.38 bT
1 b1 ··· bs
b1 ··· bs

A closer look at the coefficient formulas of (12.3.11) reveals that the algebraic condition (12.3.43) will
automatically satisfied for a collocation single step method with cs = 1!

Example 12.3.44 (L-stable implicit Runge-Kutta methods)

There is a family of s-point quadrature formulas on [0, 1] with a node located in 1 and (maximal) order
2s − 1: Gauss-Radau formulas. They induce the L-stable Gauss-Radau collocation single step methods
of order 2s − 1 according to Thm. 12.3.17.

√ √ √ √
4− 6 88−7 6 296−169 6 − 2+ 3 6
1 5 1 10
√ 360 √ 1800√ 225√
3 12 − 12 4+ 6 296+169 6 88+7 6 − 2− 3 6
1 1 3 1 10 1800 360√ 225
1 √
1 4
3
4
1 1 16− 6 16+ 6 1
36√ 36√ 9
4 4 16− 6 16+ 6 1
36 36 9

Implicit Euler method Radau RK-SSM, order 3 Radau RK-SSM, order 5

100
exp(z)
90 RADAU, s=2
The stability functions of s-stage Gauss-Radau collo- RADAU, s=3
80 RADAU, s=4
cation SSMs are rational functions of the form RADAU, s=5
70

P(z) 60
S(z) = , P ∈ Ps−1 , Q ∈ Ps .
Re(S(z))

Q (z) 50

Beware that also "‘S(∞) = 0”, which means that 30

Gauss-Radau methods when applied to problems 20

with fast exponential blow-up may produce a spuri- 10
ous decaying solution.
0
-2 -1 0 1 2 3 4 5 6
Fig. 456 z

Level lines of stability functions of s-stage Gauss-Radau collocation SSMs:

12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 789
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

10 20 30

8
15
0.4 20
6 0.4 0.4
0.4
10 0.7
0.7 0.7
0.4

4 0.9
1.1 0. 1
0.9 4
0.9 1 1
1.1 1.5 10 1.1 1.1 0.

0.4
1. 5 1.5 7
5 1.

0.9
0.7
2 5

0 .7
0.4

1 .1
1

1
0.1
1

0.4
Im

0.9
0 0 0

0.7
0.9
1.5

1.5

0.9

0.9
1.5
1.5 .1

0.4
1 1.1
0.7
1.1

0.4

1
0.4
1

-2 1.5

7
-5 1.1

0.
0.7

7 1 1.1 -10 1.1 1.5

0.9 1 0.
0.4

-4 0.9 1
0.7 0.9
0.4 0.7
-10 0.4
-6 0.4 0.4
0.4 -20
-15
-8

-10 -20 -30

-2 0 2 4 6 8
Fig. 457 Re Fig.10458 0 5
Re
10 15 20
Fig. 459
0 5 10
Re
15 20 25 30

s=2 s=3 s=4

Further information about Radau-Runge-Kutta single step methods can be found in [?, Ch. 79].

Experiment 12.3.45 (Gauss-Radau collocation SSM for stiff IVP)

We revisit the stiff IVP from Ex. 12.0.1

ẏ (t) = λy2 (1 − y) , λ = 500 , y(0) = 1

100 .

We compare the sequences generated by 1-stage and 2-stage Gauss collocation and Gauss-Radau
collocation SSMs, respectively (uniform timestep).
˜quidistantes Gitter, h=0.016667 ˜quidistantes Gitter, h=0.016667
1.4 1.4
y(t) y(t)
Gauss-Koll., s= 1 RADAU, s= 1
1.2 Gauss-Koll., s= 2 1.2 RADAU, s= 2

1 1

0.8 0.8
y

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
t t

The 2nd-order Gauss collocation SSM (implicit midpoint method) suffers from spurious oscillations we
homing in on the stable stationary state y = 1. The explanation from Exp. 12.3.36 also applies to this
example.

The fourth-order Gauss method is already so accurate that potential overshoots when approaching y = 1
are damped fast enough.

12.4 Semi-implicit Runge-Kutta Methods

Supplementary reading. [?, Ch. 80]

12. Single Step Methods for Stiff Initial Value Problems, 12.4. Semi-implicit Runge-Kutta Methods 790
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

The equations fixing the increments ki ∈ R d , i = 1, . . . , s, for an s-stage implicit RK-method

constitute a (Non-)linear system of equations with s · d unknowns.

Expensive iterations needed to find ki ?

Remember that we compute approximate solutions anyway, and the increments are weighted with the
stepsize h ≪ 1, see Def. 12.3.18. So there is no point in determining them with high accuracy!

Idea: Use only a fixed small number of Newton steps to solve for the ki , i = 1, . . . , s.

Extreme case: use only a single Newton step! Let’s try.

Example 12.4.1 (Linearization of increment equations)

✦ We consider an Initial value problem for logistic ODE, see Ex. 11.1.5

ẏ = λy(1 − y) , y(0) = 0.1 , λ = 5 .

Logistic ODE, y = 0.100000, λ = 5.000000

0
0
10

✦ We use the implicit Euler method (11.2.13) with

uniform timestep h = 1/n, -1
10
n ∈ {5, 8, 11, 17, 25, 38, 57, 85, 128, 192, 288,
, 432, 649, 973, 1460, 2189, 3284, 4926, 7389}. -2
10
& approximate computation of yk+1 by
error

1 Newton step with initial guess yk

-3
10

= semi-implicit Euler method

-4
10
implicit Euler
✦ Measured error err = max |y j − y(t j )| semi-implicit Euler
j=1,...,n O(h)
-5
10 -4 -3 -2 -1 0
10 10 10 10 10
Fig. 460 h

From (11.2.13) with timestep h > 0

yk+1 = yk + hf(yk+1 ) ⇔ F(yk+1 ) := yk+1 − hf(yk+1 ) − yk = 0 .

One Newton step (8.4.1) applied to F(y) = 0 with initial guess yk yields

yk+1 = yk − D f(yk )−1 F(yk ) = yk + (I − hDf(yk ))−1 hf(yk ) .

Note: for linear ODE with f(y) = Ay, A ∈ R d,d , we recover the original implicit Euler method!

Observation: Approximate evaluation of defining equation for yk+1 preserves 1st order convergence.

12. Single Step Methods for Stiff Initial Value Problems, 12.4. Semi-implicit Runge-Kutta Methods 791
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Logistic ODE, y0 = 0.100000, λ = 5.000000

0
10

-2
✦ Now, implicit midpoint method (11.2.18), uni- 10

form timestep h = 1/n as above

-4
10

error
& approximate computation of yk+1 by 1 New-
-6
ton step, initial guess yk 10

✦ Fehlermass err = max |y j − y(t j )| -8

j=1,...,n 10
implicit midpoint rule
semi-implicit m.p.r.
-10
O(h2)
10 -4 -3 -2 -1 0
10 10 10 10 10
Fig. 461 h

We still observe second-order convergence!

Try: Use linearized increment equations for implicit RK-SSM

s
ki := f(y0 + h ∑ aij k j ) , i = 1, . . . , s

?
j =1

!
s
k i = f ( y0 ) + h D f ( y0 ) ∑ aij k j , i = 1, . . . , s . (12.4.2)
j =1

The good news is that all results about stability derived from model problem analysis (→ Section 12.1)
remain valid despite linearization of the increment equations:
✞ ☎

✝ ✆
Linearization does nothing for linear ODEs ➢ stability function (→ Thm. 12.3.27) not affected!

The bad news is that the preservation of the order observed in Ex. 12.4.1 will no longer hold in the general
case.

Example 12.4.3 (Convergence of naive semi-implicit Radau method)

✦ We consider an IVP for the logistic ODE from Ex. 11.1.5:

ẏ = λy(1 − y) , y(0) = 0.1 , λ = 5 .

12. Single Step Methods for Stiff Initial Value Problems, 12.4. Semi-implicit Runge-Kutta Methods 792
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Logistic ODE, y0 = 0.100000, λ = 5.000000

0
10
✦
2-stage Radau RK-SSM, Butcher scheme
-2
10
1 5 1
3 12 − 12 -4
3 1 10
1 4 4 , (12.4.4)
3 1 -6
4 4 10

error
order = 3, see Ex. 12.3.44. -8
10

✦ -10
10
Increments from linearized equations (12.4.2)
RADAU (s=2)
✦ We monitor the error through err = -12
10 semi-implicit RADAU
O(h3)
max |y j − y(t j )| -14
O(h2)
j=1,...,n 10 -4 -3 -2 -1 0
10 10 10 10 10
Fig. 462 h

Loss of order due to linearization !

(12.4.5) Rosenbrock-Wanner methods

We have just seen that the simple linearization according to (12.4.2) will degrade the order of implicit
RK-SSMs and leads to a substantial loss of accuracy. This is not an option.

Yet, the idea behind (12.4.2) has been refined. One does not start from a known RK-SSM, but introduces
general coefficients for structurally linear increment equations.

Class of s-stage semi-implicit (linearly implicit) Runge-Kutta methods (Rosenbrock-Wanner (ROW)

methods):

i −1 i −1
(I − haii J)ki = f(y0 + h ∑ (aij + dij )k j ) − hJ ∑ dij k j , J = D f(y0 ) ,
j =1 j =1
(12.4.6)
s
y1 : = y0 + ∑ bj k j .
j =1

Then the coefficients aij , dij , and bi are determined from order conditions by solving large non-linear
systems of equations.

In each step s linear systems with coefficient matrices I − haii J have to be solved. For methods used in
practice one often demands that aii = γ for all i = 1, . . . , s. As a consequence, we have to solve s linear
systems with the same coefficient matrix I − hγJ ∈ R d,d , which permits us to reuse LU-factorizations,
see Rem. 2.5.10.

Remark 12.4.7 (Adaptive integrator for stiff problems in M ATLAB)

A ROW method is the basis for the standard integrator that M ATLAB offers for stiff problems:

Handle of type @(t,y) J(t,y) to Jacobian Df : I × D 7→ R d,d

12. Single Step Methods for Stiff Initial Value Problems, 12.4. Semi-implicit Runge-Kutta Methods 793
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

opts = odeset(’abstol’,atol,’reltol’,rtol,’Jacobian’,J)
[t,y] = ode23s(odefun,tspan,y0,opts);

Stepsize control according to policy of Section 11.5:

Ψ=
ˆ RK-method of order 2 e=
Ψ ˆ RK-method of order 3
ode23s
integrator for stiff IVP

12.5 Splitting methods

(12.5.1) Splitting idea: composition of partial evolutions

Many relevant ordinary differential equations feature a right hand side function that is the sum to two (or
more) terms. Consider an autonomous IVP with a right hand side function that can be split in an additive
fashion:

ẏ = f(y) + g(y) , y(0) = y0 , (12.5.2)

with f : D ⊂ R d 7→ R d , g : D ⊂ R d 7→ R d “sufficiently smooth”, locally Lipschitz continuous (→

Def. 11.1.28).

Let us introduce the evolution operators (→ Def. 11.1.39) for both summands:

Φ tf ↔ ODE ẏ = f(y) ,
(Continuous) evolution maps:
Φtg ↔ ODE ẏ = g(y) .

Temporarily we assume that both Φ tf , Φ tg are available in the form of analytic formulas or highly accurate
approximations.

Idea: Build single step methods (→ Def. 11.3.5) based on the following
discrete evolutions

Lie-Trotter splitting: Ψh = Φhg ◦ Φ hf , (12.5.3)

Ψh = Φ f/2 ◦ Φ hg ◦ Φ f/2 .
h h
Strang splitting: (12.5.4)

These splittings are easily remembered in graphical form:

Φ f/2
h
y1 y1

Ψh Ψh
(12.5.3) ↔ Φhg (12.5.4) ↔ Φhg

y0 y0
Fig. 463
Φ hf Fig. 464
Φ f/2
h

12. Single Step Methods for Stiff Initial Value Problems, 12.5. Splitting methods 794
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Note that over many timesteps the Strang splitting approach is not more expensive than Lie-Trotter split-
ting, because the actual implementation of (12.5.4) should be done as follows:

y1/2 := Φ f/2 ,
h
y1 := Φ hg y1/2 ,
y3/2 := Φhf y1 , y2 := Φ hg y3/2 ,
y5/2 := Φhf y2 , y3 := Φ hg y5/2 ,
.. ..
. .,

because Φ f/2 ◦ Φ f/2 = Φ hf . This means that a Strang splitting SSM differs from a Lie-Trotter splitting
h h

SSM in the first and the last step only.

Example 12.5.5 (Convergence of simple splitting methods)

We consider the following IVP whose right hand side function is the sum of two functions for which the
ODEs can be solved analytically:
q
ẏ = λy(1 − y) + 1 − y2 , y (0 ) = 0 .
| {z } | {z }
= : f (y) = :g(y)

1
Φtf y = , t > 0, y ∈]0, 1] (logistic ODE (11.1.6))
1 + (y −1 − 1)e−λt
(
π
sin(t + arcsin(y)) , if t + arcsin(y) < ,
Φtg y = 2 t > 0, y ∈ [0, 1] .
1 , else,

-2
10

Numerical experiment:
-3
10 For T = 1, λ = 1, we compare the two splitting
methods for uniform timesteps with a very accurate
|y(T)-y (T)|

reference solution obtained by

-4
10
f=@(t,x) λ*x*(1-x)+sqrt(1-x^2);
options=odeset(’reltol’,1.0e-10,...
-5
10 ’abstol’,1.0e-12);
Lie-Trotter-Splitting
Strang-Splitting [t,yex]=ode45(f,[0,1],y0,options);
O(h)
2
-6
10
O(h ) ✁ Error at final time T = 1
-2 -1
10 10
Fig. 465 Zeitschrittweite h

We observe algebraic convergence of the two splitting methods, order 1 for (12.5.3), oder 2 for (12.5.4).

The observation made in Ex. 12.5.5 reflects a general truth:

Theorem 12.5.6. Order of simple splitting methods

Die single step methods defined by (12.5.3) or (12.5.4) are of order (→ Def. 11.3.21) 1 and 2,
respetively.

12. Single Step Methods for Stiff Initial Value Problems, 12.5. Splitting methods 795
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

(12.5.7) Inexact splitting methods

Of course, the assumption that ẏ = f(y) and ẏ = g(y) can be solved exactly will hardly ever be met.
However, it should be clear that a “sufficiently accurate” approximation of the evolution maps Φhg and Φ hf
is all we need

Idea: In (12.5.3)/(12.5.4) replace

exact evolutions −→ discrete evolutions

.
Φ hg , Φhf −→ Ψhg , Ψhf

Example 12.5.8 (Convergence of inexact simple splitting methods)

Again we consider the IVP of Ex. 12.5.5 and inexact splitting methods based on different single step
methods for the two ODE corresponding to the summands.

-2
10 LTS-Eul explicit Euler method (11.2.7) → Ψ hh,g ,
Ψhh, f + Lie-Trotter splitting (12.5.3)
-3
10 SS-Eul explicit Euler method (11.2.7) → Ψ hh,g ,
Ψhh, f + Strang splitting (12.5.4)
|y(T)-y (T)|

SS-EuEI Strang splitting (12.5.4): Explizites Euler

-4
10
method (11.2.7) ◦ exact evolution Φ hg ◦
implicit Euler method (11.2.13)
-5
10 LTS-Eul LTS-EMP explicit midpoint method (11.2.18) →
SS-Eul
SS-EuEI
Ψhh,g , Ψhh, f + Lie-Trotter splitting (12.5.3)
-6
LTS-EMP
SS-EMP SS-EMP explicit midpoint method (11.4.7) → Ψ hh,g ,
10

Fig. 466
10
-2 -1
10 Ψhh, f + Strang splitting (12.5.4)
Zeitschrittweite h

☞ The order of splitting methods may be (but need not be) limited by the order of the SSMs used for
Φhf , Φ hg .

(12.5.9) Application of splitting methods

In the following situation the use splitting methods seems advisable:

“Splittable” ODEs

ẏ = f (y) + g(y) "‘difficult” ẏ = f (y) → stiff, but with an analytic solution

:
(e.g., stiff → Section 12.2) ẏ = g(y) "‘easy”, amenable to explicit integration.

Experiment 12.5.11 (Splitting off stiff components)

Recall Ex. 12.0.1 and the IVP studied there:

12. Single Step Methods for Stiff Initial Value Problems, 12.5. Splitting methods 796
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

AWP ẏ = λy(1 − y) + α sin(y) , λ = 100 , α = 1 , y(0) = 10−4 .

small perturbation
1.4
0.03
ode45
y(t)
1.2

1
1
0.02

Zeitschrittweite
0.8
y(t)

y
0.6

0.01
0.4

y(t)
0.2 LT-Eulex, h=0.04
0 LT-Eulex, h=0.02
ST-MPRexpl, h=0.05
0
0 0.2 0.4 0.6 0.8 1 0
Fig. 467 t
0 0.2 0.4 0.6 0.8 1
Fig. 468 t
Solution from ode45, see Ex. 12.0.1 inexacte splitting method: solution (yk )

ode45: 152
LT-Eulex, h = 0.04: 25
Total number of timesteps
LT-Eulex, h = 0.02: 50
ST-MPRexpl, h = 0.05: 20
Details of the methods:
LT-Eulex: ẏ = λy(1 − y) → exact evolution, ẏ = α sin y → expl. Euler (11.2.7) & Lie-Trotter
splitting (12.5.3)
ST-MPRexpl: ẏ = λy(1 − y) → exacte evolution, ẏ = α sin y → expl. midpoint rule (11.4.7) & Strang
splitting (12.5.4)

We observe that this splitting scheme can cope well with the stiffness of the problem, because the stiff
term on the right hand side is integrated exactly.

Example 12.5.12 (Splitting linear and local terms)

In the numerical treatment of partial differential equation one commonly encounters ODEs of the form
 
g ( y1 )
 ..  ⊤ d,d
ẏ = f(y) := −Ay +  . , A=A ∈R positive definite (→ Def. 1.1.8) , (12.5.13)
g(yd )

with state space D = R d , where λmin (A) ≈ 1, λmax (A) ≈ d2 , and the derivative of g : R → R is
bounded. Then IVPs for (12.5.13) will be stiff, since the Jacobian
 
g ′ ( y1 )
 ..  d,d
D f (y) = − A +  . ∈R
g′ (yd )

12. Single Step Methods for Stiff Initial Value Problems, 12.5. Splitting methods 797
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

will have eigenvalues “close to zero” and others that are large (in modulus) and negative. Hence, D f(y)
will satisfy the criteria (12.2.15) and (12.2.16) for any state y ∈ R d .

The natural splitting is

 
g ( y1 )
 .. 
f(y) = g(y) + q(y) with g(y) := −Ay , q(y) :=  . .
g(yd )

• For the linear ODE ẏ = g(y) we have to use and L-stable (→ Def. 12.3.38) single step method,
for instance a second-order implicit Runge-Kutta method. Its increments can be obtained by solving
a linear system of equations, whose coefficient matrix will be the same for every step, if uniform
timesteps are used.

• The ODE ẏ = q(y) boils down to decoupled scalar ODEs ẏ j = g(y j ), j = 1, . . . , d. For them we
can use an inexpensive explicit RK-SSM like the explicit trapezoidal method (11.4.6). According to
our assumptions on g these ODEs are not haunted by stiffness.

Summary and Learning Outcomes

12. Single Step Methods for Stiff Initial Value Problems, 12.5. Splitting methods 798
Bibliography

[1] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Springer, Heidelberg,
2008.

[2] M.H. Gutknecht. Lineare Algebra. Lecture notes, SAM, ETH Zürich, 2009.
http://www.sam.math.ethz.ch/~mhg/unt/LA/HS07/.

[3] W. Hackbusch. Iterative Lösung großer linearer Gleichungssysteme. B.G. Teubner–Verlag, Stuttgart,
1991.

[4] Wolfgang Hackbusch. Iterative solution of large sparse systems of equations, volume 95 of Applied
Mathematical Sciences. Springer-Verlag, New York, 1994. Translated and revised from the 1991
German original.

[5] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen Rech-
nens. Mathematische Leitfäden. B.G. Teubner, Stuttgart, 2002.

[6] A.R. Laliena and F.-J. Sayas. Theoretical aspects of the application of convolution quadrature to
scattering of acoustic waves. Numer. Math., 112(4):637–678, 2009.

[7] K. Nipp and D. Stoffer. Lineare Algebra. vdf Hochschulverlag, Zürich, 5 edition, 2002.

[8] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics, volume 37 of Texts in Applied Math-
ematics. Springer, New York, 2000.

[9] Yousef Saad. Iterative methods for sparse linear systems. Society for Industrial and Applied Mathe-
matics, Philadelphia, PA, second edition, 2003.

[10] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich, 2009. https://moodle-
app1.net.ethz.ch/lms/mod/resource/index.php?id=145.

799
Index

LU -decomposition C++11 code: CG for Poisson matrix, 687

existence, 154 C++11 code: Call of adaptquad():, 537
L2 -inner product, 514 C++11 code: Calling newton with E IGEN data
h-convergence, 492 types, 578
p-convergence, 494 C++11 code: Calling a function with multiple re-
(Asymptotic) complexity, 86 turn values, 18
(Size of) best approximaton error, 438 C++11 code: Class describing a 2-port circuit el-
Fill-in, 197 ement for circuit simulation, 433
Preconditioner, 689 C++11 code: Class for multiple data/multiple point
BLAS evaluations, 372
axpy, 81 C++11 code: Clenshaw algorithm for evalation of
E IGEN: triangularView, 90 Chebychev expansion (6.1.101), 462
M ATLAB: cumsum, 91 C++11 code: Clustering of point set, 282
M ATLAB: reshape, 63 C++11 code: Code excerpts from M ATLAB’s inte-
P YTHON: reshape, 63 grator ode45, 731
C++11 code: , 25, 28, 230, 606, 617, 625, 627, C++11 code: Comparison operators, 25
640, 642, 653, 660, 668, 734 C++11 code: Computation and evaluation of com-
C++11 code: h-adaptive numerical quadrature, plete cubic spline, 499
536 C++11 code: Computation of coefficients of trigono-
C++11 code: (Generalized) distance fitting of a metric interpolation polynomial, general
hyperplane: solution of (3.4.42), 267 nodes, 420
C++11 code: 1st stage of segmentation of grayscale C++11 code: Computation of nodal potential for
image, 637 circuit of Code 9.0.3, 610
C++11 code: 2D sine transform ➺ GITLAB, 348 C++11 code: Computation of weights in 3-term
C++11 code: Accessing entries of a sparse ma- recursion for discrete orthogonal polyno-
trix: potentially inefficient, 184 mials, 474
C++11 code: Aitken-Neville algorithm, 374 C++11 code: Computing SVDs in E IGEN, 260
C++11 code: Application of ode45 for limit cycle C++11 code: Computing generalized solution of
problem, 763 Ax = b via SVD, 264
C++11 code: Arnoldi eigenvalue approximation, C++11 code: Computing rank of a matrix through
666 SVD, 262
C++11 code: Arnoldi process, 664 C++11 code: Computing resonant frequencies and
C++11 code: Binary arithmetic operators (two ar- modes of elastic truss, 647
guments), 27 C++11 code: Computing row bandwidths, → Def. 2.7.56
C++11 code: Bisection method for solving F( x ) = ➺ GITLAB, 204
0 on [ a, b], 561 C++11 code: Computing the interpolation error
C++11 code: Blurring operator ➺ GITLAB, 325 for Runge’s example, 443
C++11 code: C++ code for approximate compu- C++11 code: Cosine transform ➺ GITLAB, 351
tation of Lebesgue constants, 389 C++11 code: DFT based deblurring ➺ GITLAB,
C++11 code: C++ data type representing a real- 326
valued function, 361 C++11 code: DFT based low pass frequency fil-
C++11 code: C++ template implementing generic tering of sound, 318
quadrature formula, 504 C++11 code: DFT based sound compression, 316

800
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11 code: DFT of real vectors of length n/2 expansion coefficient of Chebychev inter-
➺ GITLAB, 319 polant, 465
C++11 code: DFT-based 2D discrete periodic con- C++11 code: Efficient computation of coefficient
volution ➺ GITLAB, 323 of trigonometric interpolation polynomial
C++11 code: DFT-based approximate computa- (equidistant nodes), 421
tion of Fourier coefficients, 531 C++11 code: Efficient evaluation of Chebychev
C++11 code: DFT-based evaluation of Fourier sum polynomials up to a certain degree, 453
at equidistant points ➺ GITLAB, 331 C++11 code: Efficient implementation of inverse
C++11 code: DFT-based frequency filtering ➺ GITLAB, power method in E IGEN ➺ GITLAB, 168
315 C++11 code: Efficient implementation of simpli-
C++11 code: Data for “bridge truss”, 644 fied Newton method, 587
C++11 code: Definition of a class for “update friendly” C++11 code: Efficient multiplication of Kronecker
polynomial interpolant, 385 product with vector in E IGEN ➺ GITLAB,
C++11 code: Definition of a simple vector class 92
MyVector, 19 C++11 code: Efficient multiplication of Kronecker
C++11 code: Definition of class for Chebychev product with vector in P YTHON, 93
interpolation, 461 C++11 code: Efficient multiplication with the up-
C++11 code: Demo: discrete Fourier transform in per diagonal part of a rank- p-matrix in
E IGEN ➺ GITLAB, 309 E IGEN ➺ GITLAB, 91
C++11 code: Demonstration code for access to C++11 code: Envelope aware forward substitu-
matrix blocks in E IGEN ➺ GITLAB, 59 tion ➺ GITLAB, 205
C++11 code: Demonstration of over-/underflow C++11 code: Envelope aware recursive LU-factorization
➺ GITLAB, 104 ➺ GITLAB, 205
C++11 code: Demonstration of roundoff errors C++11 code: Equidistant composite Simpson rule
➺ GITLAB, 101 (7.4.5), 526
C++11 code: Demonstration of use of lambda C++11 code: Equidistant composite trapezoidal
function, 17 rule (7.4.4), 525
C++11 code: Demonstration on how reshape a C++11 code: Euclidean inner product, 28
matrix in E IGEN ➺ GITLAB, 64 C++11 code: Euclidean norm, 27
C++11 code: Dense Gaussian elimination applied C++11 code: Evaluation of difference quotients
to arrow system ➺ GITLAB, 170 with variable precision ➺ GITLAB, 109
C++11 code: Difference quotient approximation C++11 code: Evaluation of trigonometric interpo-
of the derivative of exp ➺ GITLAB, 109 lation polynomial in many points, 420
C++11 code: Direct solver applied to a upper tri- C++11 code: Example code demonstrating the
angular matrix ➺ GITLAB, 165 use of PARDISO with E IGEN ➺ GITLAB,
C++11 code: Discrete periodic convolution: DFT 195
implementation ➺ GITLAB, 311 C++11 code: Extracting an entry of a sparse ma-
C++11 code: Discrete periodic convolution: straight- trix, 182
forward implementation ➺ GITLAB, 310 C++11 code: Extraction of periodic patterns by
C++11 code: Discriminant formula for the real DFT ➺ GITLAB, 313
roots of p(ξ ) = ξ 2 + αξ + β ➺ GITLAB, C++11 code: FFT-based solution of local transla-
105 tion invariant linear operators ➺ GITLAB,
C++11 code: Divided differences evaluation by 349
modified Horner scheme, 384 C++11 code: Fast evaluation of trigonometric poly-
C++11 code: Divided differences, recursive im- nomial at equidistant points, 424
plementation, in situ computation, 383 C++11 code: Finding out EPS in C++ ➺ GITLAB,
C++11 code: Driver code for Gram-Schmidt or- 103
thonormalization, 30 C++11 code: Fitting and interpolating polynomial,
C++11 code: Driver for recursive LU-factorization 430
➺ GITLAB, 148 C++11 code: Frequency extraction → Fig. 143,
C++11 code: Efficient computation of Chebychev 312

INDEX, INDEX 801

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11 code: Function for solving a sparse LSE script, 180

with E IGEN ➺ GITLAB, 191 C++11 code: Initialization of sparse matrices: entry-
C++11 code: Function with multiple return val- wise (I), 180
ues, 18 C++11 code: Initialization of sparse matrices: triplet
C++11 code: GE by rank-1 modification ➺ GITLAB, based (II), 180
141 C++11 code: Initialization of sparse matrices: triplet
C++11 code: Gaussian elimination for “Wilkinson based (III), 180
system” in E IGEN ➺ GITLAB, 159 C++11 code: Initializing and drawing a simple
C++11 code: Gaussian elimination with multiple planar triangulations ➺ GITLAB, 186
r.h.s. → Code 2.3.4 ➺ GITLAB, 140 C++11 code: Initializing special matrices in E IGEN,
C++11 code: Gaussian elimination with pivoting: 58
extension of Code 2.3.4 ➺ GITLAB, 151 C++11 code: Instability of multiplication with in-
C++11 code: General subspace power iteration verse ➺ GITLAB, 163
step with qr based orthonormalization, C++11 code: Interpolation class: constructors,
653 372
C++11 code: Generation of noisy sinusoidal sig- C++11 code: Interpolation class: multiple point
nal ➺ GITLAB, 312 evaluations, 373
C++11 code: Generation of synthetic perturbed C++11 code: Interpolation class: precomputa-
U - I characteristics, 276 tions, 373
C++11 code: Generic Newton iteration with ter- C++11 code: Inverse cosine transform ➺ GITLAB,
mination criterion (8.4.50), 591 351
C++11 code: Generic damped Newton method C++11 code: Investigating convergence of direct
based on natural monotonicity test, 594 power method, 627
C++11 code: Golub-Welsch algorithm, 520 C++11 code: Invocation of adaptive embedded
C++11 code: Gram-Schmidt orthogonalisation in Runge-Kutta-Fehlberg integrator, 730
E IGEN ➺ GITLAB, 94 C++11 code: Invocation of copy and move con-
C++11 code: Gram-Schmidt orthogonalisation in structors, 22
P YTHON, 94 C++11 code: Invoking sparse elimination solver
C++11 code: Hermite approximation and orders for arrow matrix ➺ GITLAB, 193
of convergence, 497 C++11 code: LSE for Ex. 10.3.11, 692
C++11 code: Hermite approximation and orders C++11 code: LU-factorization ➺ GITLAB, 146
of convergence with exact slopes, 495 C++11 code: LU-factorization of sparse matrix
C++11 code: Horner scheme (vectorized version), ➺ GITLAB, 197
366 C++11 code: LU-factorization with partial pivoting
C++11 code: Image compression, 270 ➺ GITLAB, 153
C++11 code: Implementation of discrete convolution C++11 code: Lagrange polynomial interpolation
(→ Def. 4.1.22) based on periodic discrete convolution and evaluation, 377
➺ GITLAB, 311 C++11 code: Lanczos process, cf. Code 10.2.18,
C++11 code: Implementation of class PolyEval, 662
385 C++11 code: Levinson algorithm ➺ GITLAB, 356
C++11 code: In place arithmetic operations (one C++11 code: Lloyd-Max algorithm for cluster in-
argumnt), 26 dentification, 280
C++11 code: Initialisation of sample sparse ma- C++11 code: Local evaluation of cubic Hermite
trix in E IGEN ➺ GITLAB, 191 polynomial, 395
C++11 code: Initialization of Vandermonde ma- C++11 code: Matrices to mglData, 32
trix, 368 C++11 code: Matrix×vector product y = Ax in
C++11 code: Initialization of a MyVector object triplet format, 177
from an STL vector, 21 C++11 code: Measuring runtimes of Code 2.3.4
C++11 code: Initialization of a set of vectors through vs. E IGEN lu()-operator vs. MKL ➺ GITLAB,
a functor with two arguments, 31 138
C++11 code: Initialization of sparse matrices: driver C++11 code: Monotonicity preserving slopes in

INDEX, INDEX 802

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

pchip, 401 mal A ∈ R n,n ), 639

C++11 code: Naive DFT-implementation, 335 C++11 code: Recursive FFT ➺ GITLAB, 338
C++11 code: Newton iteration for (8.4.43), 588 C++11 code: Recursive LU-factorization ➺ GITLAB,
C++11 code: Newton method in the scalar case 147
n = 1, 563 C++11 code: Recursive evaluation of Chebychev
C++11 code: Newton’s method in C++, 577 expansion (6.1.101), 462
C++11 code: Non-member function output oper- C++11 code: Refined local stepsize control for
ator, 28 single step methods, 741
C++11 code: Non-member function for left multi- C++11 code: Remez algorithm for uniform poly-
plication with a scalar, 27 nomial approximation on an interval, 479
C++11 code: Numeric differentiation through dif- C++11 code: Ritz projections onto Krylov space
ference quotients, 379 (9.4.2), 659
C++11 code: Numerical differentiation by extrap- C++11 code: Roating a vector onto the x1 -axis by
olation to zero, 380 successive Givens transformation, 241
C++11 code: ONB of N (A) through SVD, 262 C++11 code: Runge-Kutta-Fehlberg 4(5) numeri-
C++11 code: ONB of R(A) through SVD, 263 cal integrator class, 729
C++11 code: PCA for measured U - I characteris- C++11 code: Runtime comparison, 422
tics, 277 C++11 code: Runtime measurement of Code 2.6.9
C++11 code: PCA in three dimensions via SVD, vs. Code 2.6.10 vs. sparse techniques
275 ➺ GITLAB, 171
C++11 code: PCA of stock prices in M ATLAB, 285 C++11 code: SVD based image compression, 271
C++11 code: Performing explicit LU-factorization C++11 code: Secant method for 1D non-linear
in E IGEN ➺ GITLAB, 155 equaton, 569
C++11 code: Permuting arrow matrix, see Figs. 89, 90 C++11 code: Simple Cholesky factorization ➺ GITLAB,
➺ GITLAB, 200 213
C++11 code: Piecewise cubic Hermite interpola- C++11 code: Simple fixed point iteration in 1D,
tion, 397 546
C++11 code: Plotting Chebychev polynomials, see C++11 code: Simple local stepsize control for sin-
Fig. 217, 218, 453 gle step methods, 736
C++11 code: Plotting a periodically truncated sig- C++11 code: Simulation of “stiff” chemical reac-
nal and its DFT ➺ GITLAB, 329 tion, 762
C++11 code: Point spread function (PSF) ➺ GITLAB, C++11 code: Sine transform ➺ GITLAB, 345
324 C++11 code: Single index access of matrix en-
C++11 code: Polynomial Interpolation, 370 tries in E IGEN ➺ GITLAB, 62
C++11 code: Polynomial evaluation, 381 C++11 code: Single point evaluation with data
C++11 code: Polynomial evaluation using polyfit, updates, 375
377 C++11 code: Small residuals for Gauss elimina-
C++11 code: Polynomial fitting ➺ GITLAB, 429 tion ➺ GITLAB, 161
C++11 code: Principal axis point set separation, C++11 code: Smart approach
280 ➺ GITLAB, 167
C++11 code: QR-algorithm with shift, 616 C++11 code: Solving LSE Ax = b with Gaus-
C++11 code: QR-based solver for full rank linear sian elimination ➺ GITLAB, 136
least squares problem (3.1.31), 247 C++11 code: Solving (12.0.2) with class ode45,
C++11 code: QR-decomposition by successive 746
Givens rotations, 241 C++11 code: Solving (3.4.31) with E IGEN ➺ GITLAB,
C++11 code: QR-decompositions in E IGEN, 245 265
C++11 code: Quadratic spline: selection of pi , C++11 code: Solving a linear least squares pro-
412 bel via normal equations, 229
C++11 code: Querying characteristics of double C++11 code: Solving a rank-1 modified LSE ➺ GITLAB,
numbers ➺ GITLAB, 100 174
C++11 code: Rayleigh quotient iteration (for nor- C++11 code: Solving a sparse linear system of

INDEX, INDEX 803

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

equations in E IGEN ➺ GITLAB, 192 E IGEN for MKL comparison ➺ GITLAB,

C++11 code: Solving a tridiagonal system by means 84
of QR-decomposition ➺ GITLAB, 249 C++11 code: Timing polynomial evaluations, 376
C++11 code: Solving an arrow system according C++11 code: Total least squares via SVD, 289
to (2.6.8) ➺ GITLAB, 171 C++11 code: Transformation of a vector through
C++11 code: Spline approximation error, 500 a functor double → double, 25
C++11 code: Square root iteration → Ex. 8.1.20, C++11 code: Two-dimensional discrete Fourier
551 transform ➺ GITLAB, 321
C++11 code: Stability by small random perturba- C++11 code: Use of std::function, 17
tions ➺ GITLAB, 159 C++11 code: Use of M ATLAB integrator ode45
C++11 code: Stable Givens rotation of a 2D vec- for a stiff problem, 747
tor, 240 C++11 code: Using Array in E IGEN ➺ GITLAB,
C++11 code: Stable computation of real root of a 60
quadratic polynomial ➺ GITLAB, 113 C++11 code: Using rank() in E IGEN, 262
C++11 code: Stable recursion for area of regular C++11 code: Vector to mglData, 31
n-gon ➺ GITLAB, 116 C++11 code: Vector type and their use in E IGEN,
C++11 code: Step by step shape preserving spline 57
interpolation, 415 C++11 code: Visualizing LU-factors of a sparse
C++11 code: Storage order in P YTHON, 62 matrix ➺ GITLAB, 196
C++11 code: Straightforward implementation of C++11 code: Visualizing the structure of matrices
2D discrete periodic convolution ➺ GITLAB, in E IGEN ➺ GITLAB, 71
322 C++11 code: Visualizing the structure of matrices
C++11 code: Subspace power iteration with Ritz in P YTHON, 72
projection, 657 C++11 code: Wasteful approach
C++11 code: Summation of exponential series ➺ GITLAB, 167
➺ GITLAB, 118 C++11 code: Wrap-around implementation of sine
C++11 code: Templated constructors copying vec- transform ➺ GITLAB, 344
tor entries from an STL container, 21 C++11 code: Wrong result from Gram-Schmidt
C++11 code: Tentative computation of circumfer- orthogonalisation E IGEN ➺ GITLAB, 95
ence of regular polygon ➺ GITLAB, 115 C++11 code: [, 34
C++11 code: Testing the accuracy of computed C++11 code: Gram-Schmidt orthonormalization
roots of a quadratic polynomial ➺ GITLAB, (do not use, unstable algorithm
106 ), 652
C++11 code: Timing different implementations of C++11 code: BLAS-based SAXPY operation in
matrix multiplication in E IGEN ➺ GITLAB, C++, 83
78 C++11 code: Inverse two-dimensional discrete
C++11 code: Timing different implementations of Fourier transform ➺ GITLAB, 321
matrix multiplication in M ATLAB, 76 C++11 code: Constructor for constant vector, also
C++11 code: Timing different implementations of default constructor, see Line 28, 20
matrix multiplication in P YTHON, 79 C++11 code: Constructor initializing vector from
C++11 code: Timing for row and column oriented STL iterator range, 21
matrix access for E IGEN ➺ GITLAB, 66 C++11 code: Copy assignment operator, 23
C++11 code: Timing for row and column oriented C++11 code: Copy constructor, 22
matrix access in M ATLAB, 65 C++11 code: Destructor: releases allocated mem-
C++11 code: Timing for row and column oriented ory, 24
matrix access in P YTHON, 67 C++11 code: Move assignment operator, 24
C++11 code: Timing multiplication with scaling C++11 code: Move constructor, 22
matrix in E IGEN ➺ GITLAB, 73 C++11 code: Type conversion operator: copies
C++11 code: Timing multiplication with scaling contents of vector into STL vector, 24
matrix in P YTHON, 74 C++11 code: C++ class representing an inter-
C++11 code: Timing of matrix multiplication in polant in 1D, 364

INDEX, INDEX 804

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

C++11 code: E IGEN’s built-in QR-based linear ing ode45, 756

least squares solver, 248 C++11 code: stochastic page rank simulation, 620
C++11 code: E IGEN based function solving a LSE C++11 code: template for Gauss-Newton method,
➺ GITLAB, 166 605
C++11 code: E IGEN code for Ex. 1.4.11 ➺ GITLAB, C++11 code: templated function for Gram-Schmidt
90 orthonormalization, 29
C++11 code: Equidistant points: fast on the fly C++11 code: timing QR-factorizations in E IGEN,
evaluation of trigonometric interpolation 246
polynomial, 424 C++11 code: timing access to rows/columns of a
C++11 code: M ATLAB -C ODE Arnoldi eigenvalue sparse matrix, 179
approximation, 666 C++11 code: timing of different implementations
C++11 code: Eigen::RowVectorXd to mglData, of DFT, 335
32 C++11 code: tracking fractions of many surfers,
C++11 code: assembly of A, D, 632 622
C++11 code: basic CG iteration for solving Ax = C++11 code: transition probability matrix for page
b, § 10.2.17, 682 rank, 622
C++11 code: computing Legende polynomials, ode45, 731
519 odeset, 744
C++11 code: computing page rank vector r via 3-term recursion
eig , 624 for Chebychev polynomials, 452
C++11 code: condition numbers of 2 × 2 matri- for Legendre polynomials, 518
ces ➺ GITLAB, 133 3-term recusion
C++11 code: fill-in due to pivoting ➺ GITLAB, orthogonal polynomials, 472
200 5-points-star-operator, 346
C++11 code: gradient method for Ax = b, A
a posteriori
s.p.d., 674
adaptive quadrature, 532
C++11 code: inverse iteration for computing λmin (A)
a posteriori adaptive, 452
and associated eigenvector, 638
a posteriori error bound, 551
C++11 code: loading and displaying an image,
a posteriori termination, 550
629
a priori
C++11 code: lotting theoretical bounds for CG
adaptive quadrature, 532
convergence rate, 686
a priori termination, 550
C++11 code: matrix multiplication L · U ➺ GITLAB,
A-inner product, 670
146
A-orthogonal, 680
C++11 code: measuring runtimes of eig, 617
A-stability of a Runge-Kutta single step method,
C++11 code: one step of subspace power itera-
776
tion with Ritz projection, matrix version,
A-stable single step method, 776
655
Absolute and relative error, 101
C++11 code: one step of subspace power itera-
absolute error, 101
tion, m = 2, 649
absolute tolerance, 549, 577, 736
C++11 code: power iteration with orthogonal pro-
adaptive
jection for two vectors, 650
a posteriori, 452
C++11 code: preconditioned inverse iteration (9.3.63),
adaptive multigrid quadrature, 534
642
adaptive quadrature, 532
C++11 code: preordering in E IGEN ➺ GITLAB,
a posteriori, 532
208
a priori, 532
C++11 code: rvalue and lvalue access operators,
Adding EPS to 1, 104
24
AGM, 548
C++11 code: simple implementation of PCG al-
Aitken-Neville scheme, 374
gorithm § 10.3.5, 690
algebra, 75
C++11 code: simulation of linear RLC circuit us-
algebraic convergence, 442

INDEX, INDEX 805

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

algebraic dependence, 87 block LU-decomposition, 149

algebraically equivalent, 111 block matrix multiplication, 75
aliasing, 484 blow-up, 733
alternation theorem, 477 blurring operator, 324
Analyticity of a complex valued function, 448 Boundary edge, 187
approximation Broyden
uniform, 432 quasi-Newton method, 597
approximation error, 432 Broyden-Verfahren
arrow matrix, 198 ceonvergence monitor, 598
Ass: “Axiom” of roundoff analysis, 103 Butcher scheme, 727, 772
Ass: Analyticity of interpoland, 449
C++
Ass: Global solutions, 707
move semantics, 22
Ass: Sampling in a period, 417
cache miss, 68
Ass: Self-adjointness of multiplication operator,
cache thrashing, 68
471
cancellation, 105, 107
asymptotic complexity, 86
capacitance, 127
sharp bounds, 86
capacitor, 127
asymptotic rate of linear convergence, 558
cardinal
augmented normal equations, 290
spline, 410
autonomization, 704
cardinal basis, 363, 367
Autonomous ODE, 700
cardinal basis function, 409
AXPY operation, 682
cardinal interpolant, 409, 506
axpy operation, 81
Cauchy product
back substitution, 135 of power series, 300
backward error analysis, 124 Causal channel/filter, 295
backward substitution, 148 causal filter, 293
Bandbreite CCS format, 178
Zeilen-, 202 cell
banded matrix, 201 of a mesh, 490
bandwidth, 202 CG
lower, 202 convergence, 687
minimizing, 207 preconditioned, 690
upper, 202 termination criterion, 683
barycentric interpolation formula, 371 CG = conjugate gradient method, 678
basis CG algorithm, 682
cosine, 350 chain rule, 580
orthonormal, 614 channel, 293
sine, 343 Characteristic parameters of IEEE floating point
trigonometric, 307 numbers, 101
Belousov-Zhabotinsky reaction, 732 characteristic polynomial, 613
bending energy, 407 Chebychev expansion, 461
Bernstein approximant, 436 Chebychev nodes, 455
Bernstein polynomials, 436 Chebychev polynomials, 452, 685
Besetzungsmuster, 208 3-term recursion, 452
best approximation Chebychev-interpolation, 451
uniform, 477 chemical reaction kinetics, 761
best approximation error, 438, 469 Cholesky decomposition
bicg, 696 costs, 213
BiCGStab, 696 circuit simulation
bisection, 561 transient, 702
BLAS, 76 circulant matrix, 302

INDEX, INDEX 806

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Classical Runge-Kutta method algebraic, 442

Butcher scheme, 728 asymptotic, 571
Clenshaw algorithm, 462 exponential, 442, 446, 457
cluster analysis, 279 global, 543
coefficient matrix, 126 iterative method, 543
coil, 127 linear, 544
collocation, 768 linear in Gauss-Newton method, 607
collocation conditions, 768 local, 543
collocation points, 768 numerical quadrature, 506
collocation single step methods, 768 quadratic, 548
column major matrix format, 61 rate, 544
column sum norm, 122 convergence monitor, 598
column transformation, 75 of Broyden method, 598
combinatorial graph Laplacian, 189 convex
complexity data, 391
asymptotic, 86 function, 391
linear, 89 Convex/concave data, 391
of SVD, 261 convex/concave function, 391
composite quadrature formulas, 524 convolution
Compressed Column Storage (CCS), 183 discrete, 293, 298
compressed row storage, 177 discrete periodic, 301
Compressed Row Storage (CRS), 183 of sequences, 299
computational cost Corollary: “Optimality” of CG iterates, 685
Gaussian elimination, 137 Corollary: Best approximant by orthogonal pro-
computational costs jection, 469
LU-decomposition, 147 Corollary: Composition of orthogonal transforma-
QR-decomposition, 249 tions, 237
Computational effort, 85 Corollary: Consistent Runge-Kutta single step meth-
computational effort, 85, 574 ods, 727
eigenvalue computation, 617 Corollary: Continuous local Lagrange interpolants,
concave 491
data, 391 Corollary: Dimension of P2n T , 419

function, 391 Corollary: Euclidean matrix norm and eigenval-

Condition (number) of a matrix, 131 ues, 123
condition number Corollary: Invariance of order under affine trans-
of a matrix, 131 formation, 511
spectral, 677 Corollary: Lagrange interpolation as linear map-
conjugate gradient method, 678 ping, 368
consistency Corollary: ONB representation of best approxi-
of iterative methods, 543 mant, 470
fixed point iteration, 553 Corollary: Periodicity of Fourier transforms, 330
Consistency of fixed point iterations, 553 Corollary: Piecewise polynomials Lagrange inter-
Consistency of iterative methods, 543 polation operator, 491
Consistent single step methods, 716 Corollary: Polynomial stability function of explicit
constant RK-SSM, 753
Lebesgue, 456 Corollary: Principal axis transformation, 614
constitutive relations, 127, 360 Corollary: Rational stability function of explicit RK-
constrained least squares, 289 SSM, 775
Contractive mapping, 556 Corollary: Smoothness of cubic Hermite polyno-
Convergence, 543 mial interpolant, 395
convergence Corollary: Stages limit order of explicit RK-SSM,

INDEX, INDEX 807

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

753 difference scheme, 711

Corollary: Uniqueness of least squares solutions, differential, 579
223 dilation, 418
Corollary: Uniqueness of QR-factorization, 237 direct power method, 626
Correct rounding, 102 DIRK-SSM, 772
cosine discrete L2 -inner product, 471, 474
basis, 350 Discrete convolution, 298
transform, 350 discrete convolution, 293, 298
cosine matrix, 350 discrete evolution, 714
cosine transform, 350 discrete Fourier transform, 304, 309
costs Discrete periodic convolution, 301
Cholesky decomposition, 213 discrete periodic convolution, 301
Crout’s algorithm, 146 discretization
CRS, 177 of a differential equation, 715
CRS format discretization error, 717
diagonal, 178 discriminant formula, 105, 112
cubic complexity, 87 divided differences, 383
cubic Hermite interpolation, 370, 396 domain of definition, 541
Cubic Hermite polynomial interpolant, 394 domain specific language (DSL), 52, 56
cubic spline interpolation dot product, 69
error estimates, 498 double nodes, 369
cyclic permutation, 200 double precision, 100
DSL: domain specific language, 52, 56
damped Newton method, 591
damping factor, 593 economical singular value decomposition, 258
data fitting, 425, 601 efficiency, 574
linear, 427 Eigen, 56
polynomial, 429 arrays, 60
data interpolation, 358 data types, 57
deblurring, 323 initialisation, 58
definite, 121 sparse matrices, 183
dense matrix, 175 eigen
derivative accessing matrix entries, 59
in vector spaces, 579 eigenspace, 613
Derivative of functions between vector spaces, 579 eigenvalue, 613
descent methods, 670 generalized, 615
destructor, 24 eigenvalue problem
DFT, 304, 309 generalized, 615
two-dimensional, 320 eigenvalues and eigenvectors, 613
Diagonal dominance, 210 eigenvector, 613
diagonal matrix, 50 generalized, 615
diagonalization electric circuit, 126, 540
for solving linear ODEs, 755 resonant frequencies, 609
of a matrix, 614 elementary arithmetic operations, 97, 103
diagonalization of local translation invariant linear elimination matrix, 144
operators, 346 embedded Runge-Kutta methods, 742
Diagonally dominant matrix, 210 Energy norm, 671
diagonally implicit Runge-Kutta method, 772 envelope
difference quotient, 108 matrix, 202
backward, 712 Equation
forward, 711 non-linear, 541
symmetric, 713 equidistant mesh, 490

INDEX, INDEX 808

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

equidistribution principle consistency, 553

for quadrature error, 533 Newton’s method, 587
equivalence floating point number, 99
of norms, 121 floating point numbers, 97, 98
Equivalence of norms, 544 forward elimination, 135
ergodicity, 625 forward substitution, 148
error Fourier
absolute, 101 matrix, 308
relative, 101 Fourier coefficient, 332
error estimator Fourier modes, 484
a posteriori, 551 Fourier series, 329, 484
Euler method Fourier transform, 329
explicit, 710 discrete, 304, 309
implicit, 712 fractional order of convergence, 571
implicit, stability function, 774 frequency domain, 127
semi implicit, 781 frequency filtering, 311
Euler polygon, 711 Frobenius norm, 269
Euler’s formula, 418 full-rank condition, 223
Euler’s iteration, 568 function
evolution operator, 708 concave, 391
Evolution operator/mapping, 708 convex, 391
expansion function object, 25
asymptotic, 378 function representation, 361
explicit Euler method, 710 Funktion
Butcher scheme, 727 shandles, 577
explicit midpoint rule
Gauss collocation single step method, 771
Butcher scheme, 727
Gauss Quadrature, 510
for ODEs, 725
Gauss-Legendre quadrature formulas, 518
Explicit Runge-Kutta method, 726
Gauss-Newton method, 604
explicit Runge-Kutta method, 726
Gauss-Radau quadrature formulas, 778
explicit trapzoidal rule
Gauss-Seidel preconditioner, 691
Butcher scheme, 727
Gaussian elimination, 134
exponential convergence, 457
block version, 141
extended normal equations, 231
by rank-1 modifications, 140
extended state space
for non-square matrices, 139
of an ODE, 704
general least squares problem, 227
extrapolation, 378
Generalized condition number of a matrix, 228
fast Fourier transform, 337 Generalized Lagrange polynomials, 369
FFT, 337 Generalized solution of a lineasr system of equa-
fill-in, 197 tions, 226
filter Givens rotation, 240
high pass, 316 Givens-Rotation, 243, 255
low pass, 316 global solution
Finding out EPS in C++, 103 of an IVP, 706
Finite channel/filter, 294 GMRES, 695
finite filter, 293 Golub-Welsch algorithm, 520
Fitted polynomial, 475 gradient, 580, 673
fixed point, 553 Gradient and Hessian, 580
fixed point form, 553 Gram-Schmidt
fixed point interation, 552 Orthonormalisierung, 664
fixed point iteration Gram-Schmidt orthogonalisation, 93, 234

INDEX, INDEX 809

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Gram-Schmidt orthogonalization, 470, 664, 681 initial value problem

Gram-Schmidt orthonormalization, 652 stiff, 764
graph partitioning, 638 initial value problem (IVP), 703
grid, 490 initial value problem (IVP) = Anfangswertproblem,
grid cell, 490 703
grid function, 347 Inner product, 466
grid interval, 490 inner product
A-, 670
Halley’s iteration, 568 intermediate value theorem, 561
harmonic mean, 400 interpolant
hat function, 362 piecewise linear, 392
heartbeat model, 701 interpolation
Hermite interpolation barycentric formula, 371
cubic, 370 Chebychev, 451
Hermitian matrix, 51 complete cubic spline, 406
Hermitian/symmetric matrices, 51 cubic Hermite, 396
Hessian, 51 Hermite, 369
Hessian matrix, 580 Lagrange, 366
high pass filter, 316 natural cubic spline, 406
Hilbert matrix, 111 periodic cubic spline, 406
homogeneous, 121 spline cubic, 404
Hooke’s law, 645 spline cubic, locality, 410
Horner scheme, 365 spline shape preserving, 410
Householder reflection, 238 trigonometric, 417
interpolation operator, 364
I/O-complexity, 86
identity matrix, 50 interpolation problem, 359
interpolation scheme, 359
IEEE standard 754, 99
inverse interpolation, 572
ill conditioned, 132
ill-conditioned problem, 125 inverse iteration, 638
preconditioned, 640
image segmentation, 629
image space, 218 inverse matrix, 128
Invertible matrix, 128
Image space and kernel of a matrix, 129
invertible matrix, 128, 129
implicit differentiation, 584, 585
implicit Euler method, 712 iteration, 542
Halley’s, 568
implicit function theorem, 713
Euler’s, 568
implicit midpoint method, 713
Impulse response, 294 quadratical inverse interpolation, 569
iteration function, 542, 552
impulse response, 294
of a filter, 293 iterative method, 542
convergence, 543
in place, 146, 147
IVP, 703
in situ, 141, 147
increment equations Jacobi preconditioner, 691
linearized, 782 Jacobian, 557, 576, 579
increments
Runge-Kutta, 726, 772 kernel, 218
inductance, 127 kinetics
inductor, 127 of chemical reaction, 761
inexact splitting methods, 785 Kirchhoff (current) law, 127
inf, 100 knots
infinity, 100 spline, 403
initial guess, 542, 552 Konvergenz

INDEX, INDEX 810

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Algebraische, Quadratur, 523 Lemma: Formula for Euclidean norm of a Hermi-

Kronecker product, 92 tian matrix, 123
Kronecker symbol, 49 Lemma: Fourier coefficients of derivatives, 486
Krylov space, 679 Lemma: Gerschgorin circle theorem, 613
for Ritz projection, 659 Lemma: Group of regular diagonal/triangular ma-
trices, 72
L-stable, 778
Lemma: Higher order local convergence of fixed
L-stable Runge-Kutta method, 778
point iterations, 559
Lagrange function, 290
Lemma: Interpolation error estimates for expo-
Lagrange interpolation approximation scheme, 441
nentially decaying Fourier coefficients, 488
Lagrange multiplier, 290
Lemma: Kernel and range of (Hermitian) trans-
Lagrangian (interpolation polynomial) approxima-
posed matrices, 223
tion scheme, 441
Lemma: LU-factorization of diagonally dominant
Lagrangian multiplier, 290
matrices, 210
lambda function, 16, 25
Lemma: Ncut and Rayleigh quotient (→ [?,
Landau symbol, 86, 442
Sect. 2]), 632
Landau-O, 86
Lemma: Necessary conditions for s.p.d., 51
Lapack, 139
Lemma: Positivity of Gauss-Legendre quadrature weights,
leading coefficient
518
of polynomial, 365
Lemma: Properties of cosine matrix, 350
Least squares
Lemma: Properties of Fourier matrix, 308
with linear constraint, 289
Lemma: Properties of the sine matrix, 344
least squares
Lemma: Quadrature error estimates for Cr -integrands,
total, 288
522
least squares problem, 225
Lemma: Quadrature formulas from linear interpo-
Least squares solution, 219
lation schemes, 506
Lebesgue
Lemma: Residual formula for quotients, 449
constant, 456
Lemma: S.p.d. LSE and quadratic minimization
Lebesgue constant, 388, 447
problem, 671
Legendre polynomials, 516
Lemma: Sherman-Morrison-Woodbury formula,
Lemma: rk ⊥ Uk , 678
173
Lemma: Absolute conditioning of polynomial in-
Lemma: Similarity and spectrum → [?, Thm. 9.7],
terpolation, 387
[?, Lemma 7.6], [?, Thm. 7.2], 614
Lemma: Affine pullbacks preserve polynomials,
Lemma: Smoothness of solutions of ODEs, 699
439
Lemma: Stability function as approximation of exp
Lemma: Bases for Krylov spaces in CG, 681
for small arguments, 753
Lemma: Cholesky decomposition, 213
Lemma: Sufficient condition for linear convergence
Lemma: Criterion for local Liptschitz continuity,
of fixed point iteration, 557
706
Lemma: Sufficient condition for local linear con-
Lemma: Cubic convergence of modified Newton
vergence of fixed point iteration, 557
methods, 568
Lemma: SVD and Euclidean matrix norm, 266
Lemma: Decay of Fourier coefficients, 486
Lemma: SVD and rank of a matrix → [?, Cor. 9.7],
Lemma: Diagonal dominance and definiteness,
260
212
Lemma: Taylor expansion of inverse distance func-
Lemma: Diagonalization of circulant matrices, 308
tion, 646
Lemma: Equivalence of Gaussian elimination and
Lemma: Theory of Arnoldi process, 665
LU-factorization, 157
Lemma: Transformation of norms under affine pull-
Lemma: Error representation for polynomial La-
backs, 440
grange interpolation, 445
Lemma: Tridiagonal Ritz projection from CG resid-
Lemma: Existence of LU -decomposition, 145
uals, 661
Lemma: Existence of LU-factorization with pivot-
Lemma: Unique solvability of linear least squares
ing, 154

INDEX, INDEX 811

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

fitting problem, 428 mantissa, 99

Lemma: Uniqueness of orthonormal polynomi- Markov chain, 353, 620
als, 472 stationary distribution, 621
Lemma: Zeros of Legendre polynomials, 517 mass matrix, 646
Levinson algorithm, 355 MATLAB, 52
Lie-Trotter splitting, 784 Matrix
limit cycle, 762 adjoint, 49
limiter, 399 Hermitian, 614
line search, 672 Hermitian transposed, 49
Linear channel/filter, 295 normal, 614
linear complexity, 87, 89 skew-Hermitian, 614
Linear convergence, 544 transposed, 49
linear correlation, 276, 283 unitary, 614
linear data fitting, 427 matrix
linear electric circuit, 126 banded, 201, 202
linear filter, 293 condition number, 131
Linear interpolation operator, 364 dense, 175
linear operator, 364 diagonal, 50
diagonalization, 346 envelope, 202
linear ordinary differential equation, 612 Fourier, 308
linear regression, 88 Hermitian, 51
linear system of equations, 126 Hessian, 580
multiple right hand sides, 139 lower triangular, 50
Lipschitz continuos function, 705 normalized, 50
Lloyd-Max algorithm, 279 orthogonal, 233
Local and global convergence, 543 positive definite, 51
local Lagrange interpolation, 491 positive semi-definite, 51
local linearization, 576 rank, 129
locality sine, 344
of interpolation, 409 sparse, 175
logistic differential equation, 699 storage formats, 61
Lotka-Volterra ODE, 700 structurally symmetric, 206
low pass filter, 316 symmetric, 51
lower triangular matrix, 50 tridiagonal, 202
LU-decomposition unitary, 233
blocked, 149 upper triangular, 50
computational costs, 147 matrix algebra, 75
envelope aware, 204 matrix block, 49
existence, 145 matrix compression, 268
in place, 147 Matrix envelope, 202
LU-factorization matrix exponential, 755
envelope aware, 204 matrix factorization, 142, 144
of sparse matrices, 195 Matrix norm, 122
with pivoting, 152 matrix norm, 122
column sums, 122
machine number, 99
row sums, 122
exponent, 99
matrix storage
machine numbers, 97, 99
envelope oriented, 206
distribution, 99
member function, 14
extremal, 99
mesh, 490
Machine numbers/floating point numbers, 99
equidistant, 490
machine precision, 103

INDEX, INDEX 812

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

in time, 715 nodal analysis, 126, 540

temporal, 709 transient, 702
mesh adaptation, 534 nodal polynomial, 451
mesh refinement, 534 nodal potentials, 127
mesh width, 490 node
Method double, 369
Quasi-Newton, 595 for interpolation, 366
method, 14 in electric circuit, 126
midpoint method multiple, 369
implicit, stability function, 774 multiplicity, 369
midpoint rule, 508, 725 of a mesh, 490
Milne rule, 509 quadrature, 504
min-max theorem, 634 nodes, 366
minimal residual methods, 694 Chebychev, 455
model function, 562 Chebychev nodes, 456
model reduction, 432 for interpolation, 367
Modellfunktionsverfahren, 562 non-linear data fitting, 602
modification techniques, 251 non-normalized numbers, 100
modified Newton method, 567 Norm, 121
monomial representation norm, 121
of a polynomial, 365 L1 , 387
monomials, 365 L2 , 387
monotonic data, 390 ∞-, 121
Moore-Penrose pseudoinverse, 227 1-, 121
move semantics, 22 energy-, 671
multi-point methods, 562, 569 Euclidean, 121
multiplicity Frobenius norm, 269
geometric, 613 of matrix, 122
of an interpolation node, 369 Sobolev semi-, 447
supremum, 387
NaN, 100
normal equations, 220, 468
Ncut, 630
augmented, 290
nested
extended, 231
subspaces, 678
with constraint, 290
nested spaces, 435
normalization, 626
Newton
Normalized cut, 630
basis, 382
normalized lower triangular matrix, 144
damping, 593
normalized triangular matrix, 50
damping factor, 593
not a number, 100
monotonicity test, 593
nullspace, 218
simplified method, 587
Nullstellenbestimmung
Newton correction, 576
Modellfunktionsverfahren, 562
simplified, 590
Numerical differentiation
Newton iteration, 576
roundoff, 110
numerical Differentiation, 587
numerical Differentiation
termination criterion, 589
Newton iteration, 587
Newton method
numerical differentiation, 108
1D, 563
numerical quadrature, 502
damped, 591
numerical rank, 261, 264
local quadratic convergence, 588
modified, 567 ODE, 703
Newton’s law of motion, 646 scalar, 710

INDEX, INDEX 813

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Ohmic resistor, 127 PINVIT, 640

one-point methods, 562 Pivot
one-step error, 720 choice of, 152
order pivot, 135, 136
of quadrature formula, 510 pivot row, 135, 136
Order of a quadrature rule, 510 pivoting, 150
Order of a single step method, 720 Planar triangulation, 185
order of convergence, 547 point spread function, 323
fractional, 571 polynomial
ordinary differential equation characteristic, 613
linear, 612 generalized Lagrange, 369
ordinary differential equation (ODE), 703 Lagrange, 367
oregonator, 732 polynomial fitting, 429
orthogonal matrix, 233 polynomial interpolation
orthogonal polynomials, 516 existence and uniqueness, 367
orthogonal projection, 469 generalized, 369
Orthogonality, 467 polynomial space, 365
Orthonormal basis, 469 positive definite
orthonormal basis, 469, 614 criteria, 51
Orthonormal polynomials, 471 matrix, 51
overflow, 100, 104 potentials
overloading nodal, 127
of functions, 14 power spectrum
of operators, 14 of a signal, 316
preconditioned CG method, 690
page rank, 619
preconditioned inverse iteration, 640
stochastic simulation, 620
preconditioner, 689
parameter estimation, 216
preconditioning, 688
PARDISO, 194
predator-prey model, 700
partial pivoting, 153
principal axis, 279
pattern
principal axis transformation, 675
of a matrix, 71
principal component, 277, 283
PCA, 272
principal component analysis (PCA), 272
PCG, 690
principal minor, 149
Peano
problem
Theorem of, 706
ill conditioned, 132
penalization, 635
ill-conditioned, 125
penalty parameter, 635
sensitivity, 130
periodic
well conditioned, 132
function, 417
procedural form, 502
periodic sequence, 300
product rule, 580
permutation, 154
propagated error, 720
Permutation matrix, 154
pullback, 439, 504
permutation matrix, 154, 253
Punkt
perturbation lemma, 131
stationär, 701
Petrov-Galerkin condition, 696
pwer method
phase space
direct, 626
of an ODE, 704
Python, 55
Picard-Lindelöf
Theorem of, 706 QR algorithm, 615
Piecewise cubic Hermite interpolant (with exact QR-algorithm with shift, 616
slopes) → Def. 5.4.1, 494 QR-decomposition, 96, 235

INDEX, INDEX 814

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

computational costs, 249 right hand side vector, 126

QR-factorization, QR-decomposition, 239 rigid body mode, 648
quadratic complexity, 87 Ritz projection, 655, 659
quadratic convergence, 559 Ritz value, 655
quadratic eigenvalue problem, 609 Ritz vector, 655
quadratic functional, 671 root of unity, 306
quadratic inverse interpolation, 573 roots of unity, 530
quadratical inverse interpolation, 569 rounding, 102
quadrature rounding up, 102
adaptive, 532 roundoff
polynomial formulas, 507 for numerical differentiation, 110
quadrature formula row major matrix format, 61
order, 510 ROW methods, 783
Quadrature formula/quadrature rule, 504 row sum norm, 122
quadrature node, 504 row transformation, 75, 134, 143
quadrature numerical, 502 Runge’s example, 386
quadrature weight, 504 Runge-Kutta
quasi-linear system, 582 increments, 726, 772
Quasi-Newton method, 595, 597 Runge-Kutta method, 726, 772
L-stable, 778
Radau RK-method
Runge-Kutta methods
order 3, 779
embedded, 742
order 5, 779
semi-implicit, 780
radiative heat transfer, 301
stability function, 752, 774
range, 218
rank saddle point problem, 290
column rank, 129 matrix form, 290
computation, 261 scalar ODE, 710
numerical, 261, 264 scaling
of a matrix, 129 of a matrix, 73
row rank, 129 scheme
Rank of a matrix, 129 Horner, 365
rank-1 modification, 141, 251 Schur
rank-1-matrix, 89 Komplement, 149
rank-1-modification, 173, 596 Schur complement, 150, 169, 173
rate scientific notation, 98
of algebraic convergence, 442 secant condition, 596
of convergence, 544 secant method, 569, 573, 596
Rayleigh quotient, 627, 634 segmentation
Rayleigh quotient iteration, 639 of an image, 629
Region of (absolute) stability, 760 semi-implicit Euler method, 781
regular matrix, 128 seminorm, 447
Regular refinemnent of a planar triangulation, 189 sensitive dependence, 125
relative error, 101 sensitivity
relative tolerance, 549, 577, 736 of a problem, 130
rem:Fspec, 308 shape
Residual, 156 preservation, 393
residual quantity, 641 preserving spline interpolation, 410
Riccati differential equation, 710, 711 Sherman-Morrison-Woodbury formula, 173
Riemann sum, 332 shifted inverse iteration, 639
right hand side signal
of an ODE, 704 time-discrete, 293

INDEX, INDEX 815

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

similarity Strang, 784

of matrices, 614 splitting methods, 783
similarity function inexact, 785
for image segmentation, 630 spy, 71, 72
similarity transformations, 614 stability function
similary transformation of explicit Runge-Kutta methods, 752
unitary, 615 of Runge-Kutta methods, 774
Simpson rule, 508 stable
sine algorithm, 124
basis, 343 numerically, 124
matrix, 344 Stable algorithm, 124
transform, 344 stages, 773
Sine transform, 343 state space
single precicion, 100 of an ODE, 704
Single step method, 715 stationary distribution, 621
single step method, 715 steepest descent, 672
A-stability, 776 Stiff IVP, 764
singular value decomposition, 256, 258 stiffness matrix, 646
Singular value decomposition (SVD), 258 stochastic matrix, 621
slopes stochastic simulation of page rank, 620
for cubic Hermite interpolation, 394 stopping rule, 549
Smoothed triangulation, 188 Strang splitting, 784
Solution of an ordinary differential equation, 698 Strassen’s algorithm, 88
Space of trigonometric polynomials, 418 Structurally symmetric matrix, 206
Sparse matrices, 175 structurally symmetric matrix, 206
Sparse matrix, 175 sub-matrix, 49
sparse matrix, 175 sub-multiplicative, 122
COO format, 176 subspace correction, 678
initialization, 180 subspace iteration
LU-factorization, 195 for direct power method, 657
multiplication, 181 subspaces
triplet format, 176 nested, 678
sparse matrix storage formats, 176 SuperLU, 194
spectral condition number, 677 SVD, 256, 258
spectral partitioning, 638 symmetric matrix, 51
spectral radius, 613 Symmetric positive definite (s.p.d.) matrices, 51
spectrum, 613 symmetry
of a matrix, 675 structural, 206
spline, 403 system matrix, 126
cardinal, 410 system of equations
complete cubic, 406 linear, 126
cubic, 404
tangent field, 710
cubic, locality, 410
Taylor expansion, 559
knots, 403
Taylor polynomial, 434
natural cubic, 406
Taylor’s formula, 434
periodic cubic, 406
template, 15
physical, 408
tensor product, 69
shape preserving interpolation, 410
tent function, 362
Splines, 403
Teopltiz matrices, 352
splitting
termination criterion, 549
Lie-Trotter, 784
ideal, 550

INDEX, INDEX 816

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Newton iteration, 589 metric interpolation for analytic interpolands,

residual based, 550 489
Theorem: → [?, Thm. 25.4], 640 Theorem: Exponential decay of Fourier coefficients
Theorem: L2 -error estimate for trigonometric in- of analytic functions, 487
terpolation, 486 Theorem: Formula for generalized solution, 227
Theorem: L∞ polynomial best approximation es- Theorem: Gaussian elimination for s.p.d. matri-
timate, 438 ces, 211
Theorem: (Absolute) stability of explicit RK-SSM Theorem: Gram-Schmidt orthonormalization, 470
for linear systems of ODEs, 759 Theorem: Implicit function theorem, 713
Theorem: 3-term recursion for Chebychev poly- Theorem: Isometry property of Fourier transform,
nomials, 452 334
Theorem: 3-term recursion for orthogonal poly- Theorem: Kernel and range of A⊤ A, 223
nomials, 473 Theorem: Least squares solution of data fitting
Theorem: Uniform approximation by polynomials, problem, 428
436 Theorem: Local quadratic convergence of New-
Theorem: Courant-Fischer min-max theorem → ton’s method, 589
[?, Thm. 8.1.2], 634 Theorem: Local shape preservation by piecewise
Theorem: Banach’s fixed point theorem, 556 linear interpolation, 393
Theorem: best low rank approximation, 269 Theorem: Maximal order of n-point quadrature
Theorem: Bound for spectral radius, 613 rule, 512
Theorem: Chebychev alternation theorem, 478 Theorem: Mean square (semi-)norm/Inner prod-
Theorem: Commuting matrices have the same uct (semi-)norm, 467
eigenvectors, 306 Theorem: Mean square norm best approximation
Theorem: Composition of analytic functions, 450 through normal equations, 468
Theorem: Conditioning of LSEs, 131 Theorem: Minimax property of the Chebychev poly-
Theorem: Convergence of gradient method/steepest nomials, 454
descent, 677 Theorem: Monotonicity preservation of limited cu-
Theorem: Convergence of approximation by cubic Hermite interpolation, 401
bic Hermite interpolation, 496 Theorem: Obtaining least squares solutions by
Theorem: Convergence of CG method, 686 solving normal equations, 220
Theorem: Convergence of direct power method Theorem: Optimality of natural cubic spline inter-
→ [?, Thm. 25.1], 628 polant, 407
Theorem: Convolution theorem, 310 Theorem: Order of collocation single step method,
Theorem: Cost for solving triangular systems, 164 772
Theorem: Cost of Gaussian elimination, 164 Theorem: Order of simple splitting methods, 785
Theorem: Criteria for invertibility of matrix, 129 Theorem: Positivity of Clenshaw-Curtis weights,
Theorem: Dimension of space of polynomials, 365 509
Theorem: Dimension of spline space, 403 Theorem: Preservation of Euclidean norm, 233
Theorem: Divergent polynomial interpolants, 444 Theorem: Property of linear, monotonicity pre-
Theorem: Envelope and fill-in, 203 serving interpolation into C1 , 401
Theorem: Equivalence of all norms on finite di- Theorem: Pseudoinverse and SVD, 265
mensional vector spaces, 544 Theorem: QR-decomposition, 236
Theorem: Existence & uniqueness of generalized Theorem: QR-decomposition “preserves bandwidth”,
Lagrange interpolation polynomials, 369 243
Theorem: Existence & uniqueness of Lagrange Theorem: Quadrature error estimate for quadra-
interpolation polynomial, 367 ture rules with positive weights, 521
Theorem: Existence of n-point quadrature formu- Theorem: Rayleigh quotient, 634
las of order 2n, 515 Theorem: Region of stability of Gauss collocation
Theorem: Existence of least squares solutions, single step methods, 777
220 Theorem: Representation of interpolation error,
Theorem: Exponential convergence of trigono- 444

INDEX, INDEX 817

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Theorem: Residue theorem, 448 tridiagonal matrix, 202

Theorem: Schur’s lemma, 614 trigonometric basis, 307
Theorem: Sensitivity of full-rank linear least squares trigonometric interpolation, 417, 481
problem, 228 Trigonometric polynomial, 333
Theorem: singular value decomposition → [?, trigonometric polynomial, 333
Thm. 9.6], [?, Thm. 11.1], 257 trigonometric polynomials, 418, 482
Theorem: Span property of G.S. vectors, 234 trigonometric transformations, 343
Theorem: Stability function of Runge-Kutta methods, tripled format, 176
cf. Thm. 12.1.17, 774 truss structure
Theorem: Stability function of explicit Runge-Kutta vibrations, 644
methods, 752 trust region method, 607
Theorem: Stability of Gaussian elimination with Types of asymptotic convergence of approxima-
partial pivoting, 158 tion schemes, 442
Theorem: Stability of Householder QR [?, Thm. 19.4], Types of matrices, 50
244
Theorem: Sufficient order conditions for quadra- UMFPACK, 194
ture rules, 511 unconstrained optimization, 601
Theorem: Taylor’s formula, 558 underflow, 100, 104
Theorem: Theorem of Peano & Picard-Lindelöf uniform approximation, 432
[?, Satz II(7.6)], [?, Satz 6.5.1], [?, Thm. 11.10], uniform best approximation, 477
[?, Thm. 73.1], 706 Uniform convergence
Time-invariant channel/filter, 295 of Fourier series, 330
time-invariant filter, 293 unit vector, 49
timestep (size), 711 Unitary and orthogonal matrices, 233
timestep constraint, 754 unitary matrix, 233
timestepping, 710 unitary similary transformation, 615
Toeplitz matrix, 354 upper Hessenberg matrix, 665
Toeplitz solvers upper triangular matrix, 50, 135, 144
fast algorithms, 357 Vandermonde matrix, 368
tolerance, 550
variational calculus, 407
absolute, 736 vector field, 704
absoute, 549, 577
vectorization
for adaptive timestepping for ODEs, 735
of a matrix, 63
for termination, 550 Vieta’s formula, 112, 113
realtive, 736
relative, 549, 577 Weddle rule, 509
total least squares, 288 weight
trajectory, 701 quadrature, 504
transform weight function, 471
cosine, 350 weighted L2 -inner product, 471
fast Fourier, 337 well conditioned, 132
sine, 344
transformation matrix, 75 Young’s modulus, 645
trapezoidal rule, 508, 530, 725
Zerlegung
for ODEs, 725
LU, 148
trend, 272
zero padding, 303, 355, 423
trial space
for collocation, 768
triangle inequality, 121
triangular linear systems, 168
triangulation, 185

INDEX, INDEX 818

List of Symbols

(A)i,j = ˆ reference to entry aij of matrix A, 49 A† =ˆ Moore-Penrose pseudoinverse of A, 227

(A)k:l,r:s =ˆ reference to submatrix of A spanning ⊤
A = ˆ transposed matrix, 49
rows k, . . . , l and columns r, . . . , s, 49 I=ˆ identity matrix, 50
( x )i =
ˆ i-th component of vector x, 48 h∗x = ˆ discrete convolution of two vectors, 299
( xk ) ∗n (yk ) =
ˆ discrete periodic convolution, 301 x ∗n y =ˆ discrete periodic convolution of vectors,
C( I ) =ˆ space of continuous functions I → R, 301
387 z̄ =
ˆ complex conjugation, 49
C1 ([ a, b]) =ˆ space of continuously differentiable −
C := {z ∈ C: Re z < 0}, 776
functions [ a, b] 7→ R , 394 K= ˆ generic field of numbers, either R or C, 48
n,n
J ( t 0 , y0 ) =ˆ maximal domain of definition of a so- K∗ = ˆ set of invertible n × n matrices, 129
lution of an IVP, 706 M= ˆ set of machine numbers, 97
O= ˆ zero matrix, 50 δij =ˆ Kronecker symbol, 49, 367
O(·)= ˆ Landau symbol, 86 δij =ˆ Kronecker symbol, 294
√
V⊥ = ˆ orthogonal complement of a subspace, 223 ı= ˆ imaginary unit, “ı := −1”, 127
E= ˆ expected value of a random variable, 353 κ (A ) =ˆ spectral condition number, 677
T
Pn = ˆ space of trigonometric polynomials of de- λT = ˆ Lebesgue constant for Lagrange interpola-
gree n, 418 tion on node set T , 388
Rk (m, n) = ˆ set of rank-k matrices, 269 λmax = ˆ largest eigenvalue (in modulus), 677
DΦ = ˆ Jacobian of Φ : D 7→ R n at x ∈ D, 557 λmin = ˆ smallest eigenvalue (in modulus), 677
Dy f = ˆ Derivative of f w.r.t.. y (Jacobian), 706 1 = [1, . . . , ]⊤ , 752
EPS = ˆ machine precision, 103 Ncut(X ) = ˆ normalized cut of subset of weighted
EigAλ = ˆ eigenspace of A for eigenvalue λ, 613 graph, 630
RA = ˆ range/column space of matrix A, 259 argmin = ˆ (global) minimizer of a functional, 672
N (A ) = ˆ kernel/nullspace of a matrix, 129, 218 cond(A), 131
NA = ˆ nullspace of matrix A, 259 cut(X ) = ˆ cut of subset of weighted graph, 630
Kl (A, z) = ˆ Krylov subspace, 679 distk·k ( x, V ) =
ˆ distance of an element of a normed
kAx − bk2 → min = ˆ minimize kAx − bk2 , 225 vector spcace from set V , 438
kAk2F , 269 env(A), 202
k xk A =
ˆ energy norm induced by s.p.d. matrix A, lsq(A, b) = ˆ set of least squares solutions of ax =
671 b, 219
k·k = ˆ Euclidean norm of a vector ∈ K n , 94 nnz, 175
k·k = ˆ norm on vector space, 121 rank(A) = ˆ rank of matrix A, 129
k f k L∞ ( I ) , 387 sgn =ˆ sign function, 399
k f k L1 ( I ) , 387 vec(A) = ˆ vectorization of a matrix, 63
weight(X ) = ˆ connectivity of subset of weighted
k f k2L2 ( I ) , 387
graph, 630
Pk , 365
=ˆ complex conjugation, 466
Ψh y = ˆ discretei evolution for autonomous ODE,
m(A), 202
715
ρ(A ) =ˆ spectral radius of A ∈ K n,n , 613
R(A) = ˆ image/range space of a matrix, 129,
ρA (u) =ˆ Rayleigh quotient, 627
218
f=ˆ right hand side of an ODE, 704
(·, ·)V =
ˆ inner product on vector space V , 466
♯=ˆ cardinality of a finite set, 53
Sd,M , 403

819
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

σ (A ) =ˆ spectrum of matrix A, 613

σ(M) hat= spectrum of matrix M, 675
e
⋆, 102
S1 = ˆ unit circle in the complex plane, 418
m(A), 202
fbj =
ˆ j-th Fourier coefficient of periodic function f ,
332
f (k) =
ˆ k-th derivative of function f : I ⊂ R → K,
435
f (k) =ˆ k derivative of f , 120
m(A), 202
y [ ti , . . . , ti + k ] =
ˆ divided difference, 383
k x k1 , 121
k x k2 , 121
k x k∞ , 121
˙=
ˆ Derivative w.r.t. time t, 698

TOL tolerance, 735

LIST OF SYMBOLS, LIST OF SYMBOLS 820

Examples and Remarks

LU -decomposition of sparse matrices, 196 Lagrange polynomials for uniformly spaced nodes,
L2 -error estimates for polynomial interpolation, 446 367
h-adaptive numerical quadrature, 538 Lanczos process for eigenvalue computation , 662
p-convergence of piecewise polynomial interpo- Page rank algorithm , 619
lation, 494 PCA for data classification , 275
(Nearly) singular LSE in shifted inverse iteration, PCA of stock prices , 284
639 Power iteration with Ritz projection , 656
(Relative) point locations from distances, 217 qr based orthogonalization , 653
L2 ([−1, 1])-orthogonal polynomials → [?, Bsp. 33.2], Rayleigh quotient iteration , 640
473 Resonances of linear electrical circuits , 609
Compressed row-storage (CRS) format, 177 Ritz projections onto Krylov space , 659
BLAS calling conventions, 82 Runtimes of eig , 617
E IGEN in use, 61 Stabilty of Arnoldi process , 667
General non-linear systems of equations, 541 Subspace power iteration with orthogonal projec-
ode45 for stiff problem, 746 tion , 650
‘Partial LU -decompositions” of principal minors, Vibrations of a truss structure , 644
149
“Annihilating” orthogonal transformations in 2D, A data type designed for of interpolation problem,
237 364
“Behind the scenes” of MyVector arithmetic, 28 A function that is not locally Lipschitz continuous,
“Butcher barriers” for explicit RK-SSM, 728 705
“Failure” of adaptive timestepping, 739 A posteriori error bound for linearly convergent it-
“Fast” matrix multiplication, 88 eration, 551
“Low” and “high” frequencies, 314 A posteriori termination criterion for linearly con-
“Squeezed” DFT of a periodically truncated sig- vergent iterations, 551
nal, 327 A posteriori termination criterion for plain CG, 683
B = B H s.p.d. mit Cholesky-Zerlegung, 615 A priori and a posteriori choice of optimal interpo-
L-stable implicit Runge-Kutta methods, 778 lation nodes, 452
fft A special quasi-linear system of equations, 583
Efficiency, 335 Accessing matrix data as a vector, 61
2-norm from eigenvalues, 677 Accessing rows and columns of sparse matrices,
3-Term recursion for Legendre polynomials, 518 178
Adapted Newton method, 567
Analytic solution of homogeneous linear ordinary Adaptive explicity RK-SSM for simple decay ODE,
differential equations , 612 748
Convergence of PINVIT , 642 Adaptive integrator for stiff problems in M ATLAB,
Convergence of subspace variant of direct power 783
method , 657 Adaptive quadrature in M ATLAB, 539
Data points confined to a subspace , 274 Adaptive timestepping for mechanical problem, 743
Direct power method , 627 Adding EPS to 1, 103
Eigenvalue computation with Arnoldi process , 668 Adding EPS to 1, 103
Impact of roundoff on Lanczos process , 663 Affine invariance of Newton method, 578
Algorithm for cluster analysis, 279

821
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Angles in a triangulation, 217 Classification from measured data, 272

Application of modified Newton methods, 568 Clenshaw-Curtis quadrature rules, 509
Approximate computaton of Fourier coefficients, Combat cancellation by approximation, 119
531 Commonly used embedded explicit Runge-Kutta
Approximation by discrete polynomial fitting, 476 methods, 743
Arnoldi process Ritz projection, 665 Communicating special properties of system ma-
Asymptotic behavior of Lagrange interpolation er- trices in E IGEN, 165
ror, 441 Composite quadrature and piecewise polynomial
Asymptotic complexity of Householder QR-factorization, interpolation, 526
245 Composite quadrature rules vs. global quadra-
Auxiliary construction for shape preserving quadratic ture rules, 528
spline interpolation, 412 Computation of nullspace and image space of ma-
trices, 262
Bad behavior of global polynomial interpolants, Computational effort for eigenvalue computations,
392 617
Banach’s fixed point theorem, 556 Computing Gauss nodes and weights, 520
Bernstein approximants, 437 Computing the zeros of a quadratic polynomial,
Block LU-factorization, 149 105
Block Gaussian elimination, 141 Conditioning and relative error, 160
Blow-up, 733 Conditioning of conventional row transformations,
Blow-up of explicit Euler method, 748 244
Blow-up solutions of vibration equations, 643 Conditioning of normal equations, 229
Bound for asymptotic rate of linear convergence, Conditioning of the extended normal equations,
558 231
Breakdown of associativity, 102 Connetion with linear least squares problems Chap-
Broyden method for a large non-linear system, ter 3, 469
600 Consistency of implicit midpoint method, 716
Broyden’s quasi-Newton method: convergence, Consistent right hand side vectors are highly im-
597 probable, 218
Butcher scheme for some explicit RK-SSM, 727 Constitutive relations from measurements, 360
Calling BLAS routines from C/C++, 83 Construction of higher order Runge-Kutta single
step methods, 728
Cancellation during the computation of relative er-
Contiguous arrays in C++, 20
rors, 111
Cancellation in decimal system, 107 Convergence monitors, 598
Convergence of CG as iterative solver, 684
Cancellation in Gram-Schmidt orthogonalisation,
Convergence of Fourier sums, 331
111
Cancellation when evaluating difference quotients, Convergence of global quadrature rules, 522
Convergence of gradient method, 676
108
Cancellation: roundoff error analysis, 111 Convergence of Hermite interpolation, 496
Convergence of Hermite interpolation with exact
Cardinal shape preserving quadratic spline, 414
slopes, 495
CG convergence and spectrum, 687
Characteristics of stiff IVPs, 766 Convergence of inexact simple splitting methods,
785
Chebychev interpolation errors, 457
Convergence of Krylov subspace methods for non-
Chebychev interpolation of analytic function, 460
symmetric system matrix, 697
Chebychev interpolation of analytic functions, 459
Convergence of naive semi-implicit Radau method,
Chebychev nodes, 456
Chebychev polynomials on arbitrary interval, 455 782
Convergence of Newton’s method in 2D, 588
Chebychev representation of built-in functions, 466
Convergence of quadratic inverse interpolation,
Chebychev vs equidistant nodes, 456
Choice of quadrature weights, 511 573
Convergence of Remez algorithm, 480
Class PolyEval, 384

EXAMPLES AND REMARKS, EXAMPLES AND REMARKS 822

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Convergence of secant method, 570 Empiric Convergence of collocation single step

Convergence of simple Runge-Kutta methods, 725 methods, 770
Convergence of simple splitting methods, 784 Empiric convergence of equidistant trapezoidal rule,
Convergence rates for CG method, 687 529
Convergence theory for PCG, 691 Envelope of a matrix, 202
Convex least squares functional, 225 Envelope oriented matrix storage, 206
Convolution of sequences, 299 Error of polynomial interpolation, 446
Cosine transforms for compression, 352 Error representation for generalized Lagrangian
interpolation, 445
Damped Broyden method, 599 Estimation of “wrong quadrature error”?, 537
Damped Newton method, 594 Estimation of “wrong” error?, 737
Deblurring by DFT, 323 Euler methods for stiff decay IVP, 767
Decay conditions for bi-infinite signals, 330 Evolution operator for Lotka-Volterra ODE, 708
Decimal floating point numbers, 98 Ex. 2.3.39 cnt’d, 155
Derivative of Euclidean norm, 581 Explicit Euler method as difference scheme, 711
Derivative of a bilinear form, 581 Explicit Euler method for damped oscillations, 758
Derivative of matrix inversion, 584 Explicit ODE integrator in M ATLAB, 731
Detecting linear convergence, 544 Explicit representation of error of polynomial in-
Detecting order of convergence, 548 terpolation, 445
Detecting periodicity in data, 313 Explicit trapzoidal rule for decay equation, 751
Determining the domain of analyticity, 450 Exploiting trigonometric identities to avoid cancel-
Determining the type of convergence in numerical lation, 114
experiments, 443 Extended normal equations, 231
Diagonalization of local translation invariant linear Extremal numbers in M, 99
grid operators, 346
diagonally dominant matrices from nodal analy- Failure of damped Newton method, 595
sis, 209 Failure of Krylov iterative solvers, 696
Different choices for consistent fixed point itera- Fast Toeplitz solvers, 357
tions (II), 555 Feasibility of implicit Euler timestepping, 712
Different choices for consistent iteration functions FFT algorithm by matrix factorization, 339
(III), 559 FFT based on general factorization, 340
Different meanings of “convergence”, 443 FFT for prime vector length, 341
Discretization, 715 Filtering in Fourier domain, 333
Distribution of machine numbers, 99 Finite-time blow-up, 706
Divided differences and derivatives, 385 Fit of hyperplanes, 266
Fixed points in 1D, 555
Efficiency of fft, 335 Fractional order of convergence of secant method,
Efficiency of fft for different backend implemen- 571
tations, 342 Frequency filtering by DFT, 316
Efficiency of FFT-based solver, 349 Frequency identification with DFT, 312
Efficiency of iterative methods, 575 From higher order ODEs to first order systems,
Efficiency of fft, 335 704
Efficiency of fft for different backend implemen- Full-rank condition, 224
tations, 342
Efficient associative matrix multiplication, 89 Gain through adaptivity, 738
Efficient evaluation of trigonometric interpolation Gauss-Radau collocation SSM for stiff IVP, 779
polynomials, 423 Gaussian elimination, 134
Efficient Initialization of sparse matrices in M AT- Gaussian elimination and LU-factorization, 143
LAB, 180 Gaussian elimination for non-square matrices, 139
Eigenvectors of circulant matrices, 304 Gaussian elimination via rank-1 modifications, 140
Eigenvectors of commuting matrices, 305 Gaussian elimination with pivoting for 3 × 3-matrix,
Embedded Runge-Kutta methods, 742 151

EXAMPLES AND REMARKS, EXAMPLES AND REMARKS 823

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Generalized bisection methods, 562 Kinetics of chemical reactions, 761

Generalized eigenvalue problems and Cholesky Krylov methods for complex s.p.d. system matri-
factorization, 615 ces, 671
Generalized Lagrange polynomials for Hermite In- Krylov subspace methods for generalized EVP,
terpolation, 370 669
Generalized polynomial interpolation, 369
Gibbs phenomenon, 483 Least squares data fitting, 601
Gradient method in 2D, 674 Lebesgue constant for equidistant nodes, 388
Gram-Schmidt orthogonalization of polynomials, Linear filtering of periodic signals, 300
515 Linear parameter estimation = linear data fitting,
Gram-Schmidt orthonormalization based on MyVec- 427
tor implementation, 29 Linear parameter estimation in 1D, 215
Group property of autonomous evolutions, 708 linear regression, 219
Growth with limited resources, 699 Linear regression for stationary Markov chains,
353
Halley’s iteration, 565 Linear systems with arrow matrices, 169
Heartbeat model, 701 Linearization of increment equations, 780
Heating production in electrical circuits, 503 Linearly convergent iteration, 545
Hesse matrix of least squares functional, 224 Local approximation by piecewise polynomials, 490
Hidden summation, 90 Local convergence of Newton’s method, 591
Horner scheme, 365 local convergence of the secant method, 572
Image compression, 270 Loss of sparsity when forming normal equations,
Image segmentation, 629 230
Impact of choice of norm, 544 LU-decomposition of flipped “arrow matrix”, 198
Impact of matrix data access patterns on runtime, Machine precision for IEEE standard, 103
65 Magnetization curves, 390
Impact of roundoff errors on CG, 683 Many choices for consistent fixed point iterations,
Implicit differentiation of F, 564 553
Implicit nature of collocation single step methods, Many sequential solutions of LSE, 167
770 Mathematical functions in a numerical code, 361
Implicit RK-SSMs for stiff IVP, 777 MATLAB command reshape, 63
Importance of numerical quadrature, 502 Matrix algebra, 75
In-situ LU-decomposition, 147 Matrix inversion by means of Newton’s method,
Inequalities between vector norms, 121 585
Initial guess for power iteration, 628 Matrix norm associated with ∞-norm and 1-norm,
Initialization of sparse matrices in Eigen, 184 122
Inner products on spaces Pm of polynomials, 470 Matrix representation of interpolation operator, 369
Input errors and roundoff errors, 101 Meaning of full-rank condition for linear models,
Instability of multiplication with inverse, 162 224
interpolation Meaningful “O-bounds” for complexity, 86
piecewise cubic monotonicity preserving, 400 Measuring the angles of a triangle, 216
shape preserving quadratic spline, 414 Midpoint rule, 508
Interpolation and approximation: enabling tech- Min-max theorem, 634
nologies, 434 Minimality property of Broyden’s rank-1-modification,
Interpolation error estimates and the Lebesgue 597
constant, 447 Model reduction by interpolation, 432
Interpolation error: trigonometric interpolation, 482 Monitoring convergence for Broyden’s quasi-Newton
Interpolation of vector-valued data, 358 method, 598
Intersection of lines in 2D, 132 Monomial representation, 365
Justification of Ritz projection by min-max theo- Multi-dimensional data interpolation, 359
rem, 655 Multidimensional fixed point iteration, 558

EXAMPLES AND REMARKS, EXAMPLES AND REMARKS 824

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Multiplication of Kronecker product with vector, 92 Polynomial fitting, 429

Multiplication of polynomials, 298, 299 Power iteration, 625
Multiplication of sparse matrices in M ATLAB, 181 Predator-prey model, 700
Multiplying matrices in M ATLAB, 76 Predicting stiffness of non-linear IVPs, 766
Multiplying triangular matrices, 72 Principal axis of a point cloud, 278
Principal component analysis for data analysis,
Necessary condition for L-stability, 778
283
Necessity of iterative approximation, 541
Pseudoinverse and SVD, 265
Newton method and minimization of quadratic func-
tional, 603 QR-Algorithm, 615
Newton method in 1D, 564 QR-based solution of banded LSE, 249
Newton’s iteration; computational effort and ter- QR-based solution of linear systems of equations,
mination, 590 248
Newton-Cotes formulas, 508 QR-decomposition in E IGEN, 97
Nodal analysis of linear electric circuit, 126 QR-decomposition in P YTHON, 97
Non-linear cubic Hermite interpolation, 401 QR-decomposition of “fat” matrices, 239
Non-linear data fitting, 602 QR-decomposition of banded matrices, 243
Non-linear data fitting (II), 605 Quadratic convergence, 548
Non-linear electric circuit, 540 Quadratic functional in 2D, 671
Normal equations for some examples from Sec- Quadratur
tion 3.0.1, 222 Gauss-Legendre Ordnung 4, 513
Normal equations from gradient, 222 Quadrature errors for composite quadrature rules,
Notation for single step methods, 716 527
Numerical Differentiation for computation of Jaco-
bian, 587 Radiative heat transfer, 301
Numerical integration of logistic ODE in M ATLAB, Rank defect in linear least squares problems, 224
732 Rationale for adaptive quadrature, 532
Numerical stability and sensitive dependence on Rationale for high-order single step methods, 723
data, 125 Rationale for partial pivoting policy, 153
Numerical summation of Fourier series, 331 Rationale for using LU-decomposition in algorithms,
NumPy command reshape, 63 149
Recursive LU-factorization, 147
Orders of simple polynomial quadrature formulas, Reducing fill-in by reordering, 208
512 Reducing bandwidth by row/column permutations,
Oregonator reaction, 732 207
Origin of the term “Spline”, 408 Reduction to finite signals, 297
Oscillating polynomial interpolant, 386 Reduction to periodic convolution, 303
Output of explicit Euler method, 711 Refined local stepsize control, 740
Overflow and underflow, 104 Regions of stability for simple implicit RK-SSM,
Parameter estimation for a linear model, 216 774
Parameter identification for linear time-invariant Regions of stability of some explicit RK-SSM, 760
filters, 352 Relative error and number of correct digits, 102
Piecewise cubic Hermite interpolation, 396 Relevance of asymptotic complexity, 87
Piecewise cubic interpolation schemes, 406 Removing a singularity by transformation, 523
Piecewise linear interpolation, 362 Reshaping matrices in E IGEN, 64
Piecewise polynomial interpolation, 492 Residual based termination of Newton’s method,
Piecewise quadratic interpolation, 393 591
Pivoting and numerical stability, 150 Resistance to currents map, 174
Pivoting destroys sparsity, 200 Restarted GMRES, 695
Polybomial interpolation vs. polynomial fitting, 429 Roundoff effects in normal equations, 230
Polynomial best approximation on general inter- Row and column transformations, 75
vals, 440

EXAMPLES AND REMARKS, EXAMPLES AND REMARKS 825

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Row swapping commutes with forward elimina- Stability functions of explicit Runge-Kutta single
tion, 156 step methods, 752
Row-wise & column-wise view of matrix product, Stable discriminant formula, 113
70 Stable implementation of Householder reflections,
Runge’s example, 443, 446 239
Runtime comparison for computation of coefficient Stable orthonormalization by QR-decomposition,
of trigonometric interpolation polynomials, 96
422 Stable solution of LSE by means of QR-decomposition,
Runtime of Gaussian elimination, 137 250
Stage form equations for increments, 773
S.p.d. Hessians, 51 Standard E IGEN lu() operator versus triangularView()
Sacrificing numerical stability for efficiency, 172 , 165
Scaling a matrix, 73 Stepsize control detects instability, 754
Sensitivity of linear mappings, 130 Stepsize control in M ATLAB, 742
Shape preservation of cubic spline interpolation, Storing orthogonal transformations, 242
408 Strongly attractive limit cycle, 762
Shifted inverse iteration, 638 Subspace power methods, 658
Significance of smoothness of interpoland, 446 Summation of exponential series, 118
Silly M ATLAB, 182 SVD and additive rank-1 decomposition, 259
Simple adaptive stepsize control, 738 SVD-based computation of the rank of a matrix,
Simple adaptive timestepping for fast decay, 750 261
Simple composite polynomial quadrature rules, Switching to equivalent formulas to avoid cancel-
525 lation, 114
Simple preconditioners, 691
Simple Runge-Kutta methods by quadrature & boos- Tables of quadrature rules, 505
trapping, 725 Tangent field and solution curves, 710
Simplified Newton method, 586 Taylor approximation, 434
Sine transform via DFT of half length, 345 Termination criterion for contrative fixed point iter-
Small residuals by Gaussian elimination, 161 ation, 560
Smoothing of a triangulation, 185 Termination criterion for direct power iteration, 628
Solving the increment equations for implicit RK- Termination criterion in pcg, 694
SSMs, 773 Termination of PCG, 693
Sound filtering by DFT, 316 Testing equality with zero, 104
Sparse LU -factors, 197 Testing stability of matrix×vector multiplication,
Sparse elimination for arrow matrix, 192 124
Sparse LSE in circuit modelling, 175 The “matrix×vector-multiplication problem”, 121
Sparse matrices from the discretization of linear The inverse matrix and solution of a LSE, 129
partial differential equations, 176 The message of asymptotic estimates, 523
Special cases in IEEE standard, 100 Timing polynomial evaluations, 376
Spectrum of Fourier matrix, 308 Timing sparse elimination for the combinatorial
Speed of convergence of Euler methods, 718 graph Laplacian, 193
spline Transformation of polynomial approximation schemes,
interpolants, approx. complete cubic, 499 438
shape preserving quadratic interpolation, 414 Transformation of quadrature rules, 504
Splines in M ATLAB, 406 Transforming approximation error estimates, 439
Splitting linear and local terms, 787 Transient circuit simulation, 702
Splitting off stiff components, 786 Transient simulation of RLC-circuit, 755
Square root iteration as Newton’s method, 563 Trend analysis, 272
Square root of a s.p.d. matrix, 688 Tridiagonal preconditioning, 692
Stability by small random perturbations, 159 Trigonometric interpolation of analytic functions,
Stability function and exponential function, 753 487
Two-dimensional DFT in M ATLAB, 321

EXAMPLES AND REMARKS, EXAMPLES AND REMARKS 826

NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015

Understanding the structure of product matrices,

70
Uniqueness of SVD, 259
Unitary similarity transformation to tridiagonal form,
616
Unstable Gram-Schmidt orthonormalization, 95
Using Intel Math Kernel Library (Intel MKL) from
E IGEN, 84