Numerical Methods For - Computational Science and Engineering
Numerical Methods For - Computational Science and Engineering
Do not print!
Important links:
• Lecture Git repository: https://gitlab.math.ethz.ch/NumCSE/NumCSE.git
(Clone this repository to get access to most of the C++ codes in the lecture document and
homework problems. ➙ Git guide)
• Lecture recording: http://www.video.ethz.ch/lectures/d-math/2016/autumn/401-0663-00L.html
• Tablet notes: http://www.sam.math.ethz.ch/~grsam/HS16/NumCSE/NCSE16_Notes/
• Homework problems: https://people.math.ethz.ch/~grsam/HS16/NumCSE/NCSEProblems.pdf
, 1
Contents
0 Introduction 3
0.0.1 Focus of this course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.0.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.0.3 To avoid misunderstandings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.0.4 Reporting errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.0.5 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.1 Specific information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.1.1 Assistants and exercise classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.1.2 Study center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
0.1.3 Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
0.1.4 Information on examinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.2 Programming in C++11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
0.2.1 Function Arguments and Overloading . . . . . . . . . . . . . . . . . . . . . . . . . 14
0.2.2 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
0.2.3 Function Objects and Lambda Functions . . . . . . . . . . . . . . . . . . . . . . . 16
0.2.4 Multiple Return Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
0.2.5 A Vector Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
0.3 Creating Plots with M ATH GL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
0.3.1 M ATH GL Documentation (by J. Gacon) . . . . . . . . . . . . . . . . . . . . . . . . 31
0.3.2 M ATH GL Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
0.3.3 Corresponding Plotting functions of M ATLAB and M ATH GL . . . . . . . . . . . . . . 32
0.3.4 The Figure Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
0.3.4.1 Introductory example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
0.3.4.2 Figure Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
CONTENTS, CONTENTS 3
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
CONTENTS, CONTENTS 4
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
CONTENTS, CONTENTS 5
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
9 Eigenvalues 609
9.1 Theory of eigenvalue problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
9.2 “Direct” Eigensolvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
9.3 Power Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
9.3.1 Direct power method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
9.3.2 Inverse Iteration [?, Sect. 7.6], [?, Sect. 5.3.2] . . . . . . . . . . . . . . . . . . . . 629
9.3.3 Preconditioned inverse iteration (PINVIT) . . . . . . . . . . . . . . . . . . . . . . . 640
9.3.4 Subspace iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
9.3.4.1 Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
9.3.4.2 Ritz projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
9.4 Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
CONTENTS, CONTENTS 6
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Index 788
Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
CONTENTS, CONTENTS 7
Chapter 0
Introduction
✄ on (efficient and stable) implementation in C++ based on the numerical linear algebra E IGEN (a Domain
Specific Language embedded into C++)
• issues of high-performance computing (HPC, shard and distributed memory parallelisation, vector-
ization)
☞ 401-0686-10L High Performance Computing for Science and Engineering (HPCSE, Profs. M. Troyer
and P. Koumoutsakos)
263-2800-00L Design of Parallel and High-Performance Computing (Prof. T. Höfler)
However, note that these other courses partly rely on knowledge of elementary numerical methods, which
is covered in this course.
Contents
(0.0.2) Prequisites
8
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
This course will take for granted basic knowledge of linear algebra, calculus, and programming, that you
should have acquired during your first year at ETH.
Eigenvalue problems
integration of ODEs
Linear systems of
Least squares
Interpolation
Quadrature
Numerical
equations
problems
Analysis Linear algebra Programming (in C++)
They are vastly different in terms of ideas, design, analysis, and scope of application. They are the
items in a toolbox, some only loosely related by the common purpose of being building blocks for
codes for numerical simulation.
Fig. 1
Despite the diverse nature of the individual topics covered in this course, some depend on others for
providing essential building blocks. The following directed graph tries to capture these relationships. The
arrows have to be read as “uses results or algorithms of”.
0. Introduction, 0. Introduction 9
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Quadrature
R Eigenvalues Krylov meth.,
f ( x ) dx,
Ax = λx, Chapter 9 Chapter 10
Chapter 7
Least squares,
Function approximation, Non-linear least squares,
kAx − b k → min,
Chapter 6 k F(x)k → min, Section 8.6
Chapter 3
Any one-semester course “Numerical methods for CSE” will cover only selected chapters and sec-
tions of this document. Only topics addressed in class or in homework problems will be relevant
for the final exam!
I am a student of computer science. After the exam, may I safely forget everything I have learned in this
mandatory “numerical methods” course? No, because it is highly likely that other courses or projects
will rely on the contents of this course:
singular value decomposition
Computational statistics, machine learning
least squares
function approximation
numerical quadrature Numerical methods for PDEs
numerical integration
interpolation
Computer graphics
least squares
eigensolvers
Graph theoretic algorithms
sparse linear systems
0. Introduction, 0. Introduction 10
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Hardly anyone will need everything covered in this course, but most of you will need something.
0.0.2 Goals
These course materials are neither a textbook nor comprehensive lecture notes.
They are meant to be supplemented by explanations given in class.
✦ the lecture material is not designed to be self-contained, but is to be studied beside attending the
course or watching the course videos,
✦ this document is not meant for mere reading, but for working with,
✦ turn pages all the time and follow the numerous cross-references,
✦ study the relevant section of the course material when doing homework problems,
✦ study referenced literature to refresh prerequisite knowledge and for alternative presentation of the
material (from a different angle, maybe), but be careful about not getting confused or distracted by
information overload.
0. Introduction, 0. Introduction 11
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
• understand another third when making a serious effort to solve the homework problems,
• hopefully understand the remaining third when studying for the examination after the end of
the course.
As the documents will always be in a state of flux, they will inevitably and invariably teem with small errors,
mainly typos and omissions.
Please report errors in the l,ecture material through the Course Wiki!
When reporting an error, please specify the section and the number of the paragraph, remark, equation,
etc. where it hides. You need not give a page number.
0.0.5 Literature
Parts of the following textbooks may be used as supplementary reading for this course. References to
relevant sections will be provided in the course material.
0. Introduction, 0. Introduction 12
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
✦ [?] U. A SCHER AND C. G REIF, A First Course in Numerical Methods, SIAM, Philadelphia, 2011.
✦ [?] W. DAHMEN AND A. R EUSKEN, Numerik für Ingenieure und Naturwissenschaftler, Springer, Hei-
delberg, 2006.
Good reference for large parts of this course; provides a lot of simple examples and lucid explana-
tions, but also rigorous mathematical treatment.
(Target audience: undergraduate students in science and engineering)
Available for download at PDF
✦ [?] M. H ANKE -B OURGEOIS, Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens, Mathematische Leitfäden, B.G. Teubner, Stuttgart, 2002.
Gives detailed description and mathematical analysis of algorithms and relies on MATLAB. Profound
treatment of theory way beyond the scope of this course. (Target audience: undergraduates in
mathematics)
Classical introductory numerical analysis text with many examples and detailed discussion of algo-
rithms. (Target audience: undergraduates in mathematics and engineering)
Can be obtained from website.
Modern discussion of numerical methods with profound treatment of theoretical aspects (Target
audience: undergraduate students in mathematics).
✦ [?]: W.. G ANDER , M.J. G ANDER , AND F. K WOK, Scientific Computing, Text in Computational Sci-
ence and Engineering, springer, 2014.
Essential prerequisite for this course is a solid knowledge in linear algebra and calculus. Familiarity with
the topics covered in the first semester courses is taken for granted, see
✦ [?] K. N IPP AND D. S TOFFER, Lineare Algebra, vdf Hochschulverlag, Zürich, 5 ed., 2002.
✦ [?] M. G UTKNECHT, Lineare algebra, lecture notes, SAM, ETH Zürich, 2009, available online.
✦ [?] M. S TRUWE, Analysis für Informatiker. Lecture notes, ETH Zürich, 2009, available online.
0. Introduction, 0. Introduction 13
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Though the assistants email addresses are provided above, their use should be restricted to cases of
emergency:
In general refrain from sending email messages to the lecturer or the assistants. They will not
be answered!
Questions should be asked in class (in public or during the break in private), in the tutorials, or
in the study center hours.
0.1.3 Assignments
A steady and persistent effort spent on homework problems is essential for success in this course.
You should expect to spend 4-6 hours per week on trying to solve the homework problems. Since many
involve small coding projects, the time it will take an individual student to arrive at a solution is hard to
predict.
✦ The weekly assignments will be a few problems from the NCSE Problem Collection available online
as PDF. The particular problems to be solved will be communicated on Friday every week.
Please note that this problem collection is being compiled during this semester. Thus, make sure
that you obtain the most current version every week.
✦ Some or all of the problems of an assignment sheet will be discussed in the tutorial classes on
Monday 10 days after the problems have been assigned.
✦ A few problems on each sheet will be marked as core problems. Every participant of the course is
strongly advised to try and solve at least the core problems.
✦ If you want your tutor to examine your solution of the current problem sheet, please put it into the
plexiglass trays in front of HG G 53/54 by the Thursday after the publication. You should submit your
codes using the online submission interface. This is voluntary, but feedback on your performance
on homework problems can be important.
✦ Please clearly mark the homework problems that you want your tutor to inspect.
✦ You are encouraged to hand-in incomplete and wrong solutions, you can receive valuable feedback
even on incomplete attempts.
C++ codes for both the classroom and homework problems are made available through a git repository
also accessible through Gitlab (Link):
The Gitlab toplevel page gives a short introduction into the repository for the course and provides a link to
online sources of information about Git.
Ddownload is possible via Git or as a zip archive. Which method you choose is up to you, but it should be
noted that updating via git is more convenient.
➣ Shell command to download the git repository:
> git clone https://gitlab.math.ethz.ch/NumCSE/NumCSE.git
Updating the repository to fetch upstream changes is then possible by executing > git pull inside the
NumCSE folder.
Note that by default participants of the course will have read access only. However, if you want to contribute
corrections and enhancements of lecture or homework codes your are invited to submit a merge request.
Beforehand you have to inform your tutor so that a personal Gitlab account can be set up for you.
For instructions on how to compile assignments or lecture codes see the README file.
Dates:
The term exams are regarded as central elements and as such are graded on a pass/fail basis.
Admission to the main exam is conditional on passing at least one term exam
Only students who could not take part in one of the term exams for cogent reasons like illness (doc-
tor’s certificate required!) may take part in the make-up term exam. Please contact Daniele Casati
([email protected]) by email, if you think that you are eligible for the make-up term exam,
and attach all required documentation. You will be informed, whether you are admitted.
Only students who have failed both term exams can take part in the repetition term exam in spring next
year. This is their only chance to be admitted to the main exam in Summer 2017.
✡ ✠
Thursday January 26, 2017, 9:00 - 12:00, HG G 1
✦ Dry-run for computer based examination:
TBA, registration via course website
✦ Subjects of examination:
• All topics, which have been addressed in class or in a homework problem (including the home-
work problems not labelled as “core problems”)
✦ Lecture documents will be available as PDF during the examination. The corresponding final version
of the lecture documents will be made available on TBA
✦ You may bring a summary of up to 10 pages A4 in your own handwriting. No printouts and copies
are allowed.
• Everybody who passed at least one of the term exams, the make-up term exam, or the repetition
term exam for last year’s course and wants to repeat the main exam, will be allowed to do so.
• Bonus points earned in term exams in last year’s course can be taken into account for this course’s
main exam.
• If you are going to repeat the main exam, but also want to earn a bonus through this year’s term
exams, please declare this intention before the mid-term exam.
C++11 is the current ANSI/ISO standard for the programming language C++. On the one hand, it offers
a wealth of features and possibilities. On the other hand, this can be confusing and even be prone to
inconsistencies. A major cause of inconsistent design is the requirement with backward compatibility with
the C programming language and the earlier standard C++98.
However, C++ has become the main language in computational science and engineering and high per-
formance computing. Therefore this course relies on C++ to discuss the implementation of numerical
methods.
• a collection of abstract data containers and basic algorithms provided by the Standard Template
Libary (STL).
Supplementary reading. A popular book for learning C++ that has been upgraded to include
The book [?] gives a comprehensive presentation of the new features of C++11 compared to earlier
versions of C++.
There are plenty of online reference pages for C++, for instance http://en.cppreference.com
and http://www.cplusplus.com/.
• We use the command line build tool C MAKE, see web page.
• The compilers supporting all features of C++ needed for this course, are clang and GCC. Both are
open source projects and free. C MAKE will automatically select a suitable compiler on your system
(Linux or Mac OS X).
• A command line tool for debugging is lldb, see short introduction by Till Ehrengruber, student of
CSE@ETH.
The following sections highlight a few particular aspects of C++11 that may be important for code devel-
opment in this course.
Argument types are an integral part of a function declaration in C++. Hence the following functions are
different
i n t * f( i n t ); // use this in the case of a single numeric argument
double f( i n t *); // use only, if pointer to a integer is given
v o i d f( const MyClass &); // use when called for a MyClass object
and the compiler selects the function to be used depending on the type of the arguments following rather
sophisticated rules, refer to overload resolution rules. Complications arise, because implicit type conver-
sions have to be taken into account. In case of ambiguity a compile-time error will be triggered. Functions
cannot be distinguished by return type!
For member functions (methods) of classes an additional distinction can be introduced by the const spec-
ifier:
s t r u c t MyClass {
double f( double ); // use for a mutable object of type MyClass
double f( double ) const ; // use this version for a constant object
...
};
The second version of the method f is invoked for constant objects of type MyClass.
In C++ unary and binary operators like =, ==, +, -, *, /, +=, -=, *=, /=, %, &&, ||, etc. are regarded
as functions with a fixed number of arguments (one or two). For built-in numeric and logic types they are
defined already. They can be extended to any other type, for instance
MyClass o p e r a t o r +( const MyClass &, const MyClass &);
MyClass o p e r a t o r +( const MyClass &, double );
MyClass o p e r a t o r +( const MyClass &); // unary + !
The same selection rules as for function overloading apply. Of course, operators can also be introduced
as class member functions.
C++ gives complete freedom to overload operators. However, the semantics of the new operators should
be close to the customary use of the operator.
When f is invoked, a temporary copy of the argument is created through the copy constructor or the move
constructor of MyClass. The new temporary object is a local variable inside the function body.
then the argument is passed to the scope of the function and can be changed inside the function. No
copies are created. If one wants to avoid the creation of temporary objects, which may be costly, but also
wants to indicate that the argument will not be modified inside f, then the declaration should read
v o i d f(const MyClass &x); // Argument x passed by constant referene.
In this case, if the scope of the object passed as the argument is merely the function or std::move()
tags it as disposable, the move constructor of MyClass is invoked, which will usually do a shallow copy
only. Refer to Code 0.2.22 for an example.
0.2.2 Templates
The template mechanism supports parameterization of definitions of classes and functions by type. An
example of a function templates is
t e m p l a t e < typename ScalarType, typename VectorType>
VectorType saxpy(ScalarType alpha, const VectorType &x, const
VectorType &y)
{ r e t u r n (alpha*x+y); }
Depending on the concrete type of the arguments the compiler will instantiate particular versions of this
function, for instance saxpy<float,double>, when alpha is of type float and both x and y are of
type double. In this case the return type will be float.
For the above example the compiler will be able to deduce the types ScalarType and VectorType
from the arguments. The programmer can also specify the types directly through the < >-syntax as in
saxpy< double , double >(a,x,y);
If an instantiation for all arguments of type double is desired. In case, the arguments do not supply
enough information about the type parameters, specifying (some of) them through < > is mandatory.
A class template defines a class depending on one or more type parameters, for instance
Types MyClsTempl<T> for a concrete choice of T are instantiated when a corresponding object is de-
clared, for instance via
double x = 3.14;
MyClass myobj; // Default construction of an object
MyClsTempl< double > tinstd; // Instantiation for T = double
MyClsTempl<MyClass> mytinst(myobj); // Instantiation for T = MyClass
MyClass ret = mytinst.memfn(myobj,x); // Instantiation of member
function for U = double, automatic type deduction
The types spawned by a template for different parameter types have nothing to do with each other.
The parameter types for a template have to provide all type definitions, member functions, operators,
and data to make possible the instantiation (“compilation”) of the class of function template.
A function object is an object of a type that provides an overloaded “function call” operator (). Function
objects can be implemented in two different ways:
(I) through special classes like the following that realizes a function R 7→ R
c l a s s MyFun {
public:
...
double operator ( double x) const ; // Evaluation operator
...
};
The evaluation operator can take more than one argument and need not be declared const.
where <capture list> is a list of variables from the local scope to be passed to the lambda func-
tion; an & indicates passing by reference,
<arguments> is a comma separated list of function arguments complete with types,
<return type> is an optional return type; often the compiler will be able to deduce the
return type from the definition of the function.
Function classes should be used, when the function is needed in different places, whereas lambda func-
tions for short functions intended for single use.
In this code the lambda function captures the local variable sum by reference, which enables the lambda
function to change its value in the surrounding scope.
The special class std::function provides types for general polymorphic function wrappers.
s t d ::function<return type(arg types)>
3 void s t d f u n c t i o n t e s t ( void ) {
4 // Vector of objects of a particular signature
5 std : : vector <std : : f u n c t i o n <double ( double , double ) >> fnv ec ;
6 // Store reference to a regular function
7 fnv ec . push_back ( binop ) ;
8 // Store are lambda function
9 fnv ec . push_back ( [ ] ( double x , double y ) −> double { r et ur n y / x ; } ) ;
10 f o r ( auto f n : fnv ec ) { std : : cout << f n ( 3 , 2 ) << std : : endl ; }
11 }
In C++ this is possible by using the tuple utility. For instance, the following function computes the mimimal
and maximal element of a vector and also returns its cumulative sum. It returns all these values.
This code snippet shows how to extract the individual components of the tuple returned by the previous
function.
Be careful: many temporary objects might be created! A demonstration of this hidden cost is given in
Exp. 0.2.39.
Since C++ is an object oriented programming language, datatypes defined by classes play a pivotal role in
every C++ program. Here, we demonstrate the main ingredients of a class definition and other important
facilities of C++ for the class MyVector meant for objects representing vectors from R n . The codes can
be found in ➺ GITLAB.
55 // Euclidean norm
56 double norm ( void ) const ;
57 // Euclidean inner product
58 double operator ∗ ( const MyVector &) const ;
59 // Output operator
60 f r i e n d std : : ostream &
61 operator << ( std : : ostream & , const MyVector &mv) ;
62
Note the use of a public static data member dbg in Line 63 that can be used to control debugging output
by setting MyVector::dbg = true or MyVector::dbg = false.
The class MyVector uses a C-style array and dynamic memory management with new and delete to
store the vector components. This is for demonstration purposes only and not recommended.
Arrays in C++
In C++ use the STL container std::vector<T> for storing data in contiguous memory locations.
C++11 code 0.2.17: Constructor for constant vector, also default constructor, see Line 28
1 MyVector : : MyVector ( std : : s i z e _ t _n , double _a ) : n ( _n ) , data ( n u l l p t r ) {
2 i f ( dbg ) cout << " { C o n s t r u c t o r M y V e c t o r ( " << _n
3 << " ) c a l l e d " << ’ } ’ << endl ;
This constructor can also serve as default constructor (a constructor that can be invoked without any
argument), because defaults are supplied for all its arguments.
The following two constructors initialize a vector from sequential containers according to the conventions
of the STL.
C++11 code 0.2.18: Templated constructors copying vector entries from an STL container
1 template <typename Container >
2 MyVector : : MyVector ( const C o n t a i n e r &v ) : n ( v . siz e ( ) ) , data ( n u l l p t r ) {
3 i f ( dbg ) cout << " { M y V e c t o r ( l e n g t h " << n
4 << " ) c o n s t r u c t e d f r o m c o n t a i n e r " << ’ } ’ << endl ;
5 i f ( n > 0) {
6 double ∗ tmp = ( data = new double [ n ] ) ;
7 f o r ( auto i : v ) ∗ tmp++ = i ; // foreach loop
8 }
9 }
Note the use of the new C++11 facility of a “foreach loop” iterating through a container in Line 7.
C++11 code 0.2.19: Constructor initializing vector from STL iterator range
1 template <typename I t e r a t o r >
2 MyVector : : MyVector ( I t e r a t o r f i r s t , I t e r a t o r l a s t ) : n ( 0 ) , data ( n u l l p t r ) {
3 n = std : : d i s t a n c e ( f i r s t , l a s t ) ;
4 i f ( dbg ) cout << " { M y V e c t o r ( l e n g t h " << n
5 << " ) c o n s t r u c t e d f r o m r a n g e " << ’ } ’ << endl ;
6 i f ( n > 0) {
7 data = new double [ n ] ;
8 std : : copy ( f i r s t , l a s t , data ) ;
9 }
10 }
10 r et ur n ( 0 ) ;
11 }
The copy constructor listed next relies on the STL algorithm std::copy to copy the elements of an
existing object into a newly created object. This takes n operations.
An important new feature of C++11 is move semantics which helps avoid expensive copy operations. The
following implementation just performs a shallow copy of pointers and, thus, for large n is much cheaper
than a call to the copy constructor from Code 0.2.21. The source vector is left in an empty vector state.
The following code demonstrates the use of std::move() to mark a vector object as disposable and
allow the compiler the use of the move constructor. The code also uses left multiplication with a scalar,
see Code 0.2.35.
This code produces the following output. We observe that v1 is empty after its data have been “stolen” by
v2.
{ MyVector ( l e n g t h 8 ) c o n s t r u c t e d from c o n t a i n e r }
{ o p e r a t o r a ∗ , MyVector o f l e n g t h 8}
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r ∗ = , MyVector o f l e n g t h 8}
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
v1 = [ ]
v2 = [ 2 . 4 , 4 . 6 , 6 . 8 , 9 , 1 1 . 2 , 1 3 . 4 , 1 5 . 6 , 1 7 . 8 ]
v3 = [ 1 . 2 , 2 . 3 , 3 . 4 , 4 . 5 , 5 . 6 , 6 . 7 , 7 . 8 , 8 . 9 ]
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
We observe that the object v1 is reset after having been moved to v3.
Use std::move only for special purposes like above and only if an object has a move con-
structor. Otherwise a ’move’ will trigger a plain copy operation. In particular, do not use
! std::move on objects at the end of their scope, e.g., within return statements.
The next operator effects copy assignment of an rvalue MyVector object to an lvalue MyVector. This
involves O(n) operations.
C++11 code 0.2.27: Type conversion operator: copies contents of vector into STL vector
1 MyVector : : operator std : : vector <double> ( ) const {
2 i f ( dbg ) cout << " { C o n v e r s i o n t o s t d : : v e c t o r , l e n g t h = " << n <<
’ } ’ << endl ;
3 r et ur n std : : vector <double > ( data , data+n ) ;
4 }
The bracket operator [] can be used to fetch and set vector components. Note that index range checking
is performed; an exception is thrown for invalid indices. The following code also gives an example of
operator overloading as discussed in § 0.2.3.
Componentwise direct comparison of vectors. Can be dangerous in numerical codes,cf. Rem. 1.5.36.
The transform method applies a function to every vector component and overwrites it with the value
returned by the function. The function is passed as an object of a type providing a ()-operator that accepts
a single argument convertible to double and returns a value convertible to double.
The following code demonstrates the use of the transform method in combination with
1. a function object of the following type
The output is
8 o p e r a t i o n s , mv t r a n s f o r m e d = [ 3 . 2 , 4 . 3 , 5 . 4 , 6 . 5 , 7 . 6 , 8 . 7 , 9 . 8 , 1 0 . 9 ]
8 o p e r a t i o n s , mv t r a n s f o r m e d = [ 5 . 2 , 6 . 3 , 7 . 4 , 8 . 5 , 9 . 6 , 1 0 . 7 , 1 1 . 8 , 1 2 . 9 ]
Final vector = [ 1.2 ,2.3 ,3.4 ,4.5 ,5.6 ,6.7 ,7.8 ,8.9 ]
Operator overloading provides the “natural” vector operations in R n both in place and with a new vector
created for the result.
29 }
C++11 code 0.2.35: Non-member function for left multiplication with a scalar
1 MyVector operator ∗ ( double alpha , const MyVector &mv) {
2 i f ( MyVector : : dbg ) cout << " { o p e r a t o r a ∗ , M y V e c t o r o f l e n g t h "
3 << mv . n << ’ } ’ << endl ;
4 MyVector tmp (mv) ; tmp ∗= alpha ;
5 r et ur n ( tmp ) ;
6 }
Adopting the notation in some linear algebra texts, the operator * has been chosen to designate the
Euclidean inner product:
At least for debugging purposes every reasonably complex class should be equipped with output function-
ality.
The following code highlights the use of operator overloading to obtain readable and compact expressions
for vector arithmetic.
We run the code and trace calls. This is printed to the console:
{ MyVector ( l e n g t h 8 ) c o n s t r u c t e d from c o n t a i n e r }
{ MyVector ( l e n g t h 8 ) c o n s t r u c t e d from c o n t a i n e r }
{ dot ∗ , MyVector o f l e n g t h 8}
{ o p e r a t o r a ∗ , MyVector o f l e n g t h 8}
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r ∗ = , MyVector o f l e n g t h 8}
{ o p e r a t o r + , MyVector o f l e n g t h 8}
{ o p e r a t o r += , MyVector o f l e n g t h 8}
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r a ∗ , MyVector o f l e n g t h 8}
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r ∗ = , MyVector o f l e n g t h 8}
{ o p e r a t o r + , MyVector o f l e n g t h 8}
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r −=, MyVector o f l e n g t h 8}
{ norm : MyVector o f l e n g t h 8}
{ o p e r a t o r / , MyVector o f l e n g t h 8}
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r ∗ = , MyVector o f l e n g t h 8}
{ o p e r a t o r + , MyVector o f l e n g t h 8}
{ o p e r a t o r += , MyVector o f l e n g t h 8}
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
Several temporary objects are created and destroyed and quite a few copy operations take place. The
situation would be worse unless move semantics was available; if we had not supplied a move constructor,
a few more copy operations would have been triggered. Even worse, the frequent copying of data runs a
high risk of cache misses. This is certainly not an efficient way to do elementary vector operations though
it looks elegant at first glance.
Gram-Schmidt orthonormalization has been taught in linear algebra and its theory will be revisited in
§ 1.5.1. Here we use this simple algorithm from linear algebra to demonstrate the use of the vector class
MyVector defined in Code 0.2.13.
The templated function gramschmidt takes a sequence of vectors stored in a std::vector object. The
actual vector type is passed as a template parameter. It has to supply length and norm member
functions as well as in place arithmetic operations -=, / and =. Note the use of the highlighted methods
of the std::vector class.
2 std : : vector <Vec> gramschmidt ( const std : : vector <Vec> &A, double
eps=1E−14) {
3 const i n t k = A . siz e ( ) ; // no. of vectors to orthogonalize
4 const i n t n = A [ 0 ] . l e n g t h ( ) ; // length of vectors
5 cout << " g r a m s c h m i d t o r t h o g o n a l i z a t i o n f o r " << k << ’ ’ << n <<
"− v e c t o r s " << endl ;
6 std : : vector <Vec> Q( { A [ 0 ] / A [ 0 ] . norm ( ) } ) ; // output vectors
7 f o r ( i n t j = 1 ; ( j <k ) && ( j <n ) ;++ j ) {
8 Q. push_back (A [ j ] ) ;
9 f o r ( i n t l = 0; l < j ;++ l ) Q. back ( ) −= (A [ j ] ∗ Q[ l ] ) ∗Q[ l ] ;
10 i f (Q. back ( ) . norm ( ) < eps ∗A [ j ] . norm ( ) ) { // premature termination ?
11 Q. pop_back ( ) ; break ;
12 }
13 Q. back ( ) / = Q. back ( ) . norm ( ) ; // normalization
14 }
15 r et ur n (Q) ; // return at end of local scope
16 }
This driver program calls a function that initializes a sequence of vectors and then orthonormalizes them
by means of the Gram-Schmidt algorithm. Eventually orthonormality of the computed vectors is tested.
Please pay attention to
C++11 code 0.2.44: Initialization of a set of vectors through a functor with two arguments
1 template <typename Functor >
2 std : : vector <myvec : : MyVector>
3 i n i t v e c t o r s ( std : : s i z e _ t n , std : : s i z e _ t k , Func tor &&f ) {
4 std : : vector <MyVector> A { } ;
5 f o r ( i n t j = 0; j <k ;++ j ) {
6 A . push_back ( MyVector ( n ) ) ;
7 f o r ( i n t i = 0; i <n ;++ i )
8 ( A . back ( ) ) [ i ] = f ( i , j ) ;
9 }
10 r et ur n ( A) ;
11 }
M ATH GL is a huge open-source plotting library for scientific graphics. It can be used for many programming
languages, in particular also for C++. Mainly we will use the Figure library (implemented by J. Gacon,
student of CSE@ETH) introduced below (Section 0.3.4).
However for some special plots using M ATH GL can be necessary. A full documentation can be found at
http://mathgl.sourceforge.net/doc_en/index.html.
First of all note that all M ATH GL plot commands do not take std::vectors or E IGEN vectors as argu-
ments but only mglData.
N OTE : The Figure environment takes care of all this conversion and formatting!
For Eigen::RowVectorXd we first must do some rearrangements of the vector, as the data()
method returns a pointer to the column major data.
If you’re using Linux the easiest way to install M ATH GL is via the command line: apt-get install
mathgl (Ubuntu/Debian) or dnf install mathgl (Fedora).
M ATH GL offers a subset of M ATLAB’s large array of plotting functions. The following tables list correspond-
ing commands/methods:
Plotting in 1-D
M ATLAB M ATH GL
axis([0,5,-2,2]) gr.Ranges(0,5,-2,2)
Default: autofit (axis auto) Default: x=-1:1, y=-1:1
Workaround:
gr.Ranges(x.Minimal(), x.Maximal(),
y.Minimal(), y.Maximal())
axis([0,5,-inf,inf]) gr.Range(’x’,0,5)
axis([-inf,inf,-2,2]) gr.Range(’y’,-2,2)
xlabel(’x-axis’) gr.Label(’x’, "x-axis")
ylabel(’y-axis’) gr.Label(’y’, "y-axis")
legend(’sin(x)’, ’xˆ2’) gr.AddLegend("sin(x)","b")
gr.AddLegend("\\xˆ2", "g")
gr.Legend()
legend(’exp(x)’) gr.AddLegend("exp(x)","b")
legend(’boxoff’) gr.Legend(1,1,"")
legend(’x’,’Location’, gr.AddLegend("x","b")
’northwest’) gr.Legend(0,1)
legend(’cos(x)’, gr.AddLegend("cos(x)","b")
’Orientation’,’horizontal’) gr.Legend("#-")
(0, 1) (0.5, 1) (1, 1) Legend alignment in MathGL:
(0, 0.5) (0.5, 0.5) (1, 0.5) Values larger than 1 will give position outside of the graph.
(0, 0) (0.5, 0) (1, 0) Default is (1,1).
plot(y) gr.Plot(y)
plot(t,y) gr.Plot(t,y)
plot(t0,y0,t1,y1) gr.Plot(t0,y0)
gr.Plot(t1,y1)
plot(t,y,’b+’) gr.Plot(t,y,"b+")
print(’myfig’,’-depsc’) gr.WriteEPS("myfig.eps")
print(’myfig’,’-dpng’) gr.WritePNG("myfig.png")
(compile w/ flag -lpng)
title(’Plot title’) gr.Title("Plot title") (title high above plot)
gr.Subplot(1,1,0,"<_")
(title directly above plot)
gr.Title("Plot title")
Plotting in 2-D
M ATLAB M ATH GL
colorbar gr.Colorbar()
mesh(Z) gr.Mesh(Z)
mesh(X,Y,Z) gr.Mesh(X,Y,Z)
surface(Z) gr.Surf(Z)
surface(X,Y,Z) gr.Surf(X,Y,Z)
pcolor(Z) gr.Tile(Z)
pcolor(X,Y,Z) gr.Tile(X,Y,Z)
plot3(X,Y,Z) gr.Plot(X,Y,Z)
Additionaly, you have to add gr.Rotate(50,60) before the plot command for MathGL to create a 3-D
box, otherwise the result is 2-D.
The Figure library is an interface to M ATH GL. By taking care of formatting and the layout it allows a very
simple, fast and easy use of the powerful plot library.
This library depends on M ATH GL (and optionally on E IGEN), so the installation requires a working version
of these dependencies.
This short example code will show, how the Figure class can be used.
5 i n t main ( ) {
6 std : : vector <double> x ( 1 0 ) , y ( 1 0 ) ;
7 f o r ( i n t i = 0 ; i < 10; ++ i ) {
8 x [ i ] = i ; y [ i ] = std : : exp ( − 0.2 ∗ i ) ∗ std : : cos ( i ) ;
9 }
10 Eigen : : VectorXd u = Eigen : : VectorXd : : LinSpaced ( 500 , 0 , 9 ) ,
11 v = ( u . array ( ) . cos ( ) ∗ ( − 0.2 ∗ u ) . array ( ) . exp ( )
) . matrix ( ) ;
12 mgl : : Figure f i g ;
13 f i g . p l o t ( x , y , " + r " ) . l a b e l ( " Sample D a t a " ) ;
14 f i g . plot ( u , v , " b " ) . label ( " F u n c t i o n " ) ;
15 f i g . legend ( ) ;
16 f i g . save ( " p l o t . e p s " ) ;
17 r et ur n 0 ;
18 }
Fig. 3
Definition:
v o i d grid( const bool & on = t r u e ,
const s t d ::string& gridType = "-",
const s t d ::string& gridCol = "h" )
Restrictions: None.
Examples:
1 mgl : : F i g u r e f i g ;
2 fig . plot (x , y) ;
3 f i g . gr id ( ) ; // set grid
4 f i g . save ( " p l o t . e p s " ) ;
5
6 mgl : : F i g u r e f i g ;
7 fig . plot (x , y) ;
8 f i g . gr id ( true , " ! " , " h " ) ; // grey (-> h) fine (-> !) mesh
9 f i g . save ( " p l o t . e p s " ) ;
Fig. 4 Fig. 5
Definition:
v o i d xlabel( const s t d ::string& label,
const double & pos = 0 )
Restrictions: None.
Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . p l o t ( x , y , " g+ " ) ; // ’g+’ equals matlab ’+-g’
3 f i g . xlabel ( " L i n e a r x a x i s " ) ;
4 f i g . save ( " p l o t . e p s " ) ;
5
6 mgl : : F i g u r e f i g ;
7 f i g . x l a b e l ( " L o g a r i t h m i x x a x i s " ) ; // no restricitons on call order
8 f i g . s e t l o g ( true , t r ue ) ;
9 f i g . p l o t ( x , y , " g+ " ) ;
10 f i g . save ( " p l o t . e p s " ) ;
Fig. 6 Fig. 7
Definition:
v o i d ylabel( const s t d ::string& label,
const double & pos = 0 )
Restrictions: None.
This method adds a legend to a plot. Legend entries have to be defined by the label method given after
the plot command.
Definition:
v o i d legend( const double & xPos = 1,
const double & yPos = 1 )
Restrictions: None.
Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . p l o t ( x0 , y0 ) . l a b e l ( " My F u n c t i o n " ) ;
3 f i g . legend ( ) ; // ’activate’ legend
4 f i g . save ( " p l o t " ) ;
5
6 mgl : : F i g u r e f i g ;
7 f i g . p l o t ( x0 , y0 ) . l a b e l ( " My F u n c t i o n " ) ;
8 f i g . legend ( 0 . 5 , 0 . 2 5 ) ; // set position to (0.5, 0.25)
9 f i g . save ( " p l o t " ) ;
10
11 mgl : : F i g u r e f i g ;
12 f i g . p l o t ( x0 , y0 ) . l a b e l ( " My F u n c t i o n " ) ;
13 f i g . save ( " p l o t " ) ; // legend won’t appear as legend() hasn’t been called
Fig. 8 Fig. 9
Restrictions: All plots will use the latest setlog options or default if none have been set.
Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . set log ( true , f a l s e ) ; // -> semilogx
3 f i g . p l o t ( x0 , y0 ) ;
4 f i g . set log ( f alse , t r ue ) ; // -> semilogy
5 f i g . p l o t ( x1 , y1 ) ;
6 f i g . set log ( true , t r ue ) ; // -> loglog
7 f i g . p l o t ( x2 , y2 ) ;
8 f i g . save ( " p l o t . e p s " ) ; // ATTENTION: all plots will use loglog-scale
9
10 mgl : : F i g u r e f i g ;
11 fig . plot (x , y) ;
12 f i g . save ( " p l o t . e p s " ) ; // -> default (= linear) scaling
Fig. 10 Fig. 11
Definition:
t e m p l a t e < typename yVector>
v o i d plot( const yVector& y,
const s t d ::string& style = "" )
Restrictions: xVector and yVector must have a size() method, which returns the size of the vec-
tor and a data() method, which returns a pointer to the first element in the vector.
Furthermore x and y must have same length.
Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . p l o t ( x , y , " g ; " ) ; // green and dashed linestyle
3 f i g . save ( " d a t a . e p s " ) ;
4
5 mgl : : F i g u r e f i g ;
6 f i g . p l o t ( x , y ) ; // OK - style is optional
7 f i g . save ( " d a t a . e p s " ) ;
8
9 mgl : : F i g u r e f i g ;
Fig. 12 Fig. 13
Fig. 14
Definition:
t e m p l a t e < typename xVector, typename yVector, typename zVector>
v o i d plot3( const xVector& x,
const yVector& y,
const zVector& z,
const s t d ::string& style = "" )
Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . plot 3 ( x , y , z ) ;
3 f i g . save ( " t r a j e c t o r i e s . e p s " ) ;
Fig. 15
Definition:
v o i d fplot( const s t d ::string& function,
const s t d ::string& style = "" )
Restrictions: None.
Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . f p l o t ( " ( 3 ∗ x ^2 − 4 . 5 / x ) ∗ e xp( − x / 1 . 3 ) " ) ;
3 f i g . f p l o t ( " 5 ∗ s i n ( 5 ∗ x ) ∗ e xp( − x ) " , " r " ) . l a b e l ( " 5 s i n ( 5 x ) ∗ e ^{ − x } " ) ;
4 f i g . ranges ( 0 . 5 , 5 , −5, 5 ) ; // be sure to set ranges for fplot!
5 f i g . save ( " p l o t . e p s " ) ;
6
7 mgl : : F i g u r e f i g ;
8 f i g . p l o t ( x , y , " b " ) . l a b e l ( " Benchmark " ) ;
9 f i g . f p l o t ( " x ^2 " , " k ; " ) . l a b e l ( "O ( x ^ 2 ) " ) ;
10 // here we don’t set the ranges as it uses the range given by the
11 // x,y data and we use fplot to draw a reference line O(xˆ2)
12 f i g . save ( " r u n t i m e s . e p s " ) ;
Fig. 16 Fig. 17
Definition:
v o i d ranges( const double & xMin,
const double & xMax,
const double & yMin,
const double & yMax )
Restrictions: xMin < xMax, yMin < yMax and ranges must be > 0 for axis in logarithmic scale.
Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . ranges ( − 1 ,1 , − 1 ,1) ;
3 fig . plot (x , y , "b" ) ;
4
5 mgl : : F i g u r e f i g ;
6 fig . plot (x , y , "b" ) ;
7 f i g . ranges ( 0 , 2 . 3 , 4 , 5 ) ; // ranges can be called before or after ’plot’
8
9 mgl : : F i g u r e f i g ;
10 f i g . ranges( − 1, 1 , 0 , 5 ) ;
11 f i g . s e t l o g ( true , t r ue ) ; // will run but MathGL will throw a warning
12 fig . plot (x , y , "b" ) ;
Fig. 18 Fig. 19
Fig. 20
Saves the graphics currently stored in a figure object to file. The default format is EPS.
Definition:
v o i d save( const s t d ::string& file )
Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . save ( " p l o t . e p s " ) ; // OK
3
4 mgl : : F i g u r e f i g ;
5 f i g . save ( " p l o t " ) ; // OK - will be saved as plot.eps
6
7 mgl : : F i g u r e f i g ;
8 f i g . save ( " p l o t . png " ) ; // OK - but needs -lpng flag!
Definition:
t e m p l a t e < typename Matrix>
MglPlot& spy( const Matrix& A, const s t d ::string& style = "b");
Restrictions: None.
Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . spy ( A) ; // e.g: A = Eigen::MatrixXd, or Eigen::SparseMatrix<double>
Fig. 21 Fig. 22
Restrictions: The vectors x and y must have the same dimensions and the matrix T must be of dimension
N × 3, with N being the number of edges.
Examples:
1 mgl : : F i g u r e f i g ;
2 f i g . t r i p l o t ( T , x , y , " b ? " ) ; // ’?’ enumerates all vertices of the mesh
3
4 mgl : : F i g u r e f i g ;
5 f i g . t r i p l o t ( T , x , y , " bF " ) ; // ’F’ will yield a solid background
6 f i g . t r i p l o t ( T , x , y , " k " ) ; // draw black mesh on top
Fig. 23 Fig. 24
Definition:
v o i d title( const s t d ::string& text )
Restrictions: None.
Linecolorsa :
blue b
green g Linestyles: Linemarkers:
red r
none + +
cyan c
solid - o o
magenta m
yellow y dashed ; ⋄ d
gray h small dashed = · .
green-blue l long dashed | △ ˆ
sky-blue n dotted : ∇ v
dash-dotted j ✁ <
orange q
small dash-dotted i ✄ >
green-yellow e
blue-violet u None is used as follows: ⊙ #.
purple p " r*" gives red stars w/o ⊞ #+
any lines ⊠ #x
a
Upper-case letters will give
a darker version of the lower-
case version.
This chapter heavily relies on concepts and techniques from linear algebra as taught in the 1st semester
introductory course. Knowledge of the following topics from linear algebra will be taken for granted and
they should be refreshed in case of gaps:
• Operations involving matrices and vectors [?, Ch. 2]
• Computations with block-structured matrices
• Linear systems of equations: existence and uniqueness of solutions [?, Sects. 1.2, 3.3]
• Gaussian elimination [?, Ch. 2]
• LU-decomposition and its connection with Gaussian elimination [?, Sect. 2.4]
The lowest level of real arithmetic available on computers are the elementary operations “+”, “−”, “∗”,
“\”, “^”, usually implemented in hardware. The next level comprises computations on finite arrays of real
numbers, the elementary linear algebra operations (BLAS). On top of them we build complex algorithms
involving iterations and approximations.
Elementary operations in R
Hardly ever anyone will contemplate implementing elementary operations on binary data formats; similarly,
well tested and optimised code libraries should be used for all elementary linear algebra operations in
simulation codes. This chapter will introduce you to such libraries and how to use them smartly.
Contents
1.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.1.2 Classes of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
52
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
1.1 Fundamentals
1.1.1 Notations
The notations in this course try to adhere to established conventions. Since these may not be universal,
idiosyncrasies cannot be avoided completely. Notations in textbooks may be different, beware!
In this course, K will designate either R (real numbers) or C (complex numbers); complex arithmetic [?,
Sect. 2.5] plays a crucial role in many applications, for instance in signal processing.
Kn =
ˆ vector space of column vectors with n components in K.
Unless stated otherwise, in mathematical formulas vector components are indexed from 1 !
✎ two notations: x = [ x1 . . . x n ] ⊤ → xi , i = 1, . . . , n
x ∈ Kn → (x)i , i = 1, . . . , n
✦ Selecting sub-vectors:
✎ notation: x = [ x1 . . . xn ] ⊤ ➣ (x)k:l = ( xk , . . . , xl )⊤ , 1 ≤ k ≤ l ≤ n
⊤
✦ j-th unit vector: e j = 0, . . . , 1, . . . , 0 , (e j )i = δij , i, j = 1, . . . , n.
✎ notation: Kronecker symbol δij := 1, if i = j, δij := 0, if i 6= j.
The colon (:) range notation is inspired by M ATLAB’s matrix addressing conventions, see Sec-
tion 1.2.1. (A)k:l,r:s is a matrix of size (l − k + 1) × (s − r + 1).
✦ Transposed matrix:
⊤
a11 . . . a1m a11 . . . an1
.. . ..
A⊤ = ... . := .. . ∈K
m,n
.
an1 . . . anm a1m . . . amn
Most matrices occurring in mathematical modelling have a special structure. This section presents a few
of these. More will come up throughout the remainder of this chapter; see also [?, Sect. 4.3].
The creation of special matrices can usually be done by special commands or functions in the various
languages or libraries dedicated to numerical linear algebra, see § 1.2.5, § 1.2.13.
A little terminology to quickly refer to matrices whose non-zero entries occupy special locations:
0 0
0 0
Definition 1.1.8. Symmetric positive definite (s.p.d.) matrices → [?, Def. 3.31], [?, Def. 1.22]
M ∈ K n,n , n ∈ N, is symmetric (Hermitian) positive definite (s.p.d.), if
M = MH and ∀x ∈ K n : xH Mx > 0 ⇔ x 6= 0 .
Lemma 1.1.9. Necessary conditions for s.p.d. → [?, Satz 3.33], [?, Prop. 1.18]
To compute the minimum of a C2 -function iteratively by means of Newton’s method (→ Sect. 8.4) a linear
system of equations with the s.p.d. Hessian as system matrix has to be solved in each step.
The solutions of many equations in science and engineering boils down to finding the minimum of some
(energy, entropy, etc.) function, which accounts for the prominent role of s.p.d. linear systems in applica-
tions.
• We consider two matrices A, B ∈ R n,m , both with at most N ∈ N non-zero entries. What is the
maximal number of non-zero entries of A + B?
Whenever algorithms involve matrices and vectors (in the sense of linear algebra) it is advisable to rely on
suitable code libraries or numerical programming environments.
1.2.1 M ATLAB
Many textbooks, for instance [?] and [?] rely on M ATLAB to demonstrate the actual implementation of
numerical algorithms. So did earlier versions of this course. The current version has dumped M ATLAB
and, hence, this section can be skipped safely.
In its basic form M ATLAB is an interpreted scripting language without strict type-binding. This, together
with its uniform IDE across many platforms, makes it a very popular tool for rapid prototyping and testing
in CSE.
• M ATLAB documentation accessible through the Help menu ot through this link,
• M ATLAB’s help facility through the commands help <function> or doc <function>,
• A concise M ATLAB primer, one of many available online, see also here
➣ In M ATLAB vectors are represented as n × 1-matrices (column vectors) or 1 × n-matrices (row vec-
tors).
Note: The treatment of vectors as special matrices is consistent with the basic operations from matrix
calculus.
☞ v = size(A) yields a row vector v of length 2 with v(1) containing the number of rows and v(1)
containing the number of columns of the matrix A.
☞ numel(A) returns the total number of entries of A; if A is a (row or column) vector, we get the length
of A.
Access (rvalue & lvalue) to components of a vector and entries of a matrix in M ATLAB is possible through
the ()-operator:
☞ r = v(i): retrieve i-th entry of vector v. i must be an integer and smaller or equal numel(v).
☞ r = A(i,j): get matrix entry (A)i,j for two (valid) integer indices i and j.
☞ r = A(end-1,end-2): get matrix entry (A)n−1,m−2 of an n × m-matrix A.
! In case the matrix A is too small to contain an entry (A)i,j , write access to A(i,j) will automatically
trigger a dynamic adjustment of the matrix size to hold the accessed entry. The other new entries are
filled with zeros.
Output: M=
% Caution: matrices are dynamically
1 2 3 0 0 0
expanded when
4 5 6 0 0 0
% out of range entries are accessed
0 0 0 0 0 0
M = [1,2,3;4,5,6]; M(4,6) = 1.0; M,
0 0 0 0 0 1
For any two (row or column) vectors I, J of positive integers A(I,J) selects the submatrix
(A)i,j i ∈I ∈ K♯I,♯J .
j ∈J
A(I,J) can be used as both r-value and l-value; in the former case the maximal components of I and
J have to be smaller or equal the corresponding matrix dimensions, lest M ATLAB issue the error message
Index exceeds matrix dimensions. In the latter case, the size of the matrix is grown, if needed,
see § 1.2.2.
Inside square brackets [ ] the following two matrix construction operators can be used:
• ,-operator =
ˆ adding another matrix to the right (horizontal concatenation)
• ;-operator =
ˆ adding another matrix at the bottom (vertical concatenation)
(The ,-operator binds more strongly than the ;-operator!)
! Matrices joined by the ,-operator must have the same number of rows.
Matrices concatenated vertically must have the same number of columns
1 2
A = [1,2;3,4;5,6];
✄ Filling a small matrix: → 3 × 2 matrix 3 4.
A = [[1;3;5],[2;4;6]];
5 6
✄ Initialization of vectors in M ATLAB:
column vectors x = [1;2;3];
row vectors y = [1,2,3];
✄ Building a matrix from blocks:
Output: C=
% MATLAB script demonstrating the construction
1 2 5 6
of a matrix from blocks
3 4 7 8
A = [1,2;3,4]; B = [5,6;7,8];
-5 -6 1 2
C = [A,B;-B,A], % use concatenation
-7 -8 3 4
In M ATLAB v = (a:s:b), where a,b,s are real numbers, creates a row vector as initialised by the
following code
i f ((b >= a) && (s > 0))
v = [a]; w h i l e (v( end )+s <= b), v = [v,v( end )+s]; end
Examples:
>> v = (3:-0.5:-0.3)
v = 3.0000 2.5000 2.0000 1.5000 1.0000 0.5000 0
>> v = (1:2.5:-13)
v = Empty matrix: 1-by-0
In general we could also pass a matrix as “loop index vector”. In this case the loop variable with run
through the columns of the matrix
Output:
% MATLAB loop over columns of a matrix
M = [1,2,3;4,5,6]; i = 1 i = 2 i = 3
f o r i = M; i, end 4 5 6
✦ A’ =
ˆ Hermitian transpose of a matrix A, transposing without complex conjugation done by transpose(A).
✦ triu (A) and tril (A) return the upper and lower triangular parts of a matrix A as r-value (copy): If
A ∈ K m.n
( (
(A)i,j , if i ≤ j , (A)i,j , if i ≥ j ,
(triu(A))i,j = (tril(A))i,j =
0 else. 0 else.
✦ diag(A) for a matrix A ∈ K m,n , min{m, n} ≥ 2 returns the column vector [(A)i,i ]i =1,min{m,n} ∈
Kmin{m,n} .
1.2.2 P YTHON
P YTHON is a widely used general-purpose and open source programming language. Together with the
packages like N UM P Y and MATPLOTLIB it delivers similar functionality like M ATLAB for free. For interactive
computing IP YTHON can be used. All those packages belong to the S CI P Y ecosystem.
P YTHON features a good documentation and several scientific distributions are available (e.g. Anaconda,
Enthought) which contain the most important packages. On most Linux-distributions the S CI P Y ecosystem
is also available in the software repository, as well as many other packages including for example the
Spyder IDE delivered with Anaconda.
A good introduction tutorial to numerical P YTHON are the S CI P Y-lectures. The full documentation of
N UM P Y and S CI P Y can be found here. For former M ATLAB-users there’s also a guide. The scripts in this
lecture notes follow the official P YTHON style guide.
Note that in P YTHON we have to import the numerical packages explicitly before use. This is normally done
at the beginning of the file with lines like import numpy as np and from matplotlib import
pyplot as plt. Those import statements are often skipped in this lecture notes to focus on the actual
computations. But you can always assume the import statements as given here, e.g. np.ravel(A) is
a call to a N UM P Y function and plt.loglog(x, y) is a call to a MATPLOTLIB pyplot function.
P YTHON is not used in the current version of the lecture. Nevertheless a few P YTHON codes are supplied
in order to convey similarities and differences to implementations in M ATLAB and C++.
The basic numeric data type in P YTHON are N UM P Y’s n-dimensional arrays. Vectors are normally imple-
mented as 1D arrays and no distinction is made between row and column vectors. Matrices are repre-
sented as 2D arrays.
There are many possibilities listed in the documentation how to create, index and manipulate arrays.
An important difference to M ATLAB is, that all arithmetic operations are normally performed element-wise,
e.g. A * B is not the matrix-matrix product but element-wise multiplication (in M ATLAB: A.*A). Also A
* v does a broadcasted element-wise product. For the matrix product one has to use np.dot(A, B)
or A.dot(B) explicitly.
1.2.3 E IGEN
Currently, the most widely used programming language for the development of new simulation software
in scientific and industrial high-performance computing is C++. In this course we are going to use and
discuss E IGEN as an example for a C++ library for numerical linear algebra (“embedded” domain specific
language: DSL).
E IGEN is a header-only C++ template library designed to enable easy, natural and efficient numerical
linear algebra: it provides data structures and a wide range of operations for matrices and vectors, see
below. E IGEN also implements many more fundamental algorithms (see the documentation page or the
discussion below).
E IGEN relies on expression templates to allow the efficient evaluation of complex expressions involving
matrices and vectors. Refer to the example given in the E IGEN documentation for details.
Here Scalar is the underlying scalar type of the matrix entries, which must support the usual operations
’+’,’-’,’*’,’/’, and ’+=’, ’*=’, ’¯’, etc. Usually the scalar type will be either double, float, or complex<>.
The cardinal template arguments RowsAtCompileTime and ColsAtCompileTime can pass a fixed
size of the matrix, if it is known at compile time. There is a specialization selected by the template argument
Eigen::Dynamic supporting variable size “dynamic” matrices.
Note that in Line 23 we could have relied on automatic type deduction via auto vectprod = ....
However, often it is safer to forgo this option and specify the type directly.
The following convenience data types are provided by E IGEN, see documentation:
• MatrixXd =
ˆ generic variable size matrix with double precision entries
• VectorXd, RowVectorXd = ˆ dynamic column and row vectors
(= dynamic matrices with one dimension equal to 1)
• MatrixNd with N = 2, 3, 4 for small fixed size square N × N -matrices (type double)
• VectorNd with N = 2, 3, 4 for small column vectors with fixed length N .
The d in the type name may be replaced with i (for int), f (for float), and cd (for complex<double>)
to select another basic scalar type.
All matrix type feature the methods cols(), rows(), and size() telling the number of columns, rows,
and total number of entries.
Access to individual matrix entries and vector components, both as Rvalue and Lvalue, is possible through
the ()-operator taking two arguments of type index_t. If only one argument is supplied, the matrix is
accessed as a linear array according to its memory layout. For vectors, that is, matrices where one
dimension is fixed to 1, the []-operator can replace () with one argument, see Line 21 of Code 1.2.12.
The entry access operator (int i,int j) allows the most direct setting of matrix entries; there is
hardly any runtime penalty.
Of course, in E IGEN dedicated functions take care of the initialization of the special matrices introduced in
??:
Eigen::MatrixXd I = Eigen::MatrixXd::Identity(n,n);
Eigen::MatrixXd O = Eigen::MatrixXd::Zero(n,m);
Eigen::MatrixXd D = d_vector.asDiagonal();
A versatile way to initialize a matrix relies on a combination of the operators « and ,, which allows the
construction of a matrix from blocks:
MatrixXd mat3(6,6);
mat3 <<
MatrixXd::Constant(4,2,1.5), // top row, first block
MatrixXd::Constant(4,3,3.5), // top row, second block
MatrixXd::Constant(4,1,7.5), // top row, third block
MatrixXd::Constant(2,4,2.5), // bottom row, left block
MatrixXd::Constant(2,2,4.5); // bottom row, right block
The matrix is filled top to bottom left to right, block dimensions have to match (like in MATLAB).
The method block(int i,int j,int p,int q) returns a reference to the submatrix with upper left
corner at position (i, j) and size p × q.
The methods row(int i) and col(int j) provide a reference to the corresponding row and column of
the matrix. Even more specialised access methods are
topLeftCorner(p,q), bottomLeftCorner(p,q),
topRightCorner(p,q), bottomRightCorner(p,q),
topRows(q), bottomRows(q),
leftCols(p), and rightCols(q),
with obvious purposes.
C++11 code 1.2.16: Demonstration code for access to matrix blocks in E IGEN ➺ GITLAB
2 template <typename MatType>
3 void blockAccess ( Eigen : : MatrixBase <MatType> &M)
4 {
5 using i n d e x _ t = typename Eigen : : MatrixBase <MatType > : : Index ;
6 using e n t r y _ t = typename Eigen : : MatrixBase <MatType > : : S c a l a r ;
7 const i n d e x _ t nrows (M. rows ( ) ) ; // No. of rows
8 const i n d e x _ t n c o l s (M. cols ( ) ) ; // No. of columns
9
10 cout << " M a t r i x M = " << endl << M << endl ; // Print matrix
11 // Block size half the size of the matrix
12 i n d e x _ t p = nrows / 2 , q = n c o l s / 2 ;
13 // Output submatrix with left upper entry at position (i,i)
14 f o r ( i n d e x _ t i = 0; i < min ( p , q ) ; i ++)
15 cout << " B l o c k ( " << i << ’ , ’ << i << ’ , ’ << p << ’ , ’ << q
16 << " ) = " << M. block ( i , i , p , q ) << endl ;
17 // l-value access: modify sub-matrix by adding a constant
18 M. block ( 1 , 1 , p , q ) += Eigen : : MatrixBase <MatType > : : Constant ( p , q , 1 . 0 ) ;
19 cout << "M = " << endl << M << endl ;
20 // r-value access: extract sub-matrix
21 MatrixXd B = M. block ( 1 , 1 , p , q ) ;
22 cout << " I s o l a t e d m o d i f i e d b l o c k = " << endl << B << endl ;
23 // Special sub-matrices
24 cout << p << " t o p r o w s o f m = " << M. topRows ( p ) << endl ;
25 cout << p << " b o t t o m r o w s o f m = " << M. bottomRows( p ) << endl ;
26 cout << q << " l e f t c o l s o f m = " << M. l e f t C o l s ( q ) << endl ;
27 cout << q << " r i g h t c o l s o f m = " << M. r ight Cols ( p ) << endl ;
28 // r-value access to upper triangular part
29 const MatrixXd T = M. template t r iangular View <Upper > ( ) ; //
30 cout << " Up p e r t r i a n g u l a r p a r t = " << endl << T << endl ;
31 // l-value access to upper triangular part
32 M. template t r iangular View <Lower > ( ) ∗= − 1.5; //
33 cout << " M a t r i x M = " << endl << M << endl ;
34 }
E IGEN offers views for access to triangular parts of a matrix, see Line 29 and Line 32, according to
M.triangularView<XX>()
where XX can stand for one of the following: Upper, Lower, StrictlyUpper, StrictlyLower,
UnitUpper, UnitLower, see documentation.
For column and row vectors references to sub-vectors can be obtained by the methods head(int
length), tail(int length), and segment(int pos,int length).
Note: Unless the preprocessor switch NDEBUG is set, E IGEN performs range checks on all indices.
Since operators like M ATLAB’s .* are not available, E IGEN uses the Array concept to furnish entry-wise
operations on matrices. An E IGEN-Array contains the same data as a matrix, supports the same meth-
ods for initialisation and access, but replaces the operators of matrix arithmetic with entry-wise actions.
Matrices and arrays can be converted into each other by the array() and matrix() methods, see
documentation for details.
The application of a functor (→ Section 0.2.3) to all entries of a matrix can also been done via the
unaryExpr() method of a matrix:
// Apply a lambda function to all entries of a matrix
aut o fnct = []( double x) { r e t u r n (x+1.0/x); };
cout << "f(m1) = " << e n d l << m1.unaryExpr(fnct) << e n d l ;
☞ E IGEN is used as one of the base libraries for the Robot Operating System (ROS), an open source
project with strong ETH participation.
☞ The geometry processing library libigl uses E IGEN as its basic linear algebra engine. At ETH it is be-
ing used and developed at ETH Zurich, at the Interactive Geometry Lab and Advanced Technologies Lab.
All numerical libraries store the entries of a (generic = dense) matrix A ∈ K m,n in a linear array of length
mn (or longer). Accessing entries entails suitable index computations.
Two natural options for “vectorisation” of a matrix: row major, column major
(A)ij ↔A_arr(m*(i-1)+(j-1))
column major:
Both in M ATLAB and E IGEN the single index access operator relies on the linear data layout: In M ATLAB
1 A = [1 2 3;4 5 6;7 8 9]; A(:)’,
In P YTHON the default data layout is row major, but it can be explicitly set. Further, array transposition
does not change any data, but only the memory order and array shape.
In E IGEN the data layout can be controlled by a template argument; default is column major.
C++11 code 1.2.22: Single index access of matrix entries in E IGEN ➺ GITLAB
2 void s tor ageOr der ( i n t nrows =6 , i n t n c o l s =7)
3 {
4 cout << " D i f f e r e n t m a t r i x s t o r a g e l a y o u t s i n E i g e n " << endl ;
5 // Template parameter ColMajor selects column major data layout
6 Matrix <double , Dynamic , Dynamic , ColMajor > mcm( nrows , n c o l s ) ;
7 // Template parameter RowMajor selects row major data layout
8 Matrix <double , Dynamic , Dynamic , RowMajor> mrm( nrows , n c o l s ) ;
9 // Direct initialization; lazy option: use int as index type
10 f o r ( i n t l =1 , i = 0 ; i < nrows ; i ++)
11 f o r ( i n t j = 0 ; j < n c o l s ; j ++ , l ++)
12 mcm( i , j ) = mrm( i , j ) = l ;
13
14 cout << " M a t r i x mrm = " << endl << mrm << endl ;
15 cout << "mcm l i n e a r = " ;
16 f o r ( i n t l = 0; l < mcm. siz e ( ) ; l ++) cout << mcm( l ) << ’ , ’ ;
17 cout << endl ;
18
The function call storageOrder(3,3), cf. Code 1.2.22 yields the output
1 D i f f e r e n t m a t r i x s t o r a g e l a y o u t s i n Eigen
2 M a t r i x mrm =
3 1 2 3
4 4 5 6
5 7 8 9
6 mcm l i n e a r = 1 , 4 , 7 , 2 , 5 , 8 , 3 , 6 , 9 ,
7 mrm l i n e a r = 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ,
Mapping a column-major matrix to a column vector with the same number of entries is called vectorization
or linearization in numerical linear algebra, in symbols
(A):,1
(A):,2
vec : K n,m → K n·m , vec(A) = .. . (1.2.24)
.
(A):,m
matlab offers the built-in command reshape for changing the dimensions of a matrix A ∈ K m,n :
B = reshape (k,l,A); % error, in case kl 6= mn
This command will create an k × l -matrix by just reinterpreting the linear array of entries of A as data for
a matrix with k rows and l columns. Regardless of the size and entries of the matrices the following test
will always produce a equal = true result
i f (() prod ( s i z e (A)) ~= (k*l)), e r r o r (’Size mismatch’); end
B = reshape(A,k,l);
equal = (B(:) == A(:));
N UM P Y offers the function np.reshape for changing the dimensions of a matrix A ∈ K m,n :
# read elements of A in row major order (default)
B = np.reshape(A, (k, l)) # error, in case kl 6= mn
B = np.reshape(A, (k, l), order=’C’) # same as above
# read elements of A in column major order
B = np.reshape(A, (k, l), order=’F’)
# read elements of A as stored in memory
B = np.reshape(A, (k, l), order=’A’)
This command will create an k × l -array by reinterpreting the array of entries of A as data for an array
with k rows and l columns. The order in which the elements of A are be read can be set by the order
argument to row major (default, ’C’), column major (’F’) or A’s internal storage order, i.e. row major if
A is row major or column major if A is column major (’A’).
If you need a reshaped view of a matrix’ data in E IGEN you can obtain it via the raw data vector belonging
to the matrix. Then use this information to create a matrix view by means of Map → documentation.
This function has to be called with a mutable (l-value) matrix type object. A sample output is printed next:
1 Matrix M =
2 0 −1 −2 −3 −4 −5 −6
3 1 0 −1 −2 −3 −4 −5
4 2 1 0 −1 −2 −3 −4
5 3 2 1 0 −1 −2 −3
6 4 3 2 1 0 −1 −2
7 5 4 3 2 1 0 −1
8 reshaped t o 2x21 =
9 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1 −6 −4 −2
10 1 3 5 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1
11 Scaled ( ! ) m a t r i x M =
12 −0 1.5 3 4.5 6 7.5 9
13 − 1.5 −0 1.5 3 4.5 6 7.5
14 −3 − 1.5 −0 1.5 3 4.5 6
15 − 4.5 −3 − 1.5 −0 1.5 3 4.5
16 −6 − 4.5 −3 − 1.5 −0 1.5 3
17 − 7.5 −6 − 4.5 −3 − 1.5 −0 1.5
18 Matrix S =
19 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1 −6 −4 −2
20 1 3 5 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1
Modern CPU feature several levels of memories (registers, L1 cache, L2 cache, . . ., main memory) of
different latency, bandwidth, and size. Frequently accessing memory locations with widely different ad-
dresses results in many cache misses and will considerably slow down the CPU.
M ATLAB-code 1.2.30: Timing for row and column oriented matrix access in M ATLAB
1 % Timing for row/column operations on matrices
2 % We conduct K runs in order to reduce the risk of skewed maesurements
3 % due to OS activity during MATLAB run.
4 K = 3; res = [];
5 f o r n=2.^(4:13)
6 A = randn (n,n);
7
8 t1 = r ealmax ;
9 f o r k=1:K, tic;
10 f o r j = 1:n-1, A(:,j+1) = A(:,j+1) - A(:,j); end ;
11 t1 = min (toc,t1);
12 end
13 t2 = r ealmax ;
14 f o r k=1:K, tic;
15 f o r i = 1:n-1, A(i+1,:) = A(i+1,:) - A(i,:); end ;
16 t2 = min (toc,t2);
17 end
18 res = [res; n, t1 , t2];
19 end
20
29 f i g u r e ; l o g l o g (res(:,1),res(:,2),’r+’, res(:,1),res(:,3),’m*’);
30 x l a b e l (’{\bf n}’,’fontsize’,14);
31 y l a b e l (’{\bf runtime [s]}’,’fontsize’,14);
32 legend (’A(:,j+1) = A(:,j+1) - A(:,j)’,’A(i+1,:) = A(i+1,:) -
A(i,:)’,...
33 ’location’,’northwest’);
34 p r i n t -depsc2 ’../PICTURES/accessrtlog.eps’;
C++11 code 1.2.31: Timing for row and column oriented matrix access for E IGEN ➺ GITLAB
2 void r o w c o l a c c e s s t i m i n g ( void )
3 {
4 const i n t K = 3 ; // Number of repetitions
P YTHON-code 1.2.32: Timing for row and column oriented matrix access in P YTHON
1 i m p o r t numpy as np
2 i m p o r t timeit
3 from matplotlib i m p o r t pyplot as plt
4
5 d e f col_wise(A):
6 f o r j i n range(A.shape[1] - 1):
7 A[:, j + 1] -= A[:, j]
8
9 d e f row_wise(A):
10 f o r i i n range(A.shape[0] - 1):
11 A[i + 1, :] -= A[i, :]
12
16 k = 3
17 res = []
18 f o r n i n 2**np.mgrid[4:14]:
19 A = np.random.normal(size=(n, n))
20
29 plt.figure()
30 plt.plot(ns, t1s, ’+’, label=’A[:, j + 1] -= A[:, j]’)
31 plt.plot(ns, t2s, ’o’, label=’A[i + 1, :] -= A[i, :]’)
32 plt.xlabel(r’n’)
33 plt.ylabel(r’runtime [s]’)
34 plt.legend(loc=’upper left’)
35 plt.savefig(’../PYTHON_PICTURES/accessrtlin.eps’)
36
37 plt.figure()
38 plt.loglog(ns, t1s, ’+’, label=’A[:, j + 1] -= A[:, j]’)
39 plt.loglog(ns, t2s, ’o’, label=’A[i + 1, :] -= A[i, :]’)
40 plt.xlabel(r’n’)
41 plt.ylabel(r’runtime [s]’)
42 plt.legend(loc=’upper left’)
43 plt.savefig(’../PYTHON_PICTURES/accessrtlog.eps’)
44
45 plt.show()
10 1
A(:,j+1) = A(:,j+1) - A(:,j)
A(i+1,:) = A(i+1,:) - A(i,:)
eigen row access
10 0 eigen column access
Platform:
10 -1 ✦ ubuntu 14.04 LTS
✦ i7-3517U CPU @ 1.90GHz × 4
10 -2 ✦ L1 32 KB, L2 256 KB, L3 4096 KB,
runtime [s]
Mem 8 GB
10 -3
✦ gcc 4.8.4, -O3, -DNDEBUG
The compiler flags -O3 and -DNDEBUG
10 -4
are essential. The C++ code would be
significantly slower if the default compiler
10 -5
options were used!
10 -6
10 1 10 2 10 3 10 4
Fig. 28 n
For both M ATLAB and E IGEN codes we observe a glaring discrepancy of CPU time required for accessing
entries of a matrix in rowwise or columnwise fashion. This reflects the impact of features of the unterlying
hardware architecture, like cache size and memory bandwidth:
Interpretation of timings: Since matrices in MATLAB are stored column major all the matrix elements in a
column occupy contiguous memory locations, which will all reside in the cache together. Hence, column
oriented access will mainly operate on data in the cache even for large matrices. Conversely, row oriented
access addresses matrix entries that are stored in distant memory locations, which incurs frequent cash
misses (cache thrashing).
The impact of hardware architecture on the performance of algorithms will not be taken into account in
this course, because hardware features tend to be both intricate and ephemeral. However, for modern
high performance computing it is essential to adapt implementations to the hardware on which the code is
supposed to run.
First we refresh the basic rules of vector and matrix calculus. Then we will learn about a very old program-
ming interface for simple dense linear algebra operations.
What you should know from linear algebra [?, Sect. 2.2]:
✦ vector space operations in matrix space K m,n (addition, multiplication with scalars)
✦ n
dot product: x, y ∈ K n , n ∈ N: x· y := xH y = ∑ x̄i yi ∈ K
i =1
(in E IGEN: x.dot(y) or x.adjoint()*y, x,y =
ˆ column vectors)
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 74
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
✦
tensor product: x ∈ K m , y ∈ K n , n ∈ N: xyH = xi ȳ j i =1,...,m ∈ Km,n
j =1,...,n
(in E IGEN: x*y.adjoint(), x,y =
ˆ column vectors)
✦ All are special cases of the matrix product:
" #
n
A ∈ K m,n , B ∈ K n,k : AB = ∑ aij bjl ∈ R m,k . (1.3.1)
j =1 i =1,...,m
l =1,...,k
Recall from linear algebra basic properties of the matrix product: for all K-matrices A, B, C (of suitable
sizes), α, β ∈ K
associative:(AB)C = A(BC) ,
bi-linear: (αA + βB)C = α(AC) + β(BC) , C(αA + βB) = α(CA) + β(CB) ,
non-commutative: AB 6= BA in general .
m n = m
n k
Fig. 29
k
= =
To understand what is going on when forming a matrix product, it is often useful to decompose it into
matrix×vector operations in one of the following two ways:
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 75
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
A ∈ K m,n , B ∈ K n,k :
" # (A)1,: B
..
AB = A(B):,1 ... A(B):,k , AB = . .
(A)m,: B (1.3.4)
↓ ↓
matrix assembled from columns matrix assembled from rows
A “mental image” of matrix multiplication is useful for telling special properties of product matrices.
For instance, zero blocks of the product matrix can be predicted easily in the following situations using the
idea explained in Rem. 1.3.3 (try to understand how):
m 0 n = 0 m
n k
Fig. 30
k
m
0 n =
0 m
n k
Fig. 31
k
A clear understanding of matrix multiplication enables you to “see”, which parts of a matrix factor matter
in a product:
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 76
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
m n = m
n 0 k
Fig. 32
k
“Seeing” the structure/pattern of a matrix product:
= ,
= .
These nice renderings of the so-called patterns of matrices, that is, the distribution of their non-zero
entries have been created by a special E IGEN/Figure-command for visualizing the structure of a matrix:
fig.spy(M)
8 i n t main ( ) {
9 i n t n = 100;
10 MatrixXd A( n , n ) , B( n , n ) ; A . setZero ( ) ; B . setZero ( ) ;
11 A . diagonal ( ) = VectorXd : : LinSpaced ( n , 1 , n ) ;
12 A . col ( n −1) = VectorXd : : LinSpaced ( n , 1 , n ) ;
13 A . row ( n −1) = RowVectorXd : : LinSpaced ( n , 1 , n ) ;
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 77
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
14 B = A . colwise ( ) . reverse ( ) ;
15 MatrixXd C = A∗A , D = A∗B ;
16 mgl : : F i g u r e f i g 1 , f i g 2 , f i g 3 , f i g 4 ;
17 f i g 1 . spy ( A) ; f i g 1 . save ( " Asp y_ cp p " ) ;
18 f i g 2 . spy ( B) ; f i g 2 . save ( " Bsp y_ cp p " ) ;
19 f i g 3 . spy (C) ; f i g 3 . save ( " Csp y_ cp p " ) ;
20 f i g 4 . spy (D) ; f i g 4 . save ( " Dsp y_ cp p " ) ;
21 r et ur n 0 ;
22 }
This code also demonstrates the use of diagonal(), col(), row() for L-value access to parts of a
matrix.
The following result is useful when dealing with matrix decompositions that often involve triangular matri-
ces.
Lemma 1.3.9. Group of regular diagonal/triangular matrices
( diagonal
( diagonal
A, B upper triangular ⇒ AB and A −1 upper triangular .
lower triangular lower triangular
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 78
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
It is important to know the different effect of multiplying with a diagonal matrix from left or right:
10 -1
10 -4
10 -8
10 0 10 1 10 2 10 3 10 4 10 5
Fig. 33 vector length n
C++11 code 1.3.11: Timing multiplication with scaling matrix in E IGEN ➺ GITLAB
2 i n t nruns = 3 , minExp = 2 , maxExp = 14;
3 MatrixXd tms ( maxExp−minExp + 1 ,4) ;
4 f o r ( i n t i = 0 ; i <= maxExp−minExp ; ++ i ) {
5 Timer tbad , tgood , t o p t ; // timer class
6 i n t n = std : : pow ( 2 , minExp + i ) ;
7 VectorXd d = VectorXd : : Random( n , 1 ) , x = VectorXd : : Random( n , 1 ) ,
y(n) ;
8 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
9 MatrixXd D = d . asDiagonal ( ) ; //
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 79
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
8 nruns = 3
9 res = []
10 for n i n 2**np.mgrid[2:15]:
11 d = np.random.uniform(size=n)
12 x = np.random.uniform(size=n)
13
Hardly surprising, the component-wise multiplication of the two vectors is way faster than the intermit-
tent initialisation of a diagonal matrix (main populated by zeros) and the computation of a matrix×vector
product. Nevertheless, such blunders keep on haunting numerical codes. Do not rely solely on E IGEN
optimizations!
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 80
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Simple operations on rows/columns of matrices, cf. what was done in Exp. 1.2.29, can often be expressed
as multiplication with special matrices: For instance, given A ∈ K n,m we obtain B by adding row (A) j,: to
row (A) j+1,: , 1 ≤ j < n.
1
..
.
Realisation through matrix 1
B= A .
product 1 1
..
.
1
The matrix multiplying A from the left is a specimen of a transformation matrix, a matrix that coincides
with the identity matrix I except for a single off-diagonal entry.
A vector space (V, K, +, ·), where V is additionally equipped with a bi-linear and associative “multiplica-
tion” is called an algebra. Hence, the vector space of square matrices K n,n with matrix multiplication is an
algebra with unit element I.
Given matrix dimensions M, N, K ∈ N block sizes 1 ≤ n < N (n′ := N − n), 1 ≤ m < M (m′ :=
M − m), 1 ≤ k < K (k′ := K − k) we start from the following matrices:
′ ′
A11 ∈ K m,n A12 ∈ K m,n B11 ∈ K n,k B12 ∈ K n,k
′ ′ ′ , ′ ′ ′ .
A21 ∈ K m ,n A22 ∈ K m ,n B21 ∈ K n ,k B22 ∈ K n ,k
This matrices serve as sub-matrices or matrix blocks and are assembled into larger matrices
A11 A12 M,N B11 B12
A= ∈K , B= ∈ K N,K .
A21 A22 B21 B22
It turns out that the matrix product AB can be computed by the same formula as the product of simple
2 × 2-matrices:
A11 A12 B11 B12 A11 B11 + A12 B21 A11 B12 + A12 B22
= . (1.3.16)
A21 A22 B21 B22 A21 B11 + A22 B21 A21 B12 + A22 B22
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 81
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
m n m
M N = M
m′ m′
n′
n n′ k k′
N
k k′ K
Fig. 34
K
Bottom line: one can compute with block-structured matrices in almost (∗) the same ways as with matrices
with real/complex entries, see [?, Sect. 1.3.3].
(∗): you must not use the commutativity of multiplication (because matrix multiplication is not
! commutative).
BLAS (Basic Linear Algebra Subprograms) is a specification (API) that prescribes a set of low-level rou-
tines for performing common linear algebra operations such as vector addition, scalar multiplication, dot
products, linear combinations, and matrix multiplication. They are the de facto low-level routines for linear
algebra libraries (Wikipedia).
The BLAS API is standardised by the BLAS technical forum and, due to its history dating back to the 70s,
follows conventions of FORTRAN 77, see the Quick Reference Guide for examples. However, wrappers for
other programming languages are available. CPU manufacturers and/or developers of operating systems
usually supply highly optimised implementations:
• OpenBLAS: open source implementation with some general optimisations, available under BSD
license.
• ATLAS (Automatically Tuned Linear Algebra Software): open source BLAS implementation with
auto-tuning capabilities. Comes with C and FORTRAN interfaces and is included in Linux distribu-
tions.
• Intel MKL (Math Kernel Library): commercial highly optimised BLAS implemetation available for all
Intel CPUs. Used by most proprietory simulation software and also M ATLAB.
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 82
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
6 t1 = r ealmax ;
7 % loop based implementation (no BLAS)
8 f o r l=1:nruns
9 tic;
10 f o r i=1:n, f o r j=1:n
11 f o r k=1:n, C(i,j) = C(i,j) + A(i,k)*B(k,j); end
12 end , end
13 t1 = min (t1, t o c );
14 end
15 t2 = r ealmax ;
16 % dot product based implementation (BLAS level 1)
17 f o r l=1:nruns
18 tic;
19 f o r i=1:n
20 f o r j=1:n, C(i,j) = dot (A(i,:),B(:,j)); end
21 end
22 t2 = min (t2, t o c );
23 end
24 t3 = r ealmax ;
25 % matrix-vector based implementation (BLAS level 2)
26 f o r l=1:nruns
27 tic;
28 f o r j=1:n, C(:,j) = A*B(:,j); end
29 t3 = min (t3, t o c );
30 end
31 t4 = r ealmax ;
32 % BLAS level 3 matrix multiplication
33 f o r l=1:nruns
34 t i c ; C = A*B; t4 = min (t4, t o c );
35 end
36 times = [ times; n t1 t2 t3 t4];
37 end
38
39 f i g u r e (’name’,’mmtiming’);
40 l o g l o g (times(:,1),times(:,2),’r+-’,...
41 times(:,1),times(:,3),’m*-’,...
42 times(:,1),times(:,4),’b^-’,...
43 times(:,1),times(:,5),’kp-’);
44 t i t l e (’Timings: Different implementations of matrix
multiplication’);
45 x l a b e l (’matrix size n’,’fontsize’,14);
46 y l a b e l (’time [s]’,’fontsize’,14);
47 legend (’loop implementation’,’dot product implementation’,...
48 ’matrix-vector implementation’,’BLAS gemm (MATLAB *)’,...
49 ’location’,’northwest’);
50
51 p r i n t -depsc2 ’../PICTURES/mvtiming.eps’;
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 83
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
10
1
✦ Mac OS X 10.6
✦ Intel Core 7, 2.66 GHz
0
10
✦ L2 256 kB, L3 4 MB, Mem 4 GB
✗ ✦ MATLAB 7.10.0 (R 2010a) ✔
-1
10
time [s]
10
-2 In M ATLAB we can achieve a tremendous gain in
-3
execution speed by relying on compact matrix/vec-
✖ ✕
10
tor operations that invoke efficient BLAS routine.
-4
10
10
-5
Advise: avoid loops in M ATLAB and replace them with
vectorised operations.
-6
10
0 1 2 3 4
10 10 10 10 10
Fig. 35 matrix size n
To some extent the same applies to E IGEN code, a corresponding timing script is given here:
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 84
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
34 C = A ∗ B;
35 t 4 . s top ( ) ;
36 }
37 t i m i n g s ( p , 0 ) =n ; t i m i n g s ( p , 1 ) = t 1 . min ( ) ; t i m i n g s ( p , 2 ) = t 2 . min ( ) ;
38 t i m i n g s ( p , 3 ) = t 3 . min ( ) ; t i m i n g s ( p , 4 ) = t 4 . min ( ) ;
39 }
40 std : : cout << std : : s c i e n t i f i c << std : : set pr ecision ( 3 ) << t i m i n g s <<
std : : endl ;
41 //Plotting
42 mgl : : F i g u r e f i g ;
43 f i g . setFontSize ( 4 ) ;
44 f i g . s e t l o g ( true , t r ue ) ;
45 f i g . p l o t ( t i m i n g s . col ( 0 ) , t i m i n g s . col ( 1 ) , " + r − " ) . l a b e l ( " l o o p
implementation " ) ;
46 f i g . p l o t ( t i m i n g s . col ( 0 ) , t i m i n g s . col ( 2 ) , " ∗m− " ) . l a b e l ( " d o t − p r o d u c t
implementation " ) ;
47 f i g . p l o t ( t i m i n g s . col ( 0 ) , t i m i n g s . col ( 3 ) , " ^ b− " ) . l a b e l ( " m a t r i x − v e c t o r
implementation " ) ;
48 f i g . p l o t ( t i m i n g s . col ( 0 ) , t i m i n g s . col ( 4 ) , " ok− " ) . l a b e l ( " E i g e n m a t r i x
product " ) ;
49 f i g . xlabel ( " matrix size n " ) ; f i g . ylabel ( " time [ s ] " ) ;
50 f i g . legend ( 0 . 0 5 , 0 . 9 5 ) ; f i g . save ( " m m t i m i n g " ) ;
51 }
10 -3
✣ ✢
10 -6 that loops are not as punished as in M ATLAB!
-7
10
10 0 10 1 10 2 10 3 10 4
Fig. 36 matrix size n
The same applies to P YTHON code, a corresponding timing script is given here:
6 d e f mm_loop_based(A, B, C):
7 m, n = A.shape
8 _, p = B.shape
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 85
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
9 f o r i i n range(m):
10 f o r j i n range(p):
11 f o r k i n range(n):
12 C[i, j] += A[i, k] * B[k, j]
13 return C
14
15 d e f mm_blas1(A, B, C):
16 m, n = A.shape
17 _, p = B.shape
18 f o r i i n range(m):
19 f o r j i n range(p):
20 C[i, j] = np.dot(A[i, :], B[:, j])
21 return C
22
23 d e f mm_blas2(A, B, C):
24 m, n = A.shape
25 _, p = B.shape
26 f o r i i n range(m):
27 C[i, :] = np.dot(A[i, :], B)
28 return C
29
30 d e f mm_blas3(A, B, C):
31 C = np.dot(A, B)
32 return C
33
34 d e f main():
35 nruns = 3
36 res = []
37 f o r n i n 2**np.mgrid[2:11]:
38 p r i n t (’matrix size n = {}’.format(n))
39 A = np.random.uniform(size=(n, n))
40 B = np.random.uniform(size=(n, n))
41 C = np.random.uniform(size=(n, n))
42
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 86
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
implementation’)
58 plt.loglog(ns, tblas3s, ’^’, label=’BLAS gemm (np.dot)’)
59 plt.legend(loc=’upper left’)
60 plt.savefig(’../PYTHON_PICTURES/mvtiming.eps’)
61 plt.show()
62
63 i f __name__ == ’__main__’:
64 main()
BLAS routines are grouped into “levels” according to the amount of data and computation involved (asymp-
totic complexity, see Section 1.4.1 and [?, Sect. 1.1.12]):
• Level 1: vector operations such as scalar products and vector norms.
asymptotic complexity O(n), (with n =ˆ vector length),
⊤
e.g.: dot product: ρ = x y
• Level 2: vector-matrix operations such as matrix-vector multiplications.
asymptotic complexity O(mn),(with (m, n) = ˆ matrix size),
e.g.: matrix×vector multiplication: y = αAx + βy
• Level 3: matrix-matrix operations such as matrix additions or multiplications.
asymptotic complexity often O(nmk),(with (n, m, k) =ˆ matrix sizes),
e.g.: matrix product: C = AB
Syntax of BLAS calls:
The functions have been implemented for different types, and are distinguished by the first letter of the
function name. E.g. sdot is the dot product implementation for single precision and ddot for double
precision.
xDOT(N,X,INCX,Y,INCY)
– x ∈ {S, D}, scalar type: S =
ˆ type float, D =
ˆ type double
– N=
ˆ length of vector (modulo stride INCX)
– X=
ˆ vector x: array of type x
– INCX =
ˆ stride for traversing vector X
– Y=
ˆ vector y: array of type x
– INCY =
ˆ stride for traversing vector Y
• vector operations y = αx + y
xAXPY(N,ALPHA,X,INCX,Y,INCY)
– x ∈ {S, D, C, Z}, S =
ˆ type float, D =
ˆ type double, C =
ˆ type complex
– N=
ˆ length of vector (modulo stride INCX)
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 87
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
– ALPHA =
ˆ scalar α
– X=
ˆ vector x: array of type x
– INCX =
ˆ stride for traversing vector X
– Y=
ˆ vector y: array of type x
– INCY =
ˆ stride for traversing vector Y
✦ BLAS LEVEL 2: matrix-vector operations, asymptotic complexity O(mn), (m, n) =
ˆ matrix size
xGEMV(TRANS,M,N,ALPHA,A,LDA,X,
INCX,BETA,Y,INCY)
– x ∈ {S, D, C, Z}, scalar type: S =
ˆ type float, D =
ˆ type double, C =
ˆ type complex
– M, N =
ˆ size of matrix A
– ALPHA =
ˆ scalar parameter α
ˆ matrix A stored in linear array of length M · N (column major arrangement)
– A=
(A)i,j = A[ N ∗ ( j − 1) + i ] .
– X=
ˆ vector x: array of type x
– INCX =
ˆ stride for traversing vector X
– BETA =
ˆ scalar paramter β
– Y=
ˆ vector y: array of type x
– INCY =
ˆ stride for traversing vector Y
• BLAS LEVEL 3: matrix-matrix operations, asymptotic complexity O(mnk), (m, n, k) =
ˆ matrix
sizes
The BLAS calling syntax seems queer in light of modern object oriented programming paradigms, but it
is a legacy of FORTRAN77, which was (and partly still is) the programming language, in which the BLAS
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 88
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
It is a very common situation in scientific computing that one has to rely on old codes and libraries imple-
mented in an old-fashioned style.
When calling BLAS library functions from C, all arguments have to be passed by reference (as pointers),
in order to comply with the argument passing mechanism of FORTRAN77, which is the model followed by
BLAS.
14 i n t main ( ) {
15 const i n t n = 5 ; // length of vector
16 const i n t i n c x = 1 ; // stride
17 const i n t i n c y = 1 ; // stride
18 double alpha = 2.5; // scaling factor
19
24 f o r ( s i z e _ t i = 0; i <n ; i ++) {
25 x [ i ] = 3.1415 ∗ i ;
26 y [ i ] = 1.0 / ( double ) ( i +1) ;
27 }
28
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 89
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
When using E IGEN in a mode that includes an external BLAS library, all this calls are wrapped into E IGEN
methods.
Example 1.3.24 (Using Intel Math Kernel Library (Intel MKL) from E IGEN)
The Intel Math Kernel Library is a highly optimized math library for Intel processors and can be called
directly from E IGEN (Using Intel R Math Kernel Library from Eigen) using the correct compiler flags.
C++-code 1.3.25: Timing of matrix multiplication in E IGEN for MKL comparison ➺ GITLAB
2 //! script for timing different implementations of matrix
multiplications
3 void mmeigenmkl ( ) {
4 i n t nruns = 3 , minExp = 6 , maxExp = 13;
5 MatrixXd t i m i n g s ( maxExp−minExp + 1 ,2) ;
6 f o r ( i n t p = 0 ; p <= maxExp−minExp ; ++p ) {
7 Timer t 1 ; // timer class
8 i n t n = std : : pow ( 2 , minExp + p ) ;
9 MatrixXd A = MatrixXd : : Random( n , n ) ;
10 MatrixXd B = MatrixXd : : Random( n , n ) ;
11 MatrixXd C = MatrixXd : : Zero ( n , n ) ;
12 f o r ( i n t q = 0 ; q < nruns ; ++q ) {
13 t1 . s t a r t ( ) ;
14 C = A ∗ B;
15 t 1 . s top ( ) ;
16 }
17 t i m i n g s ( p , 0 ) =n ; t i m i n g s ( p , 1 ) = t 1 . min ( ) ;
18 }
19 std : : cout << std : : s c i e n t i f i c << std : : set pr ecision ( 3 ) << t i m i n g s <<
std : : endl ;
20 }
Timing results:
n E IGEN sequential [s] E IGEN parallel [s] MKL sequential [s] MKL parallel [s]
64 1.318e-04 1.304e-04 6.442e-05 2.401e-05
128 7.168e-04 2.490e-04 4.386e-04 1.336e-04
256 6.641e-03 1.987e-03 3.000e-03 1.041e-03
512 2.609e-02 1.410e-02 1.356e-02 8.243e-03
1024 1.952e-01 1.069e-01 1.020e-01 5.728e-02
2048 1.531e+00 8.477e-01 8.581e-01 4.729e-01
4096 1.212e+01 6.635e+00 7.075e+00 3.827e+00
8192 9.801e+01 6.426e+01 5.731e+01 3.598e+01
1. Computing with Matrices and Vectors, 1.3. Basic linear algebra operations 90
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
10 2 10 -9
Eigen sequential Eigen sequential
Eigen parallel Eigen parallel
MKL sequential MKL sequential
10 1 MLK parallel MLK parallel
[s]
3
10 0
10 -1
10 -10
-2
10
10 -3
10 -4
10 -5 10 -11
10 1 10 2 10 3 10 4 10 1 10 2 10 3 10 4
Fig. 37 Fig. 38 matrix size n
matrix size n
Large scale numerical computations require immense resources and execution time of numerical codes
often becomes a central concern. Therefore, much emphasis has to be put on
1. designing algorithms that produce a desired result with (nearly) minimal computational effort (de-
fined precisely below),
4. implementing codes that make optimal use of hardware resources and capabilities,
While Item 2–Item 4 are out of the scope of this course and will be treated in more advanced lectures,
Item 1 will be a recurring theme.
The following definition encapsulates what is regarded as a measure for the “cost” of an algorithm in
computational mathematics.
The computational effort required by a numerical code amounts to the number of elementary oper-
ations (additions,subtractions,multiplications,divisions,square roots) executed in a run.
Fifty years ago counting elementary operations provided good predictions of runtimes, but nowadays this
is no longer true.
! The computational effort involved in a run of a numerical code is only loosely related
to overall execution time on modern computers.
This is conspicuous in Exp. 1.2.29, where algorithms incurring exactly the same computational effort took
different times to execute.
The reason is that on today’s computers a key bottleneck for fast execution is latency and bandwidth of
memory, cf. the discussion at the end of Exp. 1.2.29 and [?]. Thus, concepts like I/O-complexity [?, ?]
might be more appropriate for gauging the efficiency of a code, because they take into account the pattern
of memory access.
The concept of computational effort from Def. 1.4.1 is still useful in a particular context:
• Problem size parameters in numerical linear algebra usually are the lengths and dimensions of the
vectors and matrices that an algorithm takes as inputs.
• Worst case indicates that the maximum effort over a set of admissible data is taken into account.
We write F(n) = O(G (n)) for two functions F, G : N → R , if there exists a constant C > 0 and
n∗ ∈ N such that
F (n) ≤ C G (n) ∀ n ≥ n∗ .
More generally, F(n1 , . . . , nk ) = O(G (n1 , . . . , nk )) for two functions F, G : N k → R implies the
existence of a constant C > 0 and a threshold value n∗ ∈ N such that
Of course, the definition of the Landau symbol leaves ample freedom for stating meaningless bounds;
an algorithm that runs with linear complexity O(n) can be correctly labelled as possessing O(exp(n))
complexity.
Yet, whenever the Landau notation is used to describe asymptotic complexities, the bounds have to be
sharp in the sense that no function with slower asymptotic growth will be possible inside the O. To make
this precise we stipulate the following.
Whenever the asymptotic complexity of an algorithm is stated as O(nα log β n exp(γnδ )) with non-
negative parameters α, β, γ, δ ≥ 0 in terms of the problem size parameter n, we take for granted
that choosing a smaller value for any of the parameters will no longer yield a valid (or provable)
asymptotic bound.
In particular
✦ complexity O(n) means that the complexity is not O(nα ) for any α < 1,
✦ complexity O(exp(n)) excludes asymptotic complexity O(n p ) for any p ∈ R.
Terminology: If the asymptotic complexity of an algorithm is O(n p ) with p = 1, 2, 3 we say that it is of
“linear”, “quadratic”, and “cubic” complexity, respectively.
§ 1.4.2 warned us that computational effort and, thus, asymptotic complexity, of an algorithm for a concrete
problem on a particular platform may not have much to do with the actual runtime (the blame goes to
memory hierarchies, internal pipelining, vectorisation, etc.).
To a certain extent, the asymptotic complexity allows to predict the dependence of the runtime of a
particular implementation of an algorithm on the problem size (for large problems).
For instance, an algorithm with asymptotic complexity O(n2 ) is likely to take 4× as much time when the
problem size is doubled.
If the conjecture holds true, then the points (ni , ti ) will approximately lie on a straight line with slope
α in a doubly logarithmic plot (which can be created in M ATLAB by the loglog plotting command and
in E IGEN with the Figure-command fig.setlog(true, true);).
Performing elementary BLAS-type operations through simple (nested) loops, we arrive at the following
obvious complexity bounds:
operation description #mul/div #add/sub asymp. complexity
dot product (x ∈ Rn, y ∈ Rn )
7→ xH y n n−1 O (n)
tensor product m n
(x ∈ R , y ∈ R ) 7→ xyH nm 0 O(mn)
matrix product(∗) (A ∈ R m,n , B ∈ R n,k ) 7→ AB mnk mk(n − 1) O(mnk)
(∗): The O(mnk) complexity bound applies to “straightforward” matrix multiplication according to
(1.3.1).
For m = n = k there are (sophisticated) variants with better asymptotic complexity, e.g., the divide-and-
conquer Strassen algorithm [?] with asymptotic complexity O(nlog2 7 ):
Start from A, B ∈ K n,n with n = 2ℓ, ℓ ∈ N. The idea relies on the block matrix
product
(1.3.16) with
C11 C22
Aij , Bij ∈ K ℓ,ℓ , i, j ∈ {1, 2}. Let C := AB be partitioned accordingly: C = C21 C22 . Then tedious
elementary computations reveal
C11 = Q0 + Q3 − Q4 + Q6 ,
C21 = Q1 + Q3 ,
C12 = Q2 + Q4 ,
C22 = Q0 + Q2 − Q1 + Q5 ,
Beside a considerable number of matrix additions ( computational effort O(n2 ) ) it takes only 7 multiplica-
tions of matrices of size n/2 to compute C! Strassen’s algorithm boils down to the recursive application
of these formulas for n = 2k , k ∈ N.
A refined algorithm of this type can achieve complexity O(n2.36 ), see [?].
In computations involving matrices and vectors complexity of algoritms can often be reduced by performing
the operations in a particular order:
We consider the multiplication with a rank-1-matrix. Matrices with rank 1 can always be obtained as the
tensor product of two vectors, that is, the matrix product of a column vector and a row vector. Given
a ∈ K m , b ∈ K n , x ∈ K n we may compute the vector y = ab⊤ x in two ways:
y = ab⊤ x . (1.4.12) y = a b⊤ x . (1.4.13)
T = (a*b.transpose())*x; t = a*b.dot(x);
➤ complexity O(mn) ➤ complexity O(n + m) (“linear complexity”)
10 -3
Platform:
10 -4
when supplied with two low-rank matrices A, B ∈ K n,p , p ≪ n, in terms of n → ∞ obviously is O(n2 ),
because an intermediate n × n-matrix ABT is built.
First, consider the case of a tensor product (= rank-1) matrix, that is, p = 1, A ↔ a = [ a1 , . . . , an ]⊤ ∈ K n ,
B ↔ b = [b1 , . . . , bn ] ∈ K n . Then
a1 b1 a1 b2 ... . . . a1 bn x1
0 a2 b2 a2 b3 . . . . . . a2 bn
...
.. .. .. .. ..
. . . . .
T
y = triu(ab )x =
.. .. .. ..
. . . .
. . ..
.. .. ..
. . .. .
0 ... . . . 0 a n bn xn
1 1 . . . ... 1
a1 b1 x1
.. 0 1 1 ... ... 1 . .. ..
. .. .
... . . . . . . . . . .
= .. .. .. .. .
. . . .
. .
..
. .. . .. .. .. ..
. ..
. .
an 0 ... ... 0 1 bn x n
| {z }
T
Thus, the core problem is the fast multiplication of a vector with an upper triangular matrix T described
in E IGEN syntax by Eigen::MatrixXd::Ones(n,n).triangularView<Eigen::Upper>().
Note that multiplication of a vector x with T yields a vector of partial sums of components of x starting
from last component. This can be achieved by invoking the special C++ command std::partial_sum
in the numeric header (documentation). We also observe that
p
ABT = ∑ (A):,ℓ ((B):,ℓ )⊤ ,
ℓ=1
so that that the computations for the special case p = 1 discussed above can simply be reused p times!
C++11 code 1.4.16: Efficient multiplication with the upper diagonal part of a rank- p-matrix in
E IGEN ➺ GITLAB
2 //! Computation of y = triu(AB T )x
3 //! Efficient implementation with backward cumulative sum
4 //! (partial_sum)
5 template <class Vec , class Mat>
6 void l r t r i m u l t e f f ( const Mat& A , const Mat& B , const Vec& x , Vec& y ) {
7 const i n t n = A . rows ( ) , p = A . cols ( ) ;
8 asser t ( n == B . rows ( ) && p == B . cols ( ) ) ; // size missmatch
9 f o r ( i n t l = 0 ; l < p ; ++ l ) {
10 Vec tmp = ( B . col ( l ) . array ( ) ∗ x . array ( ) ) . matrix ( ) . reverse ( ) ;
11 std : : partial_sum ( tmp . data ( ) , tmp . data ( ) +n , tmp . data ( ) ) ;
12 y += ( A . col ( l ) . array ( ) ∗ tmp . reverse ( ) . array ( ) ) . matrix ( ) ;
13 }
14 }
This code enjoys the obvious complexity of O( pn) for p, n → ∞, p < n. The code offers an example of
a function templated with its argument types, see § 0.2.5. The types Vec and Mat must fit the concept of
E IGEN vectors/matrices.
The next concept from linear algebra is important in the context of computing with multi-dimensional arrays.
The function (A ⊗ B)x when invoked with two matrices A ∈ K m,n and B ∈ K l,k and a vector x ∈ K nk ,
will suffer an asymptotic complexity of O(m · n · l · k), determined by the size of the intermediate dense
matrix A ⊗ B ∈ K ml,nk .
The idea is to form the products Bx j , j = 1, . . . , n, once, and then combine them linearly with coefficients
given by the entries in the rows of A:
C++11 code 1.4.19: Efficient multiplication of Kronecker product with vector in E IGEN
➺ GITLAB
2 //! @brief Multiplication of Kronecker product with vector y = ( A ⊗ B) x.
Elegant way using reshape
3 //! WARNING: using Matrix::Map we assume the matrix is in ColMajor
format, *beware* you may incur bugs if matrix is in RowMajor isntead
4 //! @param[in] A Matrix m × n
5 //! @param[in] B Matrix l × k
6 //! @param[in] x Vector of dim nk
7 //! @param[out] y Vector y = kron(A,B)*x of dim ml
8 template <class Matrix , class Vector >
9 void k r o n m u l t v ( const Mat r ix &A , const Mat r ix &B , const Vector &x ,
Vector &y ) {
10 unsigned i n t m = A . rows ( ) ; unsigned i n t n = A . cols ( ) ;
11 unsigned i n t l = B . rows ( ) ; unsigned i n t k = B . cols ( ) ;
12 // 1st matrix mult. computes the products Bx j
13 // 2nd matrix mult. combines them linearly with the coefficients
of A
14 Mat r ix t = B ∗ Mat r ix : : Map( x . data ( ) , k , n ) ∗ A . transpose ( ) ; //
15 y = Mat r ix : : Map( t . data ( ) , m∗ l , 1 ) ;
16 }
Recall the reshaping of a matrix in E IGEN in order to understand this code: Rem. 1.2.27.
The asymptotic complexity of this code is determined by the two matrix multiplications in Line 14. This
yields the asymptotic complexity O(lkn + mnl ) for l, k, m, n → ∞.
Note that different reshaping is used in the P YTHON code due to the default row major storage order.
From linear algebra [?, Sect. 4.4] or Ex. 0.2.41 we recall the fundamental algorithm of Gram-Schmidt
orthogonalisation of an ordered finite set {a1 , . . . , ak }, k ∈ N, of vectors aℓ ∈ K n :
Input: { a1 , . . . , a k } ⊂ K n
In linear algebra we have learnt that, if it does
1: q1 := a1
% 1st output vector not STOP prematurely, this algorithm will com-
k a1 k 2 pute orthonormal vectors q1 , . . . , qk satisfying
2: for j = 2, . . . , k do
{ % Orthogonal projection
3: q j := a j Span{q1 , . . . , qℓ } = Span{a1 , . . . , aℓ } ,
4: for ℓ = 1, 2, . . . , j − 1 do (GS) (1.5.2)
5: { q j ← q j − a j · qℓ qℓ }
6: if ( q j = 0 ) then STOP for all ℓ ∈ {1, . . . , k}.
qj
7: else { qj ← } More precisely, if a1 , . . . , aℓ , ℓ ≤ k, are linearly
k j k2
q
8: } independent, then the Gram-Schmidt algorithm
will not terminate before the ℓ + 1-th step.
Output: { q1 , . . . , q j }
ˆ Euclidean norm of a vector ∈ K n
✎ Notation: k·k2 =
The following code implements the Gram-Schmidt orthonormalization of a set of vectors passed as the
columns of a matrix A ∈ R n,k .
We will soon learn the rationale behind the odd test in Line 13..
6 nq = np.linalg.norm(q)
7 i f nq < 1e-9 * np.linalg.norm(A[:, j]):
8 br eak
9 Q = np.column_stack([Q, q / nq])
10 return Q
Note the different loop range due to the zero-based indexing in P YTHON.
This property
can be easily tested numerically, for instance by computing Q⊤ Q for a matrix Q =
a1 , . . . , qk ∈ R n,k .
C++11 code 1.5.7: Wrong result from Gram-Schmidt orthogonalisation E IGEN ➺ GITLAB
2 void g s r o u n d o f f ( MatrixXd& A) {
3 // Gram-Schmidt orthogonalization of columns of A, see Code 1.5.3
4 MatrixXd Q = gramschmidt ( A) ;
5 // Test orthonormality of columns of Q, which should be an
6 // orthogonal matrix according to theory
7 cout << set pr ecision ( 4 ) << f i x e d << " I = "
8 << endl << Q. transpose ( ) ∗Q << endl ;
9 // E I G E N ’s stable internal Gram-Schmidt orthogonalization by
10 // QR-decomposition, see Rem. 1.5.9 below
11 HouseholderQR<MatrixXd > qr ( A . rows ( ) ,A . cols ( ) ) ; //
12 qr . compute ( A) ; MatrixXd Q1 = qr . householderQ ( ) ; //
13 // Test orthonormality
14 cout << " I 1 = " << endl << Q1 . transpose ( ) ∗Q1 << endl ;
15 // Check orthonormality and span property (1.5.2)
16 MatrixXd R1 = qr . matrixQR ( ) . t r iangular View <Upper > ( ) ;
17 cout << s c i e n t i f i c << " A−Q1 ∗ R1 = " << endl << A−Q1∗R1 << endl ;
18 }
We test the orthonormality of the output vectors of Gram-Schmidt orthogonalization for a special matrix
A ∈ R10,10 , a so-called Hilbert matrix, defined by (A)i,j = (i + j − 1)−1 . Then Code 1.5.7 produces the
follwing output:
I =
1.0000 0.0000 -0.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
0.0000 1.0000 -0.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
-0.0000 -0.0000 1.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
0.0000 0.0000 0.0000 1.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
-0.0000 -0.0000 -0.0000 -0.0000 1.0000 0.0000 -0.0008 -0.0007 -0.0007 -0.0006
0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 -0.0540 -0.0430 -0.0360 -0.0289
-0.0000 -0.0000 -0.0000 -0.0000 -0.0008 -0.0540 1.0000 0.9999 0.9998 0.9996
-0.0000 -0.0000 -0.0000 -0.0000 -0.0007 -0.0430 0.9999 1.0000 1.0000 0.9999
-0.0000 -0.0000 -0.0000 -0.0000 -0.0007 -0.0360 0.9998 1.0000 1.0000 1.0000
-0.0000 -0.0000 -0.0000 -0.0000 -0.0006 -0.0289 0.9996 0.9999 1.0000 1.0000
Obviously, the vectors produced by the function gramschmidt fail to be orthonormal, contrary to the
predictions of rigorous results from linear algebra!
However, Line 11, Line 12 of Code 1.5.7 demonstrate another way to orthonormalize the columns of a
matrix using E IGEN’s built-in class template HouseholderQR:
I1 =
1.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0000
-0.0000 1.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000 -0.0000 0.0000 -0.0000
0.0000 -0.0000 1.0000 -0.0000 -0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
-0.0000 0.0000 -0.0000 1.0000 0.0000 -0.0000 -0.0000 0.0000 0.0000 0.0000
-0.0000 0.0000 -0.0000 0.0000 1.0000 -0.0000 0.0000 -0.0000 -0.0000 0.0000
-0.0000 -0.0000 0.0000 -0.0000 -0.0000 1.0000 -0.0000 -0.0000 0.0000 -0.0000
-0.0000 -0.0000 0.0000 -0.0000 0.0000 -0.0000 1.0000 0.0000 0.0000 -0.0000
-0.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000 0.0000 1.0000 -0.0000 0.0000
-0.0000 0.0000 0.0000 0.0000 -0.0000 0.0000 0.0000 -0.0000 1.0000 -0.0000
0.0000 -0.0000 0.0000 0.0000 0.0000 -0.0000 -0.0000 0.0000 -0.0000 1.0000
Now we observe apparently perfect orthogonality (1.5.6) of the columns of the matrix Q1 in Code 1.5.7.
There is another algorithm that reliably yields the theoretical output of Gram-Schmidt orthogonalization.
Computers cannot compute “properly” in R : numerical computations may not respect the laws of
analysis and linear algebra!
In Code 1.5.7 we saw the use of the E IGEN class HousholderQR<MatrixType> for the purpose of
Gram-Schmidt orthogonalisation. The underlying theory and algorithms will be explained later in Sec-
tion 3.3.3. There we will have the following insight:
➣ Up to signs the columns of the matrix Q available from the QR-decomposition of A are the same
vectors as produced by the Gram-Schmidt orthogonalisation of the columns of A.
Code 1.5.7 demonstrates a case where a desired result can be obtained by two algebraically equiv-
alent computations, that is, they yield the same result in a mathematical sense. Yet, when im-
! plemented on a computer, the results can be vastly different. One algorithm may produce junk
(“unstable algorithm”), whereas the other lives up to the expectations (“stable algorithm”)
Supplement to Exp. 1.5.5: despite its ability to produce orthonormal vectors, we get as output for D=A-Q1*R1
in Code 1.5.7:
D =
2.2204e-16 3.3307e-16 3.3307e-16 1.9429e-16 1.9429e-16 5.5511e-17 1.3878e-16 6.9389e-17 8.3267e-17 9.7145e-17
0.0000e+00 1.1102e-16 8.3267e-17 5.5511e-17 0.0000e+00 5.5511e-17 -2.7756e-17 0.0000e+00 0.0000e+00 4.1633e-17
-5.5511e-17 5.5511e-17 2.7756e-17 5.5511e-17 0.0000e+00 0.0000e+00 0.0000e+00 -1.3878e-17 1.3878e-17 1.3878e-17
0.0000e+00 5.5511e-17 2.7756e-17 2.7756e-17 0.0000e+00 1.3878e-17 -1.3878e-17 0.0000e+00 1.3878e-17 2.7756e-17
0.0000e+00 2.7756e-17 0.0000e+00 1.3878e-17 1.3878e-17 1.3878e-17 0.0000e+00 1.3878e-17 1.3878e-17 4.1633e-17
-2.7756e-17 2.7756e-17 1.3878e-17 4.1633e-17 2.7756e-17 1.3878e-17 0.0000e+00 -1.3878e-17 2.7756e-17 2.7756e-17
0.0000e+00 2.7756e-17 0.0000e+00 2.7756e-17 2.7756e-17 1.3878e-17 0.0000e+00 1.3878e-17 2.7756e-17 2.0817e-17
0.0000e+00 2.7756e-17 2.7756e-17 1.3878e-17 1.3878e-17 1.3878e-17 0.0000e+00 1.3878e-17 2.0817e-17 2.7756e-17
1.3878e-17 1.3878e-17 1.3878e-17 2.7756e-17 1.3878e-17 0.0000e+00 -1.3878e-17 6.9389e-18 -6.9389e-18 1.3878e-17
0.0000e+00 2.7756e-17 1.3878e-17 1.3878e-17 1.3878e-17 0.0000e+00 0.0000e+00 0.0000e+00 1.3878e-17 1.3878e-17
➥ The computed QR-decomposition apparently fails to meet the exact algebraic requirements stipulated
by Thm. 3.3.9. However, note the tiny size of the “defect”.
The two different QR-decompositions (3.3.3.1) and (3.3.3.1) of a matrix A ∈ K n,k , n, k ∈ N, can be
computed in E IGEN as follows:
HouseholderQR<MatrixXd> qr(A);
// Full QR-decomposition (3.3.3.1)
Q = qr.householderQ();
// Economical QR-decomposition (3.3.3.1)
thinQ = qr.householderQ() *
MatrixXd::Identity(A.rows(), s t d ::min(A.rows(), A.cols()));
The returned matrices Q and R correspond to the QR-factors Q and R as defined above. See the
discusssion, whether the “economical” decomposition is truly economical: some expression template
magic.
The two different QR-decomposition (3.3.3.1) and (3.3.3.1) of a matrix A ∈ K n,k , n, k ∈ N, can be
computed in P YTHON as follows:
1 Q, R = np.linalg.qr(A, mode=’reduced’) # Economical (3.3.3.1)
2 Q, R = np.linalg.qr(A, mode=’complete’) # Full (3.3.3.1)
The returned matrices Q and R correspond to the QR-factors Q and R as defined above.
The reason, why computers must fail to execute exact computations with real numbers is clear:
✞ ☎ ✞ ☎
Computer = finite automaton ➢ can handle only finitely many numbers, not R
✝ ✆ ✝ ✆
machine numbers, set M
The set of machine numbers M cannot be closed under elementary arithmetic operations
+, −, ·, /, that is, when adding, multiplying, etc., two machine numbers the result may not belong
to M.
The results of elementary operations with operands in M have to be mapped back to M, an oper-
ation called rounding.
The impact of roundoff means that mathematical identities may not carry over to the computational realm.
✞ ☎
As we have seen above in Exp. 1.5.5
✝ ✆
Computers cannot compute “properly” !
✛ ✘
analysis
numerical computations 6=
linear algebra
✚ ✙
This introduces a new and important aspect in the study of numerical algorithms!
Now we give a brief sketch of the internal structure of machine numbers ∈ M. The main insight will be
that
never = 0 !
never = 0 ! 1 1 ... 1 1
| {z }
machine number ∈ M : x=± 0 . 1 1 1 1 1 ... 1 1 ·B digits for exponent
| {z }
m digits for mantissa
Clearly, there is a largest element of M and two that are closest to zero. These are mainly determined by
the range for the exponent E, cf. Def. 1.5.15.
std::numeric_limits<double>::max()
and std::numeric_limits<double>::min()
functions. Other properties of arithmetic types can be queried accordingly from the numeric_limits header.
From Def. 1.5.15 it is clear that there are equi-spaced sections of M and that the gaps between machine
numbers are bigger for larger numbers:
Bemin −1
0
spacing Bemin −m spacing Bemin −m+1 spacing Bemin −m+2
Gap partly filled with non-normalized numbers
Non-normalized numbers violate the lower bound for the mantissa i in Def. 1.5.15.
(1.5.18) IEEE standard 754 for machine numbers → [?], [?, Sect. 2.4], → link
No surprise: for modern computers B = 2 (binary system), the other parameters of the universally
implemented machine number system are
single precision : m = 24∗ ,E ∈ {−125, . . . , 128} ➣ 4 bytes
double precision : m = 53∗ ,E ∈ {−1021, . . . , 1024} ➣ 8 bytes
∗: including bit indicating sign
The standardisation of machine numbers is important, because it ensures that the same numerical algo-
rithm, executed on different computers will nevertheless produce the same result.
1 inf
2 0
Output: 3 inf
4 −nan !
E = emax , M 6= 0 =
ˆ NaN = Not a number → exception
E = emax , M = 0 =ˆ Inf = Infinity → overflow
E =0 ˆ Non-normalized numbers → underflow
=
E = 0, M = 0 =
ˆ number 0
☞ C++ does not always fulfill the requirements of the IEEE 754 standard and it needs to be checked
with std::numeric_limits<T>::is_iec559.
8 i n t main ( ) {
9 cout << std : : n u m e r i c _ l i m i t s <double > : : i s _ i e c 5 5 9 << endl
10 << std : : d e f a u l t f l o a t << n u m e r i c _ l i m i t s <double > : : min ( ) << endl
11 << std : : h e x f l o a t << n u m e r i c _ l i m i t s <double > : : min ( ) << endl
12 << std : : d e f a u l t f l o a t << n u m e r i c _ l i m i t s <double > : : max ( ) << endl
13 << std : : h e x f l o a t << n u m e r i c _ l i m i t s <double > : : max ( ) << endl ;
14 }
1 true
2 2.22507e−308
Output: 3 0010000000000000
4 1.79769e+308
5 7fefffffffffffff
1 2.22044604925031 e−16
Output: 2 6.75015598972095 e−14
3 − 1.60798663273454e−09
Can you devise a similar calculation, whose result is even farther off zero? Apparently the rounding that
inevitably accompanies arithmetic operations in M can lead to results that are far away from the true
result.
ǫabs := | x − xe| ,
| x − xe|
ǫrel := .
| x|
The number of correct (significant, valid) digits of an approximation xe of x ∈ K is defined through the
relative error:
| x − xe|
If ǫrel := | x|
≤ 10−ℓ , then xe has ℓ correct digits, ℓ ∈ N0
(Recall that argmin x F( x ) is the set of arguments of a real valued function F that makes it attain its (global)
minimum.)
Of course, ➊ above is not possible in a strict sense, but the effect of both steps can be realised and yields
a floating point realization of ⋆ ∈ {+, −, ·, /}.
Let us denote by EPS the largest relative error (→ Def. 1.5.24) incurred through rounding:
| rd( x ) − x |
EPS := max , (1.5.30)
x ∈ I \{0} | x|
For machine numbers according to Def. 1.5.15 EPS can be computed from the defining parameters B
(base) and m (length of mantissa) [?, p. 24]:
However, when studying roundoff errors, we do not want to delve into the intricacies of the internal repre-
sentation of machine numbers. This can be avoided by just using a single bound for the relative error due
to rounding, and, thus, also for the relative error potentially suffered in each elementary operation.
There is a small positive number EPS, the machine precision, such that for the elementary arithmetic
operations ⋆ ∈ {+, −, ·, /} and “hard-wired” functions∗ f ∈ {exp, sin, cos, log, . . .} holds
xe
⋆ y = ( x ⋆ y)(1 + δ) , fe( x ) = f ( x )(1 + δ) ∀ x, y ∈ M ,
Output:
1 2.22044604925031 e−16
Knowing the machine precision can be important for checking the validity of computations or coding ter-
mination conditions for iterative approximations.
cout .precision(25);
double eps =
numeric_limits< double >::epsilon();
cout << fixed << 1.0 + 0.5*eps << e n d l In fact, the following “definition”
<< 1.0 - 0.5*eps << e n d l of EPS is sometimes used:
<< (1.0 + 2/eps) - 2/eps << e n d l ; EPS is the smallest posi-
Output: tive number ∈ M for which
1+e EPS 6= 1 (in M):
1 1.0000000000000000000000000
2 0.9999999999999998889776975
3 0.0000000000000000000000000
e EPS = 1 actually complies with the “axiom” of roundoff error analysis, Ass. 1.5.32:
We find that 1+
EPS
1 = (1 + EPS)(1 + δ) ⇒ |δ| = < EPS ,
1 + EPS
2 2 EPS
= (1 + )(1 + δ) ⇒ |δ| = < EPS .
EPS EPS 2 + EPS
!
Do we have to worry about these tiny roundoff errors ?
Since results of numerical computations are almost always polluted by roundoff errors:
Tests like if (x == 0) are pointless and even dangerous, if x contains the result
of a numerical computation.
! Remedy: Test if (abs(x) < eps*s) ...,
s=ˆ positive number, compared to which | x | should be small.
We saw a first example of this practise in Code 1.5.3, Line 13.
1 5.68175492717434 e−322
Output: 2 3.15248510554597
A simple example teaching how to avoid overflow during the computation of the norm of a 2D vector [?,
Ex. 2.9]:
q ( p
r= x 2 + y2 | x | 1 + (y/x )2 , if | x | ≥ |y| ,
r= p
|y| 1 + (x/y)2 , if |y| > | x | .
straightforward evaluation:
p p overflow, when | x | >
max |M | or |y| > max |M |. ➢ no overflow!
1.5.4 Cancellation
In general, predicting the impact of roundoff errors on the result of a multi-stage computation is very diffi-
cult, if possible at all. However, there is a constellation that is particularly prone to dangerous amplification
of roundoff and still can be detected easily.
The following simple E IGEN code computes the real roots of a quadratic polynomial p(ξ ) = ξ 2 + αξ + β
by the discriminant formula
1 √
p(ξ 1 ) = p(ξ 2 ) = 0 , ξ 1/2 = α ± D , if D := α2 − 4β ≥ 0 . (1.5.40)
2
C++11-code 1.5.41: Discriminant formula for the real roots of p(ξ ) = ξ 2 + αξ + β ➺ GITLAB
2 //! C++ function computing the zeros of a quadratic polynomial
3 //! ξ → ξ 2 + αξ + β by means
p of the familiar discriminant
1
4 //! formula ξ 1,2 = 2 (−α ± α2 − 4β). However
5 //! this implementation is vulnerable to round-off! The zeros are
6 //! returned in a column vector
7 Vector2d zerosquadpol ( double alpha , double beta ) {
8 Vector2d z ;
This formula is applied to the quadratic polynomial p(ξ ) = (ξ − γ)(ξ − γ1 ) after its coefficients α, β have
been computed from γ, which will have introduced small relative roundoff errors (of size EPS).
Observation: 3
2
relative errors in ξ , ξ
leads to “wrong” roots.
1
For large γ the computed small root may be fairly 2
In order to understand why the small root is much more severely affected by roundoff, note that its com-
putation involves the subtraction of two large numbers, if γ is large. This is the typical situation, in which
cancellation occurs.
We look at the exact subtraction of two almost equal positive numbers both of which have small relative
errors (red boxes) with respect to some desired exact value (indicated by blue boxes). The result of the
subtraction will be small, but the errors may add up during the subtraction, ultimately constituting a large
fraction of the result.
(absolute) errors
Cancellation
=
ˆ Subtraction of almost equal numbers
(➤ extreme amplification of relative errors)
Fig. 41
(✁ Roundoff error introduced by subtraction itself is negligi-
ble.)
We consider two positive numbers x, y of about the same size afflicted with relative errors ≈ 10−7 .
This means that their seventh decimal digits are perturbed, here indicated by ∗. When we subtract the
two numbers the perturbed digits are shifted to the left, resulting in a possible relative error of ≈ 10−3 :
padded zeroes
Again, this example demonstrates that cancellation wreaks havoc through error amplification, not through
the roundoff error due to the subtraction.
Example 1.5.45 (Cancellation when evaluating difference quotients → [?, Sect. 8.2.6], [?,
Ex. 1.3])
f ( x + h) − f ( x )
f ′ ( x ) = lim .
h →0 h
This suggests the following approximation of the derivative by a difference quotient with small but finite
h>0
f ( x + h) − f ( x )
f ′ (x) ≈ for |h| ≪ 1 .
h
Results from analysis tell us that the approximation error should tend to zero for h → 0. More precise
quantitative information is provided by the Taylor formula for a twice continuously differentiable function [?,
p. 5]
f ( x + h) − f ( x )
− f ′ ( x ) = 12 h f ′′ (ξ ) for some ξ = ξ ( x, h) ∈ [min{ x, x + h}, max{ x, x + h}] .
h
(1.5.47)
We investigate the approximation of the derivative by difference quotients for f = exp, x = 0, and
different values of h > 0:
Fig. 42
Line 16: The literal "1.1" in instead of 1.1 prevents the conversion to double.
Obvious culprit: cancellation when computing the numerator of the difference quotient for small |h|
leads to a strong amplification of inevitable errors introduced by the evaluation of the transcendent
exponential function.
We witness the competition of two opposite effects: Smaller h results in a better approximation of the
derivative by the difference quotient, but the impact of cancellation is the stronger the smaller |h|.
f ( x + h) − f ( x )
Approximation error f ′ ( x ) − →0
h as h → 0 .
Impact of roundoff → ∞
In order to provide a rigorous underpinning for our conjecture, in this example we embark on our first
roundoff error analysis merely based on the “Axiom of roundoff analysis” Ass. 1.5.32: As in the computa-
tional example above we study the approximation of f ′ ( x ) = e x for f = exp, x ∈ R .
(Note that the estimate for the term (eh − 1)/h is a particular case of (1.5.47).)
e x − df 2eps p
relative error: ≈ h + → min for h = 2 eps .
ex h
√
In double precision: 2eps = 2.107342425544702 · 10−8
In the numerical experiment of Ex. 1.5.45 we computed the relative error of the result by subtraction, see
Code 1.5.48. Of course, massive cancellation will occur! Do we have to worry?
In this case cancellation can be tolerated, because we are interested only in the magnitude of the relative
error. Even if it was affected by a large relative error, this information is still not compromised.
For example, if the relative error has the exact value 10−8, but can be computed only with a huge relative
error of 10%, then the perturbed value would still be in the range [0.9 · 10−8, 1.1 · 10−8 ]. Therefore it will
still have the correct magnitude and still permit us to conclude the number of valid digits correctly.
The matrix created by the M ATLAB command A = hilb(10), the so-called Hilbert matrix, has columns
that are almost linearly dependent.
a2 − b2 = (a + b)(a − b) , a, b ∈ R; .
We evaluate this term by means of two algebraically equivalent algorithms for the input data a = 1.3,
b = 1.2 in 2-digit decimal arithmetic with standard rounding. (“Algebraically equivalent” means that two
algorithms will produce the same results in the absence of roundoff errors.
Algorithm A Algorithm B
x := ae· a = 1.7 (rounded) e b = 2.5 (exact)
x := a+
y := be· b = 1.4 (rounded) e b = 0.1 (exact)
y := a−
e y = 0.30 (exact)
x− x ∗ y = 0.25 (exact)
Algorithm B produces the exact result, whereas Algorithm A fails to do so. Is this pure coincidence or an
indication of the superiority of algorithm B? This question can be answered by roundoff error analysis. We
demonstrate the approach for the two algorithms A & B and general input a, b, ∈ R .
Roundoff error analysis heavily relies on Ass. 1.5.32 and dropping terms of “higher order” in the machine
precision, that is terms that behave like O(EPSq ), q > 1. It involves introducing the relative roundoff error
for every elementary operation through a factor (1 + δ), |δ| ≤ EPS.
Algorithm A:
x = a2 (1 + δ1 ) , y = b2 (1 + δ2 )
fe = (a2 (1 + δ1 ) − b2 (1 + δ2 ))(1 + δ3 ) = f + a2 δ1 − b2 δ2 + (a2 − b2 )δ3 + O(EPS2 )
2 + b2 |
| fe − f | a2 + b2 + | a2 − b2 | | a
≤ EPS + O(EPS2 ) = EPS 1 + 2 + O(EPS2 ) . (1.5.53)
|f| | a2 − b2 | | a − b2 |
will be neglected
For a ≈ b the relative error of the result of Algorithm A will be much larger than the machine
precision EPS. This reflects cancellation in the last subtraction step.
Algorithm B:
x = (a + b)(1 + δ1 ) , y = (a − b)(1 + δ2 )
fe = (a + b)(a − b)(1 + δ1 )(1 + δ2 )(1 + δ3 ) = f + (a2 − b2 )(δ1 + δ2 + δ3 ) + O(EPS2 )
| fe − f |
≤ |δ1 + δ2 + δ3 | + O(EPS2 ) ≤ 3EPS + O(EPS2 ) . (1.5.54)
|f|
The reason is that input data and and initial intermediate results are usually not as much tainted by roundoff
errors as numbers computed after many steps.
The following examples demonstrate a few fundamental techniques for steering clear of cancellation by
using alternative formulas that yield the same value (in exact arithmetic), but do not entail subtracting two
numbers of almost equal size.
Example 1.5.56 (Stable discriminant formula → Ex. 1.5.39, [?, Ex. 2.10])
If ξ 1 and ξ 2 are the two roots of the quadratic polynomial p(ξ ) = ξ 2 + αξ + β, then ξ 1 · ξ 2 = β (Vieta’s
formula). Thus once we have computed a root, we can obtain the other by simple division.
Idea:
➊ Depending on the sign of α compute “stable root” without cancellation.
➋ Compute other root from Vieta’s formula (avoiding subtraction)
➥ Invariably, we add numbers with the same sign in Line 17 and Line 21.
-11 Roundoff in the computation of zeros of a parabola
×10
3.5
unstable
stable
Code 1.5.42.
relative error in ξ 1
2
Observation:
1.5
The new code can also compute the small root of
the polynomial p(ξ ) = (ξ − γ)(ξ − γ1 ) (expanded 1
0
0 100 200 300 400 500 600 700 800 900 1000
Fig. 44 γ
10 -4
I (1-cos(x))
relative error of 1-cos(x)
10 -16
-7 -6 -5 -4 -3 -2 -1 0
10 10 10 10 10 10 10 10
Fig. 45 x
Analytic manipulations offer ample opportunity to rewrite expressions in equivalent form immune to
cancellation.
Fig. 46
sin α2n
Area of n-gon:
Fn
αn αn n n 2π cos α2n
An = n cos sin = sin αn = sin .
2 2 2 2 n
αn
Recursion formula for An derived from 2
Fig. 47
r s p
αn 1 − cos αn 1− 1 − sin2 αn
sin = = ,
2 2 2
√
Initial approximation: A6 = 23 3 .
The approximation deteriorates after applying the recursion formula many times:
n An An − π sin αn
6 2.598076211353316 -0.543516442236477 0.866025403784439
12 3.000000000000000 -0.141592653589794 0.500000000000000
24 3.105828541230250 -0.035764112359543 0.258819045102521
48 3.132628613281237 -0.008964040308556 0.130526192220052
96 3.139350203046872 -0.002242450542921 0.065403129230143
192 3.141031950890530 -0.000560702699263 0.032719082821776
384 3.141452472285344 -0.000140181304449 0.016361731626486
768 3.141557607911622 -0.000035045678171 0.008181139603937
1536 3.141583892148936 -0.000008761440857 0.004090604026236
3072 3.141590463236762 -0.000002190353031 0.002045306291170
6144 3.141592106043048 -0.000000547546745 0.001022653680353
12288 3.141592516588155 -0.000000137001638 0.000511326906997
24576 3.141592618640789 -0.000000034949004 0.000255663461803
49152 3.141592645321216 -0.000000008268577 0.000127831731987
98304 3.141592645321216 -0.000000008268577 0.000063915865994
196608 3.141592645321216 -0.000000008268577 0.000031957932997
393216 3.141592645321216 -0.000000008268577 0.000015978966498
786432 3.141593669849427 0.000001016259634 0.000007989485855
1572864 3.141592303811738 -0.000000349778055 0.000003994741190
3145728 3.141608696224804 0.000016042635011 0.000001997381017
6291456 3.141586839655041 -0.000005813934752 0.000000998683561
12582912 3.141674265021758 0.000081611431964 0.000000499355676
25165824 3.141674265021758 0.000081611431964 0.000000249677838
50331648 3.143072740170040 0.001480086580246 0.000000124894489
100663296 3.159806164941135 0.018213511351342 0.000000062779708
201326592 3.181980515339464 0.040387861749671 0.000000031610136
402653184 3.354101966249685 0.212509312659892 0.000000016660005
805306368 4.242640687119286 1.101048033529493 0.000000010536712
1610612736 6.000000000000000 2.858407346410207 0.000000007450581
v p
r
u p For αn ≪ 1: 1 − sin2 αn ≈ 1
u
αn 1 − cos αn t 1 − 1 − sin2 αn
sin = = Cancellation here!
2 2 2
We arrive at an equivalent formula not vulnerable to cancellation essentially using the identity (a + b)(a −
b) = a2 − b2 in order to eliminate the difference of square roots in the numerator.
s v
p u p p
αn 1− 2
1 − sin αn u 1 − 1 − sin2 αn 1 + 1 − sin2 αn
sin = =t · p
2 2 2 1 + 1 − sin2 αn
s
1 − (1 − sin2 αn ) sin αn
= p =r p .
2(1 + 1 − sin2 αn ) 2
2 1 + 1 − sin αn
Using the stable recursion, we observe better approximation for polygons with more corners:
n An An − π sin αn
6 2.598076211353316 -0.543516442236477 0.866025403784439
12 3.000000000000000 -0.141592653589793 0.500000000000000
24 3.105828541230249 -0.035764112359544 0.258819045102521
48 3.132628613281238 -0.008964040308555 0.130526192220052
96 3.139350203046867 -0.002242450542926 0.065403129230143
192 3.141031950890509 -0.000560702699284 0.032719082821776
384 3.141452472285462 -0.000140181304332 0.016361731626487
768 3.141557607911857 -0.000035045677936 0.008181139603937
1536 3.141583892148318 -0.000008761441475 0.004090604026235
3072 3.141590463228050 -0.000002190361744 0.002045306291164
6144 3.141592105999271 -0.000000547590522 0.001022653680338
12288 3.141592516692156 -0.000000136897637 0.000511326907014
24576 3.141592619365383 -0.000000034224410 0.000255663461862
49152 3.141592645033690 -0.000000008556103 0.000127831731976
98304 3.141592651450766 -0.000000002139027 0.000063915866118
196608 3.141592653055036 -0.000000000534757 0.000031957933076
393216 3.141592653456104 -0.000000000133690 0.000015978966540
786432 3.141592653556371 -0.000000000033422 0.000007989483270
1572864 3.141592653581438 -0.000000000008355 0.000003994741635
3145728 3.141592653587705 -0.000000000002089 0.000001997370818
6291456 3.141592653589271 -0.000000000000522 0.000000998685409
12582912 3.141592653589663 -0.000000000000130 0.000000499342704
25165824 3.141592653589761 -0.000000000000032 0.000000249671352
50331648 3.141592653589786 -0.000000000000008 0.000000124835676
100663296 3.141592653589791 -0.000000000000002 0.000000062417838
201326592 3.141592653589794 0.000000000000000 0.000000031208919
402653184 3.141592653589794 0.000000000000001 0.000000015604460
805306368 3.141592653589794 0.000000000000001 0.000000007802230
1610612736 3.141592653589794 0.000000000000001 0.000000003901115
10 0
approximation error
10 -4
of stable recursion 10
-14
unstable recursion
stable recursion
-16
10
0 2 4 6 8 10
10 10 10 10 10 10
Fig. 48 n
| exp( x )−eg
xp( x )|
x xp( x )
Approximation eg exp( x ) exp( x )
-20 6.1475618242e-09 2.0611536224e-09 1.982583033727893
-18 1.5983720359e-08 1.5229979745e-08 0.049490585500089
-16 1.1247503300e-07 1.1253517472e-07 0.000534425951530
-14 8.3154417874e-07 8.3152871910e-07 0.000018591829627
-12 6.1442105142e-06 6.1442123533e-06 0.000000299321453
-10 4.5399929604e-05 4.5399929762e-05 0.000000003501044
-8 3.3546262812e-04 3.3546262790e-04 0.000000000662004
-6 2.4787521758e-03 2.4787521767e-03 0.000000000332519
-4 1.8315638879e-02 1.8315638889e-02 0.000000000530724
-2 1.3533528320e-01 1.3533528324e-01 0.000000000273603
0 1.0000000000e+00 1.0000000000e+00 0.000000000000000
2 7.3890560954e+00 7.3890560989e+00 0.000000000479969
4 5.4598149928e+01 5.4598150033e+01 0.000000001923058
6 4.0342879295e+02 4.0342879349e+02 0.000000001344248
8 2.9809579808e+03 2.9809579870e+03 0.000000002102584
10 2.2026465748e+04 2.2026465795e+04 0.000000002143799
12 1.6275479114e+05 1.6275479142e+05 0.000000001723845
14 1.2026042798e+06 1.2026042842e+06 0.000000003634135
16 8.8861105010e+06 8.8861105205e+06 0.000000002197990
18 6.5659968911e+07 6.5659969137e+07 0.000000003450972
20 4.8516519307e+08 4.8516519541e+08 0.000000004828737
7 Terms in exponential sum for x = -20
×10
5
3
Observation:
value of k-th summand
sign. -2
-4
-5
0 5 10 15 20 25 30 35 40 45 50
Fig. 49 index k of summand
1
exp( x ) = , if x<0.
exp(− x )
Recall the Taylor expansion formula in one dimension for a function that is m + 1 times continuously
differentiable in a neighborhood of x [?, Satz 5.5.1]
m
1 (k) 1
f ( x + h) = ∑ f ( x )hk + Rm ( x, h) , Rm ( x, h) = f ( m + 1) ( ξ ) h m + 1 ,
k=0
k! ( m + 1 ) !
for some ξ ∈ [min{ x, x + h}, max{ x, x + h}], and for all sufficiently small |h|. Here R( x, h) is called
the remainder term and f (k) denotes the k derivative of f .
Cancellation in (1.5.66) can be avoided by replacing exp(a), a > 0, with a suitable Taylor expansion
around a = 0 and then dividing by a:
m
exp(a) − 1 1 1
= ∑ ak + Rm ( a) , Rm ( a) = exp(ξ )am for some 0 ≤ ξ ≤ a .
a k=0
( k + 1 ) ! ( m + 1 ) !
For a similar discussion see [?, Ex. 2.12].
Issue: How to choose the number m of terms to be retained in the Taylor expansion? We have to pick
m large enough such that the relative approximation error remains below a prescribed threshold tol. To
estimate the relative approximation error, we use the expression for the remainder together with the simple
estimate (exp(a) − 1)/a > 1 for all a > 0:
m
1
(e a − 1)/a − ∑ ( k + 1) !
ak
k=0 1 1
rel. err. = ≤ exp(ξ )am ≤ exp(a)am .
(e a − 1)/a ( m + 1) ! ( m + 1) !
For a = 10−3 we get
m 1 2 3 4 5
1.0010e-03 5.0050e-07 1.6683e-10 4.1708e-14 8.3417e-18
Hence, keeping m = 3 terms is enough for achieving about 10 valid digits.
-6
10
(exp(a)-1.0)/a
-10
10
v = 1.0 + (1.0/2 + 1.0/6*a)*a;
-11
10
else
10 -12
v = ( exp (a)-1.0)/a;
end 10 -13
-14
Error computed by comparison with M ATLAB’s built-in 10
tion of exp( x ) − 1. 10
-16
-10 -8 -6 -4 -2 0
10 10 10 10 10 10
Fig. 50 argument a
We have seen that a particular “problem” can be tackled by different “algorithms”, which produce different
results due to roundoff errors. This section will clarify what distinguishes a “good” algorithm from a rather
abstract point of view.
Note: In this course, both the data space X and the result space Y will always be subsets of finite dimen-
sional vector spaces.
We consider the “problem” of computing the product Ax for a given matrix A ∈ K m,n and a given vector
x ∈ Kn .
➣ • Data space X = K m,n × K n (input is a matrix and a vector)
• Result space Y = R m (space of column vectors)
• Problem function F : X → Y, F(a, x) := Ax
Norms provide tools for measuring errors. Recall from linear algebra and calculus [?, Sect. 4.3], [?,
Sect. 6.1]:
All norms on the vector space K n , n ∈ N, are equivalent in the sense that for arbitrary two norms k·k1
and k·k2 we can always find a constant C > 0 such that
k vk 1 ≤ C k vk 2 ∀ v ∈ K n . (1.5.72)
Of course, the constant C will usually depend on n and the norms under consideration.
For the vector norms introduced above, explicit expressions for the constants “C” are available: for all
x∈ Kn
√
k x k2 ≤ k x k1 ≤ n k x k2 , (1.5.73)
√
k x k ∞ ≤ k x k2 ≤ n k x k ∞ , (1.5.74)
k x k ∞ ≤ k x k 1 ≤ nk x k ∞ . (1.5.75)
The matrix space K m,n is a vector space, of course, and can also be equipped with various norms. Of
particular importance are norms induced by vector norms on K n and K m .
Given vector norms k·k1 and k·k2 on K n and K m , respectively, the associated matrix norm is
defined by
kMxk2
M ∈ R m,n : kMk := sup .
x∈R n \{0} kxk1
By virtue of definition the matrix norms enjoy an important property, they are sub-multiplicative:
✎ notations for matrix norms for quadratic matrices associated with standard vector norms:
k x k2 → k M k 2 , k x k 1 → k M k 1 , k x k ∞ → k M k ∞
Rather simple formulas are available for the matrix norms induced by the vector norms k·k∞ and k·k1
m
➢ matrix norm ↔ k·k1 = column sum norm kMk1 := max
j=1,...,n
∑ |mij | . (1.5.80)
i =1
Sometimes special formulas for the Euclidean matrix norm come handy [?, Sect. 2.3.3]:
|x H Ax|
A ∈ K n,n , A = A H ⇒ k Ak2 = max .
x6 =0 kxk22
Proof. Recall from linear algebra: Hermitian matrices (a special class of normal matrices) enjoy unitary
similarity to diagonal matrices:
Hence, both expressions in the statement of the lemma agree with the largest modulus of eigenvalues of
A.
✷
For A ∈ K m,n the Euclidean matrix norm kAk2 is the square root of the largest (in modulus)
eigenvalue of A H A.
For a normal matrix A ∈ K n,n (that is, A satisfies AH A = AAH ) the Euclidean matrix norm agrees
with the modulus of the largest eigenvalue.
When we talk about an “algorithm” we have in mind a concrete code function in M ATLAB or C++; the
only way to describe an algorithm is through a piece of code. We assume that this function defines
another mapping F e : X → Y on the data space of the problem. Of course, we can only feed data to the
M ATLAB/C++-function, if they can be represented in the set M of machine numbers. Hence, implicit in the
e is the assumption that input data are subject to rounding before passing them to the code
definition of F
function proper.
Problem Algorithm
F : X ⊂ Rn → Y ⊂ Rm e⊂ M
Fe : X → Y
[Stable algorithm]
We write w(x), x ∈ X , for the computational effort (→ Def. 1.4.1) required by the algorithm for input x.
An algorithm Fe for solving a problem F : X 7→ Y is numerically stable if for all x ∈ X its result Fe(x)
(possibly affected by roundoff) is the exact result for “slightly perturbed” data:
Here EPS should be read as machine precision according to the “Axiom” of roundoff analysis Ass. 1.5.32.
F
Illustration of Def. 1.5.85 ✄ x Fe y
(y = ˆ exact result for exact data x) Fe(x)
e
x F
Terminology: (Y, k·kY )
Def. 1.5.85 introduces stability in the sense of (X, k·kX )
backward error analysis
Fig. 52
Sloppily speaking, the impact of roundoff (∗) on a stable algorithm is of the same order of magnitude
as the effect of the inevitable perturbations due to rounding the input data.
(∗) In some cases the definition of Fe will also involve some approximations as in Ex. 1.5.65. Then the
above statement also includes approximation errors.
that purports to provide a stable implementation of Ax for A ∈ K m,n , x ∈ K n , cf. Ex. 1.5.68. How can
we verify this claim for particular data. Both, K m,n and K n are equipped with the Euclidean norm.
The task is, given y ∈ K n as returned by the function, to find conditions on y that ensure the existence of
aAe ∈ K m,n such that
e = y and
Ax e −A
A ≤ Cmn EPSkAk2 , (1.5.87)
2
e = A + zx T , z := y − Ax ∈ K m ,
A
kxk22
and we find
e −A x · wk zk2 ky − Axk2
A = zx T = sup ≤ kxk2 kzk2 = .
2 2 k wk2 kxk2
✬ ✩
w∈K n \{0}
Hence, in principle stability of an algorithm for computing Ax is confirmed, if for every x ∈ X the
computed result y = y(x) satisfies
✫ ✪
with a small constant C > 0 independent of data and problem size.
A problem shows sensitive dependence on the data, if small perturbations of input data lead to large
perturbations of the output. Such problems are also called ill-conditioned. For such problems stability of
an algorithm is easily accomplished.
Learning Outcomes
Contents
2.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
2.2 Theory: Linear systems of equations . . . . . . . . . . . . . . . . . . . . . . . . . . 128
2.2.1 Existence and uniqueness of solutions . . . . . . . . . . . . . . . . . . . . . . 128
2.2.2 Sensitivity of linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
2.3 Gaussian Ellimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
2.3.1 Basic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
2.3.2 LU-Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.3.3 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
2.4 Stability of Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
2.5 Survey: Elimination solvers for linear systems of equations . . . . . . . . . . . . 164
2.6 Exploiting Structure when Solving Linear Systems . . . . . . . . . . . . . . . . . . 168
2.7 Sparse Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
2.7.1 Sparse matrix storage formats . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
2.7.2 Sparse matrices in M ATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
2.7.3 Sparse matrices in E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
2.7.4 Direct Solution of Sparse Linear Systems of Equations . . . . . . . . . . . . 191
2.7.5 LU-factorization of sparse matrices . . . . . . . . . . . . . . . . . . . . . . . . 195
2.7.6 Banded matrices [?, Sect. 3.7] . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
2.8 Stable Gaussian elimination without pivoting . . . . . . . . . . . . . . . . . . . . 209
2.1 Preface
(Terminology: A =
ˆ system matrix/coefficient matrix, b =
ˆ right hand side vector )
Linear systems with rectangular system matrices A ∈ K m,n , called “overdetermined” for m > n, and
132
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Linear systems of equations are ubiquitous in computational science: they are encountered
• with discrete linear models in network theory (see Ex. 2.1.3), control, statistics;
• in the case of discretized boundary value problems for ordinary and partial differential equations (→
course “Numerical methods for partial differential equations”, 4th semester);
Example 2.1.3 (Nodal analysis of (linear) electric circuit [?, Sect. 4.7.1])
Now we study a very important application of numerical simulation, where (large, sparse) linear systems
of equations play a central role: Numerical circuit analysis. We begin with linear circuits in the frequency
domain, which are directly modelled by complex linear systems of equations. Later we tackle circuits
with non-linear elements, see Ex. 8.0.1, and, finally, will learn about numerical methods for computing the
transient (time-dependent) behavior of circuits, see Ex. 11.1.13.
Modeling of simple linear circuits takes only elementary physical laws as covered in any introductory
course of physics (or even in secondary school physics). There is no sophisticated physics or mathematics
involved.
Node (ger.: Knoten) =
ˆ junction of wires
➀ C1 ➁ R1
☞ number nodes 1, . . . , n ➂
Unknowns:
nodal potentials Uk , k = 1, . . . , n.
(some may be known: grounded nodes: ➅ in Fig. 54, voltage sources: ➀ in Fig. 54)
Constitutive relations for circuit elements: (in frequency domain with angular frequency ω > 0):
U
• Ohmic resistor: I = , [ R] = 1VA−1 −1
R (Uk − U j ) ,
R
• capacitor: I = ıωCU , capacitance [C] = 1AsV−1 ➤ Ikj = ıωC(Uk − U j ) ,
U
• coil/inductor : I = , inductance [ L] = 1VsA−1 −ıω −1 L−1 (Uk − U j ) .
ıωL
√
✎ notation: ı =
ˆ imaginary unit “ı := −1”, ı = exp(ı π/2)
Here we face the special case of a linear circuit: all relationships between branch currents and voltages
are of the form
The concrete value of αkj is determined by the circuit element connecting node k and node j.
These constitutive relations are derived by assuming a harmonic time-dependence of all quantities, which
is termed circuit analysis in the frequency domain (AC-mode).
Here U, I ∈ C are called complex amplitudes. This implies for temporal derivatives (denoted by a dot):
du di
(t) = Re{ıωU exp(ıωt)} , (t) = Re{ıωI exp(ıωt)} . (2.1.7)
dt dt
For a capacitor the total charge is proportional to the applied voltage:
i (t) = q̇(t)
q(t) = Cu(t) ⇒ i (t) = Cu̇(t) .
For a coil the voltage is proportional to the rate of change of current: u(t) = Li˙(t). Combined with
(2.1.6) and (2.1.7) this leads to the above constitutive relations.
No equations for nodes ➀ and ➅, because these nodes are connected to the “outside world” so that the
Kirchhoff current law (2.1.4) does not hold (from a local perspective). This is fitting, because the voltages
in these nodes are known anyway.
i i
ıωC1 + R11 − ωL + R12 − R11 ωL − R12 U2 ıωC1 U
− R11 1
−ıωC2 0
R1 + ıωC2 0 U3
i 1 i 1 1 =
1U
ωL 0 R5 − ωL + R4 − R4 U4 R5
− R12 −ıωC2 − R14 1 1 −1 U5 0
R2 + ıωC2 + R4 + R3
This is a linear system of equations with complex coefficients: A ∈ C4,4 , b ∈ C4 . For the algorithms to
be discussed below this does not matter, because they work alike for real and complex numbers.
Known from linear algebra [?, Sect. 1.2], [?, Sect. 1.3]:
invertible /
A ∈ K n,n :⇔ ∃1 B ∈ K n,n : AB = BA = I .
regular
B=
ˆ inverse of A, (✎ notation B = A −1 )
New, recall a few concepts from linear algebra needed to state criteria for the invertibility of a matrix.
Given A ∈ K m,n , the range/image (space) of A is the subspace of K m spanned by the columns of
A
R(A) := {Ax, x ∈ Kn } ⊂ R m .
The kernel/nullspace of A is
N (A) := {z ∈ R n : Az = 0} .
Definition 2.2.3. Rank of a matrix → [?, Sect. 2.4], [?, Sect. 1.5]
The rank of a matrix M ∈ K m,n , denoted by rank(M), is the maximal number of linearly indepen-
dent rows/columns of M.
Equivalently, rank(A) = dim R(A).
Theorem 2.2.4. Criteria for invertibility of matrix → [?, Sect. 2.3 & Cor. 3.8]
A square matrix A ∈ K n,n is invertible/regular if one of the following equivalent conditions is satis-
fied:
1. ∃B ∈ K n,n : BA = AB = I,
2. x 7→ Ax defines an endomorphism of K n ,
3. the columns of A are linearly independent (full column rank),
4. the rows of A are linearly independent (full row rank),
5. det A 6= 0 (non-vanishing determinant),
6. rank(A) = n (full rank).
2. Direct Methods for Linear Systems of Equations, 2.2. Theory: Linear systems of equations 135
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
inverse matrix
Now recall our notion of “problem” from § 1.5.67 as a function F mapping data in a data space X to a
result in a result space Y. Concretely, for n × n linear systems of equations:
X := K n,n
∗ ×K
n → Y := K n
F:
(A, b) 7 → A −1 b
✎ notation: (open) set of regular matrices ⊂ K n,n :
K n,n
∗ := {A ∈ K
n,n
: A regular/invertible → Def. 2.2.1} .
Always avoid computing the inverse of a matrix (which can almost always be avoided)!
In particular, never ever even contemplate using x = A.inverse()*b to solve the linear
! system of equations Ax = b, cf. Exp. 2.4.13. The next sections present a sound way to do this.
Before we examine sensitivity for linear systems of equations, we look at the simpler problem of matrix×vector
multiplication.
F : K n → K n , x 7→ Ax ,
that is, now we consider only the vector x as data.
Goal: Estimate relative perturbations in F(x) due to relative perturbations in x.
We assume that K n is equipped with some vector norm (→ Def. 1.5.70) and we use the induced matrix
norm (→ Def. 1.5.76) on K n,n . Using linearity and the elementary estimate kMx k ≤ kMkkxk, which is
a direct consequence of the definition of an induced matrix norm, we obtain
2. Direct Methods for Linear Systems of Equations, 2.2. Theory: Linear systems of equations 136
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
k∆y k kAkk∆x k −1 k ∆x k
⇒ ≤ −1
= k A k A . (2.2.8)
k yk k A −1 k k x k k xk
Now we study the sensitivity of the problem of finding the solution of a linear system of equations Ax = b,
A ∈ R n,n regular, b ∈ R, see § 2.1.1. We write e x for the solution of the perturbed linear system.
kx − e
xk
(normwise) relative error: ǫr := .
kxk
(k·k =
ˆ suitable vector norm, e.g., maximum norm k·k∞ )
Ax = b ↔ (A + ∆A)e
x = b + ∆b x − x) = ∆b − ∆Ax .
(A + ∆A)(e (2.2.9)
−1 (I + B)−1 x k yk 1
(I + B) = sup = sup ≤
x∈R n \{0} k xk y ∈R n \{0} k(I + B)y k 1 − k Bk
A −1
Proof (of Thm. 2.2.10) Lemma 2.2.11 ➣ (A + ∆A)−1 ≤ & (2.2.9)
1 − k A−1 ∆A k
A −1 A −1 k A k k∆b k k∆A k
⇒ k∆x k ≤ (k∆b k + k∆Ax k) ≤ + k xk .
1 − kA−1 ∆A k 1 − kA−1 kk∆A k kAkkxk kAk
2. Direct Methods for Linear Systems of Equations, 2.2. Theory: Linear systems of equations 137
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Note that the term kAk A−1 occurs frequently. Therefore it has been given a special name:
kx − e
xk cond(A)δ A k∆A k
ǫr := ≤ , δ A := . (2.2.13)
kxk 1 − cond(A)δ A kAk
✦ If cond(A) ≫ 1, small perturbations in A can lead to large relative errors in the solution of
✓LSE.
the ✏
✦ If cond(A) ≫ 1, a stable algorithm (→ Def. 1.5.85) can produce solutions with
✒ ✑
large relative error !
Recall Thm. 2.2.10: for regular A ∈ K n,n , small ∆A, generic vector/matrix norm k·k
Ax = b kx − exk cond(A) k∆b k k∆A k
⇒ ≤ + . (2.2.14)
(A + ∆A)e
x = b + ∆b k xk 1 − cond(A)k ∆A k/k Ak kbk kAk
cond(A) ≫ 1 ➣ small relative changes of data A, b may effect huge relative changes in so-
lution.
Terminology:
Solving a 2 × 2 linear system of equations amounts to finding the intersection of two lines in the coordinate
plane. This relationship allows a geometric view of “sensitivity of a linear system”:
2. Direct Methods for Linear Systems of Equations, 2.2. Theory: Linear systems of equations 138
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Li = { x ∈ R 2 : x T n i = d i } , n i ∈ R 2 , d i ∈ R .
T
n1 d
LSE for finding intersection: T x= 1 ,
n d2
| {z2 } | {z}
= :A = :b
ni =
ˆ (unit) direction vectors, di =
ˆ distance to origin.
1 cos ϕ
The following code investigates condition numbers for the matrix that can arise when comput-
0 sin ϕ
ing the intersection of two lines enclosing the angle ϕ.
In Line 10 we compute the condition number of A with respect to the Euclidean vector norm using special
E IGEN built-in functions.
2. Direct Methods for Linear Systems of Equations, 2.2. Theory: Linear systems of equations 139
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Line 13 evaluated the condition number of a matrix for the maximum norm, recall Ex. 1.5.78.
140
2-norm
max-norm
120
condition numbers
respect to the Euclidean vector norms) as the angle 80
enclosed by the two lines shrinks.
60
This corresponds to a large sensitivity of the location
of the intersection point in the case of glancing 40
incidence.
20
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Fig. 55 angle of n , n
1 2
Supplementary reading. In case you cannot remember the main facts about Gaussian elimi-
Wikipedia: Although the method is named after mathematician Carl Friedrich Gauss, the earliest pre-
sentation of it can be found in the important Chinese mathematical text Jiuzhang suanshu or
The Nine Chapters on the Mathematical Art, dated approximately 150 B.C., and commented
on by Liu Hui in the 3rd century.
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 140
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
➀ (Forward) elimination:
1 1 0 x1 4 x1 + x2 = 4
2 1 −1 x2 = 1 ←→ 2x1 + x2 − x3 = 1 .
3 −1 −1 x3 −3 3x1 − x2 − x3 = −3
1 1 0 4 1 1 0 4 1 1 0 4
2 1 −1 1 ➤ 0 −1 −1 −7 ➤ 0 −1 −1 −7
3 −1 −1 −3 3 −1 −1 −3 0 −4 −1 −15
1 1 0 4
➤ 0 − 1 −1 −7
0 0 3 13
| {z }
=U
= pivot row, pivot element bold.
transformation of LSE to upper triangular form
More general:
a11 x1 + a12 x2 + · · · + a1n xn = b1
a21 x1 + a22 x2 + · · · + a2n xn = b2
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
an1 x1 + an2 x2 + · · · + ann xn = bn
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 141
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
∗
0∗ 0
0∗
−→ −→ −→
0 00
0 0 0
0
0∗
−→ −→ · · · −→ −→
0 ∗
0 00 0 0 0 0
∗=
ˆ pivot (necessarily 6= 0 ➙ here: assumption), = pivot row
In k-th step (starting from A ∈ K n,n , 1 ≤ k < n, pivot row a⊤
k· ):
transformation: Ax = b ➤ A′ x = b′ .
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 142
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
with
aik
aij − akk akj for k < i, j ≤ n ,
bi − aik b for k < i ≤ n ,
aij′ := 0 for k < i ≤ n,j = k ,
′
bi := akk k (2.3.2)
b else.
i
aij else,
multipliers lik
Here we give a direct E IGEN implementation of Gaussian elimination for LSE Ax = b (grossly inefficient!).
Line 9: right hand side vector set as last column of matrix, facilitates simultaneous row transformations of
matrix and r.h.s.
Variable fac =
ˆ multiplier
Line 24: extract solution from last column of transformed matrix.
Forward elimination: three nested loops (note: compact vector operation in line 15 involves another loop
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 143
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
from i + 1 to m)
computational cost (↔ number of elementary operations) of Gaussian elimination [?, Sect. 1.3]:
n −1
elimination : ∑i=1 (n − i)(2(n − i) + 3) = n(n − 1)( 23 n + 67 ) Ops. , (2.3.6)
n
back substitution : ∑i=1 2(n − i) + 1 = n2 Ops. .
✎ ☞
asymptotic complexity (→ Sect. 1.4) of Gaussian elimination 2 3
= 3n + O ( n2 ) = O ( n3 )
✍ ✌
(without pivoting) for generic LSE Ax = b, A ∈ R n,n
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 144
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
C++11 code 2.3.8: Measuring runtimes of Code 2.3.4 vs. E IGEN lu()-operator vs. MKL
➺ GITLAB
2 //! Eigen code for timing numerical solution of linear systems
3 MatrixXd g a u s s t i m i n g ( ) {
4 std : : vector < i n t > n = { 8 ,16 ,32 ,64 ,128 ,256 ,512 ,1024 ,2048 ,4096 ,8192} ;
5 i n t nruns = 3 ;
6 MatrixXd times ( n . siz e ( ) , 3 ) ;
7 f o r ( i n t i = 0 ; i < n . siz e ( ) ; ++ i ) {
8 Timer t1 , t 2 ; // timer class
9 MatrixXd A = MatrixXd : : Random( n [ i ] , n [ i ] ) +
n [ i ] ∗ MatrixXd : : I d e n t i t y ( n [ i ] , n [ i ] ) ;
10 VectorXd b = VectorXd : : Random( n [ i ] ) ;
11 VectorXd x ( n [ i ] ) ;
12 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
13 t 1 . s t a r t ( ) ; x = A . l u ( ) . solve ( b ) ; t 1 . s top ( ) ; // Eigen
implementation
14 # i f n d e f EIGEN_USE_MKL_ALL // only test own algorithm without
MKL
15 i f ( n [ i ] <= 4096) // Prevent long runs
16 t2 . s t a r t ( ) ; gausselimsolve (A, b , x ) ; t 2 . s top ( ) ; //
own gauss elimination
17 #endif
18 }
19 times ( i , 0 ) = n [ i ] ; times ( i , 1 ) = t 1 . min ( ) ; times ( i , 2 ) = t 2 . min ( ) ;
20 }
21 r et ur n times ;
22 }
10 4
Eigen lu() solver
gausselimsolve
MLK solver sequential
MLK solver parallel
10 2
O(n 3 )
Platform:
✦ ubuntu 14.04 LTS
✦ i7-3517U CPU @ 1.90GHz 10 0
×4
execution time [s]
10 -8
10 0 10 1 10 2 10 3 10 4
Fig. 56 matrix size n
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 145
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
n Code 2.3.4 [s] E IGEN lu() [s] MKL sequential [s] MKL parallel [s]
8 6.340e-07 1.140e-06 3.615e-06 2.273e-06
16 2.662e-06 3.203e-06 9.603e-06 1.408e-05
32 1.617e-05 1.331e-05 1.603e-05 2.495e-05
64 1.214e-04 5.836e-05 5.142e-05 7.416e-05
128 2.126e-03 3.180e-04 2.041e-04 3.176e-04
256 3.464e-02 2.093e-03 1.178e-03 1.221e-03
512 3.954e-01 1.326e-02 7.724e-03 8.175e-03
1024 4.822e+00 9.073e-02 4.457e-02 4.864e-02
2048 5.741e+01 6.260e-01 3.347e-01 3.378e-01
4096 5.727e+02 4.531e+00 2.644e+00 1.619e+00
8192 - 3.510e+01 2.064e+01 1.360e+01
A concise list of libraries for numerical linear algebra and related problems can be found here.
In Code 2.3.4: the right hand side vector b was first appended to matrix A as rightmost column, and then
forward elimination and back substitution were carried out on the resulting matrix.
AX = B ⇔ X = A−1 B
asymptotic complexity: O ( n2 ( n + k )
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 146
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
C++11 code 2.3.10: Gaussian elimination with multiple r.h.s. → Code 2.3.4 ➺ GITLAB
2 //! Gauss elimination without pivoting, X = A−1 B
3 //! A must be an n × n-matrix, B an n × m-matrix
4 //! Result is returned in matrix X
5 void g a u s s e l i m s o l v e m u l t ( const MatrixXd &A , const MatrixXd& B ,
6 MatrixXd& X) {
7 i n t n = A . rows ( ) , m = B . cols ( ) ;
8 MatrixXd AB( n , n+m) ; // Augmented matrix [A, B]
9 AB << A , B ;
10 // Forward elimination, do not forget the B part of the Matrix
11 f o r ( i n t i = 0 ; i < n −1; ++ i ) {
12 double p i v o t = AB( i , i ) ;
13 f o r ( i n t k = i + 1; k < n ; ++k ) {
14 double f a c = AB( k , i ) / p i v o t ;
15 AB . block ( k , i + 1 ,1 ,m+n−i −1)−= f a c ∗ AB . block ( i , i + 1 ,1 ,m+n−i −1) ;
16 }
17 }
18 // Back substitution
19 AB . block ( n −1, n , 1 , m) / = AB( n −1,n −1) ;
20 f o r ( i n t i = n −2; i >= 0 ; −− i ) {
21 f o r ( i n t l = i + 1; l < n ; ++ l ) {
22 AB . block ( i , n , 1 ,m) −= AB . block ( l , n , 1 ,m) ∗AB( i , l ) ;
23 }
24 AB . block ( i , n , 1 ,m) / = AB( i , i ) ;
25 }
26 X = AB . r ight Cols (m) ;
27 }
Next two remarks: For understanding or analyzing special variants of Gaussian elimination, it is useful to
be aware of
• the effects of elimination steps on the level of matrix blocks, cf. Rem. 1.3.15,
• and of the recursive nature of Gaussian elimination.
Block perspective (first step of Gaussian elimination with pivot α 6= 0), cf. (2.3.2):
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 147
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
α c⊤ α c⊤
A :=
d
→ A′ :=
0 ′ dc⊤ .
C C := C −
α
(2.3.12)
rank-1 modification of C
Adding a tensor product of two vectors to a matrix is called a rank-1 modification of that matrix.
In this code the Gaussian elimination is carried out in situ: the matrix A is replaced with the transformed
matrices during elimination. If the matrix is not needed later this offers maximum efficiency. Notice that
the recursive call is omitted!
Recall “principle” from Ex. 1.3.15: deal with block matrices (“matrices of matrices”) like regular matrices
(except for commutativity of multiplication!).
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 148
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
A11 A12 b1 ❶ A11 A12 b1
−→ −1 −1
A21 A22 b2 0 A22 − A21 A11 A12 b2 − A21 A11 b1
−1
❷ I 0 A11 (b1 − A12 S−1 bS )
−→ ;,
0 I S −1 b S
−1 −1
where S := A22 − A21 A11 A12 (Schur complement, see Rem. 2.3.34), bS := b2 − A21 A11 b1
We can read off the solution of the block-partitioned linear system from the above Gaussian elimination:
A11 A12 x1 b x2 = S −1 b S ,
= 1 ⇒ −1
(2.3.15)
A21 A22 x2 b2 x1 = A11 (b1 − A12 S−1 bS ) .
2.3.2 LU-Decomposition
A matrix factorization (ger. Matrixzerlegung) expresses a general matrix A as product of two special
(factor) matrices. Requirements for these special matrices define the matrix factorization.
Matrix factorizations
☞ often capture the essence of algorithms in compact form (here: Gaussian elimination),
☞ are important building blocks for complex algorithms,
☞ are key theoretical tools for algorithm analysis.
In this section: forward elimination step of Gaussian elimination will be related to a special matrix factor-
ization, the so-called LU-decomposition or LU-factorization.
Supplementary reading. The LU-factorization should be well known from the introductory
linear algebra course. In case you need to refresh your knowledge, please consult one of the
following:
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 149
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
1
1 0
−→ −→
0 0
1
Ex. 1.3.13: row transformations can be realized by multiplication from left with suitable transformation
matrices. When multiplying these transformation matrices we can emulate the effect to successive row
transformations through left multiplication with a matrix T:
A′
A
−→ ⇔ TA = A′ .
0
row transformations
Now we want to determine the T for the forward elimination step of Gaussian elimination.
Example 2.3.16 (Gaussian elimination and LU-factorization → [?, Sect. 2.4], [?, II.4], [?,
Sect. 3.1])
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 150
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
1 1 1 0 4 1 1 1 0 4
2 1 0 −1 −1 −7 ➤ 2 1 0 − 1 −1 −7
3 0 1 0 −4 −1 −15 3 4 1 0 0 3 13
| {z } | {z }
=L =U
= pivot row, pivot element bold, negative multipliers red
Details: link between Gaussian elimination and matrix factorization → Ex. 2.3.16
(row transformation = multiplication with elimination matrix)
1 0 ··· ··· 0 a1 a1
a2
− a 1 0 a2 0
1
a3
a1 6 = 0 − a1 a3 = 0 . (2.3.17)
. . .
.. .. ..
− aan 0
1
1 an 0
elimination matrices Li , i = 1, . . . , n − 1 ,
A = L1 · · · · · L n − 1 U with
upper triangular matrix U ∈ R n,n .
1 0 ··· ··· 0 1 0 ··· ··· 0 1 0 ··· ··· 0
l2 1 0 0 l2 1 0
0 1
l3 0 h3 1 = l3 h3 1
. . . . .
.. .. .. .. ..
ln 0 1 0 hn 0 1 ln hn 0 1
✖ ✕
triangular matrix L and an upper triangular matrix U, [?, Thm. 3.2.1], [?, Thm. 2.10], [?, Sect. 3.1].
Algebraically equivalent = ˆ when carrying out the forward elimination in situ as in Code 2.3.4 and storing
the multipliers in a lower triangular matrix as in Ex. 2.3.16, then the latter will contain the L-factor and the
original matrix will be replaced with the U-factor.
Definition 2.3.18.
LU -decomposition/ LU -factorization Given a square matrix A ∈ K n,n , an upper triangular matrix
U ∈ K n,n and a normalized lower triangular matrix (→ Def. 1.1.5) form an LU -decomposition/ LU -
factorization of A, if A = LU.
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 151
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
1
1 0
1
1
1
1
1 · =
1
1 0
1
1
1
n = 1: assertion trivial
n − 1→n: Induction hypothesis ensures existence of normalized lower triangular matrix L e and regular
e such that A
upper triangular matrix U e =LeU
e , where A
e is the upper left (n − 1) × (n − 1) block of A:
e b
A e 0
L e y
U
= =: LU .
a⊤ α x⊤ 1 0 ξ
Then solve
➊ e =b
Ly → provides y ∈ Kn ,
➋ x⊤ U
e = a⊤ → provides x ∈ Kn ,
➌ x⊤ y + ξ = α → provides ξ ∈ K .
Regularity of A involves ξ 6= 0 (why?) so that U will be regular, too.
Regular upper triangular matrices and normalized lower triangular matrices form matrix groups (→ Lemma 1.3.9).
Their only common element is the identity matrix.
L1 U1 = L2 U2 ⇒ L2−1 L1 = U2 U1−1 = I .
A direct way to determine the factor matrices of the LU -decomposition [?, Sect. 3.1], [?, Sect. 3.3.3]:
We study the entries of the product of a normalized lower triangular and an upper triangular matrix, see
Def. 1.1.5:
0
· =
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 152
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
This reveals how to compute the entries of L and U sequentially. We start with the top row of U, which
agrees with that of A, and then work our way towards to bottom right corner:
=
ˆ columns of L
Fig. 57
It is instructive to compare this code with a simple implementation of the matrix product of a normalized
lower triangular and an upper triangular matrix. From this perspective the LU-factorization looks like the
“inversion” of matrix multiplication:
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 153
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
6 A = MatrixXd : : Zero ( n , n ) ;
7 f o r ( i n t k = 0 ; k < n ; ++k ) {
8 f o r ( i n t j = k ; j < n ; ++ j )
9 A( k , j ) = U( k , j ) + ( L . block ( k , 0 , 1 , k ) ∗ U . block ( 0 , j , k , 1 ) ) ( 0 , 0 ) ;
10 f o r ( i n t i = k + 1; i < n ; ++ i )
11 A( i , k ) = ( L . block ( i , 0 , 1 , k ) ∗ U . block ( 0 , k , k , 1 ) ) ( 0 , 0 ) +
L ( i , k ) ∗U( k , k ) ;
12 }
13 }
Observe: Solving for entries L(i,k) of L and U(k,j) of U in the multiplication of an upper tri-
angular and normalized lower triangular matrix (Code 2.3.24) yields the algorithm for LU-factorization
(Code 2.3.23).
A −→
L
Replace entries of A with entries of L (strict lower triangle) and U (upper triangle).
In light of the close relationship between Gaussian elimination and LU-factorization there will also be a
recursive version of LU-factorization.
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 154
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Refer to (2.3.12) to understand lurec: the rank-1 modification of the lower (n − 1) × (n − 1)-block of
the matrix is done in lines Line 7-Line 8 of the code.
➣
asymptotic complexity: (in leading order) the same as for Gaussian elimination
However, the perspective of LU-factorization reveals that the solution of linear systems of equations can be
✗ ✔ ✗ ✔
split into two separate phases with different asymptotic complexity in terms of the number n of unknowns:
✖ ✕ ✖ ✕
Cost: O(n3 ) Cost: O(n2 )
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 155
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Gauss elimination and LU-factorization for the solution of a linear system of equations (→ § 2.3.30) are
equivalent and only differ in the ordering of the steps.
Because in the case of LU-factorization the expensive forward elimination and the less expensive (for-
ward/backward) substitutions are separated, which sometimes can be exploited to reduce computational
cost, as highlighted in Rem. 2.5.10 below.
Principal minor =
ˆ left upper block of a matrix
The following “visual rule” help identify the structure of the LU-factors of a matrix.
0
=
0
(2.3.33)
The left-upper blocks of both L and U in the LU-factorization of A depend only on the corresponding
left-upper block of A!
Natural in light of the close connection between matrix multiplication and matrix factorization, cf. the
relationship between matrix factorization and matrix multiplication found in § 2.3.21:
Block matrix multiplication (1.3.16) ∼
= block LU -decomposition:
With A11 ∈ K n,n regular, A12 ∈ K n,m , A21 ∈ K m,n , A22 ∈ K m,m :
Schur complement
A11 A12 I 0 A11 A12
= −1 , (2.3.35)
A21 A22 A21 A11 I 0 S −1
S := A22 − A21 A11 A12 .
| {z }
block LU-factorization
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 156
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
2.3.3 Pivoting
Idea (in linear algebra): Avoid zero pivot elements by swapping rows
2 MatrixXd A( 2 , 2 ) ;
3 A << 5.0 e − 17, 1 . 0 , Output:
4 1.0 , 1.0;
5 VectorXd b ( 2 ) , x2 ( 2 ) ; 1 x1 =
6 b << 1 . 0 , 2 . 0 ; 2 1
7 VectorXd x1 = A . f u l l P i v L u ( ) . solve ( b ) ; 3 1
8 g a u s s e l i m s o l v e ( A , b , x2 ) ; // see Code 2.3.10 4 x2 =
9 MatrixXd L ( 2 , 2 ) , U( 2 , 2 ) ; 5 0
10 l u f a k ( A , L , U) ; // see Code 2.3.23 6 1
11 VectorXd z = L . l u ( ) . solve ( b ) ; 7 x3 =
12 VectorXd x3 = U . l u ( ) . solve ( z ) ; 8 0
13 cout << " x1 = \ n " << x1 << " \ n x2 = \ n " << x2 << 9 1
" \ n x3 = \ n " << x3 << std : : endl ;
1
ǫ 1 1 1−ǫ 1
A= , b= ⇒ x= ≈ for |ǫ| ≪ 1 .
1 1 2 1 − 2ǫ 1
1−ǫ
What is wrong with E IGEN? Needed: insight into roundoff errors, which we already have
→ Section 1.5.3
Armed with knowledge about the behavior of machine numbers and roundoff errors we can now under-
stand what is going on in Ex. 2.3.36
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 157
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
e = b: x = 2ǫ
Solution of LUx (meaningless result !)
1 − 2ǫ
LU-factorization after swapping rows:
1 1 1 0 1 1 e := 1 1
A= ⇒ L= , U= =U in M . (2.3.38)
ǫ 1 ǫ 1 0 1−ǫ 0 1
e = b: x = 1 + 2ǫ
Solution of LUx (sufficiently accurate result !)
1 − 2ǫ
e 0 0
no row swapping, → (2.3.37): LU = A + E with E = unstable !
0 1
e e 0 0
after row swapping, → (2.3.38): LU = A + E with E = stable !
0 ǫ
Introduction to the notion of stability → Section 1.5.5, Def. 1.5.85, see also [?, Sect. 2.3].
1 2 2 2 −3 2 2 −3 2 2 −7 2 2 −7 2
➊ ➋ ➌ ➍
A = 2 −3 2 → ] 1 2 2 → ] 0 3.5 1 → 0 25.5 −1 → 0 25.5 −1
1 24 0 1 24 0 0 25.5 −1 0 3.5 1 0 0 1.373
C++11 code 2.3.41: Gaussian elimination with pivoting: extension of Code 2.3.4 ➺ GITLAB
2 //! Solving an LSE Ax = b by Gaussian elimination with partial
pivoting
3 //! A must be an n × n-matrix, b an n-vector
4 void gepiv ( const MatrixXd &A , const VectorXd& b , VectorXd& x ) {
5 i n t n = A . rows ( ) ;
6 MatrixXd Ab ( n , n +1) ;
7 Ab << A , b ; //
8 // Forward elimination by rank-1 modification, see Rem. 2.3.11
9 f o r ( i n t k = 0 ; k < n −1; ++k ) {
10 i n t j ; double p ; // p = relativly largest pivot, j = pivot
row index
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 158
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
choice of pivot row index j (Line 11 of code): relatively largest pivot [?, Sect. 2.5],
| a ji |
j ∈ {k, . . . , n} such that → max (2.3.42)
max{| a jl |, l = k, . . . , n}
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 159
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
➣ LU-factorization with pivoting? Of course, just by rearranging the operations of Gaussian forward elim-
ination with pivoting.
Line 6: Find the relatively largest pivot element p and the index j of the corresponding row of the matrix,
see (2.3.42)
Line 7: If the pivot element is still very small relative to the norm of the matrix, then we have encountered
an entire column that is close to zero. The matrix is (close to) singular and LU-factorization does
not exist.
Line 9: Swap the first and the j-th row of the matrix.
Line 11: Call the routine for the lower right (n − 1) × (n − 1)-block of the matrix after subtracting suitable
multiples of the first row from the other rows, cf. Rem. 2.3.11 and Rem. 2.3.27.
Line 12: Reassemble the parts of the LU-factors. The vector of multipliers yields a column of L, see
Ex. 2.3.16.
Remark 2.3.45 (Rationale for partial pivoting policy (2.3.42) → [?, Page 47])
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 160
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
1 0 0 0
0 0 1 0
ˆ P=
permutation (1, 2, 3, 4) 7→ (1, 3, 2, 4) = 0
.
1 0 0
0 0 0 1
Note:
✦ P⊤ = P−1 for any permutation matrix P (→ permutation matrices orthogonal/unitary)
✦ Pπ A effects π -permutation of rows of A ∈ K n,m
✦ APπ effects π -permutation of columns of A ∈ K m,n
Lemma 2.3.47. Existence of LU-factorization with pivoting → [?, Thm. 3.25], [?, Thm. 4.4]
For any regular A ∈ K n,n there is a permutation matrix (→ Def. 2.3.46) P ∈ K n,n , a normalized
lower triangular matrix L ∈ K n,n , and a regular upper triangular matrix U ∈ K n,n (→ Def. 1.1.5),
such that PA = LU .
Every regular matrix A ∈ K n,n admits a row permutation encoded by the permutation matrix P ∈ K n,n ,
such that A′ := (A)1:n−1,1:n−1 is regular (why ?).
By induction assumption there is a permutation matrix P′ ∈ K n−1,n−1 such that P′ A′ possesses a
LU-factorization A′ = L′ U′ . There are x, y ∈ K n−1 , γ ∈ K such that
′ ′ ′ ′ ′
P′ 0 P 0 A x LU x L 0 U d
PA = = = ⊤ ,
0 1 0 1 y⊤ γ y⊤ γ c 1 0 α
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 161
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
if we choose
d = (L ′ )−1 x , c = (u ′ )− T y , α = γ − c⊤ d ,
1 2 2 2 −3 2 2 −3 2 2 −7 2 2 −7 2
➊ ➋ ➌ ➍
A = 2 −3 2 → 1 2 2 → 0 3.5 1 → 0 25.5 −1 → 0 25.5 −1
1 24 0 1 24 0 0 25.5 −1 0 3.5 1 0 0 1.373
2 −3 2 1 0 0 0 1 0
U = 0 25.5 −1 , L = 0.5 1 0 , P= 0 0 1 .
0 0 1.1373 0.5 0.1373 1 1 0 0
Two permutations: in step ➊ swap rows #1 and #2, in step ➌ swap rows #2 and #3. Apply these swaps to
the identity matrix and you will recover P. See also [?, Ex. 3.30].
E IGEN provides various functions for computing the LU-decomposition of a given matrix. They all perform
the factorization in-situ → Rem. 2.3.26:
A −→
L
The resulting matrix can be retrieved and used to recover the LU-factors, as demonstrated in the next code
snippet.
Note that for solving a linear system of equations by means of LU-decomposition (the standard algorithm)
we never have to extract the LU-factors.
2. Direct Methods for Linear Systems of Equations, 2.3. Gaussian Ellimination 162
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Any kind of pivoting only involves comparisons and row/column permutations, but no arithmetic operations
on the matrix entries. This makes the following observation plausible:
The LU-factorization of A ∈ K n,n with partial pivoting by § 2.3.43 is numerically equivalent to the LU-
factorization of PA without pivoting (→ Code in § 2.3.21), when P is a permutation matrix gathering
the row swaps entailed by partial pivoting.
numerically equivalent =
ˆ same result when executed with the same machine arithmetic
The above statement means that whenever we study the impact of roundoff errors on LU-
factorization it is safe to consider only the basic version without pivoting, because we can always
assume that row swaps have been conducted beforehand.
It will turn out that when investigating the stability of algorithms meant to solve linear systems of equations,
a key quantity is the residual.
r = b − Ae
x.
Assume that you have downloaded a direct solver for a general (dense) linear system of equations Ax =
b, A ∈ K n,n regular, b ∈ K n . When given the data A and b it returns the perturbed solution e x. How
can we tell that ex is the exact solution of a linear system with slightly perturbed data (in the sense of a
tiny relative error of size ≈ EPS, EPS the machine precision, see § 1.5.29). That is, how can we tell that e
x
is an acceptable solution in the sense of backward error analysis, cf. Def. 1.5.85. A similar question was
explored in Ex. 1.5.86 for matrix×vector multiplication.
➊ x−e
x accounted for by perturbation of right hand side:
Ax = b
⇒ ∆b = Ae
x − b =: −r (residual, Def. 2.4.1) .
x = b + ∆b
Ae
krk
Hence, e
x can be accepted as a solution, if ≤ Cn3 · EPS, for some small constant C ≈ 1, see
kbk
Def. 1.5.85. Here, k·k can be any vector norm on K n .
➋ x−e
x accounted for by perturbation of system matrix:
Ax = b , (A + ∆A)e
x=b
2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 163
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
xH , u ∈ K n ]
[ try perturbation ∆A = ue
r xH
re
u= ⇒ ∆A = .
kxk22 kxk22
As in Ex. 1.5.86 we find
k∆A k2 krk k r k2
= ≤ .
k A k2 kAk2 ke
xk2 kAe xk2
k rk
Thus, e
x is ok in the sense of backward error analysis, if ≤ Cn3 · EPS.
kAe xk
The roundoff error analysis of Gaussian elimination based Ass. 1.5.32 is rather involved. Here we merely
summarise the results:
A profound roundoff analysis of Gaussian eliminatin/LU-factorization can be found in [?, Sect. 3.3 & 3.5]
and [?, Sect. 9.3]. A less rigorous, but more lucid discussion is given in [?, Lecture 22].
2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 164
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Let A ∈ R n,n be regular and A(k) ∈ R n,n , k = 1, . . . , n − 1, denote the intermediate matrix arising
in the k-th step of § 2.3.43 (Gaussian elimination with partial pivoting) when carried out with exact
arithmetic.
For the approximate solution e x ∈ R n of the LSE Ax = b, b ∈ R n , computed as in § 2.3.43 (based
on machine arithmetic with machine precision EPS, → Ass. 1.5.32) there is ∆A ∈ R n,n with
If ρ is “small”, the computed solution of a LSE can be regarded as the exact solution of a LSE with “slightly
perturbed” system matrix (perturbations of size O(n3 EPS)).
1 0 0 0 0 0 0 0 0 1
−1 1 0 0 0 0 0 0 0 1
n=10:
−1 −1 1 0 0 0 0 0 0 1
−1 −1 −1 1 0 0 0 0 0 1
1 , if i = j ∨ j = n ,
−1 −1 −1 −1 1 0 0 0 0 1
aij = −1 , if i > j , , A=
−1
−1 −1 −1 −1 1 0 0 0 1
0 else. −1 −1 −1 −1 −1 −1 1 0 0 1
−1 −1 −1 −1 −1 −1 −1 1 0 1
−1 −1 −1 −1 −1 −1 −1 −1 1 1
−1 −1 −1 −1 −1 −1 −1 −1 −1 1
Partial pivoting does not trigger row permutations !
1 , if i = j ,
1 , if i = j ,
A = LU , lij = −1 , if i > j , uij = 2i −1 , if j = n ,
0 else 0 else.
C++11 code 2.4.6: Gaussian elimination for “Wilkinson system” in E IGEN ➺ GITLAB
2 MatrixXd r es ( 1 0 0 , 2 ) ;
3 f o r ( i n t n = 10; n <= 100 ∗ 10; n += 10) {
4 MatrixXd A( n , n ) ; A . s e t I d e n t i t y ( ) ;
5 A . t r iangular View < S t r i c t l y L o w e r > ( ) . setConstant ( − 1) ;
6 A . rightCols <1 >() . setOnes ( ) ;
7 VectorXd x = VectorXd : : Constant ( n, − 1) . binaryExpr (
VectorXd : : LinSpaced ( n , 1 , n ) , [ ] ( double x , double y ) { r et ur n
pow ( x , y ) ; } ) ;
8 double r e l e r r = ( A . l u ( ) . solve ( A∗ x )−x ) . norm ( ) / x . norm ( ) ;
9 r es ( n/10 − 1 ,0) = n ; r es ( n/10 − 1 ,1) = r e l e r r ;
10 }
11 // ... different solver(e.g. colPivHouseholderQr()), plotting
(∗) If cond2 (A) was huge, then big errors in the solution of a linear system can be caused by small
perturbations of either the system matrix or the right hand side vector, see (2.4.12) and the message
of Thm. 2.2.10, (2.2.13). In this case, a stable algorithm can obviously produce a grossly “wrong”
solution, as was already explained after (2.2.13).
Hence, lack of stability of Gaussian elimination will only become apparent for linear systems with
well-conditioned system matrices.
450
0
10
400
-2
10
relative error (Euclidean norm)
350
-4
10
300
-6
10
cond2(A)
250
Gaussian elimination
-8
10 QR-decomposition
200 relative residual norm
-10
10
150
-12
10
100
-14
10
50
-16
10
0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Fig. 58 n Fig. 59 n
These observations match Thm. 2.4.4, because in this case we encounter an exponential growth of ρ =
ρ(n), see Ex. 2.4.5.
√
Observation: In practice ρ (almost) always grows only mildly (like O( n)) with n
√
Discussion in [?, Lecture 22]: growth factors larger than the orderO( n) are exponentially rare in certain
relevant classes of random matrices.
2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 166
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
0
10
-2
10
-4
10
Recall the statement made above about “improbabil-
10
-6 ity” of matrices for which Gaussian elimination with
relative error
-16
10
0 20 40 60 80 100 120 140 160 180 200
Fig. 60 matrix size n
In the discussion of numerical stability (→ Def. 1.5.85, Rem. 1.5.88) we have seen that a stable algorithm
may produce results with large errors for ill-conditioned problems. The conditioning of the problem of
2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 167
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
solving a linear system of equations is determined by the condition number (→ Def. 2.2.12) of the system
matrix, see Thm. 2.2.10.
Hence, for an ill-conditioned linear system, whose system matrix is beset with a huge condition number,
(stable) Gaussian elimination may return “solutions” with large errors. This will be demonstrated in this
experiment.
20 2
10 10
cond(A)
19
10 1
10
17
10
relative error
-1
T 10
cond(A)
A = uv + ǫI , 10
16
u = 13 (1, 2, 3, . . . , 10)T ,
-2
10
15
10
1 T
v = (−1, 12 , − 31 , 14 , . . . , 10 ) 10
14
10
-3
-4
13 10
10
relative error
12 -5
10 10
-14 -13 -12 -11 -10 -9 -8 -7 -6 -5
10 10 10 10 10 10 10 10 10 10
Fig. 61 ε
The practical stability of Gaussian elimination is reflected by the size of a particular vector that can easily
be computed after the elimination solver has finished:
x = b ⇒ r = b − Ae
(A + ∆A)e x ⇒
x = ∆Ae krk ≤ k∆A kke
xk ,
for any vector norm k·k. This means that, if a direct solver for an LSE is stable in the sense of backward
error analysis, that is, the perturbed solution could be obtained as the exact solution for a slightly relatively
perturbed system matrix, then the residual will be (relatively) small.
2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 168
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
2
10
relative error
relative residual
0
10
-2
10
-4
10
-6
10
Observations (w.r.t k·k∞ -norm)
✦ for ǫ ≪ 1 large relative error in computed so-
-8
10
lution e
x
-10
10 ✦ small residuals for any ǫ
-12
10
-14
10
-16
10
-14 -13 -12 -11 -10 -9 -8 -7 -6 -5
10 10 10 10 10 10 10 10 10 10
Fig. 62 ε
How can a large relative error be reconciled with a small relative residual ?
Ax = b ↔ Ae x≈b
− 1
x) = r ⇒ k x − e
A (x − e xk ≤ A krk kx − e
xk krk
⇒ ≤ k A k A −1 . (2.4.12)
Ax = b ⇒ kbk ≤ kAkkxk kxk kbk
➣ If cond(A) := k Ak A−1 ≫ 1, then a small relative residual may not imply a small relative error.
Also recall the discussion in Exp. 2.4.9.
An important justification for Rem. 2.2.6 is conveyed by this experiment. We again consider the nearly
singular matrix from Ex. 2.4.10.
2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 169
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
2
10
Gaussian elimination
multiplication with inversel
0
10 inverse
-2
10
-4
10
relative residual
-12
10
-14
10
-16
10
-14 -13 -12 -11 -10 -9 -8 -7 -6 -5
10 10 10 10 10 10 10 10 10 10
Fig. 63 ε
2. Direct Methods for Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 170
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
All direct (∗) solver algorithms for square linear systems of equations Ax = b with given matrix A ∈
K n,n , right hand side vector b ∈ K n and unknown x ∈ K n rely on variants of Gaussian elimination
with pivoting, see Section 2.3.3. Sophisticated, optimised and verified implementations are available in
numerical libraries like LAPACK/MKL.
(∗): a direct solver terminates after a predictable finite number of elementary operations for every admis-
sible input.
Therefore, familiarity with details of Gaussian elimination is not required, but one must know when and
how to use the library functions and one must be able to assess the computational effort they involve.
We repeat the reasoning of § 2.3.5: Gaussian elimination for a general (dense) matrix invariably involves
three nested loops of length n, see Code 2.3.4, Code 2.3.41.
The constant hidden in the Landau symbol can be expected to be rather small (≈ 1) as is clear from
(2.3.6).
The cost for solving are substantially lower, if certain properties of the matrix A are known. This is clear,
if A is diagonal or orthogonal/unitary. It is also true for triangular matrices (→ Def. 1.1.5), because they
can be solved by simple back substitution or forward elimination. We recall the observation made in see
§ 2.3.30.
2. Direct Methods for Linear Systems of Equations, 2.5. Survey: Elimination solvers for linear systems of equations
171
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Sometimes, the coefficient matrix of a linear system of equations is known to have certain analytic proper-
ties that a direct solver can exploit to perform elimination more efficiently. These properties may even be
impossible to detect by an algorithm, because matrix entries that should vanish exactly might have been
perturbed due to roundoff.
In this numerical experiment we study the gain in efficiency achievable by make the direct solver aware of
important matrix properties.
C++11 code 2.5.7: Direct solver applied to a upper triangular matrix ➺ GITLAB
2 //! Eigen code: assessing the gain from using special properties
3 //! of system matrices in Eigen
4 MatrixXd t i m i n g ( ) {
5 std : : vector < i n t > n = { 16 ,32 ,64 ,128 ,256 ,512 ,1024 ,2048 ,4096 ,8192} ;
6 i n t nruns = 3 ;
7 MatrixXd times ( n . siz e ( ) , 3 ) ;
8 f o r ( i n t i = 0 ; i < n . siz e ( ) ; ++ i ) {
9 Timer t1 , t 2 ; // timer class
10 MatrixXd A = VectorXd : : LinSpaced ( n [ i ] , 1 , n [ i ] ) . asDiagonal ( ) ;
11 A += MatrixXd : : Ones ( n [ i ] , n [ i ] ) . t r iangular View <Upper > ( ) ;
12 VectorXd b = VectorXd : : Random( n [ i ] ) ;
13 VectorXd x1 ( n [ i ] ) , x2 ( n [ i ] ) ;
14 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
15 t1 . s t a r t ( ) ; x1 = A . l u ( ) . solve ( b ) ; t 1 . s top ( ) ;
16 t2 . s t a r t ( ) ; x2 =
A . t r iangular View <Upper > ( ) . solve ( b ) ; t 2 . s top ( ) ;
2. Direct Methods for Linear Systems of Equations, 2.5. Survey: Elimination solvers for linear systems of equations
172
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
17 }
18 times ( i , 0 ) = n [ i ] ; times ( i , 1 ) = t 1 . min ( ) ; times ( i , 2 ) =
t 2 . min ( ) ;
19 }
20 r et ur n times ;
21 }
10 2
naive lu()
triangularView lu()
1
10
10 0
10 -5
10 -6
10 1 10 2 10 3 10 4
Fig. 64
matrix size n
This can be reduced to one line, as the solvers can also be used as methods acting on matrices:
Eigen::VectorXd x = A.solverType().solve(b);
A full list of solvers can be found here, in the E IGEN documentation. The next code demonstrates a few of
the available decompositions that can serve as the basis for a linear solver:
2. Direct Methods for Linear Systems of Equations, 2.5. Survey: Elimination solvers for linear systems of equations
173
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
x) {
9 x = A . f u l l P i v L u ( ) . solve ( b ) ; // total pivoting
10 }
11
The different decompositions trade speed for stability and accuracy: fully pivoted and QR-based decom-
positions also work for nearly singular matrices, for which the standard LU-factorization may non longer
be reliable.
Both E IGEN and M ATLAB provide functions that return decompositions of matrices, here the LU-decomposition
(→ Section 2.3.2):
E IGEN : MatrixXd A(n,n); auto ludec = mat.lu ();
M ATLAB: [L,U] − lu(A)
Based on the precomputed decompositions, a linear system of equations with coefficient matrix A ∈ K n,n
can be solved with asymptotic computational effort O(n2 ), cf. § 2.3.30.
The following example illustrates a special situation, in which matrix decompositions can curb computa-
tional cost:
A concrete example is the so-called inverse power iteration, see Chapter 9, for which a skeleton code is
2. Direct Methods for Linear Systems of Equations, 2.5. Survey: Elimination solvers for linear systems of equations
174
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
x∗
x ∗ : = A − 1 x ( k ) , x ( k + 1) : = , k = 0, 1, 2, . . . , (2.5.13)
kx∗ k2
This is necessary, because A+B will spawn an auxiliary object of a “strange” type determined by the
expression template mechanism.
2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 175
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Triangular linear systems are linear systems of equations whose system matrix is a triangular matrix (→
Def. 1.1.5).
Thm. 2.5.3 tells us that (dense) triangular linear systems can be solved by backward/forward elimina-
tion with O(n2 ) asymptotic computational effort (n =ˆ number of unknowns) compared to an asymptotic
3
complexity of O(n ) for solving a generic (dense) linear system of equations (→ Thm. 2.5.2).
This is the simplest case where exploiting special structure of the system matrix leads to faster algorithms
for the solution of a special class of linear systems.
Remember that thanks to the possibility to compute the matrix product in a block-wise fashion (→ § 1.3.15,
Gaussian elimination can be conducted on the level of matrix blocks. We recall Rem. 2.3.14 and Rem. 2.3.34.
Using block matrix multiplication (applied to the matrix×vector product in (2.6.3)) we find an equivalent
way to write the block partitioned linear system of equations:
A11 x1 + A12 x2 = b1 ,
(2.6.4)
A21 x1 + A22 x2 = b2 .
We assume that A11 is regular (invertible) so that we can solve for x1 from the first equation.
The resulting ℓ × ℓ linear system of equations for the unknown vector x2 is called the Schur complement
system for (2.6.3).
Unless A has a special structure that allows the efficient solution of linear systems with system matrix
A11 , the Schur complement system is mainly of theoretical interest.
2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 176
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
2
4
D c
A=
6
8
b⊤ α 10
(2.6.6)
12
0 2 4 6 8 10 12
Fig. 65 nz = 31
We can apply the block partitioning (2.6.3) with k = n and ℓ = 1 to a linear system Ax = y with system
matrix A and obtain A11 = D, which can be inverted easily, provided that all diagonal entries of D are
different from zero. In this case
D c x1 y
Ax = ⊤ = y := 1 , (2.6.7)
b α ξ η
η − b T D − 1 y1
ξ= ,
α − b ⊤ D −1 c (2.6.8)
x1 = D−1 (y1 − ξc) .
These formulas make sense, if D is regular and α − b⊤ D−1 c 6= 0, which is another condition for the
invertibility of A.
Using the formula (2.6.8) we can solve the linear system (2.6.7) with an asymptotic complexity O(n)!
This superior speed compared to Gaussian elimination applied to the (dense) linear system is evident in
runtime measurements.
C++11 code 2.6.9: Dense Gaussian elimination applied to arrow system ➺ GITLAB
2 VectorXd arrowsys_slow ( const VectorXd &d , const VectorXd &c , const
VectorXd &b , const double alpha , const VectorXd &y ) {
3 i n t n = d . siz e ( ) ;
4 MatrixXd A( n + 1 , n + 1 ) ; A . setZero ( ) ;
5 A . diagonal ( ) . head ( n ) = d ;
6 A . col ( n ) . head ( n ) = c ;
7 A . row ( n ) . head ( n ) = b ;
8 A( n , n ) = alpha ;
9 r et ur n A . l u ( ) . solve ( y ) ;
10 }
2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 177
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
mit samcode-Umgebung
C++11 code 2.6.11: Runtime measurement of Code 2.6.9 vs. Code 2.6.10 vs. sparse tech-
niques ➺ GITLAB
2 MatrixXd a r r o w s y s t i m i n g ( ) {
3 std : : vector < i n t > n = {8 ,1 6 ,3 2 ,6 4 ,1 2 8 ,2 5 6 ,5 1 2 ,1 0 2 4 ,2 0 4 8 ,4 0 9 6 };
4 i n t nruns = 3 ;
5 MatrixXd t i m e s ( n . s i z e ( ) , 6 ) ;
6 f o r ( i n t i = 0 ; i < n . s i z e ( ) ; ++ i ) {
7 Timer t1 , t2 , t3 , t 4 ; // timer class
8 double a l p h a = 2 ;
9 VectorXd b = VectorXd : : Ones ( n [ i ] , 1 ) ;
10 VectorXd c = VectorXd : : LinSpaced ( n [ i ] , 1 , n [ i ] ) ;
11 VectorXd d = −b ;
12 VectorXd y = VectorXd : : Constant ( n [ i ]+1 , − 1) . binaryExpr (
VectorXd : : LinSpaced ( n [ i ] + 1 , 1 , n [ i ] + 1 ) , [ ] ( double x , double
y ) { r e tur n pow ( x , y ) ; } ) . a r r a y ( ) ;
13 VectorXd x1 ( n [ i ] + 1 ) , x2 ( n [ i ] + 1 ) , x3 ( n [ i ] + 1 ) , x4 ( n [ i ] + 1 ) ;
14 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
15 t1 . s t a r t ( ) ; x1 = arrowsys_slow ( d , c , b , alpha , y ) ; t 2 . s t o p ( ) ;
16 t2 . s t a r t ( ) ; x2 = a r r o w s y s _ f a s t ( d , c , b , alpha , y ) ; t 2 . s t o p ( ) ;
17 t3 . s t a r t ( ) ;
18 x3 = arrowsys_sparse <SparseLU<SparseMatrix <double > >
>(d , c , b , alpha , y ) ;
19 t3 . stop ( ) ;
20 t4 . s t a r t ( ) ;
21 x4 = arrowsys_sparse <BiCGSTAB<SparseMatrix <double > >
>(d , c , b , alpha , y ) ;
22 t4 . stop ( ) ;
23 }
24 t i m e s ( i , 0 ) =n [ i ] ; t i m e s ( i , 1 ) = t 1 . min ( ) ; t i m e s ( i , 2 ) = t 2 . min ( ) ;
25 t i m e s ( i , 3 ) = t 3 . min ( ) ; t i m e s ( i , 4 ) = t 4 . min ( ) ; t i m e s ( i , 5 ) =( x4−x3 ) . norm ( ) ;
26 }
2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 178
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
27 r e tur n t i m e s ;
28 }
10 1
arrowsys slow
arrowsys fast
0
10
10 -1
runtime [s]
ubuntu 14.04 LTS, gcc 4.8.4, -O3)
10 -3
No comment! ✄
10 -4
10 -5
10 -6
10 0 10 1 10 2 10 3 10 4
Fig. 66 matrix size n
The vector based implementation of the solver of Code 2.6.10 can be vulnerable to roundoff errors, be-
cause, upon closer inspection, the algorithm turns out to be equivalent to Gaussian elimination without
pivoting, cf. Section 2.3.3, Ex. 2.3.36.
Given a regular matrix A ∈ K n,n , let us assume that at some point in a code we are in a position to solve
any linear system Ax = b “fast”, because
e = A + z · ei ∗ e T∗
A . (2.6.15)
j
2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 179
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
(Recall: ei = ˆ i-th unit vector.) The question is whether we can reuse some of the computations spent on
solving Ax = b in order to solve Ae e x = b with less effort than entailed by a direct Gaussian elimination
from scratch.
We may also consider a matrix modification affecting a single row: Changing a single row: given z ∈ K n
(
aij , if i 6= i ∗ ,
e ∈ K n,n : e
A, A aij = , i ∗ , j∗ ∈ {1, . . . , n} .
(z) j + aij , if i = i ∗ ,
e = A + ei ∗ z T
A . (2.6.16)
Both matrix modifications (2.6.14) and (2.6.16) represent rank-1-modifications of A. A generic rank-1-
modification reads
e := A + uv H , u, v ∈ K n .
A ∈ K n,n 7→ A (2.6.17)
general rank-1-matrix
Idea:
Block elimination of an extended linear system, see § 2.6.2
(A + uvH )e ex = b . !
x = b ⇔ Ae (2.6.19)
Hence, we have solved the modified LSE, once we have found the component e x of the solution of the
extended linear system (2.6.18). We do block elimination again, now getting rid of e
x first, which yields the
other Schur complement system
(1 + vH A − 1 u ) ξ = vH A − 1 b . (2.6.20)
uvH A−1
x = b−
Ae b. (2.6.21)
1 + vH A − 1 u
The generalization of this formula to rank-k-perturbations if given in the following lemma:
if I + V H A−1 U is regular.
2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 180
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
e x = b with A
We use this result to solve Ae e from (2.6.17) more efficiently than straightforward elimination
could deliver, provided that the LU-factorisation a = LU is already known.
LU-decomposition of A is available.
Many lineare systems with system matrices that differ in a single entry only have to be solved when we
want to determine the dependence of the total impedance of a (linear) circuit from the parameters of a
2. Direct Methods for Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear Systems 181
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
single component.
C1 R1 C1 R1
L L
R1
R1
Large (linear) electric circuit (modelling → R2 R2
C2 C2
Ex. 2.1.3) ✄ Rx
R4 R4 R1
Sought:
C2
C2
R1
R1
R3
R3
Dependence of (certain) branch currents U ~
~
on “continuously varying” resistance Rx
R2
R2
(➣ currents for many different values of
Rx )
C1
C1
R4
R4
L
L
R2 12 R2 R2
Fig. 67
Only a few entries of the nodal analysis matrix A (→ Ex. 2.1.3) are affected by variation of Rx !
(If Rx connects nodes i & j ⇒ only entries aii , a jj , aij , a ji of A depend on Rx )
A ∈ K m,n , m, n ∈ N, is sparse, if
Sloppy parlance: matrix sparse :⇔ “almost all” entries = 0 /“only a few percent of” entries 6= 0
A matrix with enough zeros that it pays to take advantage of them should be treated as sparse.
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 182
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
nnz(A(l ) )
lim =0.
l →∞ nl m l
See Ex. 2.1.3 for the description of a linear electric circuit by means of a linear system of equations for
nodal voltages. For large circuits the system matrices will invariably be huge and sparse.
Remark 2.7.5 (Sparse matrices from the discretization of linear partial differential equations)
☛ spatial discretization of linear boundary value problems for partial differential equations by means
of finite element (FE), finite volume (FV), or finite difference (FD) methods (→ 4th semester course
“Numerical methods for PDEs”).
Sparse matrix storage formats for storing a “sparse matrix” A ∈ K m,n are designed to achieve two objec-
tives:
➊ Amount of memory required is only slightly more than nnz(A) scalars.
➋ Computational effort for matrix×vector multiplication is proportional to nnz(A).
In this section we see a few schemes used by numerical libraries.
In the case of a sparse matrix A ∈ K m,n , this format stores triplets (i, j, αi,j ), 1 ≤ i ≤ m, 1 ≤ j ≤ n:
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 183
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
s t r u c t TripletMatrix {
size_t m,n; // Number of rows and columns
v e c t o r <size_t> I; // row indices
v e c t o r <size_t> J; // column indices
v e c t o r <scalar_t> a; // values associated with index pairs
};
We write “≥”, because repetitions of index pairs (i, j) are allowed. The matrix entry (A))i, j is defined to
be the sum of all values αi,j associated with the index pair (i, j). The next code clearly demonstrates this
summation.
Note that this code assumes that the result vector y has the appropriate length; no index checks are
performed.
Code 2.7.7: computational effort is proportional to the number of triplets. (This might be much larger than
nnz(A) in case of many repetitions of triplets.)
The CRS format for a sparse matrix A = (aij ) ∈ K n,n keeps the data in three contiguous arrays:
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 184
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
val aij
col_ind j
i
10 0 0 0 −2 0 val-vector:
3 9 0 0 0 3 10 -2
3 9 3 7 8 7 3 . . . 9 13 4 2 -1
0 7 8 7 0 0 col_ind-array:
A =
3
0 8 7 5 0
1 5 1 2 6 2 3 4 1 ...5 6 2 5 6
0 8 0 9 9 13 row_ptr-array:
0 4 0 0 2 −1 1 3 6 9 13 17 20
Supplementary reading. A detailed discussion of sparse matrix formats and how to work with
them efficiently is given in [?]. An interesting related article on M ATLAB-central can be found here.
Matrices have to be created explicitly in sparse format by one of the following commands in order to let
M ATLAB know that the CCS format is to be used for internal representation.
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 185
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The CCS internal data format used by M ATLAB has an impact on the speed of access operations, cf.
Exp. 1.2.29 for a similar effect.
-1
10
6
-3
10
10
-4
10
12
-5
10
14
-6
10
0 1 2 3 4 5 6 7
16 10 10 10 10 10 10 10 10
Fig. 70 size n of sparse quadratic matrix
0 2 4 6 8 10 12 14 16
Fig. 69 nz = 32
5 t = [];
6 f o r i=1:20
7 n = 2^i; m = n/2;
8 A = spdiags(repmat([-1 2 5],n,1),[-n/2,0,n/2],n,n); %
9
15 f i g u r e;
16 l o g l o g (t(:,1),t(:,3),’r+-’, t(:,1),t(:,4),’b*-’,...
17 t(:,1),t(1,3)*t(:,1)/t(1,1),’k-’);
18 x l a b e l (’{\bf size n of sparse quadratic matrix}’,’fontsize’,14);
19 y l a b e l (’{\bf access time [s]}’,’fontsize’,14);
20 le ge nd (’row access’,’column access’,’O(n)’,’location’,’northwest’);
21
22 p r i n t -depsc2 ’../PICTURES/sparseaccess.eps’;
M ATLAB uses compressed column storage (CCS), which entails O(n) searches for index j in the index
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 186
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
array when accessing all elements of a matrix row. Conversely, access to a column does not involve any
search operations.
Note the use of the M ATLAB command repmat in Line 1 and Line 8 of the above code. It can be used to
build structured matrices. Consult the M ATLAB documentation for details.
We study different ways to set a few non-zero entries of a sparse matrix. The first code just uses the
()-operator to set matrix entries.
The second and third code rely on an intermediate triplet format (→ § 2.7.6) to build the sparse matrix
and finally pass this to M ATLAB’s sparse function.
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 187
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
3 f o r n=2.^(8:14)
4 t1= 1000; f o r k=1:K, f p r i n t f (’sparse1, %d, %d\n’,n,k); t i c ;
sparse1; t1 = min (t1, t o c ); end
5 t2= 1000; f o r k=1:K, f p r i n t f (’sparse2, %d, %d\n’,n,k); t i c ;
sparse2; t2 = min (t2, t o c ); end
6 t3= 1000; f o r k=1:K, f p r i n t f (’sparse3, %d, %d\n’,n,k); t i c ;
sparse3; t3 = min (t3, t o c ); end
7 r = [r; n, t1 , t2, t3];
8 end
9
10 l o g l o g (r(:,1),r(:,2),’r*’,r(:,1),r(:,3),’m+’,r(:,1),r(:,4),’b^’);
11 x l a b e l (’{\bf matrix size n}’,’fontsize’,14);
12 y l a b e l (’{\bf time [s]}’,’fontsize’,14);
13 le ge nd (’Initialization I’,’Initialization II’,’Initialization III’,...
14 ’location’,’northwest’);
15 p r i n t -depsc2 ’../PICTURES/sparseinit.eps’;
3
10
Initialization I
Initialization II
Initialization III
2
10
-2
10
2 3 4 5
10 10 10 10
Fig. 71 matrix size n
☛ It is grossly inefficient to initialize a matrix in CCS format (→ Ex. 2.7.9) by setting individual entries
one after another, because this usually entails moving large chunks of memory to create space for
new non-zero entries.
Instead calls like
sparse(dat(1:k,1),dat(1:k,2),dat(1:k,3),n,n);,
where the triplet format is defined by (→ § 2.7.6)
allow M ATLAB to allocate memory and initialize the arrays in one sweep.
We study a sparse matrix A ∈ R n,n initialized by setting some of its (off-)diagonals with M ATLAB’s spdiags
function:
A = spdiags([(1:n)’,ones(n,1),(n:-1:1)’],...
[-floor(n/3),0,floor(n/3)],n,n);
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 188
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
A A*A
0 0
20 20
40 40
60 60
80 80
100 100
120 120
3
10
2
10
1
10
(tic/toc timing) ✄ 0
10
-2
10
-3
10
-4
10
1 2 3 4 5 6 7
10 10 10 10 10 10 10
Fig. 74 n
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 189
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
When extracting a single entry from a sparse matrix, this entry will be stored in sparse format though it is
a mere number! This will considerably slow down all operations on that entry.
Change in Indexing for Sparse Matrix Access. Now subscripted reference into a sparse matrix always
returns a sparse matrix. In previous versions of MATLAB, using a double scalar to index into a sparse
matrix resulted in full scalar output.
Eigen can handle sparse matrices in the standard Compressed Row Storage (CRS) and Compressed
Column Storage (CCS) format, see Ex. 2.7.9 and the documentation:
# i n c l u d e <Eigen/Sparse>
Eigen::SparseMatrix< i n t , Eigen::ColMajor> Asp(rows,cols); // CCS
format
Eigen::SparseMatrix< double , Eigen::RowMajor> Bsp(rows,cols); // CRS
format
As already discussed in Exp. 2.7.13, sparse matrices must not be filled by setting entries through index-
pair access. As in MATLAB, also for E IGEN the matrix should first be assembled in triplet format, from
which a sparse matrix is built. E IGEN offers special facilities for handling triplets.
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 190
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
unsigned i n t row_idx = 2;
unsigned i n t col_idx = 4;
double value = 2.5;
Eigen::Triplet< double > triplet(row_idx,col_idx,value);
s t d :: cout << ’(’ << triplet.row() << ’,’ << triplet.col()
<< ’,’ << triplet.value() << ’)’ << s t d :: e n d l ;
As shown, a Triplet object offers the access member functions row(), col(), and value() to
fetch the row index, column index, and scalar value stored in a Triplet.
The statement that entry-wise initialization of sparse matrices is not efficient has to be qualified in Eigen.
Entries can be set, provided that enough space for each row (in RowMajor format) is reserved in ad-
vance. This done by the reserve() method that takes an integer vector of maximal expected numbers
of non-zero entries per row:
insert(i.j) sets an entry of the sparse matrix, which is rather efficient, provided that enough space
has be reserved. coeffRef(i,j) gives l-value and r-value access to any matrix entry, creating a
non-zero entry, if needed: costly!
The usual matrix operations are supported for sparse matrices; addition and subtraction may involve only
sparse matrices stored in the same format. These operations may incur large hidden costs and have to
be used with care!
We study the runtime behavior of the initialization of a sparse matrix in Eigen parallel to the tests from
Exp. 2.7.13. We use the methods described above.
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 191
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
6
10
Triplets
Runtimes (in ms) for the initialization of a banded ma- 5
coeffRef with space reserved
coeffRef without space reserved
10
trix (with 5 non-zero diagonals, that is, a maximum of
5 non-zero entries per row) using different techniques 10
4
Time in milliseconds
in Eigen.
3
10
Observation: insufficient advance allocation of memory massively slows down the set-up of a sparse
matrix in the case of direct entry-wise initialization.
Reason: Massive internal copying of data required to created space for “unexpected” entries.
This example demonstrates that sparse linear systems of equations naturally arise in the handling of
triangulations.
The points in N are also called the nodes of the mesh, the triangles the cells, and all line segments
connecting two nodes and occurring as a side of a triangle form the set of edges. We always assume a
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 192
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
consecutive numbering of the nodes and cells of the triangulation (starting from 1, M ATLAB’s convention).
Fig. 76 Fig. 77
M ATLAB data structure for describing a triangulation with N nodes and M cells:
• column vector x ∈ R N : x-coordinates of nodes
• column vector y ∈ R N : y-coordinates of nodes
• M × 3-matrix T whose rows contain the index numbers of the vertices of the cells.
(This matrix is a so-called triangle-node incidence matrix.)
The Figure library provides the function triplot for drawing planar triangulations:
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 193
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
These
9 // indices refer to the ordering of the coordinates as given in the
10 // vectors x and y.
11 Eigen : : M a t r i x X i T ( 1 1 , 3 ) ;
12 T << 7 , 1 , 2 , 5, 6, 2, 4, 1, 7, 6, 7, 2,
13 6, 4, 7, 6, 5, 0, 3, 6, 0, 8, 4, 3,
14 3, 4, 6, 8, 1, 4, 9 , 1 , 8;
15 // Call the Figure plotting routine, draw mesh with blue edges
16 // red vertices and a numbering/ordering
17 mgl : : F i g u r e f i g 1 ; f i g 1 . s e t F o n t S i z e ( 8 ) ;
18 f i g 1 . ranges ( 0 . 0 , 1 . 0 5 , 0 . 0 , 1 . 0 5 ) ;
19 f i g 1 . t r i p l o t ( T , x , y , " b ? " ) ; // drawing triangulation with numbers
20 f i g 1 . p l o t ( x , y , " ∗ r " ) ; // mark vertices
21 f i g 1 . save ( " m e s h p l o t _ c p p " ) ;
Fig. 78
The cells of a mesh may be rather distorted triangles (with very large and/or small angles), which is usually
not desirable. We study an algorithm for smoothing a mesh without changing the planar domain covered
by it.
Every edge that is adjacent to only one cell is a boundary edge of the triangulation. Nodes that are
endpoints of boundary edges are boundary nodes.
✎ Notation: Γ ⊂ {1, . . . , N } =
ˆ set of indices of boundary nodes.
✎ Notation: pi = ( pi1 , pi2 ) ∈ R2 =
ˆ coordinate vector of node ♯i, i = 1, . . . , N
i i
( p1 ↔ x( i ), p2 ↔ y( i ) in M ATLAB)
We define
S(i ) := { j ∈ {1, . . . , N } : nodes i and j are connected by an edge} , (2.7.27)
as the set of node indices of the “neighbours” of the node with index number i.
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 194
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
1 j
pi = ∑ p j ⇔ ♯S(i ) pid = ∑ pd , d = 1, 2 , for all i ∈ {1, . . . , N } \ Γ , (2.7.29)
♯ S (i ) j ∈ S (i ) j ∈ S (i )
that is, every interior node is located in the center of gravity of its neighbours.
The relations (2.7.29) correspond to the lines of a sparse linear system of equations! In order to state it,
we insert the coordinates of all nodes into a column vector z ∈ K2N , according to
(
pi1 , if 1 ≤ i ≤ N ,
zi = i− N (2.7.30)
p2 , if N + 1 ≤ i ≤ 2N .
For the sake of ease of presentation, in the sequel we assume (which is not the case in usual triangulation
data) that interior nodes have index numbers smaller than that of boundary nodes.
From (2.7.27) we infer that the system matrix C ∈ R 2n,2N , n := N − ♯Γ, of that linear system has the
following structure:
♯S(i ) , if i = j ,
A O i ∈ {1, . . . , n} ,
C= , (A)i,j = −1 , if j ∈ S(i ) , (2.7.31)
O A
j ∈ {1, . . . , N } .
0 else.
(2.7.29) ⇔ Cz = 0 . (2.7.32)
We partition the vector z into coordinates of nodes in the interior and of nodes on the boundary
zint
1
zbd ⊤
zT = 1 := z , . . . , z , z
zint 1 n n + 1 , . . . , z N , z N + 1 , . . . , z N + n , z N + n + 1 , . . . , z 2N .
2
zbd
2
This induces the following block partitioning of the linear system (2.7.32):
int
z
1bd
Aint Abd O O
z1 = 0 , Aint ∈ R n,n ,
O O Aint Abd zint
2
Abd ∈ R n,N −n .
bd
z2
m
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 195
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Aint Abd
=0. (2.7.33)
Aint Abd
The linear system (2.7.33) holds the key to the algorithmic realization of mesh smoothing; when smoothing
the mesh
(i) the node coordinates belonging to interior nodes have to be adjusted to satisfy the equilibrium con-
dition (2.7.29), they are unknowns,
(ii) the coordinates of nodes located on the boundary are fixed, that is, their values are known.
This is a square linear system with an n × n system matrix, to be solved for two different right hand side
vectors. The matrix Aint is also known as the matrix of the combinatorial graph Laplacian.
We examine the sparsity pattern of the system matrices Aint for a sequence of triangulations created by
regular refinement.
We start from the triangulation of Fig. 78 and in turns perform regular refinement and smoothing (left ↔
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 196
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Below we give spy plots of the system matrices Aint for the first three triangulations of the sequence:
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 197
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Efficient Gaussian elimination for sparse matrices requires sophisticated algorithms that are encapsulated
in special types of solvers in E IGEN. Their calling syntax remains unchanged, however:
Eigen::SolverType<Eigen::SparseMatrix< double >> solver(A);
Eigen::VectorXd x = solver.solve(b);
C++-code 2.7.36: Function for solving a sparse LSE with E IGEN ➺ GITLAB
2 using SparseMatrix = Eigen : : SparseMatrix <double > ;
3 // Perform sparse elimination
4 void s par s e_s olv e ( const SparseMatrix& A , const VectorXd& b , VectorXd&
x) {
5 Eigen : : SparseLU<SparseMatrix > s o l v e r ( A) ;
6 x = s o l v e r . solve ( b ) ;
7 }
The following codes initialize a sparse matrix, then perform an LU-factorization, and, finally, solve a sparse
linear system with a random right hand side vector.
8 f o r ( s i z e _ t l = 0; l <n ; ++ l )
9 t r i p l e t s . push_back ( Eigen : : T r i p l e t < s c a l a r _ t > ( l , l , 5 . 0 ) ) ;
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 198
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
10 f o r ( s i z e _ t l = 1; l <n ; ++ l ) {
11 t r i p l e t s . push_back ( Eigen : : T r i p l e t < s c a l a r _ t > ( l −1, l , 1 . 0 ) ) ;
12 t r i p l e t s . push_back ( Eigen : : T r i p l e t < s c a l a r _ t > ( l , l − 1 ,1.0) ) ;
13 }
14 const s i z e _ t m = n / 2 ;
15 f o r ( s i z e _ t l = 0; l <m; ++ l ) {
16 t r i p l e t s . push_back ( Eigen : : T r i p l e t < s c a l a r _ t > ( l , l +m, 1 . 0 ) ) ;
17 t r i p l e t s . push_back ( Eigen : : T r i p l e t < s c a l a r _ t > ( l +m, l , 1 . 0 ) ) ;
18 }
19 SpMat M( n , n ) ;
20 M. setFromTriplets ( t r i p l e t s . begin ( ) , t r i p l e t s . end ( ) ) ;
21 M. makeCompressed ( ) ;
22 r et ur n M;
23 }
The compute method of the solver object triggers the actual sparse LU-decomposition. The solve
method then does forward and backward elimination, cf. § 2.3.30. It can be called multiple times, see
Rem. 2.5.10.
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 199
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
In Ex. 2.6.5 we saw that applying the standard lu() solver to a sparse arrow matrix results in an extreme
waste of computational resources.
Yet, E IGEN can do much better! The main mistake was the creation of a dense matrix instead of storing
the arrow matrix in sparse format. There are E IGEN solvers which rely on particular sparse elimination
techniques. They still rely of Gaussian elimination with (partial) pivoting (→ Code 2.3.41), but take pains
to operate on non-zero entries only. This can greatly boost the speed of the elimination.
C++11 code 2.7.41: Invoking sparse elimination solver for arrow matrix ➺ GITLAB
2 template <class s o l v e r _ t >
3 VectorXd arrowsys_sparse ( const VectorXd &d , const VectorXd &c , const
VectorXd &b , const double alpha , const VectorXd &y ) {
4 i n t n = d . siz e ( ) ;
5 SparseMatrix <double> A( n+1 , n+1) ; // default: column-major
6 VectorXi reserveVec = VectorXi : : Constant ( n+1 , 2 ) ; // nnz per col
7 reserveVec ( n ) = n + 1; // last full col
8 A . r e s e r v e ( reserveVec ) ;
9 f o r ( i n t j = 0 ; j < n ; ++ j ) { // initalize along cols for
efficency
10 A. inser t ( j , j ) = d( j ) ; // diagonal entries
11 A. inser t (n , j ) = b( j ) ; // bottom row entries
12 }
13 f o r ( i n t i = 0 ; i < n ; ++ i ) {
14 A. inser t ( i , n) = c ( i ) ; // last col
15 }
16 A . i n s e r t ( n , n ) = alpha ; // bottomRight entry
17 A . makeCompressed ( ) ;
18 r et ur n s o l v e r _ t ( A) . solve ( y ) ;
19 }
10 1
arrowsys slow
arrowsys fast
arrowsys SparseLU
Observation: 10 0 arrowsys iterative
dense matrix.
10 -3
The sparse solver is still slower than
Code 2.6.10. The reason is that it is a
10 -4
general algorithm that has to keep track of
non-zero entries and has to be prepared to do 10 -5
pivoting.
10 -6
10 0 10 1 10 2 10 3 10 4
Fig. 83 matrix size n
Experiment 2.7.42 (Timing sparse elimination for the combinatorial graph Laplacian)
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 200
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
We consider a sequence of planar triangulations created by successive regular refinement (→ Def. 2.7.35)
of the planar triangulation of Fig. 78, see Ex. 2.7.23. We use different E IGEN and MKL sparse solver for
the linear system of equations (2.7.34) associated with each mesh.
10 2
Eigen SparseLU
Eigen SimplicialLDLT
Timing results ✄ 10 1 Eigen ConjugateGradient
MKL PardisoLU
MKL PardisoLDLT
O(n 1.5 )
Platform: 10 0
When solving linear systems of equations directly dedicated sparse elimination solvers from
numerical libraries have to be used!
System matrices are passed to these algorithms in sparse storage formats (→ 2.7.1) to convey
information about zero entries.
STOP Never ever even think about implementing a general sparse elimination solver by yourself!
→ SuperLU (http://www.cs.berkeley.edu/~demmel/SuperLU.html),
→ UMFPACK (http://www.cise.ufl.edu/research/sparse/umfpack/), used by M ATLAB’s \,
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 201
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Fig. 85
C++11-code 2.7.44: Example code demonstrating the use of PARDISO with E IGEN ➺ GITLAB
2 void s olv eSpar s ePar dis o ( s i z e _ t n ) {
3 using SpMat = Eigen : : SparseMatrix <double > ;
4 // Initialize a sparse matrix
5 const SpMat M = i n i t S p a r s e M a t r i x <SpMat > ( n ) ;
6 const Eigen : : VectorXd b = Eigen : : VectorXd : : Random( n ) ;
7 Eigen : : VectorXd x ( n ) ;
8 // Initalization of the sparse direct solver based on the Pardiso
library with directly passing the matrix M to the solver
9 // Pardiso is part of the Intel MKL library, see also Ex. 1.3.24
10 Eigen : : PardisoLU <SpMat> s o l v e r (M) ;
11 // The checks of Code 2.7.39 are omitted
12 // solve the LSE
13 x = s o l v e r . solve ( b ) ;
14 }
In Sect. 2.7.1 we have seen, how sparse matrices can be stored requiring O(nnz(A)) memory.
In Ex. 2.7.18 we found that (sometimes) matrix multiplication of sparse matrices can also be carried out
with optimal complexity, that is, with computational effort proportional to the total number of non-zero
entries of all matrices involved.
Does this carry over to the solution of linear systems of equations with sparse system matrices? Sec-
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 202
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
tion 2.7.4 says “Yes”, when sophisticated library routines are used. In this section, we examine some
aspects of Gaussian elimination ↔ LU-factorisation when applied in a sparse context.
We examine the following “sparse” matrix with a typical structure and inspect the pattern of the LU-factors
returned by E IGEN, see Code 2.7.46.
3 −1 −1
.. .. ..
−1 . . .
..
.
..
. −1
..
.
A=
−1 − 1 3 −1 n,n
∈ R ,n ∈ N
3 −1
.. . ..
. −1 . . .
.. .. ..
. . . −1
−1 −1 3
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 203
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
20 20 20
40 40 40
60 60 60
80 80 80
Of course, in case the LU-factors of a sparse matrix possess many more non-zero entries than the matrix
itself, the effort for solving a linear system with direct elimination will increase significantly. This can be
quantified by means of the following concept:
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 204
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
A is called an “arrow matrix”, see the pattern of non-zero entries below and Ex. 2.6.5.
Recalling Rem. 2.3.32 it is easy to see that the LU-factors of A will be sparse and that their sparsity
patterns will be as depicted below. Observe that despite sparse LU-factors, A−1 will be densely populated.
Pattern of A -1 Pattern of L Pattern of U
Pattern of A
0 0 0 0
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 31 nz = 121 nz = 21 nz = 21
Recall the discussion in Ex. 2.6.5. Here we look at an arrow matrix in a slightly different form:
α b⊤
α∈R,
M=
c
,
b, c ∈ R n−1 ,
D
D ∈ R n−1,n−1 regular diagonal matrix, → Def. 1.1.5
(2.7.51)
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 205
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
0 0
2 2
4 4
10 10
12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 65 nz = 65
Now it comes as a surprise that the arrow matrix A
from Ex. 2.6.5, (2.6.6) has sparse LU-factors!
D c
A=
Arrow matrix (2.6.6) ✄
b⊤ α
I 0 D c
A=
·
, σ : = α − b ⊤ D −1 c .
b ⊤ D −1 1 0 σ
| {z } | {z }
= :L = :U
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 206
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Idea: Transform A into A by row and column permutations before performing LU-
decomposition.
0 0
2 2
4 4
6 6
8 8
10 10
12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 89 nz = 31 Fig. 90 nz = 31
➣ Then LU-factorization (without pivoting) of the resulting matrix requires O(n) operations.
C++11 code 2.7.52: Permuting arrow matrix, see Figs. 89, 90 ➺ GITLAB
2 MatrixXd A( 1 1 , 1 1 ) ; A . s e t I d e n t i t y ( ) ;
3 A . col ( 0 ) . setOnes ( ) ; A . row ( 0 ) = RowVectorXd : : LinSpaced ( 1 1 , 1 1 , 1 ) ;
4 // Permutation matrix (→ Def. 2.3.46) encoding cyclic permutation
5 MatrixXd P( 1 1 , 1 1 ) ; P . setZero ( ) ;
6 P . topRightCorner ( 1 0 , 1 0 ) . s e t I d e n t i t y ( ) ; P( 1 0 , 0 ) = 1 ;
7 mgl : : F i g u r e f i g 1 , f i g 2 ;
8 f i g 1 . spy ( A) ; f i g 1 . setFontSize ( 4 ) ;
9 f i g 1 . save ( " I n v A r r o w S p y _ c p p " ) ;
10 f i g 2 . spy ( ( P∗A∗P . transpose ( ) ) . e v a l ( ) ) ; f i g 2 . s e t F o n t S i z e ( 4 ) ;
11 f i g 2 . save ( " A r r o w S p y _ c p p " ) ;
In Ex. 2.7.50 we found that permuting a matrix can make it amenable to Gaussian elimination/LU-decomposition
with much less fill-in (→ Def. 2.7.47). However, recall from Section 2.3.3 that pivoting, which may be es-
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 207
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
sential for achieving numerical stability, amounts to permuting the rows (or even columns) of the matrix.
Thus, we may face the awkward situation that pivoting tries to reverse the very permutation we applied to
minimize fill-in! The next example shows that this can happen for an arrow matrix.
1 2
1
2
2
.. ..
A= . . → arrow matrix, Ex. 2.7.48
1
2
10
2 ... 2
The distributions of non-zero entries of the computed LU-factors (“spy-plots”) are as follows:
arrow matrix A L factor U factor
0 0 0
2 2 2
4 4 4
6 6 6
In
8 8 8
10 10 10
12 12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 31 nz = 21 nz = 66
A L U
this case the solution of a LSE with system matrix A ∈ R n,n of the above type by means of Gaussian
elimination with partial pivoting would incur costs of O(n3 ).
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 208
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Banded matrices are a special class of sparse matrices (→ Notion 2.7.1 with extra structure:
: diagonal
: super-diagonals
m
: sub-diagonals
✁ bw(A) = 3, bw(A) = 2
n
We now examine a generalization of the concept of a banded matrix that is particularly useful in the context
of Gaussian elimination:
Definition 2.7.56. Matrix envelope
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 209
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
∗ 0 ∗ 0 0 0 0 bw1R ( A) =0
0 ∗ 0 0 ∗ 0 0 bw2R ( A) =0
∗ 0 ∗ 0 0 0 ∗
bw3R ( A) =2
env( A) = red entries
A=
0 0 0 ∗ ∗ 0 ∗
bw4R ( A) = 0 ∗ = non-zero matrix entry a 6= 0
ˆ ij
0 ∗ 0 ∗ ∗ ∗ 0 bw5R ( A) =3
0 0 0 0 ∗ ∗ 0 bw6R ( A) =1
0 0 ∗ ∗ 0 0 ∗ bw7R ( A) =4
2 2
4 4
6 6
8 8
10 10
12 12
14 14
16 16
18 18
20 20
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 91 nz = 138 Fig. 92 nz = 121
Note: the envelope of the arrow matrix from Ex. 2.7.48 is just the set of index pairs of its non-zero entries.
Hence, the following theorem provides another reason for the sparsity of the LU-factors in that example.
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 210
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Proof. (by induction, version II) Use block-LU-factorization, cf. Rem. 2.3.34 and proof of Lemma 2.3.19:
e ⊤l = c ,
Ae b e 0
L e u
U U
= ⇒ (2.7.59)
c⊤ α l⊤ 1 0 ξ e =b.
Lu
If mC
n (A ) = m, then b1 , . . . , bn − m = 0 (entries of b
= 0
from (2.7.59))
Thm. 2.7.58 immediately suggests a policy for saving cmputational effort when solving linear system
whose system matrix A ∈ K n,n is sparse due to small envelope:
♯ env(A) ≪ n2 :
✞ ☎
✝ ✆
Policy Confine elimination to envelope!
Envelope-aware LU-factorization:
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 211
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
18 m( i ) = i − j ;
19 break ;
20 }
21 r et ur n m;
22 }
Asymptotic complexity of envelope aware forward substitution, cf. § 2.3.30, for Lx = y, L ∈ K n,n regular
lower triangular matrix is
O(# env(L)) !
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 212
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Assumption:
A ∈ K n,n is
structurally symmetric
O(n · # env(A)) .
Since by Thm. 2.7.58 fill-in is confined to the envelope, we need store only the matrix entries aij , (i, j) ∈
env(A) when computing (in situ) LU-factorization of structurally symmetric A ∈ K n,n
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 213
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Two arrays:
scalar_t * val size P, Indexing rule:
size_t * dptr size n ∗ 0 ∗ 0 0 0 0
0 ∗ 0 0 ∗ 0 0 dptr[ j] = k
∗
n A = 0 00 ∗
0
0
∗
0
∗
0
0
∗
∗ m
P : = n + ∑ mi ( A ) . 0 ∗ 0 ∗ ∗ ∗ 0
0 0 0 0 ∗ ∗ 0 val[k] = a jj
i =1 0 0 ∗ ∗ 0 0 ∗
(2.7.67)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
val a11 a22 a31 a32 a33 a44 a52 a53 a54 a55 a65 a66 a73 a74 a75 a76 a77
dptr 0 1 2 5 6 10 12 17
Minimizing bandwidth/envelope:
Goal: Minimize mi (A),A = (aij ) ∈ R N,N , by permuting rows/columns of A
Recall: cyclic permutation of rows/columns of arrow matrix applied in Ex. 2.7.50. This can be viewed as
a drastic shrinking of the envelope:
envelope arrow matrix envelope arrow matrix
0 0
2 2
4 4
6 6
8 8
10 10
12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 31 nz = 31
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 214
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
2. Direct Methods for Linear Systems of Equations, 2.7. Sparse Linear Systems 215
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Very desirable: a priori criteria, when Gaussian elimination/LU-factorization remains stable even
without pivoting. This can help avoid the extra work for partial pivoting and makes it possible to
exploit structure without worrying about stability.
This section will introduce classes of matrices that allow Gaussian elimination without pivoting. Fortu-
nately, linear systems of equations featuring system matrices from these classes are very common in
applications.
Example 2.8.1 (Diagonally dominant matrices from nodal analysis → Ex. 2.1.3)
➀ R12 ➁ R23 ➂
Consider:
−1 −1 −1 −1
➁ : R12 (U2 − U1 ) + R23 (U2 − U3 ) + R24 (U2 − U4 ) + R25 (U2 − U5 ) = 0,
−1 −1
➂: R23 (U3 − U2 ) + R35 (U3 − U5 ) = 0,
−1 −1 −1
➃: R14 (U4 − U1 ) + R24 (U4 − U2 ) + R45 (U4 − U5 ) = 0,
−1 −1 −1
➄ : R25 (U5 − U2 ) + R35 (U5 − U3 ) + R45 (U5 − U4 ) + R56 (U5 − U6 ) = 0,
U1 = U , U6 = 0 .
1 1
R12 + R23+ R124 + 1
R25 − R123 − R124 − R125 U2
1
R12
− R123 1 1
− R135
R23 + R35 0 U3 0
= 1 U
− R124 0 1
R24 + R45
1
− R145 U4 R14
− R125 − R135 − R145 1
+ R135 + R145 + 1 U5 0
R22 R56
2. Direct Methods for Linear Systems of Equations, 2.8. Stable Gaussian elimination without pivoting 216
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
n
• ∑ akj ≥ 0 , k = 1, . . . , n , (2.8.3)
j =1
• A is regular. (2.8.4)
All these properties are obvious except for the fact that A is regular.
Proof of (2.8.4): By Thm. 2.2.4 it suffices to show that the nullspace of A is trivial: Ax = 0 ⇒ x = 0.
Pick x ∈ R n , Ax = 0, and i ∈ {1, . . . , n} so that
| xi | = max{| x j |, j = 1, . . . , n} .
Intermediate goal: show that all entries of x are the same
aij | aij |
Ax = 0 ⇒ xi = ∑ aii x j ⇒ | xi | ≤ ∑ |aii | | x j | . (2.8.5)
j 6 =i j 6 =i
Hence, (2.8.6) combined with the above estimate (2.8.5) that tells us that the maximum is smaller equal
than a mean implies | x j | = | xi | for all j = 1, . . . , n. Finally, the sign condition akj ≤ 0 for k 6= j enforces
the same sign of all xi . Thus, we conclude, w.l.o.g., x1 = x2 = · · · = xn . As
n
∃i ∈ {1, . . . , n}: ∑ aij > 0 (strict inequality) ,
j =1
A has LU-factorization
regular, diagonally dominant
A ⇔ m
with positive diagonal
Gaussian elimination feasible without pivoting(∗)
2. Direct Methods for Linear Systems of Equations, 2.8. Stable Gaussian elimination without pivoting 217
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
(∗): partial pivoting & diagonally dominant matrices ➣ triggers no row permutations !
n n
(1) (1) ai1 a
| aii | − ∑ | aij | = aii − a1i − ∑ aij − i1 a1j
j =2
a11 j =2
a11
j 6 =i j 6 =i
n
| ai1 || a1i | |a | n
≥ aii − − ∑ | aij | − i1 ∑ | a1j |
a11 j =2
a11 j=2
j 6 =i j 6 =i
n n
| ai1 || a1i | a − | a1i |
≥ aii − − ∑ | aij | − | ai1 | 11 ≥ aii − ∑ | aij | ≥ 0 .
a11 j =2
a11 j =1
j 6 =i j 6 =i
A regular, diagonally dominant ⇒ partial pivoting according to (2.3.42) selects i-th row in i-th step.
The class of symmetric positive definite (s.p.d.) matrices has been defined in Def. 1.1.8. They permit
stable Gaussian elimintation without pivoting:
Equivalent to the assertion of the theorem: Gaussian elimination is feasible without pivoting
In fact, this theorem is a corollary of Lemma 2.3.19, because all principal minors of an s.p.d. matrix are
s.p.d. themselves.
!
a11 b⊤ 1. step a11 b⊤
A= e −−−−−−−−−−→ e − bb⊤ .
b A Gaussian elimination 0 A a 11
2. Direct Methods for Linear Systems of Equations, 2.8. Stable Gaussian elimination without pivoting 218
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
e− bb ⊤
➣ to show: A a11 s.p.d. (→ step of induction argument)
✦
e − bb⊤ ∈ R n−1,n−1
Evident: symmetry of A a 11
✦
As A s.p.d. (→ Def. 1.1.8), for every y ∈ R n−1 \ {0}
⊤
!⊤ ⊤
!
− ba11y a11 b⊤ − ba11y bb⊤
0< e = y ⊤ (A
e− )y .
y b A y a11
e− bb ⊤
A a11 positive definite. ✷
The proof can also be based on the identities
(A)1:n−1,1:n−1 (A)1:n−1,n L1 0 U1 u
= , (2.7.62)
(A)n,1:n−1 (A)n,n l⊤ 1 0 γ
⇒ (A)1:n−1,1:n−1 = L1 U1 , L1 u = (A)1:n−1,n , U1⊤ l = (A)⊤ ⊤
n,1:n −1 , l u + γ = (A )n,n ,
noticing that the principal minor (A)1:n−1,1:n−1 is also s.p.d. This allows a simple induction argument.
The next result gives a useful criterion for telling whether a given symmetric/Hermitian matrix is s.p.d.:
Proof. For A = AH diagonally dominant, use inequality between arithmetic and geometric mean (AGM)
ab ≤ 12 (a2 + b2 ):
n n
H 2 2
x Ax = ∑ aii | xi | + ∑ aij x̄i x j ≥ ∑ aii | xi | − ∑ | aij || xi || x j |
i =1 i 6= j i =1 i 6= j
AGM n
≥ ∑ aii | xi |2 − 12 ∑ |aij |(| xi |2 + | x j |2 )
i =1 i 6= j
n n
1 2 2 1 2 2
≥ 2 ∑ ii i ∑ ij i
{ a | x | − | a || x | } + 2 ∑ ii j
{ a | x | − ∑ ij j
| a || x | }
i =1 j 6 =i j =1 i 6= j
n
≥ ∑ | xi |2 aii − ∑ | aij | ≥ 0 .
i =1 j 6 =i
2. Direct Methods for Linear Systems of Equations, 2.8. Stable Gaussian elimination without pivoting 219
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Lemma 2.8.14. Cholesky decomposition for s.p.d. matrices → [?, Sect. 3.4], [?, Sect. II.5],
[?, Thm. 3.6]
For any s.p.d. A ∈ K n,n , n ∈ N, there is a unique upper triangular matrix R ∈ K n,n with rii > 0,
i = 1, . . . , n, such that A = RH R (Cholesky decomposition).
e , D=ˆ diagonal of U ,
A = LDU e=
U ˆ normalized upper triangular matrix → Def. 1.1.5
A = A⊤ ⇒ U = DL⊤ ⇒ A = LDL⊤ ,
x⊤ Ax > 0 ∀x 6= 0 ⇒ y⊤ Dy > 0 ∀y 6= 0 .
√
➤ D has positive diagonal ➨ R= DL⊤ . ✷
Formulas analogous to (2.3.22)
i −1
min{i,k} ∑ r ji r jk + rii rik , if i < k ,
H j =1
R R = A ⇒ aik = ∑ r ji r jk =
−1
i
(2.8.15)
2 2
j =1
∑ |r ji | + rii
, if i = k .
j =1
1 3
Computational costs (# elementary arithmetic operations) of Cholesky decomposition: 6n + O ( n2 )
(➣ “half the costs” of LU-factorization, cf. Code in § 2.3.21, but this does not mean “twice as fast” in a
concrete implementation, because memory access patterns will have a crucial impact, see Rem. 1.4.8.)
Gains of efficiency hardly justify the use of Cholesky decomposition in modern numerical algorithms.
Savings in memory compared to standard LU-factorization (only one factor R has to be stored) offer a
stronger reason to prefer the Cholesky decomposition.
2. Direct Methods for Linear Systems of Equations, 2.8. Stable Gaussian elimination without pivoting 220
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The computation of Cholesky-factorization by means of the algorithm of Code 2.8.16 is numerically stable
(→ Def. 1.5.85)!
Reason: recall Thm. 2.4.4: Numerical instability of Gaussian elimination (with any kind of pivoting) mani-
fests itself in massive growth of the entries of intermediate elimination matrices A(k) .
Use the relationship between LU-factorization and Cholesky decomposition, which tells us that we only
have to monitor the growth of entries of intermediate upper triangular “Cholesky factorization matrices”
A = ( R ( k ) )H R ( k ) .
Consider: Euclidean vector norm/matrix norm (→ Def. 1.5.76) k·k2
Computation of the Cholesky decomposition largely agrees with the computation of LU-factorization (with-
out pivoting). Using the latter together with forward and backward substitution (→ Sect. 2.3.2) to solve a
linear system of equations is algebraically and numerically equivalent to using Gaussian elimination with-
out pivoting.
✗ ✔
From these equivalences we conclude:
✖ ✕
is numerically stable (→ Def. 1.5.85)
m
✎ ☞
Gaussian elimination without pivoting is a numerically stable way to solve LSEs with s.p.d.
✍ ✌
system matrix.
Learning Outcomes
• A clear understanding of the algorithm of Gaussian elimination with and without pivoting (prerequisite
knowledge from linear algebra)
• Insight into the relationship between Gaussian elimination and LU-decomposition and the algorith-
mic relevance of LU-decomposition
2. Direct Methods for Linear Systems of Equations, 2.8. Stable Gaussian elimination without pivoting 221
Chapter 3
In this chapter we study numerical methods for overdetermined linear systems of equations, that is, linear
systems with a “tall” rectangular system matrix
x ∈ R n : “Ax = b” , (3.0.1)
A x = b
b ∈ R m , A ∈ R m,n , m≥n.
In contrast to Chapter 1 we will mainly restrict ourselves to real linear systems in this chapter.
Note that the quotation marks in (3.0.1) indicate that this is not a well-defined problem in the sense of
§ 1.5.67; Ax = b does no define a mapping (A, b) 7→ x, because
Contents
3.0.1 Overdetermined Linear Systems of Equations: Examples . . . . . . . . . . . 215
3.1 Least Squares Solution Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
3.1.1 Least Squares Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
3.1.2 Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
3.1.3 Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . . . . . . . . . . 225
3.1.4 Sensitivity of Least Squares Problem . . . . . . . . . . . . . . . . . . . . . . . 227
3.2 Normal Equation Methods [?, Sect. 4.2], [?, Ch. 11] . . . . . . . . . . . . . . . . . . 228
3.3 Orthogonal Transformation Methods [?, Sect. 4.4.2] . . . . . . . . . . . . . . . . . 232
3.3.1 Transformation Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
3.3.2 Orthogonal/Unitary Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
3.3.3 QR-Decomposition [?, Sect. 13], [?, Sect. 7.3] . . . . . . . . . . . . . . . . . . 233
3.3.3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
222
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
You may think that overdetermined linear systems of equations are exotic, but this is not true. Rather they
are very common in mathematical models.
From first principles it is known that two physical quantities x ∈ R and y ∈ R (e.g., pressure and density
of an ideal gas) are related by a linear relationship
h i measurement errors will thwart the solvability of (3.0.4) and for m > 2
In practice inevitable (“random”)
the probability that a solution αβ exists is zero, see Rem. 3.1.2.
3. DIrect Methods for Linear Least Squares Problems, 3. DIrect Methods for Linear Least Squares Problems 223
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Known: without measurement errors data would satisfy an affine linear relationship y = a⊤ x + β, for
some a ∈ R n , c ∈ R .
Plugging in the measured quantities gives yi = a⊤ xi + β, i = 1, . . . , m, a linear system of equations of
the form
x1⊤ 1 y1
.. .. a ..
. . β = . , (3.0.6)
x⊤m 1 ym
overdetermined in case m > n + 1.
Die Grundlagen seines Verfahrens hatte Gauss schon 1795 im Alter von 18 Jahren entwickelt.
Basis war eine Idee von Pierre-Simon Laplace, die Beträge von Fehlern aufzusummieren,
so dass sich die Fehler zu Null addieren. Gauss nahm stattdessen die Fehlerquadrate und
konnte die künstliche Zusatzanforderung an die Fehler weglassen.
3. DIrect Methods for Linear Least Squares Problems, 3. DIrect Methods for Linear Least Squares Problems 224
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Gauss benutzte dann das Verfahren intensiv bei seiner Vermessung des Königreichs Han-
nover durch Triangulation. 1821 und 1823 erschien die zweiteilige Arbeit sowie 1826 eine
Ergänzung zur Theoria combinationis observationum erroribus minimis obnoxiae (Theorie der
den kleinsten Fehlern unterworfenen Kombination der Beobachtungen), in denen Gauss eine
Begründung liefern konnte, weshalb sein Verfahren im Vergleich zu den anderen so erfolgre-
ich war: Die Methode der kleinsten Quadrate ist in einer breiten Hinsicht optimal, also besser
als andere Methoden.
We now extend Ex. 3.0.7 to planar triangulations, for which measured values for all internal angles are
available. We obtain an overdetermined system of equations by combining the following linear relations:
If the planar triangulation has N0 interior vertices and M cells, then we end up with 4M + N0 equations
for the 3M unknown angles.
Example 3.0.10 ((Relative) point locations from distances [?, Sect. 6.1])
Consider n points located on the real axis at unknown locations xi ∈ R , i = 1, . . . , n. At least we know
that xi < xi +1 , i = 1, . . . , n − 1.
⊤
Note that we can never expect a unique solution for x ∈ R n , because adding a multiple of [1, 1, . . . , 1]
⊤
to any solution will again yield a solution, because A has a non-trivial kernel: N (A) = [1, 1, . . . , 1] .
Non-uniqueness can be cured by setting x1 := 0, thus removing one component of x.
If the measurements were perfect, we could then find x2 , . . . , xn from di −1,i , i = 2, . . . , n by solving a
standard (square) linear system of equations. However, as in Ex. 3.0.7, using much more information
through the overdetermined system (3.0.11) helps curb measurement errors.
3. DIrect Methods for Linear Least Squares Problems, 3. DIrect Methods for Linear Least Squares Problems 225
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Recall from linear algebra that Ax = b has a solution, if and only if the right hand side vector b lies in the
image (range space, → Def. 2.2.2) of the matrix A:
∃x ∈ R n : Ax = b ⇔ b ∈ R(A) . (3.1.1)
✎ Notation for important subspaces associated with a matrix A ∈ K m,n (→ Def. 2.2.2)
Remark 3.1.2 (Consistent right hand side vectors are highly improbable)
If R(A) 6= R m , then “almost all” perturbations of b (e.g., due to measurement errors) will destroy b ∈
R(A), because R(A) is a “set of measure zero” in R m .
For given A ∈ K m,n , b ∈ K m the vector x ∈ R n is a least squares solution of the linear system of
equations Ax = b, if
x ∈ argmink Ay − bk2 ,
y ∈K n
m
kAx − b k2 = inf n kAy − b k2 .
y ∈K
➨ A least squares solution is any vector x that minimizes the Euclidean norm of the residual r =
b − Ax, see Def. 2.4.1.
We write lsq(A, b) for the set of least squares solutions of the linear system of equations Ax = b,
A ∈ R m,n , b ∈ R m :
3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 226
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
We consider the problem of parameter estimation for a linear model from Ex. 3.0.5:
y = a⊤ x + β , (3.1.6)
data points ✄
[?, Sect. 4.5]: In statistics we learn that the least squares estimate provides a maximum likelihood estimate,
if the measurement errors are uniformly and independently normally distributed.
3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 227
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Appealing to the geometric intuition gleaned from Fig. 96 we infer the orthogonality of b − Ax, x a least
squares solution of the overdetermined linear systems of equations Ax = b, to all columns of A:
Surprisingly, we have found a square linear system of equations satisfied by the least squares solution.
The next theorem gives the formal statement is this discovery. It also completely characterizes lsq(A, b)
and reveals a way to compute this set.
The vector x ∈ R n is a least squares solution (→ Def. 3.1.3) of the linear system of equations
Ax = b, A ∈ R m,n , b ∈ R m , if and only if it solves the normal equations
A⊤ Ax = A⊤ b . (3.1.11)
Note that the normal equations (3.1.11) are an n × n square linear system of equations with a symmetric
positive semi-definite coefficient matrix:
" # " # " #
A⊤ A x = A⊤ b,
" #" # " #
⇔ A⊤ A x = A⊤ b.
➊: We first show that a least squares solution satisfies the normal equations. Let x ∈ R n be a least
squares solutions according to Def. 3.1.3. Pick an arbitrary d ∈ R n \ {0} and define the function
3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 228
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
2
Moreover, since every x ∈ lsq(A, b) is a minimizer of y 7→ k Ay − bk2 , we conclude that τ 7→ ϕd (τ )
has a global minimum in τ = 0. Necessarily,
dϕd
= 2d⊤ A⊤ (Ax − b) = 0 .
dτ |τ =0
Since this holds for any vector d 6= 0, we conclude (set d equal to all the Euclidean unit vectors in R n )
A⊤ (Ax − b) = 0 ,
➋: Let x be a solution of the normal equations. Then we find by tedious but straightforward computations
Since this holds for any y ∈ R n , x must be a global minimizer of y 7→ kAy − bk!
✷
Example 3.1.13 (Normal equations for some examples from Section 3.0.1)
Given A and b it takes only elementary linear algebra operations to form the normal equations
A⊤ Ax = A⊤ b . (3.1.11)
• For § 5.7.7, A ∈ R m,2 given in (3.0.4) we obtain the normal equations linear system
x1 1
x2 1
.. .. ⊤
x1 x2 . . . . . . xm . . α kxk22 1⊤ x α x y
= ⊤ = ⊤ ,
1 1 ... ... 1 β 1 x m β 1 y
. .
.. ..
xm 1
⊤
with 1 = [1, . . . , 1] .
• In the case of Ex. 3.0.5 and the overdetermined m × (n + 1) linear system (3.0.6), the normal
equations read
x1⊤ 1
x1 x2 . . . . . . xm .. .. a XX⊤ X1 a Xy
. . β = ⊤ ⊤ = ⊤ ,
1 1 ... ... 1 1 X m β 1 y
x⊤
m 1
3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 229
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Thm. 3.1.10 together with Thm. 3.1.9 already confirms that the normal equations will always have a so-
lution and that lsq(A, b) is a subspace of R n parallel to N (A⊤ A). The next theorem gives even more
detailed information.
N (A ⊤ A ) = N (A ) , (3.1.19)
⊤ ⊤
R(A A) = R(A ) . (3.1.20)
V ⊥ : = { x ∈ K k : xH y = 0 ∀ y ∈ V } .
3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 230
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Az = 0 ⇒ A⊤ Az = 0 ⇔ z ∈ N (A⊤ A) .
If m ≥ n and N (A) = {0}, then the linear system of equations Ax = b, A ∈ R m,n , b ∈ R m , has
a unique least squares solution (→ 3.1.3)
x = (A ⊤ A )−1 A ⊤ b , (3.1.23)
Hence the assumption N (A) = {0} of Cor. 3.1.22 is also called a full-rank condition, because the rank
of A is maximal.
that is, the manifest condition, that the points do not lie on a vertical line.
• In the case of Ex. 3.0.5 and the overdetermined m × (n + 1) linear system (3.0.6), we find
There is a subset of n + 1
x1⊤ 1
.. points xi1 , . . . xin+1 such that
rank ... . = n + 1 ⇔
{0, xi1 , . . . xin+1 } spans a non-
x⊤m 1 degenerate n + 1-simplex.
3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 231
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
In case the system matrix A ∈ R m,n , m ≥ n, of an overdetermined linear system arising from a mathe-
matical fails to have full rank, it hints at inadequate modelling:
In this case parameters are redundant, because different sets of parameters yield the same output quan-
tities: the parameters are not “observable”.
J : R n → R , J (y ) := k Ay − bk22 . (3.1.15)
and its explicit form as polynomial in the vector components y j we find the Hessian (→ Def. 8.4.11, [?,
Satz 7.5.3]) of J :
Thm. 3.1.18 implies that A⊤ A is positive definite (→ Def. 1.1.8) if and only if N (A) = {0}.
Therefore, by [?, Satz 7.5.3], under the full-rank condition J has a positive definite Hessian everywhere,
and a minimum at every stationary point of its gradient, that is, at every solution of the normal equations.
Another result from analysis tells us that real-valued C1 -functions on R n whose Hessian has positive
eigenvalues uniformly bounded away from zero are strictly convex. Hence, if A has full rank, the least
squares functional J from (3.1.15) is a strictly convex function.
Fig. 97
3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 232
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Now we are in a position to state precisely what we mean by solving an overdetermined (m ≥ n!) linear
system of equations Ax = b, A ∈ R m,n , b ∈ R m , provided that A has full (maximal) rank, cf. (3.1.25).
✎ A sloppy notation for the minimization problem (3.1.31) is kAx − bk2 → min
As we have seen in Ex. 3.0.10, there can be many least squares solutions of Ax = b, in case N (A) 6=
{0}. We can impose another condition to single out a unique element of lsq(A, b):
➨ The generalized solution is the least squares solution with minimal norm.
Elementary geometry teaches that the minimal norm element of an affine subspace L (a plane) in Eu-
clidean space is the orthogonal projection of 0 onto L.
lsq(A, b)
✁ visualization:
The minimal norm element x† of the affine space
x† lsq(A, b) ⊂ R n belongs to the subspace of R n that
is orthogonal to lsq(A, b).
lsq(A, b)⊥
0
Fig. 98
3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 233
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Since the space of least squares solutions of Ax = b is an affine subspace parallel to N (A)
lsq(A, b) = x0 + N (A) , x0 solves normal equations, (3.1.35)
the generalized solution x∗ of Ax = b is contained in N (A)⊥ . Therefore, given a basis {v1 , . . . , vk } ⊂
R n of N (A)⊥ , k := dim N (A), we can find y ∈ R k such that x∗ = Vy, V := [v1 , . . . , vk ] ∈ R n,k .
Plugging this representation into the normal equations and multiplying with V⊤ yields the reduced normal
equations
V⊤ A⊤ AV y = V⊤ A⊤ b (3.1.36)
m
" #
V⊤ A⊤ y =
A
V
V⊤
A⊤ b .
The very construction of V ensures N (AV) = {0} so that, by Thm. 3.1.18 the k × k linear system of
equations (3.1.36) has a unique solution. The next theorem summarizes our insights:
✎ notation: A† ∈ R n,m =
ˆ pseudoinverse of A ∈ R m,n
Note that the Moore-Penrose pseudoinverse does not depend on the choice of V.
Armed with the concept of generalized solution and the knowledge about its existence and uniqueness we
can state the most generali linear least squares problem:
given: A ∈ R m,n , m, n ∈ N, b ∈ R m ,
find: x ∈ R n such that (3.1.38)
(i) k Ax − bk2 = inf{k Ay − bk2 : y ∈ R n },
(ii) k xk2 is minimal under the condition (i).
3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 234
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Recall Section 2.2.2, where we discussed the sensitivity of solutions of square linear systems, that is,
the impact of perturbations in the problem data on the result. Now we study how (small) changes in A
and b affect the unique (→ Cor. 3.1.22) least solution x of Ax = b in the case of A with full rank (⇔
N ( A ) = { 0 })
Note: If the matrix A ∈ R m,n , m ≥ n, has full rank, then there is a c > 0 such that A + ∆A still has
full rank for all ∆A ∈ R m,n with k∆A k2 < c. Hence, “sufficiently small” perturbations will not destroy the
full-rank property of A. This is a generalization of the Perturbation Lemma 2.2.11.
For square linear systems the condition number of the system matrix (→ Def. 2.2.12) provided the key
gauge of sensitivity. To express the sensitivity of linear least squares problems we also generalize this
concept:
For a square regular matrix this agrees with its condition number according to Def. 2.2.12, which follows
from Cor. 1.5.82.
For m ≥ n, A ∈ R m,n , rank(A) = n, let x ∈ R n be the solution of the least squares problem
kAx − bk → min and b x − bk →
x the solution of the perturbed least squares problem k(A + ∆A)b
min. Then
kx − b
x k2 · krk2 k∆A k2
≤ 2 cond2 (A) + cond22 (A)
kxk2 k A k2 k x k 2 k A k 2
This means: if k rk2 ≪ 1 ➤ condition of the least squares problem ≈ cond2 (A)
if k rk2 “large” ➤ condition of the least squares problem ≈ cond22 (A)
For instance, in a linear parameter estimation problem (→ Ex. 3.0.5) a small residual will be the conse-
quence of small measurement errors.
3. DIrect Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 235
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
3.2 Normal Equation Methods [?, Sect. 4.2], [?, Ch. 11]
In fact, Cor. 3.1.22 suggests a simple algorithm for solving linear least squares problems of the form
(3.1.31) satisfying the full (maximal) rank condition rank(A) = n: it boils down to solving the normal
equations (3.1.11):
C++11 code 3.2.1: Solving a linear least squares probel via normal equations
2 //! Solving the overdetermined linear system of equations
3 //! Ax = b by solving normal equations (3.1.11)
4 //! The least squares solution is returned by value
5 VectorXd normeqsolve ( const MatrixXd &A , const VectorXd &b ) {
6 i f ( b . siz e ( ) ! = A . rows ( ) ) throw r u n t i m e _ e r r o r ( " D i m e n s i o n m i s m a t c h " ) ;
7 // Cholesky solver
8 VectorXd x = ( A . transpose ( ) ∗A) . l l t ( ) . solve ( A . transpose ( ) ∗ b ) ;
9 r et ur n x ;
10 }
By Thm. 2.8.11, for the s.p.d. matrix A⊤ A Gaussian elimination remains stable even without pivoting. This
is taken into account by requesting the Cholesky decomposition of A⊤ A by calling the method llt().
The problem size parameters for the linear least squares problem (3.1.31) are the matrix dimensions
m, n ∈ N, where n small & fixed, n ≪ m, is common.
In Section 1.4.2 and Thm. 2.5.2 we discussed the asymptotic complexity of the operations involved in step
➊-➌ of the normal equation method:
step ➊: cost O(mn2 )
step ➋: cost O(nm) cost O(n2 m + n3 ) for m, n → ∞ .
3
step ➌: cost O(n )
Note that for small fixed n, n ≪ m, m → ∞ the computational effort scales linearly with m.
3. DIrect Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [?, Sect. 4.2], [?, Ch. 11]236
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
1 1
⊤ 1 + δ2 1
A = δ 0 ⇒ A A=
1 1 + δ2
0 δ
! √
Exp. 1.5.35: If δ ≈ EPS, then 1 + δ2 =· 1 in M. Hence the computed A⊤ A will fail to
√
be regular, though rank(A) = 2, cond2 (A) ≈ EPS.
C++-code 3.2.5:
2 i n t main ( ) {
3 MatrixXd A( 3 , 2 ) ;
4 // Inquire about machine precision → Ex. 1.5.33
5 double eps = std : : n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) ;
6 // « initialization of matrix → § 1.2.13
7 A << 1 , 1 , s q r t ( eps ) , 0 , 0 , s q r t ( eps ) ;
8 // Output rank of A⊤ A
9 std : : cout << " Rank o f A : " << A . f u l l P i v L u ( ) . rank ( ) << std : : endl
10 << " Rank o f A^ TA : "
11 << ( A . transpose ( ) ∗ A) . f u l l P i v L u ( ) . rank ( ) << std : : endl ;
12 r et ur n 0 ;
13 }
Output:
1 Rank o f A : 2
2 Rank o f A^T∗A : 1
A sparse 6⇒ A⊤ A sparse
3. DIrect Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [?, Sect. 4.2], [?, Ch. 11]237
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The benefit of using (3.2.8) instead of the standard normal equations (3.1.11) is that sparsity is preserved.
However, the conditioning of the system matrix in (3.2.8) is not better than that of A⊤ A.
A more general substitution r := α−1 (Ax − b) with α > 0 may improve the conditioning for suitably
chosen parameter α
H H r −αI A r b
A Ax = A b ⇔ Bα := = . (3.2.9)
x AH 0 x 0
For m, n ≫ 1, A sparse, both (3.2.8) and (3.2.9) lead to large sparse linear systems of equations,
amenable to sparse direct elimination techniques, see Section 2.7.5.
3. DIrect Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [?, Sect. 4.2], [?, Ch. 11]238
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
10
10
cond2(A)
In this example we explore empirically how the Eu- 9
10 cond2(AHA)
clidean condition number of the extended normal 8
10
cond2(B)
equations (3.2.9) is influenced by the coice of α cond2(Bα)
7
10
Consider (3.2.8), (3.2.9) for
6
10
1+ǫ 1 5
10
A= 1 −ǫ 1 . 4
10
ǫ ǫ
3
10
2
Plot of different condition numbers 10
in dependence on ǫ√ ✄ 1
10
(Here α = ǫk Ak2 / 2) 0
10
-5 -4 -3 -2 -1 0
10 10 10 10 10 10
Fig. 99 ε
Recall the rationale behind Gaussian elimination (→ Section 2.3, Ex. 2.3.1)
➥ e,
By row transformations convert LSE Ax = b to equivalent (in terms of set of solutions) LSE Ux = b
which is easier to solve because it has triangular form.
Two questions: ➊ What linear least squares problems are “easy to solve” ?
➋ How can we arrive at them by equivalent transformations of (3.1.31) ?
Here we call two overdetermined linear systems Ax = b and Ax e =b e equivalent in the sense of (3.1.31),
e b
if both have the same set of least squares solutions: lsq(A, b) = lsq(A, e ), see (3.1.4).
Linear least squares problems (3.1.31) with upper triangular A are easy to solve!
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]239
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
b1
.. −1
. b1
..
.
..
x1 x=
.. (∗) R .
R . − → min =⇒ ..
.
xn
bn
0
..
. x=
ˆ least squares solution
bm
2
How can we draw the conclusion (∗)? Obviously, the components n + 1, . . . , m of the vector inside the
norm are fixed and do not depend on x. All we can do is to make the first components 1, . . . , n vanish, by
choosing a suitable x, see [?, Thm. 4.13]. Obviously, x = R−1 (b)1:n accomplishes this.
Note: since A has full rank n, the upper triangular part R ∈ R n,n of A is regular!
Answer to question ➋:
Idea: If we have a (transformation) matrix T ∈ R m,m satisfying
where A e = Tb.
e = TA and b
The next section will characterize the class of eligible transformation matrices T.
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]240
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
From Thm. 3.3.5 we immediately conclude that, if a matrix Q ∈ K n,n is unitary/orthogonal, then
This section will answer the question whether and how it is possible to find orthogonal transformations that
convert any given matrix A ∈ R m,n , m ≥ n, rank(A) = n, to upper triangular form, as required for the
application of the “equivalence transformation idea” to full-rank linear least squares problems.
3.3.3.1 Theory
Input: {a1 , . . . , ak } ⊂ K n
Output: {q1 , . . . , qk } (assuming no premature termination!)
The span property (1.5.2) can be made more explicit in terms of the existence of linear combinations
q1 = t11 a1
q2 = t12 a1 + t22 a2
q3 = t13 a1 + t23 a2 + t33 a3 ∃T ∈ R n,n upper triangular: Q = AT , (3.3.8)
..
.
qk = t1n a1 + t2n a2 + · · · + tkk ak .
where Q = [q1 , . . . , qn ] ∈ R m,n (with orthonormal columns), A = [a1 , . . . , an ] ∈ R m,n . Note that
thanks to the linear independence of {a1 , . . . , ak } and {q1 , . . . , qk }, the matrix T = (tij )i,j
k
=1 ∈ R
k,k is
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]241
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Recall from Lemma 1.3.9 that inverses of regular upper triangular matrices are upper triangular again.
−1
=
T R
Thus, by (3.3.8), we have found an upper triangular R := T−1 ∈ R n,n such that
A Q
A = QR ↔
=
.
R
Next “augmentation by zero”: add m − n zero rows at the bottom of R and complement columns of Q to
e ∈ R m,m :
an orthonormal basis of R m , which yields an orthogonal matrix Q
e R
A=Q
0
l
A e
= Q R
0
e⊤ R
⇔ Q A= .
0
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]242
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
A = Q0 · R0 (“economical” QR-decomposition) ,
(ii) a unitary Matrix Q ∈ K n,n and a unique upper triangular R ∈ K n,k with (R)i,i > 0, i ∈
{1, . . . , n}, such that
A = QR , Q ∈ K n,n , R ∈ K n,k ,
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]243
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
A
= Q . (3.3.11)
R
Proof. We observe that R is regular, if A has full rank n. Since the regular upper triangular matrices form
a group under multiplication:
In theory, Gram-Schmidt orthogonalization (GS) can be used to compute the QR-factorization of a matrix
A ∈ R m,n , m ≥ n, rank(A) = n. However, as we saw in Exp. 1.5.5, Gram-Schmidt orthogonalization in
the form of Code 1.5.3 is not a stable algorithm.
There is a stable way to compute QR-decompositions, based on the accumulation of orthogonal transfor-
mations.
Corollary 3.3.13. Composition of orthogonal transformations
The product of two orthogonal/unitary matrices of the same size is again orthogonal/unitary.
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]244
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Recall that this “annihilation of column entries” is the key operation in Gaussian forward elimination, where
it is achieved by means of non-unitary row transformations, see Sect. 2.3.2. Now we want to find a
counterpart of Gaussian elimination based on unitary row transformations on behalf of numerical stability.
In 2D there are two possible orthogonal transformations make 2nd component of a ∈ R 2 vanish, which,
in geometric terms, amounts to mapping the vector onto the x1 -axis.
x2
x2
cos ϕ sin ϕ
Q= − sin ϕ cos ϕ
a a
x1 ϕ x1
. .
Fig. 100
Fig. 101
Note that in each case we have two different length-preserving lineare mappings at our disposal. This
flexibility will be important for curbing the impact of roundoff.
Both reflections and rotations are actually used in library routines and both are discussed in the sequel:
The following so-called Householder matrices effect the reflection of a vector into a multiple of the first unit
vector with the same length:
vv H
Q = H(v) : = I − 2 with v = 12 (a ± k ak2 e1 ) . (3.3.16)
vH v
Orthogonality of these matrices can be established by direct computation.
vT v a
b = a − (a − b) = a − v T
v v
vT a vvT b
= a − 2v T = a − 2 T a = H(v)a ,
v v v v
Fig. 102
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]245
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
( a − b ) T ( a − b ) = ( a − b ) T ( a − b + a + b ) = 2( a − b ) T a .
Suitable successive Householder transformations determined by the leftmost column (“target column”) of
shrinking bottom right matrix blocks cen be used to achieve upper triangular form R. Visualization of
annihiliation of lower triangular matrix part for a square matrix:
*
*
*
➤ ➤ ➤ .
0 0 0
Writing Qℓ for the Householder matrix used in the ℓ-th step we get
Q n − 1 Q n − 2 · · · · · Q1 A = R ,
QR-factorization Q := Q1H · · · · · QnH−1 orthogonal matrix ,
of A ∈ C n,n : A = QR ,
(QR-decomposition) R upper triangular matrix .
We can also apply successive Householder transformation as outlined in § 3.3.15 to a matrix A ∈ R m,n
with m < n. If the first m columns of A are linearly independent, we obtain another variant of the QR-
decomposition:
A = Q R ,
A = QR , Q ∈ R m,m , R ∈ R m,n ,
In (3.3.16) the computation of the vector v can be prone to cancellation (→ Section 1.5.4), if the vector
a encloses a very small angle with the first unit vector, because in this case v can be very small and
beset with a huge relative error. This is a concern, because in the formula for the Householder matrix v is
2
normalized to unit length (division by k vk2 ).
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]246
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Fortunately, two choices for v are possible in (3.3.16) and at most one can be affected by cancellation.
The right choice is
(
1
2 (a + kak2 e1 ) , if a1 > 0 ,
v= 1
2 (a − kak2 e1 ) , if a1 ≤ 0 .
See [?, Sect. 19.1] and [?, Sect. 5.1.3] for a discussion.
The 2D rotation displayed in Fig. 101 can be embedded in an identity matrix. Thus, the following orthogonal
transformation, a Givens rotation, annihilates the k-th component of a vector a = [ a1 , . . . , an ]⊤ ∈ R n .
Here γ stands for cos( ϕ) and σ for sin( ϕ), ϕ the angle of rotation, see Fig. 101.
(1)
γ ··· σ ··· 0 a1 a1
.. .. . .. .. ..
. . .. . . . γ = √ a1
,
| a1 |2 +| ak |2
G1k (a1 , ak )a := −σ · · · γ · · · 0 a k = 0 , if √ ak (3.3.20)
. .. . . .. . ..
σ = .
.. . . . ..
.
| a1 |2 +| ak |2
0 ··· 0 ··· 1 an f an
Orthogonality (→ Def. 6.2.2) of G1k (a1 , ak ) is verified immediately. Again, we have two options for an
annihilating rotation, see Ex. 3.3.14. It will always be possible to choose one that avoids cancellation [?,
Sect. 5.1.8], see Code 3.3.21 for details.
So far, we know how to annihilate a single component of a vector by means of a Givens rotation that
targets that component and some other (the first in (3.3.20)). However, for the sake of QR-decomposition
we aim to map all components to zero except for the first.
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]247
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
☞ This can be achieved by n − 1 successive Givens rotations, see also Code 3.3.23
(2)
( n − 1)
a1 (1) a a
a1 10 1
.. 0 0
. ..
. G12 (a1 ,a2 )
G (a(1) ,a ) 0 G (a(2) ,a )
( n −2)
G1n ( a1 ,an )
.. −−−−−→ a3 −−−−−−→ 13 1 3
−
14 1
−−−−−
4
→ · · · −
−−−−−−− → . (3.3.22)
. a4
. .. .
.. .. ..
.
an an
an 0
C++11 code 3.3.23: Roating a vector onto the x1 -axis by successive Givens transformation
2 // Orthogonal transformation of a (column) vector into a multiple of
3 // the first unit vector by successive Givens transformations
4 void g i v e n s c o l t r f ( const VectorXd& aIn , MatrixXd& Q, VectorXd& aOut ) {
5 unsigned i n t n = aIn . siz e ( ) ;
6 // Assemble rotations in a dense matrix.
7 // For (more efficient) alternatives see Rem. Rem. 3.3.25
8 Q. s e t I d e n t i t y ( ) ;
9 Matrix2d G; Vector2d tmp , xDummy ;
10 aOut = aIn ;
11 f o r ( i n t j = 1 ; j < n ; ++ j ) {
12 tmp ( 0 ) = aOut ( 0 ) ; tmp ( 1 ) = aOut ( j ) ;
13 planer ot ( tmp , G, xDummy ) ; // see Code 3.3.21
14 // select 1st and jth element of aOut and use the Map function
15 // to prevent copying; equivalent to aOut([1,j]) in M A T L A B
16 Map<VectorXd , 0 , I n n e r S t r i d e <> > aOutMap ( aOut . data ( ) , 2 ,
I n n e r S t r i d e < >( j ) ) ;
17 aOutMap = G ∗ aOutMap ;
18 // select 1st and jth column of Q (Q(:,[1,j]) in M A T L A B )
19 Map<MatrixXd , 0 , O u t e r S t r i d e <> > QMap(Q. data ( ) , n , 2 ,
O u t e r S t r i d e < >( j ∗ n ) ) ;
20 QMap = QMap ∗ G. transpose ( ) ;
21 }
22 }
Armed with these compound Givens rotations we can proceed as in the case of Householder reflections
to accomplish the orthogonal transformation of a full-rank matrix to upper triangular form, see
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]248
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
10 f o r ( i n t i = 0 ; i < n −1; ++ i ) {
11 f o r ( i n t j = n −1; j > i ; −− j ) {
12 tmp ( 0 ) = R( j −1, i ) ; tmp ( 1 ) = R( j , i ) ;
13 p l a n e r o t ( tmp , G, xDummy ) ; // see Code 3.3.21
14 R. block ( j − 1 ,0 ,2 ,n ) = G ∗ R . block ( j − 1 ,0 ,2 ,n ) ;
15 Q. block ( 0 , j −1, n , 2 ) = Q. block ( 0 , j −1,n , 2 ) ∗ G. transpose ( ) ;
16 }}
17 }
The matrices for the orthogonal transformation are never built in codes!
The transformations are stored in a compressed format.
↑ Case m < n
Case m > n →
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]249
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
ρ=1 ⇒ γ = 0 , σ = 1√
which means |ρ| < 1 ⇒ σ = 2ρ , γ = p 1 − σ2
|ρ| > 1 ⇒ γ = 2/ρ , σ = 1 − γ2 .
Then store Gij (a, b) as triple (i, j, ρ). The parameter ρ forgets the sign of the matrix Gij , so the signs of
the corresponding rows in the transformed matrix R have to be changed accordingly. The rationale behind
the above convention is to curb the impact of roundoff errors.
The advantage of Givens rotations is its selectivity, which can be exploited for banded matrices, see
Section 2.7.6, Def. 2.7.55.
Example: Orthogonal transformation of an n × n tridiagonal matrix to upper triangular form, that is, the
annihilation of the sub-diagonal, by means of successive Givens rotations:
∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0
∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0
0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0 0 0
0 0 ∗ ∗ ∗ 0 0 0 G12 0 0 ∗ ∗ ∗ 0 0 0 G23 ···Gn −1,n 0 0 0 ∗ ∗ ∗ 0 0
−−−→ −−−−−−→
0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0
0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗
0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗
0 0 0 0 0 0 ∗ ∗ 0 0 0 0 0 0 ∗ ∗ 0 0 0 0 0 0 0 ∗
∗=
ˆ entry set to zero by Givens rotation, ∗ =
ˆ new non-zero entry (“fill-in” → Def. 2.7.47).
This is a manifestation of a more general result, see Def. 2.7.55 for notatons:
A total of only n Givens rotations is required, involving an asymptotic total computational effort of O(nm)
for an m × n-matrix.
In numerical linear algebra orthogonal transformation methods usually give rise to reliable algorithms,
thanks to the norm-preserving property of orthogonal transformations.
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]250
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
We are interested in the sensitivity of F, that is, the impact of relative errors in the data vector x on the
output vector y := F(x).
We conclude, that unitary/orthogonal transformations do not involve any amplification of relative errors in
the data vectors.
Of course, this also applies to the “solution” of square linear systems with orthogonal coefficient matrix
Q ∈ R n,n , which, by Def. 6.2.2, boils down to multiplication of the right hand side vector with QH .
Gaussian elimination as presented in § 2.3.3 converts a matrix to upper triangular form by elementary
row transformations. Those add a scalar multiple of a row of the matrix to another matrix and amount to
left-multiplication with matrices
However, these transformations can lead to a massive amplification of relative errors, which, by virtue of
Ex. 2.2.7 can be linked to large condition numbers of T.
This accounts for fact that the computation of LU-decompositions by means of Gaussian elimination might
not be stable, see Ex. 2.4.5.
6
10
Study in 2D:
5
10
4
trices of Gaussian elimination. 10
1 0 10
3
T(µ) =
µ 1 2
10
The perfect conditioning of orthogonal transformation prevents the destructive build-up of roundoff errors.
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]251
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
E IGEN offers several classes dedicated to computing QR-type decompositions of matrices, for instance
HouseholderQR. Internally the QR-decomposition is stored in compressed format as explained in Rem. 3.3.25.
Its computation is triggered by the constructor.
Note that the method householderQ returns the Q-factor in compressed format → Rem. 3.3.25. As-
signment to a matrix will convert it into a (dense) matrix format, see Line 8; only then the actual computa-
tion of the matrix entries performed. It can also be multiplied with another matrix of suitable size, which is
used in Line 19 to extract the Q-factor Q0 ∈ R m,n of the economical QR-decomposition (3.3.3.1).
The matrix returned by the method matrixQR() gives access to a matrix storing the QR-factors in
compressed form. Its upper triangular part provides R, see Line 20.
A close inspection of the algorithm for the computation of QR-decompositions of A ∈ R m,n by successive
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]252
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Householder reflections (→ § 3.3.15) reveals, that n transformations costing ∼ mn operations each are
required.
10 2
Code 3.4.13.
10 -3
Platform:
✦ ubuntu 14.04 LTS 10 -4
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]253
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The QR-decomposition introduced in Section 3.3.3, Thm. 3.3.9, paves the way for the practical algorithmic
realization of the “equivalent orthonormal transformation to upper triangular form”-idea from Section 3.3.1.
We consider the full-rank linear least squares problem Eq. (3.1.31): Given A ∈ R m,n , m ≥ n, rank(A) =
n,
kAx − b k2 = Q(Rx − QH b) e
= Rx − b e : = QH b .
, b
2 2
e
b1
..
.
R0
x1
..
kAx − b k2 → min ⇔ . −
→ min .
xn
0 ..
.
e
bm
2
−1 0
..
e .
b1
.. 0
x=
. , with residual r = Q e
.
R0 bn + 1
e
bn ..
.
e
bm
q
Note: by Thm. 3.3.5 the norm of the residual is readily available: krk2 = eb2n+1 + · · · + e
b2m .
C++-code 3.3.38: QR-based solver for full rank linear least squares problem (3.1.31)
2 // Solution of linear least squares problem (3.1.31) by means of
QR-decomposition
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]254
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
C++11 code 3.3.39: E IGEN’s built-in QR-based linear least squares solver
2 // Solving a full-rank least squares problem kAx − bk2 → min in E I G E N
3 double l s q s o l v e _ e i g e n ( const MatrixXd& A , const VectorXd& b ,
4 VectorXd& x ) {
5 x = A . householderQr ( ) . solve ( b ) ;
6 r et ur n ( ( A∗ x−b ) . norm ( ) ) ;
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]255
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
7 }
Applying the QR-based algorithm for full-rank linear least squares problems in the case m = n, that
is, to a square linear system of equations Ax = b with a regular coefficient matrix , will compute the
solution x = A−1 b. In a sense, the QR-decomposition offers an alternative to Gaussian elimination/LU-
decomposition discussed in § 2.3.30.
The steps for solving a linear system of equations Ax = b by means of QR-decomposition are as follows:
① QR-decomposition A = QR, computational costs 23 n3 + O(n2 )
(about twice as expensive as LU -decomposition without pivoting)
Ax = b : ② orthogonal transformation z = Q H b, computational costs 4n2 + O(n)
(in the case of compact storage of reflections/rotations)
③ Backward substitution, solve Rx = z, computational costs 12 n(n + 1)
Benefit: we can utterly dispense with any kind of pivoting:
✬ ✩
✌ Computing the generalized QR-decomposition A = QR by means of Householder reflections
or Givens rotations is (numerically stable) for any A ∈ C m,n .
✌ For any regular system matrix an LSE can be solved by means of
QR-decomposition + orthogonal transformation + backward substitution
✫ ✪
in a stable manner.
Drawback: QR-decomposition can hardly ever avoid massive fill-in (→ Def. 2.7.47) also in situations,
where LU-factorization greatly benefits from Thm. 2.7.58.
From Rem. 3.3.26, Thm. 3.3.27, we know that that particular situtation, in which QR-decomposition can
avoid fill-in (→ Def. 2.7.47) is the case of banded matrices, see Def. 2.7.55. For a banded n × n linear
systems of equations with small fixed bandwidth bw(A) ≤ O(1) we incur an
➣ asymptotic computational effort: O(n) for n → ∞
d1 c1 0 ... 0
The following code uses QR-decomposition com- ..
e1 d2 c2 .
puted by means of selective Givens rotations (→
A = 0 e2 d3 c3
§ 3.3.19) to solve a tridiagonal linear system of equa- . . . .
.. .. .. .. c n −1
tions Ax = b
0 ... 0 e n −1 d n
The matrix is passed in the form of three vectors e, c, d giving the entries in the non-zero bands.
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]256
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]257
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Aiming to confirm the claim of superior stability of QR-based approaches (→ Rem. 3.3.40, § 3.3.28) we
revisit Wilkinson’s counterexample from Ex. 2.4.5 for which Gaussian elimination with partial pivoting does
not yield an acceptable solution.
0
10
-2
10
−1 for i > j, j < n ,
-6
10
1 for i = j , Gaussian elimination
(A)i,j := -8
10 QR-decomposition
0 for i < j, j < n ,
relative residual norm
-10
10
1 for j = n . -12
10
-14
10
QR-decomposition produces perfect solution ✄
-16
10
0 100 200 300 400 500 600 700 800 900 1000
Fig. 105 n
Let us summarize the pros and cons of orthogonal transformation techniques for linear least squares
problems:
Use orthogonal transformations methods for least squares problems (3.1.38), whenever A ∈
R m,n dense and n small.
SVD/QR-factorization cannot exploit sparsity:
Use normal equations in the expanded form (3.2.8)/(3.2.9), when A ∈ R m,n sparse (→
Notion 2.7.1) and m, n big.
e = b efficiently, whose
In § 2.6.13 we faced the task of solving a square linear system of equations Ax
e
coefficient matrix A was a (rank-1) perturbation of A, for which an LU-decomposition was available.
Lemma 2.6.22 showed a way to reuse the information contained in the LU-decomposition.
A similar task can be posed, if the QR-decomposition of a matrix A ∈ R m,n , m ≥ n, has already been
computed and then we have to solve a full-rank linear least squares problem e −b
Ax → min with
2
e ∈ R m,n a “slight” perturbation of A. If we aim to use orthogonalization techniques it would be desirable
A
to compute the QR-decomposition of A e with recourse to the QR-decomposition of A.
For A ∈ R m,n , m ≥ n, rank(A) = n, we consider the rank-1 modification, cf. Eq. (2.6.17),
e := A + uv⊤ , u ∈ R m , v ∈ R n .
A −→ A (3.3.45)
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]258
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
e ) = n.
We assume that still rank(A
Given the (unique) full QR-decomposition A = QR, Q ∈ R m,m orthogonal, R ∈ R m,n upper trian-
gular, according to Thm. 3.3.9, the goal is to find an efficient algorithm that yields the (unique) full QR-
decomposition of Ae: A
e =Q eRe.
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]259
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
∗ ∗ ··· ∗ ∗ ∗ ∗ ∗ ∗ ··· ∗ ∗ ∗ ∗
0 ∗
··· ∗ ∗ ∗ ∗
0 ∗ ··· ∗ ∗ ∗ ∗
.. ..
Gn−2,n−1 . G .
n−1,n e.
−−−−−→ 0 ··· 0 ∗ ∗ ∗ ∗ −−−→ 0 · · · 0 ∗ ∗ ∗ ∗ =: R (3.3.46)
0 ··· 0 0 ∗ ∗ ∗ 0 ··· 0 0 ∗ ∗ ∗
0 ··· 0 0 0 ∗ ∗ 0 ··· 0 0 0 ∗ ∗
0 ··· 0 0 0 ∗ ∗ 0 ··· 0 0 0 0 ∗
We need n Givens rotations acting on matrix rows of length n:
➣ Computational effort O(n2 )
eR
A + uv H = Q e = QQ H G H
e with Q H
1 n −1,n · · · · · G12 .
➣ Asymptotic total computational effort O(mn) for m, n → ∞
e from
For large n this is much cheaper than the cost O(n2 m) for computing the QR-decomposition of A
scratch.
k 7→ n + 1 , i 7→ i − 1 , i = k + 1, . . . , n + 1 ,
1 0
..
P = . 0 1 ∈ R n+1,n+1
.. ..
. .
1
0 ··· 1 0
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]260
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
e:
Effects the following transformation of A
R
A e = [ a1 , . . . , a n , v ] = Q R Q ⊤ v = Q
e −→ A1 = AP
Column Q⊤ v
Case m > n +
1
① If m > n + 1 there is an orthogonal transformation Q1 ∈ R m,m , for instance realized by m − n − 1
Givens rotations or a single Householder reflection, such that
∗ ··· ··· ∗
∗
0 ∗ ∗
..
. .. .. ..
n+1 . . . n+1
∗
∗ ∗
⊤
Q1 Q v = ∗ Q1 Q ⊤ A 1 =
... ∗
0
0 ··· ··· 0
.. m−n−1
.
. .. m−n−1
.. .
0
0 ··· ··· 0
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]261
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
0
→ → → 0 →
0 0
0
=
ˆ target rows of Givens rotations, ˆ new entries 6= 0
=
➣ Computational effort for this step O (n − k)2
We are given a matrix A ∈ R m,n of which a full QR-decomposition (→ Thm. 3.3.9) A = QR, Q ∈ R m,m
orthogonal, R ∈ R m,n upper triangular, is already available, maybe only in encoded form (→ Rem. 3.3.25).
We add another row to the matrix A in arbitrary position k ∈ {1, . . . , m}
a1,·
..
.
ak−1,·
A ∈ R m,n e = vT
7→ A
, v ∈ Rn . (3.3.48)
ak,·
.
..
am,·
3. DIrect Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [?, Sect. 4.4.2]262
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
R
e = A QH 0 e R
PA PA = = .
vT 0 1 vT
vT
Case m = n
Step ②: Restore upper triangular form through Givens rotations (→ § 3.3.19)
Successively target bottom row and rows from the top to turn leftmost entries of bottom row into zeros.
∗ ··· ··· ∗ ∗ ··· ··· ∗
.. ..
0 ∗ . 0 ∗ .
.. ..
.. ..
. 0 . G1,m . 0 . G2,m
−−−→ −−−→ · · ·
.. .. . .. .. ..
. . ∗ .. . . ∗ .
0 0 0 ∗ ∗ 0 0 0 ∗ ∗
∗ ··· ··· ∗ ∗ ∗ 0 ∗ ··· ∗ ∗ ∗
∗ ··· ··· ∗ ∗ ··· ··· ∗
.. ..
0 ∗ . 0 ∗ .
Gm−2,m ... 0
..
. Gm−1,m ... 0
..
.
· · · −−−−→ . .
−−−−→ .
:= R
e (3.3.49)
.. .. .. .. .. ..
∗ . . ∗ .
0 0 0 ∗ ∗ 0 0 0 ∗ ∗
0 ··· 0 ∗ ∗ 0 ··· 0 ∗
Beside the QR-decomposition of matrix A ∈]R m,n there are other factorizations based on orthogonal
transformations. The most important among them is the singular value decomposition (SVD), which can
be used to tackle linear least squares problems and many other optimization problems beyond, see [?].
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 263
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Theorem 3.4.1. singular value decomposition → [?, Thm. 9.6], [?, Thm. 11.1]
For any A ∈ K m,n there are unitary matrices U ∈ K m,m , V ∈ K n,n and a (generalized) diagonal
(∗) matrix Σ = diag(σ , . . . , σ ) ∈ R m,n , p := min{m, n}, σ ≥ σ ≥ · · · ≥ σ ≥ 0 such that
1 p 1 2 p
A = UΣVH .
A
= U Σ VH
VH
A = U Σ
➤ ∃x ∈ K n , y ∈ K m , k xk = kyk 2 = 1 : Ax = σy , σ = kAk2 ,
where we used the definition of the matrix 2-norm, see Def. 1.5.76. By Gram-Schmidt orthogonalization
or a similar procedure we can extend the single unit vectors x and y to orthonormal bases of K n and K m ,
respectivelt: ∃V e ∈ K m,m−1 such that
e ∈ K n,n−1, U
e ] ∈ K n,n , U = [y U
V = [x V e ] ∈ K m,m are unitary.
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 264
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
iH hh i yH Ax yH AV e
H
H e A xV e = σ w
U AV = y U e H Ax U e = 0 B
e H AV = : A1 .
U
we conclude
σ 2
kA1 xk22 A1 ( w ) ( σ 2 + w H w )2
kA1 k22 = sup ≥ 2
≥ = σ 2 + wH w . (3.4.2)
06 = x ∈ K n kxk22 σ
(w )
2
2
2
σ +w w H
2 (3.4.2)
σ2 = kAk22 = UH AV = kA1 k22 =⇒ kA1 k22 = kA1 k22 + kwk22 ⇒ w = 0 .
2
σ 0
A1 = .
0 B
The decomposition A = UΣVH of Thm. 3.4.1 is called singular value decomposition (SVD) of A.
The diagonal entries σi of Σ are the singular values of A.
As in the case of the QR-decomposition, compare (3.3.3.1) and (3.3.3.1), we can also drop the bottom
zero rows of Σ and the corresponding columns of U in the case of m > n. Thus we end up with an
“economical” singular value decomposition of A ∈ K m,n :
with true diagonal matrices Σ, whose diagonals contain the singular values of A.
Visualization of economimcal SVD for m > 0:
Σ
A = U VH
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 265
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Lemma 3.4.5.
The squares σi2 of the non-zero singular values of A are the non-zero eigenvalues of AH A, AAH
with associated eigenvectors (V):,1 , . . . , (V):,p , (U):,1 , . . . , (U):,p , respectively.
Proof. AAH and AH A are similar (→ Lemma 9.1.6) to diagonal matrices with non-zero diagonal entries
σi2 (σi 6= 0), e.g.,
Remark 3.4.6 (SVD and additive rank-1 decomposition → [?, Cor. 11.2], [?, Thm. 9.8])
Recall from linear algebra: rank-1 matrices are tensor products of vectors
because rank(A) = 1 means that Ax = µ(x)u for some u ∈ K m and linear form x 7→ µ(x). By the
Riesz representation theorem the latter can be written as µ(x) = vH x.
p
A = UΣVH = ∑ σj (U):,j (V)H:,j . (3.4.8)
j =1
The SVD from Def. 3.4.3 is not (necessarily) unique, but the singular values are.
Proof. Proof by contradiction: assume that A has two singular value decompositions
A = U1 Σ1 VH H
1 = U2 Σ2 V2 ⇒ U1 Σ1 ΣH UH H
1 = AA = U2 Σ2 ΣH UH
2 .
| {z 1} | {z 2}
=diag(σ12 ,...,σm
2) =diag(σ12 ,...,σm
2)
Two similar diagonal matrices with non-increasing diagonal entries are equal !
✷
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 266
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
(3.4.12)
The E IGEN class JacobiSVD is constructed from a matrix data type, computes the SVD of its argument
during construction and offers access methods MatrixU(), singularValues(), and MatrixV()
to request the SVD-factors and singular values.
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 267
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
16
The second argument in the constructor of JacobiSVD determines, whether the methods matrixU()
and matrixV() return the factor for the full SVD of Def. 3.4.3 or of the economical (thin) SVD (3.4.4):
Eigen::ComputeFull* will select the full versions, whereas Eigen::ComputeThin* picks the
economical versions → documentation.
Internally, the computation of the SVD is done by a sophisticated algorithm, for which key steps rely on
orthogonal/unitary transformations. Also there we reap the benefit of the exceptional stability brought
about by norm-preserving transformations → § 3.3.28.
According to E IGEN’s documentation the SVD of a general dense matrix involves the following asymptotic
complexity:
Based on Lemma 3.4.11, the SVD is the main tool for the stable computation of the rank of a matrix (→
Def. 2.2.3)
However, theory as reflected in Lemma 3.4.11 entails identifying zero singular values, which must rely
on a threshold condition in a numerical code, recall Rem. 1.5.36. Given the SVD A = UΣVH , Σ =
diag(σ1 , . . . , σmin{m,n} ), of a matrix A ∈ K m,n , A 6= 0 and a tolerance tol > 0, we define the numerical
rank
r := ♯ σi : |σi | ≥ tol max{|σj |} . (3.4.16)
j
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 268
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
E IGEN offers an equivalent built-in method rank() for objects representing singular value decomposi-
tions:
“Computing” a subspace of R k amounts to making available a (stable) basis of that subspace, ideally an
orthonormal basis.
Lemma 3.4.11 taught us how to glean orthonormal bases of N (A) and R(A) from the SVD of a matrix
A. This immediately gives a numerical method and its implementation is given in the next two codes.
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 269
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
In a similar fashion as explained for QR-decomposition in Section 3.3.4, the singular valued decompisition
(SVD, → Def. 3.4.3) can be used to transform general linear least squares problems (3.1.38) into a simpler
form. In the case of SVD based orthogonal transformation method based on SVD this simpler form
involves merely a diagonal matrix.
Here we consider the most general setting: A ∈ K m,n , rank(A) = r ≤ min{m, n}, cf. (3.1.38). In
particular, we we drop the assumption of full rank of A. This means that the minimum norm condition (ii) in
the definition (3.1.38) of a linear least squares problem may be required for singling out a unique solution.
We recal the SVD of A ∈ R m,n :
Σr 0 VH
1
A = [ U1 U2 ]
0 0 VH
2
Σr
0
A
=
VH
1
U1 U2
VH
2
0 0
| {z }
∈R n,n
| {z } | {z } | {z }
∈R m,n ∈R m,m ∈R m,n
(3.4.22)
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 270
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
H
Σ 0 VH Σr VH1 x − U1 b
kAx − b k2 = [U1 U2 ] r 1 x−b = (3.4.23)
0 0 VH
2 2
0 UH
2b 2
To fix a unique solution in the case r < n we appeal to the minimal norm condition in (3.1.38): appealing
to the considerations of § 3.1.34, the solution x of (3.4.24) is unique up to contributions from
Since V is unitary, the minimal norm solution is obtained by setting contributions from R(V2 ) to zero,
which amounts to choosing x ∈ R(V1 ). This converts (3.4.24) into
−1 H
Σr VH H
1 V1 z = U1 b ⇒ z = Σr U1 b .
| {z }
=I
In a practical implementation, as in Code 3.4.17, we have to resort to the numerical rank from (3.4.16):
where we have assumed that the singular values σj are sorted according to decreasing modulus.
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 271
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Eigen : : ComputeThinV ) ;
4 r et ur n svd . solve ( b ) ;
5 }
Remark 3.4.29 (Pseudoinverse and SVD → [?, Ch. 12], [?, Sect. 4.7])
From Thm. 3.1.37 we could conclude a general formula for the Moore-Penrose pseudoinverse of any matrix
A ∈ R m,n . Now, the solution formula (3.4.26) directly yields a concrete incarnation of the pseudoinverse
A+ .
Theorem 3.4.30. Pseudoinverse and SVD
If A ∈ K m,n has the SVD decomposition A = UΣVH partitioned as in (3.4.22), then its Moore-
Penrose pseudoinverse (→ Thm. 3.1.37)is given by A† = V1 Σr−1 UH
1.
For the general least squares problem (3.1.38) we have seen the use of SVD for its numerical solution in
Section 3.4.3. The the SVD was a powerful tool for solving a minimization problem for a 2-norm. In many
other contexts the SVD is also a key component in numerical optimization.
We consider the following problem of finding the extrema of quadratic forms on the Euclidean unit sphere
{ x ∈ K n : k x k 2 = 1}:
given A ∈ K m,n , m ≥ n, find x ∈ K n , kxk2 = 1 , kAx k2 → min . (3.4.31)
Use that multiplication with orthogonal/unitary matrices preserves the 2-norm (→ Thm. 3.3.5) and resort
to the singular value decomposition A = UΣVH (→ Def. 3.4.3):
2 2
min k Axk22 = min UΣVH x = min UΣ(VH x)
kx k 2 = 1 kx k2 = 1 2 k V x k2 = 1
H 2
Since the singular values are assumed to be sorted as σ1 ≥ σ2 ≥ · · · ≥ σn , the minimum with value σn2
is attained for VH x = y = en ⇒ minimizer x = Ven = (V):,n .
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 272
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
By similar arguments we can solve the corresponding norm constrained maximization problem
given A ∈ K m,n , m ≥ n, find x ∈ K n , kxk2 = 1 , kAx k2 → max ,
and obtain the solution based on the SVD A = UΣVH of A:
σ1 = max k Axk2 , (V):,1 = argmaxkAx k2 . (3.4.33)
kx k2 = 1 k x k2 = 1
Recall: The Euclidean matrix norm (2-norm) of the matrix A (→ Def. 1.5.76) is defined as the maximum
in (3.4.33). Thus we have proved the following theorem:
If A ∈ K m,n has singular values σ1 ≥ σ2 ≥ · · · ≥ min{m, n}, then its Euclidean matrix norm is
given by k Ak2 = σ1 (A)
If m = n and A is regular/invertible, then its 2-norm condition number is cond2 (A) = σ1 /σn .
For an important application from computational geometry, this example studies the power and versatility
of orthogonal transformations in the context of (generalized) least squares minimization problems.
From school recall the Hesse normal form of a hyperplane H (= affine subspace of dimension d − 1) in
Rd:
H = { x ∈ R d : c + n ⊤ x = 0} , k n k2 = 1 . (3.4.36)
where n is the unit normal to H and |c| gives the distance of H from 0. The Hesse normal form is
convenient for computing the distance of points from H, because the
Euclidean distance of y ∈ R d from the plane is dist(H, y) = |c + n⊤ y| , (3.4.37)
Goal: given the points y1 , . . . , ym , m > d, find H ↔ {c ∈ R, n ∈ R d , k nk2 = 1}, such that
m m
∑ dist(H, y j ) 2
= ∑ |c + n⊤ y j |2 → min . (3.4.38)
j =1 j =1
Note that (3.4.38) is not a linear least squares problem due to the constraint k nk2 = 1. However, it turns
out to be a minimization problem with almost the structure of (3.4.31):
1 y1,1 · · · y1,d c
1 y · · · y2,d n
2,1 1
(3.4.38) ⇔ .. .. .. .. → min under constraint k nk2 = 1 .
. . . .
1 ym,1 · · · ym,d nd
| {z }
= :A 2
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 273
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Step ➊: To convert the minimization problem into the form (3.4.31) we start with a QR-decomposition
(→ Section 3.3.3)
r11 r12 · · · · · · r1,d+1
0 r22 · · · · · · r2,d+1
1 y1,1 · · · y1,d . ..
. ..
.
1 y
· · · y2,d . .
2,1 m,d+1
A := .. .. .. = QR , R := 0 r d+1,d+1 ∈ R .
. . .
0 ··· ··· 0
1 ym,1 · · · ym,d . ..
.. .
0 ··· ··· 0
r11 r12 · · · · · · r1,d+1
0 r22 · · · · · · r2,d+1 c
. ..
. ..
. n1
. . .
kAxk2 → min ⇔ kRxk2 = 0 rd+1,d+1 .. → min . (3.4.39)
0 ··· ··· 0 ...
. ..
.. . n
d
0 ··· ··· 0 2
√ d
−1
Note: Since r11 = k(A):,1 k2 = m 6= 0 ⇒ c = −r11 ∑ r1,j+1n j .
j =1
This algorithm is implemented in the following code, making heavy use of E IGEN’s block access operations
and the built in QR-decomposition and SVD factorization.
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 274
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Code 3.4.41 solves the general problem: For A ∈ K m,n find n ∈ R d , c ∈ R n−d such that
c
A → min with constraintknk2 = 1 . (3.4.42)
n 2
Matrix compression addresses the problem of approximating a given “generic” matrix (of a certain class) by
means of matrix, whose “information content”, that is, the number of reals needed to store it, is significantly
lower than the information content of the original matrix.
Sparse matrices (→ Notion 2.7.1) are a prominent class of matrices with “low information content”. Un-
fortunately, they cannot approximate dense matrices very well. Another type of matrices that enjoy “low
information content”, also called data sparse, are low-rank matrices.
Lemma 3.4.44.
If A ∈ R m,n has rank p ≤ min{m, n} (→ Def. 2.2.3), then there exist U ∈ R m,p and V ∈ R n,p ,
such that A = UV⊤ .
None of the columns of U and V can vanish. Hence, in addition, we may assume that the columns of U
are normalized: (U):,j 2 = 1, j = 1, . . . , p.
Thus approximating a given matrix A ∈ R m,n with a rank- p matrix, p ≪ min{m, n}, can be regarded as
an instance of matrix compression. The approximation error with respect to some matrix norm k·k will be
minimal if we choose the best approximation
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 275
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Here we explore low-rank best approximation of general matrices with respect to the Euclidean matrix
norm k·k2 induced by the 2-norm for vectors (→ Def. 1.5.76), and the Frobenius norm k·k F .
It should be obvious that kAk F invariant under orthogonal/unitary transformations of A. Thus the Frobe-
nius norm of a matrix A, rank(A) = p, can be expressed through its singular values σj :
p
Frobenius norm and SVD: kAk2F = ∑ j=1 σj2 (3.4.47)
The next profound result links best approximation in Rk (m, n) and the singular value decomposition (→
Def. 3.4.3).
Let A = UΣVH be the SVD of A ∈ K m.n (→ Thm. 3.4.1). For 1 ≤ k ≤ rank(A) set Uk :=
[ u:,1 , . . . , u:,k ] ∈ Km,k , Vk := [v:,1 , . . . , v:,k ] ∈ Kn,k , Σk := diag(σ1 , . . . , σk ) ∈ Kk,k . Then, for
k·k = k·k F and k·k = k·k2 , holds true
A − U k Σk VH
k ≤ kA − Fk ∀F ∈ Rk (m, n) .
This theorem teaches that the rank-k-matrix that is closest to A (rank-k best approximation) in both the
Euclidean matrix norm and the Frobenniusnorm (→ Def. 3.4.46) can be obtained by truncating the rank-1
sum expansion (3.4.8) obtained from the SVD of A after k terms.
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 276
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
2
k+1 k+1
kA − Bk22 ≥ k(A − B)x k22 = kAxk22 = ∑ σj (vH
j x)u j = ∑ σj2 (vHj x)2 ≥ σj2+1 ,
j =1 2 j =1
k+1
2
because ∑ ( vH 2
j x ) = k x k 2 = 1.
j =1
➋ Find ONB {z1 , . . . , zn−k } of N (B) and assemble it into a matrix Z = [z1 . . . zn−k ] ∈ K n,n−k
n−k n−k r
kA − Bk2F ≥ k(A − B)Z k2F = kAZk2F = ∑ kAzi k22 = ∑ ∑ σj2 (vHj zi )2 ✷
i =1 i =1 j =1
Since the matrix norms k·k2 and k·k F are invariant under multiplication with orthogonal (unitary) matrices,
we immediately obtain expressions for the norms of the best approximation error:
A − U k Σk VH
k = σk+1 , (3.4.49)
2
2 min{m,n }
A − U k Σk VH
k = ∑ σj2 . (3.4.50)
F
j = k+1
This provides precise information about the best approximation error for rank-k matrices.
A rectangular greyscale image composed of m × n pixels (greyscale, BMP format) can be regarded as a
matrix A ∈ R m,n , aij ∈ {0, . . . , 255}, cf. Ex. 9.3.24. Thus low-rank approximation of the image matrix is
a way to compress the image.
Thm. 3.4.48 e = U k Σk V ⊤
➣ best rank-k approximation of image: A
Of course, the matrices Ul , Vk , and Σk are available from the economical (thin) SVD (3.4.4) of A. This
is the idea behind the following M ATLAB code (This example is coded in M ATLAB, because M ATH GL lacks
image rendering capabilities.).
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 277
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
View of ETH Zurich main building Compressed image, 40 singular values used
100 100
200 200
300 300
400 400
500 500
600 600
700 700
800 800
200 400 600 800 1000 1200 200 400 600 800 1000 1200
Difference image: |original - approximated| Singular Values of ETH view (Log-Scale)
6
10
100
5
10
200
4
10
300
3
400 10 k = 40 (0.08 mem)
500
2
10
600
1
10
700
0
800 10
200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800
Note that there are better and faster ways to compress images than SVD (JPEG, Wavelets, etc.)
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 278
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
DTE
EOAN Are there underlying governing
FME
1
FRE3 trends ?
10 HEI
HEN3
IFX Are there a few vectors
LHA
LIN u1 , . . . , u p , p ≪ n, such that,
MAN
MEO
approximately all other data
0
10 MRK vectors ∈ Span{u1 , . . . , u p }.
MUV2
RWE
SAP
SDF
SIE
TKA
-1 VOW3
10
0 100 200 300 400 500 600 700 800 900
Fig. 106 days in past
! Measurement errors !
Manufacturing tolerances !
Fig. 107
Possible (“synthetic”) measured data for two types of diodes; measurement errors and manufacturing
tolerances taken into account by (Gaussian) random perturbations.
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 279
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
measured U-I characteristics for some diodes measured U-I characteristics for all diodes
1.2 1.2
1 1
0.8 0.8
current I
current I
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 108 voltage U Fig. 109 voltage U
Ex. 3.4.54 and Ex. 3.4.55 present typical tasks that can be tackled by principal component analysis.
In Ex. 3.4.54: n =
ˆ number of stocks,
m= ˆ number of days, for which stock prices are recorded
✦ Extreme case: all stocks follow exactly one trend
↔ a j ∈ Span{u} ∀ j = 1, . . . , n ,
↔ a j ∈ Span{u1 , . . . , u p } ∀ j = 1, . . . , m , (3.4.56)
Why unlikely ? Small random fluctuations will be present in each stock prize
Why orthonormal ? Trends should be as “independent as possible” (minimally correlated)
Now singular value decomposition (SVD) according to Def. 3.4.3 comes into play, because Lemma 3.4.11
tells us that it can supply an orthonormal basis of the image space of a matrix, cf. Code 3.4.21.
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 280
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
This already captures the case (3.4.56) and we see that the columns of U supply the trend vectors we are
looking for!
➊ no perturbations:
If there is a pronounced gap in distribution of the singular values, which separates p large from min{m, n} −
p relatively small singular values, this hints that RA has essentially dimension p. It depends on the appli-
cation what one accepts as a “pronounced gap”.
The j-th row of V (up to the p-th component) gives the weights with which the p identified trends
contribute to data set j.
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 281
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
6 f i g u r e ; sv = d i a g (S(1:3,1:3))
7
14 p r i n t -depsc2 ’../PICTURES/svdpca.eps’;
0.2
0
singular values: -0.2
-0.4
3.1378
-0.6
1.8092
-0.8
0.1792 -1
-1.2
third singular value ➣ the data points essentially
-1.4
lie in a 2D subspace.
-1.6
-1.8 1.5
-1 1
-0.5 0.5
0
0.5 0
Fig. 110 1
1.5 -0.5
Example 3.4.60 (Principal component analysis for data classification → Ex. 3.4.55 cnt’d)
Sought: Number of different types of diodes in batch and reconstructed U - I characteristic for each type.
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 282
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
measured U-I characteristics for some diodes measured U-I characteristics for all diodes
1.2 1.2
1 1
0.8 0.8
current I
current I
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 111 voltage U Fig. 112 voltage U
13 uvals = (0:1/(m-1):1);
14 D1 = (1+nm*randn(n,m)).*(i1(repmat(uvals,n,1)))+na*randn(n,m);
15 D2 = (1+nm*randn(n,m)).*(i2(repmat(uvals,n,1)))+na*randn(n,m);
16 A = ([D1;D2])’; A = A(1: s i z e (A,1),randperm(1: s i z e (A,2)));
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 283
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
20
← distribution of singular values of matrix
two dominant singular values !
i
singular value σ
15
0
0 2 4 6 8 10 12 14 16 18 20
Fig. 113 no. of singular value
21 f i g u r e (’name’,’singular values’);
22 sv = d i a g (S(1:2*n,1:2*n));
23 p l o t (1:2*n,sv,’r*’); g r i d on;
24 x l a b e l (’{\bf index i of singular value}’,’fontsize’,14);
25 y l a b e l (’{\bf singular value \sigma_i}’,’fontsize’,14);
26 t i t l e (’{\bf singular values for diode measurement
matrix}’,’fontsize’,14);
27 p r i n t -depsc2 ’../PICTURES/diodepcasv.eps’;
28
29 f i g u r e (’name’,’trend vectors’);
30 p l o t (1:m,U(:,1:2),’+’);
31 x l a b e l (’{\bf voltage U}’,’fontsize’,14);
32 y l a b e l (’{\bf current I}’,’fontsize’,14);
33 t i t l e (’{\bf principal components (trend vectors) for diode
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 284
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
measurements}’,’fontsize’,14);
34 legend (’dominant principal component’,’second principal
component’,’location’,’best’);
35 p r i n t -depsc2 ’../PICTURES/diodepcau.eps’;
36
37 f i g u r e (’name’,’strength’);
38 p l o t (V(:,1),V(:,2),’mo’); g r i d on;
39 x l a b e l (’{\bf strength of singular component #1}’,’fontsize’,14);
40 y l a b e l (’{\bf strength of singular component #2}’,’fontsize’,14);
41 t i t l e (’{\bf strengths of contributions of singular
components}’,’fontsize’,14);
42 p r i n t -depsc2 ’../PICTURES/diodepcav.eps’;
strengths of contributions of singular components principal components (trend vectors) for diode measurements
0.15 0.3
dominant principal component
second principal component
0.1
0.2
strength of singular component #2
0.05
0.1
0
-0.05
current I
-0.1
-0.1
-0.15
-0.2
-0.2
-0.25
-0.3
-0.3
-0.35 -0.4
0.1 0.15 0.2 0.25 0.3 0.35 0 5 10 15 20 25 30 35 40 45 50
Fig. 114 strength of singular component #1 Fig. 115 voltage U
Observations:
✦ First two rows of V-matrix specify strength of contribution of the two leading principal components
to each measurement
➣ Points (V):,1:2 , which correspond to different diodes are neatly clustered in R2 . To determine
the type of diode i, we have to identify the cluster to which the point ((V)i,1 , Vi,2 ) belongs (→ cluster
analysis, course “machine learning”, next Rem. 3.4.65).
✦ The principal components themselves do not carry much useful information in this example.
Given m > 2 points x j ∈ R k , j = 1, . . . , m, in k-dimensional space, we ask what is the “longest” and
“shortest” diameter d+ and d− . This question can be stated rigorously in several different ways: here we
ask for directions for which the point cloud will have maximal/minimal variance, when projected onto that
direction:
d+ := argmax Q(v) ,
m m
kv k=1 1
d− := argmin Q(v) ,
Q (v) : = ∑ |(xi − c)⊤ v|2 , c = m ∑ xj . (3.4.64)
j =1 j =1
kv k=1
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 285
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
6 points
major axis
minor axis
2
The directions d+ , d− are called the principal axes
of the point cloud, a term borrowed from mechanics 0
and connected with the axes of inertia of an assembly
of point masses. -2
-6
Fig. 116 -8 -6 -4 -2 0 2 4 6 8
The subsets {xi : i ∈ Il } are called the clusters. The points ml are their centers of gravity.
➋ Splitting of cluster by separation along its principal axis, see Rem. 3.4.63 and Code 3.4.70:
al := argmax{ ∑ |(xi − ml )⊤ v|2 } (3.4.69)
k v k2 = 1 i ∈ Il
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 286
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 287
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
17 }
18 double sumd = mx .sum ( ) ; // sum of all squared distances
19 // Computer sum of squared distances within each cluster
20 VectorXd cds (C . cols ( ) ) ; cds . setZero ( ) ;
21 f o r ( i n t j = 0 ; j < i d x . siz e ( ) ; ++ j ) // loop over all points
22 cds ( i d x ( j ) ) += mx( j ) ;
23 r et ur n std : : make_tuple ( sumd , i d x , cds ) ;
24 }
25
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 288
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
65 void lloydmax ( const MatrixXd & X , MatrixBase <Derived > && C, VectorXi
& i d x , VectorXd & cds , const double t o l = 0.0001) {
66 lloydmax ( X , C, i d x , cds ) ;
67 }
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 289
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Example 3.4.73 (Principal component analysis for data analysis → Ex. 3.4.54 cnt’d)
A ∈ R m,n , m ≫ n:
Columns A → series of measurements at different times/locations etc.
Rows of A → measured values corresponding to one time/location etc.
Goal: detect linear correlations
4 y cos ( p i *(1:m)’/m);
=
5 A []; = 0
6 f o r i = 1:n
7 A = [A, x.* rand (m,1),...
8 y+0.1* rand (m,1)]; -0.5
measurement 1
9 end measurement 2
measurement 2
measurement 4
-1
0 5 10 15 20 25 30 35 40 45 50
16
14
12
10
6
measurements display linear correlation with two
4
principal components
0
0 2 4 6 8 10 12 14 16 18 20
No. of singular value
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 290
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
0.1
0.25
0.05
0.2
0
0.15
-0.05
0.1
-0.1
Cluster plot
0.5
0
Weight of second singular component
-0.5
-2
-2.5
-6 -5 -4 -3 -2 -1 0 1
Fig. 119 Weight of first singular component
DTE
singular value
EOAN
FME
FRE3
1 2
10 HEI 10
HEN3
IFX
LHA
LIN
MAN
MEO
0
10 MRK 10
1
MUV2
RWE
SAP
SDF
SIE
TKA
-1 VOW3 0
10 10
0 100 200 300 400 500 600 700 800 900 0 5 10 15 20 25 30
Fig. 120 days in past Fig. 121 no. of singular value
We observe a pronounced decay of the singular values (≈ exponential decay, logarithmic scale in Fig. 121)
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 291
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
➣ a few trends (corresponding to a few of the largest singular values) govern the time series.
Five most important stock price trends (normalized) Five most important stock price trends
0.15 500
400
0.1
300
0.05
200
0
100
-0.05
0
-0.1
-100
U(:,1) U*S(:,1)
-0.15 U(:,2) U*S(:,2)
-200
U(:,3) U*S(:,3)
U(:,4) U*S(:,4)
U(:,5) U*S(:,5)
-0.2 -300
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Fig. 122 days in past Fig. 123 days in past
Columns of U (→ Fig. 122) in SVD A = UΣV⊤ provide trend vectors, cf. Ex. 3.4.54 & Ex. 3.4.73.
When weighted with the corresponding singular value, the importance of a trend contribution emerges,
see Fig. 123
Trends in BMW stock, 1.1.2008 - 29.10.2010 Trends in Daimler stock, 1.1.2008 - 29.10.2010
0.25 0.15
0.2
0.1
0.15
0.05
0.1
relative strength
relative strength
0
0.05
0
-0.05
-0.05
-0.1
-0.1
-0.15
-0.15
-0.2 -0.2
1 2 3 4 5 1 2 3 4 5
Fig. 124 no of singular vector Fig. 125 no of singular vector
Stocks of companies from the same sector of the economy should display similar contributions of major
trend vectors, because their prices can be expected to be more closely correlated than stock prices in
general.
Data obtained from Yahoo Finance:
# ! / b i n / csh
foreach i (ADS ALV BAYN BEI BMW CBK DAI DBK DB1 LHA DPW DTE EOAN FRE3 \
FME HEI HEN3 IFX SDF LIN MAN MRK MEO MUV2 RWE SAP SIE TKA VOW3)
wget −O " i " . csv " h t t p : / / i c h a r t . f i n a n c e . yahoo . com / t a b l e . csv?s= i . DE&a=00&b=1&
c=2008&d=09&e=30& f =2010&g=d&i g n o r e = . csv "
sed − i −e ’ s / − / , / g ’ " i " . csv
end
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 292
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
46 f o r j= s i z e (A,1):-1:2
47 zidx = f i n d (A(j-1,:) == 0);
48 A(j-1,zidx) = A(j,zidx);
49 end
50 f o r j=2: s i z e (A,1)
51 zidx = f i n d (A(j,:) == 0);
52 A(j,zidx) = A(j-1,zidx);
53 end
54
55 f i g u r e (’name’,’DAX’);
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 293
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
71 f i g u r e (’name’,’trend vectors’);
72 p l o t (U(:,1:5));
73 x l a b e l (’days in past’,’fontsize’,14);
74 t i t l e (’Five most important stock price trends (normalized)’);
75 le ge nd (’U(:,1)’,’U(:,2)’,’U(:,3)’,’U(:,4)’,’U(:,5)’,’location’,’south’);
76 p r i n t -depsc2 ’../../PICTURES/stocktrendsn.eps’;
77
78 f i g u r e (’name’,’trend vectors’);
79 p l o t (U(:,1:5)*S(1:5,1:5));
80 x l a b e l (’days in past’,’fontsize’,14);
81 t i t l e (’Five most important stock price trends’);
82 le ge nd (’U*S(:,1)’,’U*S(:,2)’,’U*S(:,3)’,’U*S(:,4)’,’U*S(:,5)’,’location’,’south’)
83 p r i n t -depsc2 ’../../PICTURES/stocktrends.eps’;
84
3. DIrect Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 294
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
In the examples of Section 3.0.1 we generally considered overdetermined linear systems of equations
Ax = b, for which only the right hand side vector b was affected by measurement errors. However, also
the entries of the coefficient matrix A may have been obtained by measurement. This is the case, for
instance, in the nodal analysis of electric circuits → Ex. 2.1.3. Then, it may be legitimate to seek a “better”
matrix based on information contained in the whole linear system. This is the gist of the total least squares
approach.
☞ least squares problem “turned upside down”: now we are allowed to tamper with system matrix and
right hand side vector!
h i h i
b ∈ R(A
b b) ⇒ rank( A b )=n
b b (3.5.1) ⇒ b = argmin [A b] − X
b b
A b .
b )= n F
rank(X
h i
☞ A b is the rank-n best approximation of [ A b]!
b b
We face the problem to compute the best rank-n approximation of the given matrix [A b], a problem
already treated in Section 3.4.4.2: Thm. 3.4.48 tells us how to use the SVD of [A b]
n +1 h i n
Thm. 3.4.48
[A b] = UΣV⊤ = ∑ σj (U):,j (V)⊤
:,j =⇒ A b =
b b ∑ σj (U):,j (V)⊤:,j . (3.5.3)
j =1 j =1
V orthogonal
h i
=⇒ A b (V):,n+1 = A
b b b (V)n+1,n+1 = 0 .
b (V)1:n,n+1 + b (3.5.4)
b,
b =b
(3.5.4) also provides the solution x of Sx
b −1 b
x := −A b = −(V)1:n,n+1 /(V)n+1,n+1 . (3.5.5)
3. DIrect Methods for Linear Least Squares Problems, 3.5. Total Least Squares 295
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
In the examples of Section 3.0.1 we expected all components of the right hand side vectors to be possibly
affected by measurement errors. However, it might happen that some data are very reliable and in this
case we would like the corresponding equation to be satisfied exactly.
Linear constraint
Here the constraint matrix C collects all the coefficients of those p equations that are to be satisfied exactly,
and the vector d the corresponding components of the right hand side vector. Conversely, the m equations
of the (overdetermined) LSE Ax = b cannot be satisfied and are treated in a least squares sense.
Recall important technique from multidimensional calculus for tackling constrained minimization problems:
Lagrange multipliers, see [?, Sect. 7.9].
3. DIrect Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 296
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
L as defined in (3.6.4) is called a Lagrange function. The simple heuristics behind Lagrange multipliers is
the observation:
2.5
that is, both the derivative with respect to x and with -0.5
Fig. 126
multiplier m
state x
In a saddle point the Lagrange function is “flat”, that is, all its partial derivatives have to vanish there.
Necessary (and sufficient) conditions for the solution x of (3.6.3)
(For a similar technique employing multi-dimensional calculus see Rem. 3.1.14)
∂L !
(x, m) = A⊤ (Ax − b) + C⊤ m = 0 , (3.6.6a)
∂x
∂L !
(x, m) = Cx − d = 0 . (3.6.6b)
∂m
A⊤ A C⊤ x A⊤ b Augmented normal equations
= (3.6.7)
C 0 m d (matrix saddle point problem)
As we know, a direct elimination solution algorithm for (3.6.7) amounts to finding an LU-decomposition of
the coefficient matrix. Here we opt for its symmetric variant, the Cholesky decomposition, see Section 2.8.
3. DIrect Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 297
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The same caveats as those discussed for the regular normal equations in Rem. 3.2.3, Ex. 3.2.4, and
Rem. 3.2.6, apply to the direct use of the augmented normal equations (3.6.7):
1. their condition number can be much bigger than that of the matrix A,
2. forming A⊤ A may be vulnerable to roundoff,
3. the matrix A⊤ A may not be sparse, though A is.
As in Rem. 3.2.7 also in the case of the augmented normal equations (3.6.7) switching to an extended
version by introducing the residual r = Ax − b as a new unknown is a remedy, cf. (3.2.8). This leads to
the following linear system of equations.
−I A 0 r b
A⊤ 0 C⊤ x = 0 Extended augmented
=
ˆ . (3.6.9)
normal equations
0 C 0 m d
Idea: Identify the subspace in which the solution can vary without violating the constraint.
Since C has full rank, this subspace agrees with the nullspace/kernel of C.
From Lemma 3.4.11 and Ex. 3.4.19 we have learned that the SVD can be used to compute (an orthonormal
basis of) the nullspace N (C). The suggests the following method for solving the constrained linear least
squares problem (3.6.1).
➀ Compute an orthonormal basis of N (C) using SVD (→ Lemma 3.4.11, (3.4.22)):
⊤
V
C = U[Σ 0] 1⊤ , U ∈ R p,p , Σ ∈ R p,p , V1 ∈ R n,p , V2 ∈ R n,n− p
V2
N (C) = R(V2 ) .
and the particular solution of the constraint equation
x0 : = V1 Σ −1 U ⊤ d .
3. DIrect Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 298
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
➁ Insert this representation in (3.6.1). This yields a standard linear least squares problem with coeffi-
cient matrix AV2 ∈ R m,n− p and right hand side vector b − Ax0 ∈ R m
Learning Outcomes
After having studied the contents of this chapter you should be able to
• give a rigorous definition of the least squares solution of an (overdetermined) linear system of equa-
tions,
• state the (extended) normal equations for any overdetermined linear system of equations,
• tell uniqueness and existence of solutions of the normal equations,
• define (economical) QR-decomposition and SVD of a matrix,
• explain the use of QR-decomposition and, in particular, Givens rotations, for solving (overdeter-
mined) linear systems of equations (in least squares sense),
• use SVD to solve (constrained) optimization and low-rank best approximation problems
• formulate the augmented (extended) normal equations for a linearly constrained least squares prob-
lem.
3. DIrect Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 299
Chapter 4
Filtering Algorithms
This chapter continues the theme of numerical linear algebra, also covered in Chapter 1, 2, 10. We will
come across very special linear transformations (↔ matrices) and related algorithms. Surprisingly, these
form the basis of a host of very important numerical methods for signal processing.
ˆ time-continuous signal, 0 ≤ t ≤ T ,
X = X (t) = x0
“sampling”: x j = X ( j∆t) , j = 0, . . . , n − 1 ,
x1
n ∈ N, n∆t ≤ T . x2 x n −2x n −1
X (t)
∆t > 0 =
ˆ time between samples.
As already indicated by the indexing the sam-
Fig. 127 t0 t1 t2 t n −2 t n −1 time
pled values can be arranged in a vector x =
[ x0 , . . . , x n −1 ] ⊤ ∈ R n .
Note that in this chapter, as is customary in signal processing, we adopt a C++-style indexing from 0: the
components of a vector with length n carry indices ∈ {0, . . . , n − 1}.
As an idealization one sometimes considers a signal of infinite duration X = X (t), −∞ < t < ∞. In
this case sampling yields a bi-infinite time-discrete signal, represented by a sequence ( xk )k∈Z . If this
sequence has a finite number of non-zero terms only, then we write (0, . . . , xℓ , xℓ+1, . . . , xn−1, xn , 0, . . .).
Contents
4.1 Discrete Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
4.2 Discrete Fourier Transform (DFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
4.2.1 Discrete Convolution via DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
4.2.2 Frequency filtering via DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
4.2.3 Real DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
300
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Now we study a finite linear time-invariant causal channel (filter), which is widely used model for digital
communication channels, e.g. in wireless communication theory. Mathematically speaking, a (discrete)
channel/filter is a mapping F : ℓ∞ (Z ) → ℓ∞ (Z ) from the vector space ℓ∞ (Z ) of bounded input se-
quences { x j } j∈Z to bounded output sequences {y j } j∈Z .
xk yk
input signal output signal
time time
Fig. 128
Channel/filter: F : ℓ ∞ (Z ) → ℓ ∞ (Z ) , yj j ∈Z
= F xj j ∈Z
. (4.1.2)
For the description of filters we rely on special input signals, analogous to the description of a linear
mapping R n 7→ R m through a matrix, that is, its action on unit vectors.
Visualization of the (finite!) impulse response (. . . , 0, h0 , . . . , hn−1 , 0, . . .) of a (causal, see Def. 4.1.11
below) channel/filter:
impulse response
1 h0
h1 h n −2
h2 h n −1
➣ The impulse response of a finite filter can be described by a vector h of finite length n.
It should not matter when exactly signal is fed into the channel. To express this intuition more rigorously
we introduce the time shift operator for signals: for m ∈ Z
Sm : ℓ ∞ (Z ) → ℓ ∞ (Z ) , Sm ( x j j ∈Z
) = x j−m j ∈Z
. (4.1.6)
A filter F : ℓ∞ (Z ) → ℓ∞ (Z ) is called time-invariant, if shifting the input in time leads to the same
output shifted in time by the same amount; it commutes with the time shift operator from (4.1.6):
∀( x j ) j∈Z ∈ ℓ∞ (Z), ∀m ∈ Z: F(Sm ( x j j ∈Z
)) = Sm ( F( x j j ∈Z
)) . (4.1.8)
Of course, a signal should not trigger an output before it arrives at the filter; output may depend only on
past and present inputs, not on the future.
A filter F : ℓ∞ (Z ) → ℓ∞ (Z ) is called causal (or physical, or nonanticipative), if the output does not
start before the input
∀ M ∈ N: xj j ∈Z
∈ ℓ ∞ (Z ), x j = 0 ∀ j ≤ M ⇒ F ( x j )
j ∈Z k
= 0 ∀k ≤ M . (4.1.12)
ˆ finite (→ Def. 4.1.4), linear (→ Def. 4.1.9), time-invariant (→ Def. 4.1.7), and causal
Acronym: LT-FIR =
(→ Def. 4.1.11) filter F : ℓ∞ (Z ) → ℓ∞ (Z )
The input response of a finite and causal filter is a sequence of the form (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .),
n ∈ N. Such an impulse response is depicted in Fig. 130.
Let (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .), n ∈ N, be the impulse response (→ 4.1.3) of a finite (→ Def. 4.1.4),
linear (→ Def. 4.1.9), time-invariant (→ Def. 4.1.7), and causal (→ Def. 4.1.11) filter (LT-FIR) F : ℓ∞ (Z ) →
ℓ ∞ (Z ):
F( δj,0 j ∈Z
) = (. . . , 0, h0 , h1 , . . . , hn−1, 0, . . .) .
m−1 m−1
xj j ∈Z
= ∑ xk δj,k j ∈Z
= ∑ x k Sk δj,0 j ∈Z
.,
k=0 k=0
where Sk is the time-shift operator from (4.1.6). Applying the filter on both sides of this equation and using
linarity leads to the general formula for the output signal y j j∈Z
y0 h0 0 0
0
y1 .. h0 ..
. 0 .
.. .. .
. h n −1 . h0 .
.
yn 0 h n −1 .. .
. ..
.. = x 0 0 + x 1 0 + x 2 + · · · + x m−1 .
. . . h n −1 0
.. ..
.. 0 h
. . . . 0
y m+ n −3 .. .. .. ..
.
y m+ n −2 0 0 0 h n −1
Thius, in compact notation we can write the non-zero components of the output signal y j j∈Z as
channel is causal and finite!
m+ n −1
yk = ∑ hk− j x j , k = 0, . . . , m + n − 2 (h j := 0 for j < 0 and j ≥ n) . (4.1.14)
j =0
The output (. . . , 0, y0 , y1 , y2 , . . .) of a finite, time-invariant, linear, and causal channel for finite length
input x = (. . . , 0, x0, . . . , xn−1, 0, . . .) ∈ ℓ∞ (Z ) is
a superposition of x j -weighted j∆t time-shifted impulse responses.
The following diagrams give a visual display of considerations of § 4.1.13, namely of the superposition
of impulse responses for a particular finite, time-invariant, linear, and causal filter (LT-FIR), and an input
signal of duration 3∆t, ∆t =
ˆ time between samples.
input signal x impulse response h
3.5 3.5
3 3
2.5 2.5
2 2
hi
xi
1.5 1.5
1 1
0.5 0.5
0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Fig. 131 index i of sampling instance ti Fig. 132 index i of sampling instance ti
3 3 3 3
signal strength
signal strength
signal strength
2 2 2 2
1.5
+ 1.5
+ 1.5
+ 1.5
1 1 1 1
0 0 0 0
Fig. 133 0 1 2 3
i
4 5 6 7 Fig.8 134 0 1 2 3
i
4 5 6 7 Fig.8 135 0 1 2 3
i
4 5 6 7 Fig.8 136 0 1 2 3
i
4 5 6 7 8
2.5
signal strength
signal strength
4
2
3
1.5
2
1
0.5 1
0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Fig. 137 i Fig. 138 i
We consider a finite (→ Def. 4.1.4), linear (→ Def. 4.1.9), time-invariant (→ Def. 4.1.7), and causal (→
Def. 4.1.11) filter (LT-FIR) with impulse response (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .), n ∈ N. From (4.1.14)
we learn that
Therefore, if we know that all input signals are of the form (. . . , x0 , x1 , . . . , xm−1, 0, . . .), we can model
⊤
them as vectors x = [ x0 , . . . , xm−1 ] ∈ R m , cf. § 4.0.1, and the filter can be viewed as a linear mapping
F : R n → R m+n−1, which takes us to the realm of linear algebra.
⊤
Thus, for the filter we have a matrix representation of (4.1.14). Writing y = [y0 , . . . , y2n−2 ] ∈ R2n−1 for
the vector of the output signal we find in the case m = n
h
0 0 0
h
y0 1
..
.
x0
..
.
0
h
h1 h0
= n −1 . (4.1.18)
0
.
..
x n −1
..
.
y2n−2
0 0 h n −1
“Surprisingly” the bilinear operation (4.1.14) that takes two input vectors and produces an output vector
with double the number of entries (−1) also governs the multiplication of polynomials:
n −1 n −1 2n −2 k
k k
p(z) = ∑ ak z , q(z) = ∑ bk z ( pq)(z) = ∑ ∑ a j bk − j zk (4.1.20)
k=0 k=0 k=0 j =0
| {z }
= :ck
➣ the coefficients of the product polynomial can be obtained through an operation similar to finite, time-
invariant, linear, and causal (LT-FIR) filtering!
Both in (4.1.14) and (4.1.20) we recognize the same pattern of a particular bi-linear combination of
• discrete signals in § 4.1.13,
• polynomial coefficient sequences in Ex. 4.1.19.
Definition 4.1.22. Discrete convolution
⊤ ⊤
Given x = [ x0 , . . . , xn−1] ∈ K n , h = [h0 , . . . , hn−1 ] ∈ Kn their discrete convolution is the
vector y ∈ K2n−1 with components
n −1
yk = ∑ hk− j x j , k = 0, . . . , 2n − 2 (h j := 0 for j < 0) . (4.1.23)
j =0
n −1 n −1
yk = ∑ hk− j x j = ∑ hl xk− l , k = 0, . . . , 2n − 2 , (that is, h∗x = x∗h ) ,
j =0 l =0
=
LT-FIR h0 , . . . , hn−1 LT-FIR x0 , . . . , xn−1
The formula (4.1.23) for the discrete convolution also occurs in a context completely detached from signal
processing.
Consider two polynomials in t of degree n − 1, n ∈ N,
n −1 n −1
p(t) = ∑ a j t j , q(t) = ∑ bj tj , aj , bj ∈ K .
j =0 j =0
Let us introduce dummy coefficients for p( t) and q(t), a j , b j , j = 2n, . . . , 2n − 2, all set to 0. This can
be easily done in a computer code by resizing the coefficient vectors of p and q and filling the new entries
Moreover, this provides another proof for the commutativity of discrete convolution.
The notion of a discrete convolution of Def. 4.1.22 naturally extends to sequences ∈ ℓ∞ (N0 ), that is,
bounded mappings N0 7→ K: the (discrete) convolution of two sequences ( x j ) j∈N0 , (y j ) j∈N0 is the
sequence (z j ) j∈N0 defined by
k k
zk := ∑ xk− j y j = ∑ x j yk− j , k ∈ N0 .
j =0 j =0
In this context recall the product formula for power series, Cauchy product, which can be viewed as a
multiplication rule for “infinite polynoials” = power series.
An n-periodic signal (n ∈ N) is a sequence x j j∈Z satisfying x j+n = x j ∀ j ∈ Z .
➣ An n-periodic signal ( x j ) j∈Z uniquely determined by the values x0 , . . . , xn−1 and can be associated
⊤
with a vector x = [ x0 , . . . , xn−1 ] ∈ Rn.
Whenever the input signal of a finite, time-invariant filter is n-periodic, so will be the output signal. Thus,
in the n-periodic setting, a causal, linear, and time-invariant filter (LT-FIR) will give rise to a linear mapping
R n 7→ R n according to
n −1
yk = ∑ pk− j x j for some p0 , . . . , pn−1 ∈ R , pk := pk−n for all k ∈ Z . (4.1.30)
j =0
The following special variant of a discrete convolution operation is motivated by the preceding Rem. 4.1.29.
The discrete periodic convolution of two n-periodic sequences ( pk )k∈Z , ( xk )k∈Z yields the n-
periodic sequence
n −1 n −1
(yk ) := ( pk ) ∗n ( xk ) , yk := ∑ pk− j x j = ∑ xk− j p j , k∈Z.
j =0 j =0
Since n-periodic sequences can be identified with vectors in K n (see above), we can also introduce the
discrete periodic convolution of vectors:
Beyond signal processing discrete periodic convolutions occur in many mathematical models:
heated
An engineering problem:
✦ cylindrical pipe,
✦ heated on part Γ H of its perimeter (→ prescribed heat flux),
✦ cooled on remaining perimeter ΓK (→ constant heat flux).
Task: compute local heat fluxes.
cooled
Modeling (discretization):
• approximation by regular n-polygon, edges Γ j ,
• isotropic radiation of each edge Γ j (power Ij ),
j
αij
αij radiative heat flow Γ j → Γi : Pji := Ij ,
i π
ϕ
opening angle: αij = π γ|i − j|, 1 ≤ i, j ≤ n,
n n
power balance: ∑ Pji − ∑ Pij = Q j . (4.1.35)
i =1,i 6= j i =1,i 6= j
| {z }
= Ij
Qj =
ˆ heat flux through Γ j , satisfies
Z 2π
(
n j local heating , if ϕ ∈ Γ H ,
Q j := q( ϕ) dϕ , q( ϕ) := R
2π ( j −1)
n
− |Γ1 | ΓH q( ϕ) dϕ (const.), if ϕ ∈ ΓK .
K
n αij
(4.1.35) ⇒ LSE: Ij − ∑ π
Ii = Q j , j = 1, . . . , n .
i =1,i 6= j
1 − γ1 − γ2 − γ3 − γ4 − γ3 − γ2 −γ1 I1 Q1
− γ1 1 − γ1 − γ2 − γ3 − γ4 − γ3
−γ2 I2 Q2
− γ2 − γ1 − γ1 − γ2 − γ3 − γ4 − γ3
1 I3 Q3
− γ3 − γ2 − γ1 − γ1 − γ2 − γ2 − γ4
e.g. n = 8:
1 I4 = Q4 . (4.1.36)
− γ4 − γ3 − γ2 − γ1 − γ1 − γ2 − γ3
1 I5 Q5
− γ3 − γ4 − γ3 − γ2 − γ1 − γ1
− γ2
1 I6 Q6
− γ2 − γ3 − γ4 − γ3 − γ2 − γ1 1 −γ1 I7 Q7
− γ1 − γ2 − γ3 − γ4 − γ3 − γ2 − γ1 1 I8 Q8
This is a linear system of equations with symmetric, singular, and (by Lemma 9.1.5, ∑ γi ≤ 1) positive
semidefinite (→ Def. 1.1.8) system matrix.
Note that the matrices from (4.1.31) and (4.1.36) have the same structure!
Also observe that the LSE from (4.1.36) can be written by means of the discrete periodic convolution (→
Def. 4.1.33) of vectors y = (1, −γ1 , −γ2 , −γ3 , −γ4 , −γ3 , −γ2 , −γ1 ), x = ( I1 , . . . , I8 )
(4.1.36) ↔ y ∗8 x = [ Q 1 , . . . , Q 8 ] ⊤ .
In Ex. 4.1.34 we have already seen a coefficient matrix of a special form, which is common enough to
warrant giving it a particular name:
✎ Notation: We write circul(p) ∈ K n,n for the circulant matrix generated by the periodic sequence/vector
p = [ p0 , . . . , p n −1 ] ⊤ ∈ K n
☞ Circulant matrix has constant (main, sub- and super-) diagonals (for which indices j − i = const.).
Write Z((uk )) ∈ K n,n for the circulant matrix generated by the n-periodic sequence (uk )k∈Z . Denote by
y := (y0 , . . . , yn−1 )⊤ , x = ( x0 , . . . , xn−1 )⊤ the vectors associated to n-periodic sequences.
Then the commutativity of the discrete periodic convolution (→ Def. 4.1.33) involves
circul(x)y = circul(y)x . (4.1.39)
Recall discrete convolution (→ Def. 4.1.22) of two vectors a = (a0 , . . . , an−1 )⊤ ∈ K n , b = (b0 , . . . , bn−1 )⊤ ∈
Kn .
n −1
zk := (a ∗ b)k = ∑ a j bk − j , k = 0, . . . , 2n − 2 .
j =0
a0 an −1
0 0 0
Fig. 139
−n 0 n 2n 3n 4n
In the spirit of (4.1.18) we can switch to a matrix view of the reduction to periodic convolution:
y0 0 x0
0yn−1 y1 y0
z0 ...
.. y1 0
.
..
.
0
x n −1
=y y1 y0 0 0 yn−1 .
n −1 y1 y0 0
(4.1.43)
0 0
..
.
..
.
0 ..
z2n−2 0 0 y n −1 y1 y0 .
| {z } 0
a (2n − 1) × (2n − 1) circulant matrix!
Algorithms dealing with circulant matrices make use of their very special spectral properties. Full un-
derstanding requires familiarity with the theory of eigenvalues and eigenvectors of matrices from linear
algebra, see [?, Ch. 7], [?, Ch. 9].
5
C : real(ev)
1
C : imag(ev)
1
eigenvalue
2
VectorXd::Random(n).
1
eigenvalues (real part) ✄
0
Little relationship between (complex!) eigenvalues
can be observed, as can be expected from random -1
-0.2 0 0 0
Circulant matrix 1, eigenvector 5 Circulant matrix 1, eigenvector 6 Circulant matrix 1, eigenvector 7 Circulant matrix 1, eigenvector 8
0.4 0.4 0.4 0.4
0 0 0 0
Eigenvectors of matrix C2
Circulant matrix 2, eigenvector 1 Circulant matrix 2, eigenvector 2 Circulant matrix 2, eigenvector 3 Circulant matrix 2, eigenvector 4
0 0.4 0.4 0.4
-0.2 0 0 0
Circulant matrix 2, eigenvector 5 Circulant matrix 2, eigenvector 6 Circulant matrix 2, eigenvector 7 Circulant matrix 2, eigenvector 8
0.4 0.4 0.4 0.4
0 0 0 0
random 256x256 circulant matrix, eigenvector 2 random 256x256 circulant matrix, eigenvector 3 random 256x256 circulant matrix, eigenvector 5 random 256x256 circulant matrix, eigenvector 8
0.1 0.1 0.1 0.1
0 0 0 0
An abstract result from linear algebra puts the surprising observation made in Exp. 4.2.1 in a wider context.
If A, B ∈ K n,n commute, that is, AB = BA, and A has n distinct eigenvalues, then the
eigenspaces of A and B coincide.
In Exp. 4.2.1 we saw that we get complex eigenvalues/eigenvectors for general circulant matrices. More
generally, in many cases real matrices can be diagonalized only in C, which is the ultimate reason for the
importance of complex numbers.
Complex numbers also allow an elegant handling on trigonometric functions: recall from analysis the
unified treatment of trigonometric functions via the complex exponential function
The field of complex numbers C is the natural framework for the analysis of linear, time-invariant
C! filters, and the development of algorithms for circulant matrices.
Now we verify by direct computations that circulant matrices all have a particular set of eigenvectors. This
will entail computing in C, cf. Rem. 4.2.15.
✎ notation: nth root of unity ωn := exp(−2πi/n ) = cos(2π/n) − i sin(2π/n ), n ∈ N
n −1
1 − qn
∑ qk = 1−q
∀ q ∈ C \ {1} , n ∈ N . (4.2.9)
k=0
n −1 nj
kj 1 − ωn 1 − exp(−2πij)
⇒ ∑ ωn = j
=
1 − exp(−2πij/n)
=0,
k=0 1 − ωn
nj
because exp(−2πij) = ωn = (ωnn ) j = 1 for all j ∈ Z.
! In expressions like ωnkl the term “kl ” will always designate an exponent and will never play
the role of a superscript.
Now we want to confirm the conjecture gleaned from Exp. 4.2.1 that vectors with powers of roots of unity
are eigenvectors for any circulant matrix. We do this by simple and straightforward computations:
Consider: circulant matrix C ∈ C n,n circulant matrix (→ Def. 4.1.38), with cij = ui − j , for a
n-periodic sequence (uk )k∈Z , uk ∈ C
h i
jk n −1
We “guess” an eigenvector: vk ∈ Cn with vk : = ωn ∈ C n , k ∈ {0, . . . , n − 1}.
j =0
(u j−l ωnlk )l ∈Z is n-periodic!
n −1 j
(Cvk ) j = ∑ u j−l ωnlk = ∑ u j−l ωnlk
l =0 l = j − n +1
(4.2.10)
n −1
( j− l )k jk n −1 jk
= ∑ ul ωn = ωn ∑ ul ωn−lk = λk · ωn = λk · ( vk ) j .
l =0 l =0
The set {v0 , . . . , vn−1 } ⊂ C n provides the so-called orthogonal trigonometric basis of C n = eigen-
vector basis for circulant matrices
0 ω 0 ωn0 ωn0
ω n n
n − 2 ω n − 1
.
. ω 1 ω n n
. n
. . 2 ( n − 2 ) 2 ( n − 1 )
. ω n nω
{ v0 , . . . , v n − 1 } =
··· ..
..
. . . (4.2.11)
. .
.. ..
. . .
. . .
ω 0 ω n −1
(n −1)(n −2) ( n − 1)2
n n ωn ω n
From (4.2.8) we can conclude orthogonality of the basis vectors by straightforward computations:
n −1 n −1 (4.2.8)
jk − jk jm (m− k) j
vk : = (ωn )nj=−01 ∈C : n
vH
k vm = ∑ ωn ωn = ∑ ωn = 0 , if k 6= m . (4.2.12)
j =0 j =0
The matrix effecting the change of basis trigonometrical basis → standard basis is called the Fourier-
matrix
ωn0 ωn0 ··· ωn0
ωn0
0 ωn1 ··· ωnn−1
h i n −1
ωn2 ··· ωn2n−2 lj
Fn = ωn = ωn ∈ C n,n . (4.2.13)
.. .. .. l,j=0
. . .
( n − 1)2
ωn0 ωnn−1 · · · ωn
n −1 n −1 n −1
( l − 1) k ( l − 1) k −( j−1)k k(l − j)
F n FH
n = ∑ ωn ω n ( j − 1) k = ∑ ωn ωn = ∑ ωn , 1 ≤ l, j ≤ n .
l,j
k=0 k=0 k=0
For any circulant matrix C ∈ K n,n , cij = ui − j , (uk )k∈Z n-periodic sequence, holds true
Lemma 4.2.16, (4.2.17) ➣ multiplication with Fourier-matrix will be crucial operation in algorithms for
circulant matrices and discrete convolutions.
n −1
kj
ck = ∑ y j ωn , k = 0, . . . , n − 1 . (4.2.19)
j =0
Recall the convention also adopted for the discussion of the DFT: vector indexes range from 0 to n − 1!
From F− 1 1
n = n F n (→ Lemma 4.2.14) we find the inverse discrete Fourier transform:
n −1
kj 1 n −1 − kj
ck = ∑ y j ωn ⇔ yj =
n k∑
c k ωn (4.2.20)
j =0 =0
1. # include <unsupported/Eigen/FFT>
Coding the formula from Def. 4.1.33 one would code discrete periodic convolution as follows:
In § 4.1.37 we have seen that periodic convolution (→ Def. 4.1.33) amounts to multiplication with a cir-
culant matrix. In addition, (4.2.17) reduces multiplication with a circulant matrix to two multiplications with
the Fourier matrix Fn (= DFT) and (componentwise) scaling operations.
Summary:
n −1
discrete periodic convolution zk = ∑ uk− j x j (→ Def. 4.1.33), k = 0, . . . , n − 1
j =0
l n
multiplication with circulant matrix (→ Def. 4.1.38) z = Cx, C := ui − j i,j=1.
Idea: (4.2.17) ➣ z = F− 1
n diag(Fn u)Fn x
n −1 n
(u) ∗n ( x) := ∑ uk− j x j = F−n 1 (Fn u ) j (Fn x ) j j =1
.
j =0
In Rem. 4.1.40 we learned that the discrete convolution of n-vectors (→ Def. 4.1.22) can be accomplished
by the periodic discrete convolution of 2n − 1-vectors (obtained by zero padding, see Rem. 4.1.40):
The trigonometric basis vectors, when interpreted as time-periodic signals, represent harmonic oscilla-
tions. This is illustrated when plotting some vectors of the trigonometric basis (n = 16):
Fourier-basis vector, n=16, j=1 Fourier-basis vector, n=16, j=7 Fourier-basis vector, n=16, j=15
1 1 1
Value
Value
0 0 0
Dominant coefficients of a signal after transformation to trigonometric basis indicate dominant fre-
quency components.
Terminology: coefficients of a signal w.r.t. trigonometric basis = signal in frequency domain, original
signal = time domain.
n −1
kj 1 n −1 − kj
ck = ∑ y j ωn ⇔ yj =
n k∑
c k ωn (4.2.20)
j =0 =0
kj (n − k) j
Consider yk ∈ R ⇒ ck = cn−k , because ωn = ω n , and n = 2m + 1
m 2m m
− kj − kj − kj (k− n ) j
ny j = c0 + ∑ c k ωn + ∑ c k ωn = c0 + ∑ c k ωn + c n − k ωn
k=1 k= m+1 k=1
m
= c0 + 2 ∑ Re(ck ) cos(2π kj/n ) + Im(ck ) sin(2π kj/n ) ,
k=1
Extraction of characteristic frequencies from a distorted discrete periodical signal, generated by the fol-
lowing C++ code:
3
20
18
2
16
1 14
12
Signal
2
|ck|
0 10
-1
6
4
-2
0
-3 0 5 10 15 20 25 30
0 10 20 30 40 50 60 70
Fig. 142 Fig. 143 Coefficient index k
Sampling points (time)
Fig. 143 was generated by the following M ATLAB code. A corresponding C++ code is also available
➺ GITLAB.
3 f i g u r e (’Name’,’power spectrum’);
4 bar (0:31,p(1:32),’r’);
5 s e t ( gca ,’fontsize’,14);
6 a x i s ([-1 32 0 max(p)+1]);
7 x l a b e l (’{\bf index k of Fourier coefficient}’,’Fontsize’,14);
8 y l a b e l (’{\bf |c_k|^2}’,’Fontsize’,14);
We observe that frequencies present in unperturbed signal become evident in frequency domain, whereas
it is hard to tell them from the time-domain signal.
The following C++ code processes actual web search data and performs a frequency analysis using DFT:
5 mgl : : F i g u r e t r e n d ;
6 trend . p lo t ( x , " r " ) ;
7 trend . t i t l e ( " Google : ’ V o r l e s u n g s v e r z e i c h n i s ’ " ) ;
8 trend . grid ( ) ;
9 t r e n d . x l a b e l ( " week ( 1 . 1 . 2 0 0 4 − 3 1 . 1 2 . 2 0 1 0 ) " ) ;
10 t r e n d . y l a b e l ( " r e l a t i v e no . o f s e a r c h e s " ) ;
11 t r e n d . save ( " s e a r c h d a t a " ) ;
12
13 Eigen : : FFT<double> f f t ;
14 VectorXcd c = f f t . fwd ( x ) ;
15 VectorXd p = c . cwiseProduct ( c . conjugate ( ) ) . r e a l ( )
16 . segment ( 2 , std : : f l o o r ( ( n + 1 . ) / 2 ) ) ;
17
Pronounced peaks in the power spectrum point to periodic structure of the data. Location of peaks tells
lengths of dominant periods.
Plots of real parts of trigonometric basis vectors (Fn ):,j (= columns of Fourier matrix), n = 16.
Trigonometric basis vector (real part), n=16, j=0 Trigonometric basis vector (real part), n=16, j=1 Trigonometric basis vector (real part), n=16, j=2
1 1 1
0.5 0 0
0 -1 -1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
vector component index vector component index vector component index
0 0 0
-1 -1 -1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
vector component index vector component index vector component index
0 0 0
-1 -1 -1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
vector component index vector component index vector component index
Low frequencies
Slow oscillations/low frequencies ↔ j ≈ 1 and
j ≈ n.
Fast oscillations/high frequencies ↔ j ≈ n/2.
Fig. 146
Frequency filtering of real discrete periodic signals by suppressing certain “Fourier coefficients”.
12 VectorXcd clow = c ;
13 // Set high frequency coefficients to zero, Fig. 146
14 f o r ( i n t j = −k ; j <= +k ; ++ j ) clow (m+ j ) = 0 ;
15 // (Complementary) vector of high frequency coefficients
16 VectorXcd c high = c − clow ;
17
Noisy signal:
n = 256; y = exp(sin(2*pi*((0:n-1)’)/n)) + 0.5*sin(exp(1:n)’);
Frequency filtering by Code 4.2.33 with k = 120.
3.5
signal 350
noisy signal
3 low pass filter
high pass filter
300
2.5
250
2
200
1.5
|ck|
1 150
0.5
100
0
50
-0.5
0
-1 0 20 40 60 80 100 120 140
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 No. of Fourier coefficient
time
Low pass filtering can be used for denoising, that is, the removal of high frequency perturbations of a
signal.
Frequency filtering is ubiquitous in sound processing. Here we demonstrate it in M ATLAB, which offers
tools for audio processing.
7 n = l e n g t h (y);
8 f p r i n t f (’Read wav File: %d samples, rate = %d/s, nbits = %d\n’,
n,Fs,nbits);
9 k = 1; s{k} = y; leg{k} = ’Sampled signal’;
10
11 c = fft(y);
12
13 f i g u r e (’name’,’sound signal’);
14 p l o t ((22000:44000)/Fs,s{1}(22000:44000),’r-’);
15 t i t l e (’samples sound signal’,’fontsize’,14);
16 x l a b e l (’{\bf time[s]}’,’fontsize’,14);
17 y l a b e l (’{\bf sound pressure}’,’fontsize’,14);
18 g r i d on;
19
22 f i g u r e (’name’,’sound frequencies’);
23 p l o t (1:n, abs (c).^2,’m-’);
24 t i t l e (’power spectrum of sound signal’,’fontsize’,14);
25 x l a b e l (’{\bf index k of Fourier coefficient}’,’fontsize’,14);
26 y l a b e l (’{\bf |c_k|^2}’,’fontsize’,14);
27 g r i d on;
28
31 f i g u r e (’name’,’sound frequencies’);
32 p l o t (1:3000, abs (c(1:3000)).^2,’b-’);
33 t i t l e (’low frequency power spectrum’,’fontsize’,14);
34 x l a b e l (’{\bf index k of Fourier coefficient}’,’fontsize’,14);
35 y l a b e l (’{\bf |c_k|^2}’,’fontsize’,14);
36 g r i d on;
37
39
40 f o r m=[1000,3000,5000]
41
53 k = k+1;
54 s{k} = r e a l (yf);
55 leg{k} = s p r i n t f (’cutt-off = %d’,m’);
56 end
57
0.4 12
0.2 10
sound pressure
0 8
|c |2
k
-0.2 6
-0.4 4
-0.6 2
-0.8 0
0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7
Fig. 147 time[s] Fig. 148 index k of Fourier coefficient x 10
4
x 10
4 low frequency power spectrum 0.6
14
0.4
12
0.2
10
sound pressure
0
8
|c |2
k
-0.2
6
-0.4
4
n −1
The power spectrum of a signal y ∈ C n is the vector |c j |2 j=0 , where c = Fn y is the discrete Fourier
transform of y.
Every time-discrete signal obtained from sampling a time-dependent physical quantity will yield a real
vector. Of couse, a real vector contains only half the information compared to complex vector of the same
length. We aim to exploit this for a more efficient implementation of DFT.
Task: Efficient implementation of DFT (Def. 4.2.18) (c0 , . . . , cn−1 ) for real coefficients (y0 , . . . , yn−1 )⊤ ∈
R n , n = 2m, m ∈ N.
If y j ∈ R in DFT formula (4.2.19), we obtain redundant output
(n − k) j kj
ωn = ω n , k = 0, . . . , n − 1 ,
n −1 n −1
kj (n − k) j
⇒ ck = ∑ yjωn = ∑ y j ωn = cn−k , k = 1, . . . , n − 1 .
j =0 j =0
m−1
j(m− k) −1 jk m−1 jk
hm− k = ∑ y2j + iy2j+1 ω m = ∑m
j=0 y2j ωm − i · ∑ j=0 y2j+1 ωm . (4.2.39)
j =0
−1 jk −1 jk
⇒ ∑m
j=0 y2j ωm = 12 (hk + h m−k ) , ∑m
j=0 y2j+1 ωm = − 12 i (hk − h m−k ) .
16 Eigen : : FFT<double> f f t ;
17 VectorXcd d = f f t . fwd ( yc ) , h (m + 1 ) ;
18 h << d , d ( 0 ) ;
19
20 c . resize ( n ) ;
21 // Step II: implementation of (4.2.41)
22 f o r ( unsigned k = 0 ; k < m; ++k ) {
23 c ( k ) = ( h ( k ) + std : : c o n j ( h (m−k ) ) ) / 2 . −
i / 2 . ∗ std : : exp ( − 2. ∗ k / n ∗ M_PI ∗ i ) ∗ ( h ( k ) − std : : c o n j ( h (m−k ) ) ) ;
24 }
25 c (m) = std : : r e a l ( h ( 0 ) ) − std : : imag ( h ( 0 ) ) ;
26 f o r ( unsigned k = m+ 1; k < n ; ++k ) c ( k ) = std : : c o n j ( c ( n−k ) ) ;
27 }
In this we study the frequency decomposition of matrices. Due to the natural analofy
Let a matrix C ∈ C m,n be given as a linear combination of these basis matrices with coefficients y j1 ,j2 ∈ C,
0 ≤ j1 < m, 0 ≤ j2 < n:
m−1 n −1
C= ∑ ∑ y j1 ,j2 (Fm ):,j1 (Fn )⊤
:,j2 . (4.2.45)
j1 =0 j2 =0
Then the entries of C can be computed by two nested discrete Fourier transforms:
!
m−1 n −1 m−1 n −1
j k j k j k j k
(C)k1 ,k2 = ∑ ∑ y j1 ,j2 ωm1 1 ωn2 2 = ∑ ωm1 1 ∑ ωn2 2 y j1 ,j2 , 0 ≤ k1 < m , 0 ≤ k2 < n .
j1 =0 j2 =0 j1 =0 j2 =0
The coefficients can also be regarded as entries of a matrix Y ∈ C m,n . Thus we can rewrite the above
expressions: for all 0 ≤ k1 < m, 0 ≤ k2 < n
m−1 j k1
(C)k1 ,k2 = ∑ Fn (Y) j1 ,: k2
ωm1 C = Fm (Fn Y⊤ )⊤ = Fm YFn . (4.2.46)
j1 =0
This formula defines the two-dimensional discrete Fourier transform of the matrix Y ∈ C m,n . By Lemma 4.2.14
we immediately get the inversion formula:
m−1 n −1
C= ∑ ∑ y j1 ,j2 (Fm ):,j1 (Fn )⊤
:,j2 ⇒ Y = F− 1 −1
m CFn =
1
mn F m CF n . (4.2.47)
j1 =0 j2 =0
which defines the two-dimensional discrete periodic convolution, cf. Def. 4.1.33.
5 E i g e n M a t r i x &Z ) {
6 using i d x _ t = typename E i g e n M a t r i x : : Index ;
7 using v a l _ t = typename E i g e n M a t r i x : : S c a l a r ;
8 const i d x _ t n=X . cols ( ) ,m=X . rows ( ) ;
9 i f ( (m! =Y . rows ( ) ) | | ( n ! =Y . cols ( ) ) ) throw
std : : r u n t i m e _ e r r o r ( " pmconv : s i z e m i s m a t c h " ) ;
10 Z . r e s i z e (m, n ) ;
11 // Implementation of (4.2.52)
12 auto idx wr ap = [ ] ( const i d x _ t L , i n t i )
13 { i f ( i >=L ) i −= L ; else i f ( i <0) i += L ; r et ur n i ; } ;
14 f o r ( i n t i = 0; i <m; i ++) f o r ( i n t j = 0; j <n ; j ++) {
15 v a l _ t s = 0;
16 f o r ( i n t k = 0; k<m; k ++) f o r ( i n t l = 0; l <n ; l ++)
17 s += X( k , l ) ∗Y( idx wr ap (m, i −k ) , idx wr ap ( n , j − l ) ) ;
18 Z( i , j ) = s ;
19 }
20 }
The 2D discrete periodic convolution admits a diagonalization by switching to the trigonometric basis of
C m,n , analogous to (4.2.17), see Section 4.2.1.
sj
In (4.2.52) set Y = (Fm ):,r (Fn )s,: ∈ C m,n ri ω , 0 ≤ i < m, 0 ≤ j < n:
↔ (Y)i,j = ωm n
m−1 n −1
(B(X, Y)) k,ℓ = ∑ ∑ (X)i,j (Y)k−i mod m,ℓ− j mod n
i =0 j =0
m−1 n −1
r (k−i ) s(ℓ− j)
= ∑ ∑ (X)i,j ωm ωn
i =0 j =0
!
m−1 n −1
ri ω sj rk sℓ
= ∑ ∑ (X)i,j ωm n · ωm ωn .
i =0 j =0
!
m−1 n −1
ri ω sj
B(X, (Fm ):,r (Fn )s,: ) =
| {z } ∑ ∑ (X)i,j ωm n (Fm ):,r (Fn )s,: . (4.2.54)
i =0 j =0
“eigenvector” | {z }
see Eq. (4.2.45)
Hence, the (complex conjugated) two-dimensional discrete Fourier transform of X according to (4.2.45)
provides the eigenvalues of the aniti-linear mapping Y 7→ B(X, Y), X ∈ C m,n fixed.
This suggests the following DFT-based algorithm for evaluating the periodic convolution of matrices:
➊ Compute Ŷ by inverse 2D DFT of Y, see Code 4.2.49
➋ Compute X̂ by 2D DFT of X, see Code 4.2.48.
➌ Component-wise multiplication of X̂ and Ŷ: Ẑ = X̂. ∗ Ŷ.
➍ Compute Z through inverse 2D DTF of Ẑ.
2D discrete convolutions are important for image processing. Let a Gray-scale pixel image be stored in
the matrix P ∈ R m,n , actually P ∈ {0, . . . , 255}m,n , see also Ex. 9.3.24.
Blurring = pixel values get replaced by weighted averages of near-by pixel values
(effect of distortion in optical transmission systems)
L L
0≤l<m,
cl,j = ∑ ∑ sk,q pl +k,j+q ,
0≤j<n,
L ∈ {1, . . . , min{m, n}} . (4.2.57)
k=− L q =− L
Does this ring a bell? Hidden in (4.2.57) is a 2D discrete periodic convolution, see Eq. (4.2.52). In light of
the algorithm implemented in Code 4.2.55 it is hardly surprising that DFT comes handy for reversing the
effect of the blurring!
Note that usually: L small, sk,m ≥ 0, ∑kL=− L ∑qL=− L sk,q = 1 (an averaging)
1
In the experiments we used: L = 5 and the PSF sk,q = .
1 + k2 + q2
9 MatrixXd C(m, n ) ;
10 f o r ( long l = 1 ; l <= m; ++ l ) {
11 f o r ( long j = 1 ; j <= n ; ++ j ) {
12 double s = 0 ;
13 f o r ( long k = 1 ; k <= (2 ∗ L+1) ; ++k ) {
14 f o r ( long q = 1 ; q <= (2 ∗ L+1) ; ++q ) {
15 double k l = l + k − L − 1 ;
16 i f ( k l < 1 ) k l += m;
17 else i f ( k l > m) k l −= m;
18 double jm = j + q − L − 1 ;
19 i f ( jm < 1 ) jm += n ;
20 else i f ( jm > n ) jm −= n ;
21 s += P( k l −1, jm −1) ∗S( k −1,q −1) ;
22 }
23 }
24 C( l −1, j −1) = s ;
25 }
26 }
27 r et ur n C;
28 }
Now we revisit the considerations of § 4.2.43 and recall the derivation of (4.2.10) and Lemma 4.2.16.
L L L L
νk µq ν(l + k) µ( j+q) νl µj νk µq
B( ωm ωn
k,q ∈Z
= ∑ ∑ sk,q ωm ωn = ωm ωn ∑ ∑ sk,q ωm ωn .
l,j k=− L q =− L k=− L q =− L
νk ω µq
Vν,µ := ωm n k,q ∈Z
, 0 ≤ µ < m, 0 ≤ ν < n are the eigenvectors of B :
L L
νk µq
B Vν,µ = λν,µ Vν,µ , eigenvalue λν,µ = ∑ ∑ sk,q ωm ωn (4.2.60)
k=− L q =− L
| {z }
2-dimensional DFT of point spread function !
Thus the inversion of the blurring operator boils down to componentwise scaling in “Fourier domain”, see
See also Code 4.2.55 for the same idea.
Starting from Rem. 4.1.29 we mainly looked at time-discrete n-periodic signals, which can be mapped to
vectors ∈ R n . This led to discrete periodic convolution (→ Def. 4.1.33) and the discrete Fourier transform
(DFT) (→ Def. 4.2.18) as (bi-)linear mappings in C n .
In this section we are concerned with non-periodic signals of infinite duration as introduced in § 4.0.1.
Idea: Study the limit n → ∞ for the n-periodic setting and DFT.
Now we associate a point tk ∈ [0, 1[ with each index k for the components of the transformed signal
(ck )nk=−01 :
k
k ∈ {0, . . . , n − 1} ←→ tk := . (4.2.63)
n
0.9
✁ “Squeezing” a vector ∈ R n into [0, 1[.
0.8
Thus we can read the values ck as sampled values
0.7
of a functions defined on [0, 1]
0.6
c k ↔ c ( t k ); ,
ck
0.5
0.4
k
0.3 tk = , k = 0, . . . , n − 1 .
n
0.2
0.1
This makes it possible to pass from a discrete finite
signal to a continuous signal.
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 153 t
The notation indicates that we read ck as the value of a function c : [0, 1[7→ C for argument tk .
1
Bi-infinite discrete signal, “concentrated around 0”: yj = , j ∈ Z.
1 + j2
m
We examine the DFT of the 2m + 1-periodic signal obtained by periodic extension of (yk )k=−m .
C++11 code 4.2.66: Plotting a periodically truncated signal and its DFT ➺ GITLAB
Now we pass to the limit m → ∞ (and keep the function perspective ck = c(tk ))
Terminology: The series (= infinite sum) on the right hand side of (4.2.67) is called a Fourier series (link)
The function c : [0, 1[7→ C defined by (4.2.67) is called the Fourier transform of the se-
quence (yk )k∈Z (, if the series converges).
1.5
Fourier transform
=
1 weighted sum of Fourier modes t 7→ exp(−2πıkt),
k∈Z
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 162 t
1 1 1 1 1
Fourier mode
Fourier mode
Fourier mode
Fourier mode
0.2 0.2 0.2 0.2 0.2
-0.2 + 0
-0.2 + 0
-0.2 + 0
-0.2 + 0
-0.2
-1 -1 -1 -1 -1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t t t t
1 π
π −2πt 2πt− π
c (t) = ∑ 1 + k2 exp (− 2πıkt ) = π − e−π
· e + e ∈ C∞ ([0, 1]) .
k ∈Z
e
Note that when considered as a 1-periodic function on R , this c(t) is merely continuous.
(4.2.71) ⇒ Fourier series (4.2.67) (link) converges uniformly [?, Def. 4.8.1]
⇒ c : [0, 1[7→ C is continuous [?, Thm. 4.8.1].
Assuming sufficiently fast decay of the signal (yk )k∈Z for k → ∞ (→ Rem. 4.2.69), we can approximate
the Fourier series (4.2.67) by a Fourier sum
M
c (t) ≈ c M (t) := ∑ yk exp(−2πikt) , M ≫ 1 . (4.2.73)
k=− M
j
Task: Approximate evaluation of c(t) at N equidistant points t j := N , j = 0, . . . , N (e.g., for plotting it).
M M
kj
c(t j ) = lim
M→∞
∑ yk exp(−2πikt j ) ≈ ∑ yk exp(−2πi
N
), (4.2.74)
k=− M k=− M
for j = 0, . . . , N − 1.
C++11 code 4.2.75: DFT-based evaluation of Fourier sum at equidistant points ➺ GITLAB
2 # include " f e v a l . hpp " // evaluate scalar function with a vector
3 // DFT based approximate evaluation of Fourier series
4 // signal is a handle to a function providing the yk
5 // M specifies truncation of series according to (4.2.73)
6 // N is the number of equidistant evaluation points for c in [0, 1[.
7 template <class Func tion >
8 VectorXcd foursum ( const F u n c t i o n& s i g n a l , i n t M, i n t N) {
9 const i n t m = 2 ∗M+ 1; // length of the signal
10 // sample signal
11 VectorXd y = f e v a l ( s i g n a l , VectorXd : : LinSpaced (m, −M, M) ) ;
12 // Ensure that there are more sampling points than terms in series
13 i n t l ; i f (m > N) { l = c e i l ( double (m) /N) ; N ∗= l ; } else l = 1 ;
14 // Zero padding and wrapping of signal, see Code 4.2.33
15 VectorXd y _ex t = VectorXd : : Zero (N) ;
16 y _ex t . head (M+1) = y . t a i l (M+1) ;
17 y _ex t . t a i l (M) = y . head (M) ;
18 // Perform DFT and decimate output vector
19 Eigen : : FFT<double> f f t ;
20 Eigen : : VectorXcd k = f f t . fwd ( y _ex t ) , c (N / l ) ;
21 f o r ( i n t i = 0 ; i < N / l ; ++ i ) c ( i ) = k ( i ∗ l ) ;
22 r et ur n c ;
23 }
1
Infinite signal, satisfying decay condition (4.2.71): yk = , see Ex. 4.2.65.
1 + k2
Monitored: approximation of Fourier transform c(t) by Fourier sums cm (t), see (4.2.73).
2
Fourier transform of (1/1+k )
k Fourier sum approximations with 2m+1 terms, y = 1/(1+k2)
k
3.5 3.5
m=2
3 m=4
3
m=8
m = 16
m = 32
2.5 2.5
2 2
cm(t)
c(t)
1.5 1.5
1 1
0.5 0.5
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 163 t Fig. 164 t
Observation: Convergence of Fourier sums in “eyeball norm”; quantitative statements about convergence
can be deduced from Thm. 4.2.89.
1 n −1 jk
yj = ∑
n k=0
ck exp(2πi ) , j = −m, . . . , m .
n
(4.2.77)
1 n −1
y j = ∑ c(tk ) exp(2πijtk ) , j = −m, . . . , m . (4.2.78)
n k=0
Idea: right hand side of (4.2.78) = Riemann sum, cf. [?, Sect. 6.2]
R1
yj = c(t) exp (2πijt) dt . (4.2.79)
0
The formula (4.2.79) allows to recover the signal (yk )k∈Z from its Fourier transform c(t).
Terminology: y j from (4.2.79) is called the j-th Fourier coefficient of the function c.
✎ Notation: b
c j := y j with y j defined by (4.2.79) =
ˆ j-th Fourier coefficient of c : [0, 1[→ C
✬ ✩
Summary of a fundamental correspondence:
Z 1
b
cj = c(t) exp (2πıjt) dt
(continuous) function 0 (bi-infinite) sequence
c : [0, 1[7→ C (b
c j ) j ∈Z
c (t) = ∑ cbk exp(−2πıkt)
k ∈Z
✫ ✪
Fourier transform Fourier coefficient
What happens to the Fourier transform of a bi-infinite signal, if it passes through a channel?
Consider a (bi-)infinite signal ( xk )k∈Z sent through a finite (→ Def. 4.1.4, linear (→ Def. 4.1.9) time-
invariant (→ Def. 4.1.7) causal (→ Def. 4.1.11) channel with impulse response (→ 4.1.3) (. . . , 0, h0 , . . . , hn−1 0, . . .)
(→ § 4.1.1).
n −1
c (t) = ∑ yk exp(−2πıkt) = ∑ ∑ h j xk− j exp(−2πıkt)
k ∈Z k ∈Z j = 0
n −1
[shift summation index k] = ∑ ∑ h j xk exp(−2πıjt) exp(−2πıkt) (4.2.82)
j = 0 k ∈Z
!
n −1
= ∑ h j exp(−2πıjt) b(t) .
j =0
| {z }
trigonometric polynomial of degree n − 1
Lemma 4.2.14 ➣ for Fourier matrix Fn , see (4.2.13), √1 Fn is unitary (→ Def. 6.2.2)
n
Thm. 3.3.5
1
√ Fn y = k yk 2 . (4.2.86)
n 2
Since DFT boils down to multiplication with Fn (→ Def. 4.2.18), we conclude from (4.2.86)
1 n −1 m
| c k |2 = | y j |2 .
n k∑ ∑
ck from (4.2.62) ⇒ (4.2.87)
=0 j=− m
Now we adopt the function perspective again and associated ck ↔ c(tk ). Then we pass to the limit
m → ∞, appeal to Riemann summation (see above), and conclude
Z1
m→∞
(4.2.87) =⇒ |c(t)|2 dt = ∑ | y j |2 . (4.2.88)
0 j ∈Z
Recalling the concept of the L2 -norm of a function, see (5.2.67), the theorem can be stated as follows:
Thm. 4.2.89 ↔ The L2 -norm of a Fourier transform agrees with the Euclidean norm of the
corresponding signal.
2
Note: Euclidean norm of a sequence ( y k ) k ∈Z 2
:= ∑ | y j |2 .
k ∈Z
You might have been wondering why the reduction to DFTs received sop much attention in Section 4.2.1.
An explanation is given now.
Supplementary reading. [?, Sect. 8.7.3], [?, Sect. 53], [?, Sect. 10.9.2]
At first glance (at (4.2.19)): DFT in C n seems to require asymptotic computational effort of O(n2 ) (matrix×vector
multiplication with dense matrix).
1
10
-1
10
Timings in M ATLAB ✄:
-5
10
-6
10
0 500 1000 1500 2000 2500 3000
vector length n
j =0 j =0 (4.3.4)
m−1 2πi 2πi k
m−1 2πi
= ∑ y2j e|−{zm jk} +e− n · ∑ y2k+1 |e−{z
m jk .
}
j =0 jk j =0 jk
=ωm =ωm
| {z } | {z }
ceven
= :ek codd
= :ek
Note: ceven
ek codd
,ek from DFTs of length m!
−1
with yeven := (y0 , y2 , . . . , yn−2 )⊤ ∈ C m : (e
ceven
k )m
k=0 = Fm yeven ,
m−1
with yodd := (y1 , y3 , . . . , yn−1 )⊤ ∈ C m : codd
e k k=0
= Fm yodd .
✞ ☎
✝ ✆
(4.3.4): DFT of length 2m = 2× DFT of length m + 2m additions & multiplications
FFT-algorithm
2 L × DFT of length 1
Code 4.3.5: each level of the recursion requires O(2 L ) elementary operations.
Asymptotic complexity
For n = 2m, m ∈ N,
2j j
As ωn = ωm :
Fm Fm
OE
permutation of rows Pm Fn = =
0
ωn ωn/2
n
ωn1 ωn/2+1
n
Fm .. Fm ..
. .
n/2−1
ωn ωnn−1
Fm
I I
0
ωn −ωn0
ωn1 −ωn1
Fm .. ..
. .
n/2−1
−ωn/2−1
n
ωn
ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0
ω0 ω2 ω4 ω6 ω8 ω0 ω2 ω4 ω6 ω8
ω0 ω4 ω8 ω2 ω6 ω0 ω4 ω8 ω2 ω6
ω0 ω6 ω2 ω8 ω4 ω0 ω6 ω2 ω8 ω4
ω0 ω8 ω6 ω4 ω2 ω0 ω8 ω6 ω4 ω2
P5OE F10 = , ω := ω10 .
ω0 ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9
ω0 ω3 ω6 ω9 ω2 ω5 ω8 ω1 ω4 ω7
ω0 ω5 ω0 ω5 ω0 ω5 ω0 ω5 ω0 ω5
ω0 ω7 ω4 ω1 ω8 ω5 ω2 ω9 ω6 ω3
ω0 ω9 ω8 ω7 ω6 ω5 ω4 ω3 ω2 ω1
To compute an n-point DFT when n is composite (that is, when n = pq), the FFTW library decomposes the
problem using the Cooley-Tukey algorithm, which first computes p transforms of size q, and then computes
q transforms of size p. The decomposition is applied recursively to both the p- and q-point DFTs until the
problem can be solved using one of several machine-generated fixed-size "codelets." The codelets in turn
use several algorithms in combination, including a variation of Cooley-Tukey, a prime factor algorithm, and
a split-radix algorithm. The particular factorization of n is chosen heuristically.
The execution time for fft depends on the length of the transform. It is fastest for powers of two. It is
almost as fast for lengths that have only small prime factors. It is typically several times slower for
lengths that are prime or which have large prime factors → Ex. 4.3.12.
n −1 p−1 q −1 p−1 q −1
jk [ j= :l p+ m] − 2πi
pq ( l p+ m ) k l (k mod q )
ck = ∑ y j ωn = ∑ ∑ y l p+ m e = ∑ ωnmk ∑ y l p+ m ωq . (4.3.9)
j =0 m=0 l =0 m=0 l =0
q −1
Step I: perform p DFTs of length q zm,k := ∑ yl p+m ωqlk , 0 ≤ m < p, 0 ≤ k < q.
l =0
Step I Step II
p p
q q
When n 6= 2 L , even the Cooley-Tuckey algorithm of Rem. 4.3.8 will eventually lead to a DFT for a vector
with prime length.
Quoted from the M ATLAB manual:
When n is a prime number, the FFTW library first decomposes an n-point problem into three (n − 1)-point
problems using Rader’s algorithm [?]. It then uses the Cooley-Tukey decomposition described above to
compute the (n − 1)-point DFTs.
p p−1 ⊤
For the Fourier matrix F p = ( f ij )i,j=1 the permuted block Pp−1 Pp,g ( f ij )i,j=2 Pp,g is circulant.
ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0
ω0 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1
ω0 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7
ω0 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10
ω0 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5
ω0 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9
F13 −→ ω0 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11
ω0 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12
ω0 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6
ω0 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3
ω0 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8
ω0 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4
ω0 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2
Then apply fast algorithms for multiplication with circulant matrices (= discrete periodic convolution, see
§ 4.1.37) to right lower (n − 1) × (n − 1) block of permuted Fourier matrix. These fast algorithms rely on
DTFs of length n − 1, see Code 4.2.25.
← Section 4.2.1
Asymptotic complexity of discrete periodic convolution,see Code 4.2.25:
Cost(pconvfft(u,x), u, x ∈ C n ) = O(n log n).
From FFTW homepage: FFTW is a C subroutine library for computing the dis-
crete Fourier transform (DFT) in one or more dimensions, of arbitrary input size,
and of both real and complex data.
FFTW will perform well on most architectures without modification. Hence
the name, "FFTW," which stands for the somewhat whimsical title of “Fastest
Fourier Transform in the West.”
Supplementary reading. [?] offers a comprehensive presentation of the design and implemen-
tation of the FFTW library (version 3.x). This paper also conveys the many tricks it takes to achieve
satisfactory performance for DFTs of arbitrary length.
FFTW can be installed from source following the instructions from the installation page after downloading
the source code of FFTW 3.3.5 from the download page. Precompiled binaries for various linux distribu-
tions are available in their main package repositories:
Platform:
✦ Linux (Ubuntu 16.04 64bit)
✦ Intel(R) Core(TM) i7-4600U CPU @
2.10GHz
✦ L2 256KB, L3 4 MB, 8 GB DDR3 @ 1.60GHz
✦ Clang 3.8.0, -O3
For reasonably high input sizes the FFTW backend
gives, compared to E IGEN’s default backend (Kiss
FFT), a speedup of 2-4x.
Supplementary reading. [?, Sect. 55], see also [?] for an excellent presentation of various
Keeping in mind exp(2πix ) = cos(2πx ) + ı sin(2πx ) we may also consider the real/imaginary parts
of the Fourier basis vectors (Fn ):,j as bases of R n and define the corresponding basis transformation.
They can all be realized by means of fft with an asymptotic computational effort of O(n log n). These
transformations avoid the use of complex numbers.
Basis transform matrix (sine basis → standard basis): Sn := (sin( jkπ/n))nj,k−=11 ∈ R n−1,n−1 .
n −1
⊤
Sine transform of y = [y1 , . . . , yn−1 ] ∈ R n −1 : sk = ∑ y j sin(πjk/n ) , k = 1, . . . , n − 1 .
j =1
(4.4.2)
ˆ Sn ×vector):
By elementary consideration we can devise a DFT-based algorithm for the sine transform (=
y j , if j = 1, . . . , n − 1 ,
2n
y ∈ R : yej = 0
tool: “wrap around”: e , if j = 0, n , (e
y “odd”)
−y2n− j , if j = n + 1, . . . , 2n − 1 .
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
tyj
0 0
yj
−→
-0.2 -0.2
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-1 -1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 5 10 15 20 25 30
j j
1
Next we use sin( x ) = 2ı (exp(ıx ) − exp(−iux ) to identify the DFT of a wrapped around vector as a sine
transform:
2n −1 n −1 2n −1
(4.2.19) − 2πı − πı πı
(F2n e
y )k = ∑ yej e 2n kj = ∑ yje n kj − ∑ y2n− j e− n kj
j =1 j =1 j = n +1
n −1 πı πı
= ∑ y j (e− n kj − e n kj ) = −2i (Sn y)k ,k = 1, . . . , n − 1 .
j =1
9 Eigen : : VectorXcd c t ;
10 Eigen : : FFT<double> f f t ; // DFT helper class
11 f f t . SetFlag ( Eigen : : FFT<double > : : Flag : : Unscaled ) ;
12 f f t . fwd ( c t , y t ) ;
13
The simple Code 4.4.3 relies on a DFT for vectors of length 2n, which may be a waste of computational
resources in some applications. A DFT of length n is sufficient as demonstrated by the following manipu-
lations.
Step ➀: transform of the coefficients
n −1 2πi jk
Step ➁: real DFT (→ Section 4.2.3) of (ye0 , . . . , yen−1 ) ∈ R n : ck := ∑ yej e− n
j =0
n −1 n −1
πj
Hence Re{ck } = ∑ yej cos(− 2πi
n jk) = ∑ (y j + yn− j ) sin( n ) cos( 2πi
n jk)
j =0 j =1
n −1 n −1
πj 2k + 1 2k − 1
= ∑ 2y j sin( n ) cos( 2πi
n jk) = ∑ yj sin( πj) − sin( πj)
j =0 j =0
n n
= s2k+1 − s2k−1 .
n −1 n −1 n −1
Im{ck } = ∑ yej sin(− 2πi
n jk) = − ∑ 1
2 (y j − yn− j ) sin( 2πi
n jk) =− ∑ y j sin( 2πi
n jk)
j =0 j =1 j =1
= −s2k .
Step ➂: extraction of sk
n −1
s2k+1 , k = 0, . . . , n2 − 1 ➤ from recursion s2k+1 − s2k−1 = Re{ck } , s1 = ∑ y j sin(πj/n) ,
j =1
n
s2k , k = 1, . . . , 2 − 2 ➤ s2k = − Im{ck } .
11 // Transform coefficients
12 Eigen : : VectorXd y t ( n ) ;
13 yt (0) = 0;
14 y t . t a i l ( n −1) = s i n e v a l s . array ( ) ∗ ( y + y . reverse ( ) ) . array ( ) +
0 . 5 ∗ ( y−y . reverse ( ) ) . array ( ) ;
15
16 // FFT
17 Eigen : : VectorXcd c ;
18 Eigen : : FFT<double> f f t ;
19 f f t . fwd ( c , y t ) ;
20
21 s . resize ( n ) ;
22 s ( 0 ) = s i n e v a l s . dot ( y ) ;
23
grid: 20
Matrix X ∈ R n,n
15
l
grid function ∈ {1, . . . , n}2 7→ R 10
1
Visualization of a grid function ✄ 5
2
3
0
4
1
2 5
3
4
5
i
Identification R n,n ∼
=
2
Rn , xij ∼ xe( j−1)n+i (row-wise numbering) gives a matrix representation T ∈
R n2 ,n2 of T:
C cy I 0 ··· ··· 0 j
..
c y I C cy I .
T= ..
.
..
.
..
. ∈ R n2 ,n2 ,
0
.
.. cy I C cy I cy
0 ··· ··· 0 cy I C cx
cx
cy
c cx 0 ··· ··· 0
..
c x c cx .
.. .. ..
C=0 . . . ∈ R n,n .
n+1 n+2
. 1 2 3
.. cx c cx i
0 ··· ··· 0 cx c
The key observation is that elements of the sine basis are eigenvectors of T:
(T (Bkl ))ij = c sin( n+
π
1 ki ) sin ( π
n +1 lj ) + c y sin ( π
n +1 ki ) sin ( π
n +1 l ( j − 1 )) + sin ( π
n +1 l ( j + 1 )) +
π π π
c x sin( n+1 lj) sin( n+1 k(i − 1)) + sin( n+1 k(i + 1))
π π π π
= sin( n+ 1 ki ) sin ( n +1 lj )(c + 2c y cos( n +1 l ) + 2c x cos( n +1 k))
Hence Bkl is eigenvector of T (or T after row-wise numbering) and the corresponding eigenvalue is
π π
given by c + 2cy cos( n+ 1 l ) + 2c x cos( n +1 k). Recall very similar considerations for discrete (periodic)
convolutions in 1D (→ § 4.2.6) and 2D (→ § 4.2.51)
The basis transform can be implemented efficiently based on the 1D sine transform:
n n n n
X= ∑ ∑ ykl Bkl ⇒ xij = ∑ π
sin( n+ π
1 ki ) ∑ ykl sin( n +1 lj ) .
k=1 l =1 k=1 l =1
Hence nested sine transforms (→ Section 4.2.4) for rows/columns of Y = (ykl )nk,l =1 .
Here: implementation of sine transform (4.4.2) with “wrapping”-technique.
7 Eigen : : VectorXcd c ;
8 Eigen : : FFT<double> f f t ;
9 std : : complex <double> i ( 0 , 1 ) ;
10
38
8 // Eigen’s meshgrid
9 Eigen : : MatrixXd I =
Eigen : : RowVectorXd : : LinSpaced ( n , 1 , n ) . r e p l i c a t e (m, 1 ) ;
10 Eigen : : MatrixXd J =
Eigen : : VectorXd : : LinSpaced (m, 1 ,m) . r e p l i c a t e ( 1 , n ) ;
11
12 // FFT
13 Eigen : : MatrixXd X_ ;
14 s i n e t r a n s f o r m 2 d ( B , X_ ) ;
15
16 // Translation
17 Eigen : : MatrixXd T ;
18 T = c + 2 ∗ cx ∗ ( M_PI / ( n +1) ∗ I ) . array ( ) . cos ( ) +
19 2 ∗ cy ∗ ( M_PI / (m+1) ∗ J ) . array ( ) . cos ( ) ;
20 X_ = X_ . cwiseQuotient ( T ) ;
21
22 s i n e t r a n s f o r m 2 d ( X_ , X) ;
23 X = 4 ∗X / ( (m+1) ∗ ( n +1) ) ;
24 }
Thus the diagonalization of T via 2D sine transformyields an efficient algorithm for solving linear system
of equations T(X) = B: computational cost O(n2 log n).
In the experiment we test the gain in runtime obtained by using DFT-based algorithms for solving linear
systems of equations with coefficient matrix T induced by the operator T from (4.4.7)
60
FFT-Loeser
Backslash-Loeser
50
A = gallery(’poisson’,n);
B = magic(n); 40
b = reshape(B,n*n,1);
Laufzeit [s]
tic; 30
C = fftsolve(B,4,-1,-1);
t1 = toc; 20
0
0 100 200 300 400 500 600
n
n −1
2j+1
cosine transform of y = [y0 , . . . , yn−1 ]⊤ : ck = ∑ y j cos(k 2n π ) , k = 1, . . . , n − 1 ,
j =0
(4.4.13)
1 n −1
c0 = √ ∑ y j .
2 j =0
6 Eigen : : VectorXd y_ (2 ∗ n ) ;
7 y_ . head ( n ) = y ;
8 y_ . t a i l ( n ) = y . reverse ( ) ;
9
10 // FFT
11 Eigen : : VectorXcd z ;
12 Eigen : : FFT<double> f f t ;
13 f f t . fwd ( z , y_ ) ;
14
Implementation of C− 1
n y (“Wrapping”-technique):
18 // FFT
19 Eigen : : VectorXd z ;
20 Eigen : : FFT<double> f f t ;
21 f f t . i n v ( z , c_2 ) ;
22
29 y = 2 ∗ y_ . head ( n ) ;
30 }
This task reminds us of the parameter estimation problem from Ex. 3.0.5, which we tackled with least
squares techniques. We employ similar ideas for the current problem
xk input signal
yk output signal
time time
If the yk were exact, we could retrieve h0 , . . . , hn−1 by examining only y0 , . . . , yn−1 and inverting the
discrete periodic convolution (→ Def. 4.1.33) using (4.2.17).
However, in case the yk are affected by measurements errors it is advisable to use all available yk for a
least squares estimate of the impulse response.
We can now formulate the least squares parameter identification problem: seek h = ( h0 , . . . , h n −1 ) ⊤ ∈
R n with
x0 x −1 ··· · · · x 1− n
x ..
1 x0 x −1 . y0
. ..
.. x1 x0 . h0 ..
.
.. .. .. ..
. . . .
kAh − yk2 = . . .. .. − → min .
. . . x −1 .
..
x n −1 x1 x0 .
h ..
x n x n −1 x1 n −1
. .. y m−1
.. .
x m−1 ··· · · · xm− n 2
➣ Linear least squares problem, → Chapter 3 with a coefficient matrix A that enjoys the property that
(A)ij = xi − j (constant entries of diagonals).
The coefficient matrix for the normal equations (→ Section 3.1.2, Thm. 3.1.10) corresponding to the above
linear least squares problem is
m
M := AH A , (M)ij = ∑ x k − i x k − j = zi − j due to periodicity of ( xk )k∈Z .
k=1
We consider a sequence of scalar random variables: (Yk )k∈Z , a so-called Markov chain. These can be
thought of as values for a random quantity sampled at equidistant points in time.
Assume: stationary (time-independent) correlation, that is, with (A, Ω, dP) denoting the underlying
probability space,
Z
E (Yi − j Yi −k ) = Yi − j (ω )Yi −k (ω ) dP (ω ) = uk− j ∀i, j, k ∈ Z , ui = u−i .
Ω
n 2
Estimator x = argmin E Yi − ∑ x j Yi − j (4.5.4)
x ∈R n j =1
By definition A is a so-called covariance matrix and, as such, has to be symmetric and positive definite
(→ Def. 1.1.8). Also note that
with x∗ = A−1 b. Therefore x∗ is the unique minimizer of x⊤ Ax − 2b⊤ x. The problem is reduced to
solving the linear system of equations Ax = b (Yule-Walker-equation, see below).
Matrices with constant diagonals occur frequently in mathematical models, see Ex. 4.5.1, ??. They gen-
eralize circulant matrices (→ Def. 4.1.38).
Note: “Information content” of a matrix M ∈ K m,n with constant diagonals, that is, (M)i,j = mi − j , is
m + n − 1 numbers ∈ K.
u0 u1 ··· · · · un −1
Definition 4.5.8. Toeplitz matrix
..
u − 1 u0 u1 .
.
..
n m,n is a Toeplitz matrix, if there is . .. .. ..
T = (tij )i,j =1 ∈ K .
T= .
. . .
.
.. .. .. ..
a vector u = [u−m+1 , . . . , un−1 ] ∈ K m+n−1 such ..
. . . .
that tij = u j−i , 1 ≤ i ≤ m, 1 ≤ j ≤ n. .. .. ..
. . . u1
u1− m · · · · · · u−1 u0
⊤
Given: T = (u j−i ) ∈ K m,n , a Toeplitz matrix with generating vector u = [u−m+1 , . . . , un−1 ] ∈
K m+n−1 , see Def. 4.5.8.
To motivate the approach we realize that we have already encountered Toeplitz matrices in the convolution
of finite signals discussed in Rem. 4.1.17, see (4.1.18). The trick introduced in Rem. 4.1.40 was to extend
(
u j for j = −m + 1, . . . , n − 1 ,
cj = + periodic extension.
0 for j = n ,
From (4.5.9) it is clear how to implement matrix×vector for the Toeplitz matrix T
x Tx
C =
zero padding 0 ∗
Computational effort for computing Tx: O((n + m) log(m + n) (FFT based, Section 4.3)
Almost optimal in light of the data complexity O(m + n) of a Toeplitz matrix.
Note that the symmetry of a Toeplitz matrix is induced by the property u−k = uk of its generating vector.
Task: Find an efficient solution algorithm for the LSE Tx = b, b ∈ C n , the Yule-Walker problem
from 4.5.3.
Define:
k
✦ Tk := u j−i i,j=1 ∈ K k,k (left upper block of T) ➣ Tk is s.p.d. Toeplitz matrix ,
✦ xk ∈ K k : Tk xk = [ b1 , . . . , bk ]⊤ ⇔ xk = T− 1 k
k b ,
✦ u k : = ( u1 , . . . , u k ) ⊤
Thus we can block partition the LSE
uk b1
.. e k+1 .. e k+1
k+1
= Tk . x . b
Tk+1 x
u1 = bk
=
bk + 1
(4.5.10)
+1
u k · · · u1 1 xkk+ 1 bk + 1
Now recall block Gaussian elimination/block-LU decomposition from Rem. 2.3.14, Rem. 2.3.34. They
+1
xk+1 and obtain an expression for xkk+
teach us how to eliminate e 1.
To state the formulas concisely, we introduce the reversing permutation:
x k+1 = T−
e 1 e k+1
k (b
+1
− xkk+ k k k+1 −1 k
1 Pk u ) = x − xk+1 Tk Pk u ,
(4.5.12)
+1 −1
xkk+ k
xk+1 = bk+1 − Pk · xk + xkk+
1 = bk+1 − Pk u · e
+1 k
1 Pk · Tk Pk u .
+1
k+1 x k+1
e xkk+ k
1 = (bk+1 − Pk u )/σk
x = k+1 with +1 k , σk := 1 − Pk uk · yk . (4.5.13)
x k+1 xk+1 = xk − xkk+
e 1y
Below: Levinson algorithm for the solution of the Yule-Walker problem Tx = b with an s.p.d. Toeplitz
matrix described by its generating vector u (recursive, un+1 not used!)
Linear recursion: Computational cost ∼ (n − k) on level k, k = 0, . . . , n − 1
➣ Asymptotic complexity O ( n2 )
12 Eigen : : VectorXd xk , yk ;
13 l e v i n s o n ( u . head ( k ) , b . head ( k ) , xk , yk ) ;
14
FFT-based algorithms for solving Tx = b with asymptotic complexity O(n log3 n) [?] !
Supplementary reading. [?, Sect. 8.5]: Very detailed and elementary presentation, but the
discrete Fourier transform through trigonometric interpolation, which is not covered in this chapter.
Hardly addresses discrete convolution.
[?, Ch. IX] presents the topic from a mathematical point of few stressing approximation and trigono-
metric interpolation. Good reference for algorithms for circulant and Toeplitz matrices.
[?, Ch. 10] also discusses the discrete Fourier transform with emphasis on interpolation and (least
squares) approximation. The presentation of signal processing differs from that of the course.
There is a vast number of books and survey papers dedicated to discrete Fourier transforms, see,
for instance, [?, ?]. Issues and technical details way beyond the scope of the course are discussed
in these monographs.
Contents
5.1 Abstract interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
5.2 Global Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
5.2.1 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
5.2.2 Polynomial Interpolation: Theory . . . . . . . . . . . . . . . . . . . . . . . . 366
5.2.3 Polynomial Interpolation: Algorithms . . . . . . . . . . . . . . . . . . . . . . 370
5.2.3.1 Multiple evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . 370
5.2.3.2 Single evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
5.2.3.3 Extrapolation to zero . . . . . . . . . . . . . . . . . . . . . . . . . . 378
5.2.3.4 Newton basis and divided differences . . . . . . . . . . . . . . . . 381
5.2.4 Polynomial Interpolation: Sensitivity . . . . . . . . . . . . . . . . . . . . . . 386
5.3 Shape preserving interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
5.3.1 Shape properties of functions and data . . . . . . . . . . . . . . . . . . . . . 390
5.3.2 Piecewise linear interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
5.4 Cubic Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
5.4.1 Definition and algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
5.4.2 Local monotonicity preserving Hermite interpolation . . . . . . . . . . . . . 399
5.5 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
5.5.1 Cubic spline interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
5.5.2 Structural properties of cubic spline interpolants . . . . . . . . . . . . . . . . 407
5.5.3 Shape Preserving Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . 410
5.6 Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
5.6.1 Trigonometric Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
5.6.2 Reduction to Lagrange Interpolation . . . . . . . . . . . . . . . . . . . . . . . 419
5.6.3 Equidistant Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . 421
5.7 Least Squares Data Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
366
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The task of (one-dimensional, scalar) data interpolation (point interpolation) can be described as fol-
lows:
f (ti ) = yi , i = 0, . . . , n.
n
The function f we find is called the interpolant of the given data set {ti , yi )}i =0 .
For ease of presentation we will usually assume that the nodes are ordered: t0 < t1 < · · · < tn and
[t0 , tn ] ⊂ I . However, algorithms often must not take for granted sorted nodes.
A natural generalization is data interpolation with vector-valued data values, seeking a function f : I →
R d , d ∈ N, such that, for given data points (ti , yi ), ti ∈ I mutually different, yi ∈ R d , it satisfies the
interpolation conditions f(ti ) = yi , i = 0, . . . , n.
In this case all methods available for scalar data can be applied component-wise.
x1
y3
y4
An important application is curve reconstruction, that
is the interpolation of points y0 , . . . , Vyn ∈ R 2 in the y2
plane.
y5
A particular aspect of this problem is that the nodes
ti also have to be found, usually from the location of
the yi in a preprocessing step. y1
y0
x2
Fig. 165
In many applications (computer graphics, computer vision, numerical method for partial differential equa-
tions, remote sensing, geodesy, etc.) one has to reconstruct functions of several variables.
5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract interpolation 367
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Significant additional challenges arise in a genuine multidimensional setting. A treatment is beyond the
scope of this course. However, the one-dimensional techniques presented in this chapter are relevant
even for multi-dimensional data interpolation, if the points xi ∈ R m are points of a finite lattice also called
tensor product grid.
For instance, for m = 2 this is the case, if
n o
⊤ 2
{xi }i = [ tk , sl ] ∈ R : k ∈ {0, . . . , K }, l ∈ {0, . . . , L} , (5.1.3)
Interpolation
0.5
0.4
0.3
0
Interpolants can have vastly different properties.
-0.2
method to build interpolants and their different
-0.3 linear
poly
properties will become apparent.
-0.4 spline
pchip
-0.5
0 1 2 3 4 5 6 7 8
reset
Fig. 166
5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract interpolation 368
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Imagine that t, y correspond to the voltage U and current I measured for a 2-port non-linear circuit
element (like a diode). This element will be part of a circuit, which we want to simulate based on nodal
analysis as in Ex. 8.0.1. In order to solve the resulting non-linear system of equations F(u) = 0 for the
nodal potentials (collected in the vector u) by means of Newton’s method (→ Section 8.4) we need the
voltage-current relationship for the circuit element as a continuously differentiable function I = f (U ).
(∗) Meaning of attribute “accurate”: justification for interpolation. If measured values yi were affected by
considerable errors, one would not impose the interpolation conditions (??), but opt for data fitting (→
Section 5.7).
Rather, in the context of numerical methods, “function” should be read as “subroutine”, a piece of code that
can, for any x ∈ I , compute f ( x ) in finite time. Even this has to be qualified, because we can only pass
machine numbers x ∈ I ∩ M (→ § 1.5.12) and, of course, in most cases, f ( x ) will be an approximation.
In a C++ code a simple real valued function can be incarnated through a function object of a type as given
in Code 5.1.7, see also Section 0.2.3.
5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract interpolation 369
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
4 public :
5 // Constructor: expects information for specifying the function
6 F u n c t i o n ( /* ... */ ) ;
7 // Evaluation operator
8 double operator () ( double t ) const ;
9 };
Of course, the basis functions b j should be “simple” in the sense that b j ( x ) can be computed efficiently for
every x ∈ I and every j = 0, . . . , m.
Note that the basis functions may depend on the nodes ti , but they must not depend on the values yi .
➙ The internal representation of f (in the data member section of the class Function from Code 5.1.7)
will then boil down to storing the coefficients/parameters c j , j = 0, . . . , m.
Note: The focus in this chapter will be on the special case that the data interpolants belong to a finite-
dimensional space of functions spanned by “simple” basis functions.
Recall: A linear function in 1D is a function of the form x 7→ a + bx, a, b ∈ R (polynomial of degree 1).
y
✬ ✩
Piecewise linear interpolation
✫ ✪
➣ interpolating polygon
t
Fig. 168
t0 t1 t2 t3 t4
5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract interpolation 370
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
What could be a convenient set of basis functions {b j }nj=0 for representing the piecewise linear interpolant
through n + 1 data points?
b0 b1 b2 b3 b4 bn
1
Fig. 169 t0 t1 t2 t3 t4 t5 tn −1 tn
Note: in Fig. 169 the basis functions have to be extended by zero outside the t-range where they are
drawn.
Explicit formulas for these basis functions can be given and bear out that they are really “simple”:
(
t − t0
1− t1 − t0 for t0 ≤ t < t1 ,
b0 (t) =
0 for t ≥ t1 .
t j −t
1 − t j −t j−1 for t j−1 ≤ t < t j ,
b j (t) = 1 − t t−−t j t for t j ≤ t < t j+1 , , j = 1, . . . , n − 1 , (5.1.11)
j +1 j
0 elsewhere in [t0 , tn ] .
(
1 − tnt−n − t
tn−1 for tn −1 ≤ t < tn ,
bn ( t ) =
0 for t < tn−1 .
We consider the setting for interpolation that the interpolant belongs to a finite-dimension space Vm of
functions spanned by basis functions b0 , . . . , bm , see Rem. 5.1.6. Then the interpolation conditions imply
5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract interpolation 371
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The interpolation problem in Vm and the linear system (5.1.15) are really equivalent in the sense that
(unique) solvability of one implies (unique) solvability of the other.
If m = n and A from (5.1.15) regular (→ Def. 2.2.1),then for any values y j , j = 0, . . . , n we can find
coefficients c j , j = 0, . . . , n, and, from them build the interpolant according to (5.1.9):
n
f = ∑ (A −1 y) j b j . (5.1.16)
j =0
✓ ✏
For fixed nodes ti the interpolation problem R n +1 7→ Vn
I:
✒ ✑
(5.1.14) defines linear mapping y 7→ f
An interpolation operator I : R n+1 7→ C0 ([t0 , tm ]) for the given nodes t0 < t1 < · · · < tn is called
linear, if
✎ Notation: C0 ([t0 , tm ]) =
ˆ vector space of continuous functions on [t0 , tm ]
If a constitutive relationship for a circuit element is needed in a C++ simulation code (→ Ex. 5.1.5), the
following data type could be used to represent it:
5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract interpolation 372
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
✦ Constructor: “setup phase”, e.g. building and solving linear system of equations (5.1.15)
✦ Evaluation operator, e.g., implemented as evaluation of linear combination (5.1.9)
Crucial issue:
computational effort for evaluation of interpolant at single point: O(1) or O(n) (or in
between)?
(Global) polynomial interpolation, that is, interpolation into spaces of functions spanned by polynomials
up to a certain degree, is the simplest interpolation scheme and of great importance as building block for
more complex algorithms.
5.2.1 Polynomials
P k : = { t 7 → α k t k + α k −1 t k −1 + · · · + α1 t + α0 , α j ∈ R } . (5.2.1)
leading coefficient
Obvious: Pk is a vector space, see [?, Sect. 4.2, Bsp. 4]. What is its dimension?
dim Pk = k + 1 and P k ⊂ C ∞ (R ).
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 373
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Polynomials (of degree k) in monomial representation are stored as a vector of their coefficients a j , j =
0, . . . , k. A convention for the ordering has to be fixed. For instance, M ATLAB functions expect a monomial
representation through a vector of their monomial coefficients in descending order:
The following code gives an implementation based on vector data types of E IGEN. The function is vector-
ized in the sense that many evaluation points are processed in parallel.
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 374
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Supplementary reading. This topic is also presented in [?, Sect. 8.2.1], [?, Sect. 8.1], [?,
Ch. 10].
Now we consider the interpolation problem introduced in Section 5.1 for the special case that the sought
interpolant belongs to the polynomial space Pk (with suitable degree k).
Given the simple nodes t0 , . . . , tn , n ∈ N, −∞ < t0 < t1 < · · · < tn < ∞ and the values
y0 , . . . , yn ∈ R compute p ∈ Pn such that
Is this a well-defined problem? Obviously, it fits the framework developed in Rem. 5.1.6 and § 5.1.13,
because Pn is a finite-dimensional space of functions, for which we already know a basis, the monomi-
als. Thus, in principle, we could examine the matrix A from (5.1.15) to decide, whether the polynomial
interpolant exists and is unique. However, there is a shorter way.
(
1 if i = j ,
Recall the Kronecker symbol δij =
0 else.
From this relationship we infer that the Lagrange polynomials are linearly independent. Since there are
n + 1 = dim Pn different Lagrange polynomials, we conclude that they form a basis of Pn , which is a
cardinal basis for the node set {ti }in=0 .
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 375
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
8
L
0
L
2
L
Consider the equidistant nodes in [−1, 1]: 6
5
Lagrange Polynomials
2
T : = t j = −1 + n j , 4
j = 0, . . . , n . 2
t5 , respectively.
-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 170 t
The Lagrange polynomial interpolant p for data points (ti , yi )in=0 allows a straightforward representation
with respect to the basis of Lagrange polynomials for the node set {ti }in=0 :
n
p ( t ) = ∑ y i Li ( t ) ⇔ p ∈ Pn and p(ti ) = yi . (5.2.13)
i =0
Known from linear algebra: for a linear mapping T : V 7→ W between finite-dimensional vector spaces
with dim V = dim W holds the equivalence
T surjective ⇔ T bijective ⇔ T injective.
Applying this equivalence to evalT yields the assertion of the theorem
✷
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 376
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Lagrangian polynomial interpolation leads to linear systems of equations also for the representation coef-
ficients of the polynomial interpolant in monomial basis, see § 5.1.13:
n
p(t j ) = y j ⇐⇒ ∑ ai tij = y j , j = 0, . . . , n
i =0
⇐⇒ solution of (n + 1) × (n + 1) linear system Va = y with matrix
1 t0 t20 · · · t0n
1 t1 t21 · · · t1n
t2 t22 · · · t2n
V = 1 ∈ R n+1,n+1 . (5.2.18)
.. .. .. . . .
. . . . ..
1 tn t2n · · · tnn
Remark 5.2.21 (Generalized polynomial interpolation → [?, Sect. 8.2.7], [?, Sect. 8.4])
The following generalization of Lagrange interpolation is possible: We still seek a polynomial interpolant,
but beside function values also prescribe derivatives up to a certain order for interpolating polynomial at
given nodes.
Convention: indicate occurrence of derivatives as interpolation conditions by multiple nodes.
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 377
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
✬ ✩
Generalized polynomial interpolation problem
Given the (possibly multiple) nodes t0 , . . . , tn , n ∈ N, −∞ < t0 ≤ t1 ≤ · · · ≤ tn < ∞ and the values
y0 , . . . , yn ∈ R compute p ∈ Pn such that
dk
p(t j ) = y j for k = 0, . . . , l j and j = 0, . . . , n , (5.2.22)
dtk
where l j := max{i − i ′ : t j = ti = ti ′ , i, i ′ = 0, . . . , n} is the multiplicity of the nodes t j .
✫ ✪
The most important case of generalized Lagrange interpolatioon is when all the multiplicities are equal
to 2. It is called Hermite interpolation (or osculatory interpolation) and the generalized interpolation
conditions read for nodes t0 = t1 < t2 = t3 < · · · < tn−1 = tn (note the double nodes!) [?, Ex. 8.6]:
The generalized polynomial interpolation problem Eq. (5.2.22) admits a unique solution p ∈ Pn .
The generalized Lagrange polynomials for the nodes T = {t j }nj=0 ⊂ R (multiple nodes allowed)
are defined as Li := IT (ei +1 ), i = 0, . . . , n, where ei = (0, . . . , 0, 1, 0, . . . , 0)T ∈ R n+1 are the
unit vectors.
Note: The linear interpolation operator IT in this definition refers to generalized Lagrangian interpolation.
Its existence is guaranteed by Thm. 5.2.23.
T = {t0 = 0, t1 = 0, t2 = 1, t3 = 1} . 1
Cubic Hermite Polynomials
0.8
The plot shows the four unique generalized La-
grange polynomials of degree n = 3 for these nodes. 0.6 p
0
p
They satisfy 1
p
2
0.4 p
3
More details are given in Section 5.4. For explicit formulas for the polynomials see (5.4.5).
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 378
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Now we consider the algorithmic realization of Lagrange interpolation as introduced in Section 5.2.2. The
setting is a follows:
When used in a numerical code, different demands can be made for a routine that implements Lagrange
interpolation. They determine, which algorithm is most suitable.
The member function eval(y,x) expects n data values in y and (any number of) evaluation points in
x (↔ [ x1 , . . . , x N ]⊤ ) and returns the vector [ p( x1 ), . . . , p() x N )] ⊤ , where p is the Lagrange polynomial
interpolant.
An implementation directly based on the evaluation of Lagrange polynomials (5.2.11) and (5.2.13) would
incur an asymptotic computational effort of O(n2 N ) for every single invocation of eval and large n, N .
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 379
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
By means of pre-calculations the asymptotic effort for eval can be reduced substantially:
1
where λi = , i = 0, . . . , n.
(ti − t0 ) · · · (ti − ti −1 )(ti − ti +1 ) · · · (ti − tn )
1
with λi = , i = 0, . . . , n, independent of t and yi
(ti − t0 ) · · · (ti − ti −1 )(ti − ti +1 ) · · · (ti − tn )
→ precompute !
The following C++ class demonstrated the use of the barycentric interpolation formula for efficient multiple
point evaluation of a Lagrange interpolation polynomial:
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 380
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 381
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
9 f o r ( unsigned i = 0 ; i < N ; ++ i ) {
10 nodeVec_t z = ( x ( i ) ∗ nodeVec_t : : Ones ( n ) − t ) ;
11
As an exception, the test for equality with zero in Line 13 is possible. If x(i) is almost equal to t, then the
corresponding entry of the vector mu will be huge and the value of the barycentric sum will almost agree
with p(i), the same value returned, if x(i) is exactly zero. Hence, it does not matter, if the test returns
false because of some small perturbations of the values.
Task: Given a set of interpolation points (t j , y j ), j = 0, . . . , n, with pairwise different interpolation nodes
t j , perform a single point evaluation of the Lagrange polynomial interpolant p at x ∈ R.
We discuss the efficient implementation of the following function for n ≫ 1. It is meant for a single
evaluation of a Lagrange interpolant.
double eval( const Eigen::VectorXd &t, const Eigen::VectorXd &y,
double x);
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 382
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The starting point is a recursion formula for partial Lagrange interpolants: For 0 ≤ k ≤ ℓ ≤ n define
because the left and right hand sides represent polynomials of degree ℓ − k through the points (t j , y j ),
j = k, . . . , ℓ.
Thus the values of the partial Lagrange interpolants can be computed sequentially and their dependencies
can be expressed by the following so-called Aitken-Neville scheme:
n= 0 1 2 3
t0 y0 =: p0,0 ( x ) → p0,1 ( x ) → p0,2 ( x ) → p0,3 ( x )
ր ր ր
t1 y1 =: p1,1 ( x ) → p1,2 ( x ) → p1,3 ( x ) (ANS)
ր ր
t2 y2 =: p2,2 ( x ) → p2,3 ( x )
ր
t3 y3 =: p3,3 ( x )
Here, the arrows indicate contributions to the convex linear combinations of (5.2.34). The computation
can advance from left to right, which is done in following C++ code.
The vector y contains the columns of the above triangular tableaux in turns from left to right.
Asymptotic complexity of ANipoeval in terms of number of data points: O(n2 ) (two nested loops).
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 383
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The Aitken-Neville algorithm has another interesting feature, when we run through the Aitken-Neville
scheme from the top left corner:
n= 0 1 2 3
t0 y0 =: p0,0 ( x ) → p0,1 ( x ) → p0,2 ( x ) → p0,3 ( x )
ր ր ր
t1 y1 =: p1,1 ( x ) → p1,2 ( x ) → p1,3 ( x )
ր ր
t2 y2 =: p2,2 ( x ) → p2,3 ( x )
ր
t3 y3 =: p3,3 ( x )
Thus, the values of partial polynomial interpolants at x can be computed before all data points are even
processed. This results in an “update-friendly” algorithm that can efficiently supply the point values p0,k ( x ),
k = 0, . . . , n, while being supplied with the data points (ti , yi ). It can be used for the efficient implemen-
tation of the following interpolator class:
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 384
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Fig. 172
1 # include <cmath>
2 # include <vector >
3 # include <Eigen / Dense>
4 # include < f i g u r e / f i g u r e . hpp>
5 // ----- include timer library
6 # include "timer.h"
7 // ----- includes for Interpolation functions
8 # include "ANipoleval.hpp"
9 # include "ipolyeval.hpp"
10 # include "intpolyval.hpp"
11 # include "intpolyval_lag .hpp"
12
13 /**
14 * Benchmarking 4 differentinterpolationattempts:
15 * - AitkenNeville - Barycentricformula
16 * - Polyfit + Polyval - Lagrangepolynomials
17 **/
18 i n t main ( ) {
19 // function to interpolate
20 auto f = [ ] ( const Eigen : : VectorXd& x ) { r etur n x . c w i s e S q r t ( ) ; } ;
21
22 const unsigned min_deg = 3 , max_deg = 200;
23
24 Eigen : : VectorXd b u f f e r ;
25 std : : vector <double > t1 , t2 , t3 , t4 , N;
26
27 // Number of repeats for each eval
28 const i n t r e p e a t s = 100;
29
30 // n = increasing polynomial degree
31 f o r ( unsigned n = min_deg ; n <= max_deg ; n++) {
32
33 const Eigen : : VectorXd t = Eigen : : VectorXd : : LinSpaced ( n , 1 , n ) ,
34 y = f(t);
35
36 // ANipoleval takes a double as argument
37 const double x = n ∗ drand48 ( ) ; // drand48 returns random double ∈ [0, 1]
38 // all other functions take a vector as argument
39 const Eigen : : VectorXd xv = n ∗ Eigen : : VectorXd : : Random( 1 ) ;
40
41 std : : cout << "Degree = " << n << " \n" ;
42 Timer a i t k e n , i p o l , i n t p o l , i n t p o l _ l a g ;
43
44 // do the same many times and choose the best result
45 // Aitken-Neville -----------------------
46 aitken . s ta r t ( ) ;
47 f o r ( unsigned i = 0 ; i < r e p e a t s ; ++ i ) {
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 385
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
This uses functions given in Code 5.2.32, Code 5.2.35 and the function polyfit (with a clearly greater
computational effort !)
polyfit is the equivalent to M ATLAB’s built-in polyfit. The implementation can be found on GitLab.
C++-code 5.2.40: Polynomial evaluation using polyfit
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 386
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Extrapolation is the same as interpolation but the evaluation point t is outside the interval
[inf j=0,...,n t j , sup j=0,...,n t j ]. In the sequel we assume t = 0, ti > 0.
Of course, Lagrangian interpolation can also be used for extrapolation. In this section we give a very
important application of this “Lagrangian extrapolation”.
Task: compute the limit limh→0 ψ(h) with prescribed accuracy, though the evaluation of the function
ψ = ψ(h) (maybe given in procedural form only) for very small arguments |h| ≪ 1 is difficult,
usually because of numerically instability (→ Section 1.5.5).
The extrapolation technique introduced below works well, if
In Ex. 1.5.45 we have already seen a situation, where we wanted to compute the limit of a function ψ(h) for
h → 0, but could not do it with sufficient accuracy. In this case ψ(h) was a one-sided difference quotient
with span h, meant to approximate f ′ ( x ) for a differentiable function f . The cause of numerical difficulties
was cancellation → § 1.5.43.
Now we will see how to dodge cancellation in difference quotients and how to use extrapolation to zero to
computes derivatives with high accuracy:
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 387
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
df f ( x + h) − f ( x − h)
(x) ≈ . (5.2.44)
dx 2h
straightforward implementation fails due to cancellation in the numerator, see also Ex. 1.5.45.
√
f ( x ) = arctan( x ) f (x) = x f ( x ) = exp( x )
h Relative error h Relative error h Relative error
2−1 0.20786640808609 2−1 0.09340033543136 2−1 0.29744254140026
2−6 0.00773341103991 2−6 0.00352613693103 2−6 0.00785334954789
2−11 0.00024299312415 2−11 0.00011094838842 2−11 0.00024418036620
2−16 0.00000759482296 2−16 0.00000346787667 2−16 0.00000762943394
2−21 0.00000023712637 2−21 0.00000010812198 2−21 0.00000023835113
2−26 0.00000001020730 2−26 0.00000001923506 2−26 0.00000000429331
2−31 0.00000005960464 2−31 0.00000001202188 2−31 0.00000012467100
2−36 0.00000679016113 2−36 0.00000198842224 2−36 0.00000495453865
Recall the considerations elaborated in Ex. 1.5.45. Owing to the impact of roundoff errors amplified by
cancellation, h → 0 does not achieve arbitrarily high accuracy. Rather, we observe fewer correct digits for
very small h!
Extrapolation offers a numerically stable (→ Def. 1.5.85) alternative, because for a 2(n + 1)-times con-
tinuously differentiable function f : I ⊂ R 7→ R , x ∈ I we find that the symmetric difference quotient
behaves like a polynomial in h2 in the vicinity of h = 0. Consider Taylor sum of f in x with Lagrange
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 388
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
remainder term:
n
f ( x + h) − f ( x − h) ′ 1 d2k f 1
ψ (h) := ∼ f (x) + ∑ 2k
( x ) h2k + f (2n+2) (ξ ( x )) .
2h k=1
(2k)! dx (2n + 2)!
While the extrapolation table (→ § 5.2.36) is computed, more and more accurate approximations of f ′ ( x )
become available. Thus, the difference between the two last approximations can be used to gauge the
error of the current approximation, it provides an error indicator, which can be used to decide when the
level of extrapolation is sufficient, see Line 25.
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 389
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Supplementary reading. We also refer to [?, Sect. 8.2.4], [?, Sect. 8.2].
In § 5.2.33 we have seen a method to evaluate partial polynomial interpolants for a single or a few evalua-
tion points efficiently. Now we want to do this for many evaluation points that may not be known when we
receive information about the first interpolation points.
The challenge: Both addPoint() and the evaluation operator may be called many times and the imple-
mentation has to remain efficient under these circumstances.
Why not use the techniques from § 5.2.27? Drawback of the Lagrange basis or barycentric formula: adding
another data point affects all basis polynomials/all precomputed values!
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 390
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The abstract considerations of § 5.1.13 still apply and we get a linear system of equations for the coeffi-
cients a j of the polynomial interpolant in Newton basis:
a j ∈ R: a0 N0 (t j ) + a1 N1 (t j ) + · · · + an Nn (t j ) = y j , j = 0, . . . , n .
a0 = y 0 ,
y − a0 y − y0
a1 = 1 = 1 ,
t1 − t0 t1 − t0
y2 − y0 y1 − y0
y −y
y2 − y0 − (t2 − t0 ) t1 −t00 −
y 2 − a0 − ( t2 − t0 ) a1 1 t2 − t0 t1 − t0
a2 = = = ,
(t2 − t0 )(t2 − t1 ) (t2 − t0 )(t2 − t1 ) t2 − t1
..
.
In order to reveal the pattern, we turn to a new interpretation of the coefficients a j of the interpolating
✓ ✏
polynomials in Newton basis.
✒ ✑
⇒ a j is the leading coefficient of the interpolating polynomial p0,j .
(the notation pℓ,m for partial polynomial interpolants through the data points (tℓ , yℓ ), . . . , (tm , ym ) was
introduced in Section 5.2.3.2, see (5.2.34))
➣ Recursion (5.2.34) implies a recursion for the leading coefficients aℓ,m of the interpolating polynomi-
als pℓ,m , 0 ≤ ℓ ≤ m ≤ n:
aℓ+1,m − aℓ,m−1
aℓ,m = . (5.2.51)
tm − tℓ
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 391
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Hence, instead of using elimination for a triangular linear system, we find a simpler and more efficient
algorithm using the so-called divided differences:
y [ ti ] = y i
y [ ti + 1 , . . . , ti + k ] − y [ ti , . . . , ti + k − 1 ]
y [ ti , . . . , ti + k ] = (recursion) (5.2.52)
ti + k − ti
Recursive calculation by divided differences scheme, cf. Aitken-Neville scheme, Code 5.2.35:
t0 y [ t0 ]
> y [ t0 , t1 ]
t1 y [ t1 ] > y [ t0 , t1 , t2 ]
> y [ t1 , t2 ] > y [ t0 , t1 , t2 , t3 ] , (5.2.54)
t2 y [ t2 ] > y [ t1 , t2 , t3 ]
> y [ t2 , t3 ]
t3 y [ t3 ]
The elements can be computed from left to right, every “>” indicates the evaluation of the recursion
formula (5.2.52).
However, we can again resort to the idea of § 5.2.36 and traverse (5.2.54) along the diagonals from top to
bottom: If a new datum (tn+1 , yn+1 ) is added, it is enough to compute the n + 2 new terms
y [ t n + 1 ] , y [ t n , t n + 1 ] , . . . , y [ t0 , . . . , t n + 1 ] .
The following M ATLAB code computes divided differences for data points (ti , yi ), i = 0, . . . , n, in this
fashion. It is implemented by recursion to elucidate the successive use of data points. The divided
differences y[t0 ], y[t0 , t1 ], . . . , y[t0 , . . . , tn ], are accumulated in the vector y.
By derivation: computed finite differences are the coefficients of interpolating polynomials in Newton
basis:
n −1
p(t) = a0 + a1 (t − t0 ) + a2 (t − t0 )(t − t1 ) + · · · + an ∏ (t − t j ) (5.2.56)
j =0
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 392
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
a0 = y [ t0 ] , a1 = y [ t0 , t1 ] , a2 = y [ t0 , t1 , t2 ] , . . . .
Thus, Code 5.2.55 computes the coefficients a j , j = 0, . . . , n, of the polynomial interpolant with respect
to the Newton basis. It uses only the first j + 1 data points to find a j .
“Backward evaluation” of p(t) in the spirit of Horner’s scheme (→ Rem. 5.2.5, [?, Alg. 8.20]):
p ← an , p ← (t − tn −1 ) p + an −1 , p ← (t − tn −2 ) p + an −2 , ....
13 // evaluate
14 VectorXd ones = VectorXd : : Ones ( x . siz e ( ) ) ;
15 p = c o e f f s ( n ) ∗ ones ;
16 f o r ( i n t j = n − 1 ; j >= 0 ; −− j ) {
17 p = ( x − t ( j ) ∗ ones ) . cwiseProduct ( p ) + c o e f f s ( j ) ∗ ones ;
18 }
19 }
Computational effort:
✦ O(n2 ) for computation of divided differences (“setup phase”),
✦ O(n) for every single evaluation of p(t).
(both operations can be interleaved, see Code 5.2.47)
Implementation of a C++ class supporting the efficient update and evaluation of an interpolating polynomial
making use of
• presentation in Newton basis (5.2.49),
• computation of representation coefficients through divided difference scheme (5.2.54), see Code 5.2.55,
• evaluation by means of Horner scheme, see Code 5.2.58.
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 393
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
7 void Poly Ev al : : d i v d i f f ( ) {
8 i n t n = t . siz e ( ) ;
9 f o r ( i n t j = 0; j <n −1; j ++) y [ n −1] = ( ( y [ n−1]− y [ j ] ) / ( t [ n−1]− t [ j ] ) ) ;
10 }
11
If y0 , . . . , yn are the values of a smooth function f in the points t0 , . . . , tn , that is, y j := f (t j ), then
f (k) (ξ )
y [ ti , . . . , ti + k ] =
k!
for a certain ξ ∈ [ti , ti +k ], see [?, Thm. 8.21].
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 394
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
This section addresses a major shortcoming of polynomial interpolation in case the interpolation knots ti
are imposed, which is usually the case when given data points have to be interpolated, cf. Ex. 5.1.5.
10
T : = −5 + n j ,
j =0 1.5
y,p(t)
1
yj = , j = 0, . . . n.
1 + t2j
1
0.5
In Section 2.2.2 we introduced the concept of sensitivity to describe how perturbations of the data affect the
output for a problem map as defined in § 1.5.67. Concretely, in Section 2.2.2 we discussed the sensitivity
for linear systems of equations. Motivated by Ex. 5.2.63, we now examine the sensitivity of Lagrange
interpolation with respect to perturbations in the data values.
Thus, the (pointwise) sensitivity of polynomial interpolation will tell us to what extent perturbations in the y-
data will affect the values of the interpolating function somewhere else. In the case of high sensitivity small
perturbations in the data can cause big variations in some function values, which is clearly undesirable.
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 395
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Necessary for studying sensitivity of polynomial interpolation in quantitative terms are norms (→ Def. 1.5.70)
on the vector space of continuous functions C( I ), I ⊂ R . The following norms are the most relevant:
In § 5.1.13 we have learned that (polynomial) interpolation gives rise to a linear problem map, see
Def. 5.1.17. For this class of problem maps the investigation of sensitivity has to study operator norms, a
generalization of matrix norms (→ Def. 1.5.76).
Let L : X → Y be a linear problem map between two normed spaces, the data space X (with norm k·k X )
and the result space Y (with norm k·kY ). Thanks to linearity, perturbations of the result y := L(x) for the
input x ∈ X can be expressed as follows:
L(x + δx) = L(x) + L(δx) = y + L(δx) .
Hence, the sensitivity (in terms of propagation of absolute errors) can be measured by the operator norm
kL(δx)kY
kLk X →Y : = sup . (5.2.70)
δx∈ X \{0} δ k x k X
This can be read as the “matrix norm of L”, cf. Def. 1.5.76.
It seems challenging to compute the operator norm (5.2.70) for L = IT (IT the Lagrange interpolation
operator for node set T ⊂ I ), X = R n+1 (equipped with a vector norm), and Y = C( I ) (endowed with a
norm from § 5.2.65). The next lemma will provide surprisingly simple concrete formulas.
kIT (y)k L∞ ( I ) n
kI T k ∞ → ∞ : = sup
k yk ∞
= ∑ i = 0 | Li | L∞ ( I )
, (5.2.72)
y ∈R n+1 \{0}
kIT (y)k L2 ( I ) 1
n
k Li k2L2 ( I )
2
k I T k 2→ 2 : = sup
k yk 2
≤ ∑ i =0
. (5.2.73)
y ∈R n+1 \{0}
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 396
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
n n n
kIT (y)k L∞ ( I ) = ∑ j =0 y j L j L∞ ( I )
≤ sup ∑ j=0 |y j || L j (t)| ≤ kyk∞ ∑ i = 0 | Li | L∞ ( I )
,
t∈ I
✷
n
Terminology: Lebesgue constant of T : λT := ∑ i = 0 | Li | L∞ ( I )
= kI T k ∞ → ∞
6
Lebesgue constant for Chebychev nodes
10
5
10
4
10
Lebesgue constant for uniformly spaced nodes:
3
10
n
Chebychev nodes
equidistant nodes
λT ≥ Ce /2
2
10
0
10
0 5 10 15 20 25
Fig. 174 Polynomial degree n
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 397
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Note: In Code 5.2.75 the norm k Li k L∞ ( I ) can be computed only approximate by taking the maximum
modulus of function values in many sampling points.
In Ex. 5.1.5 we learned that interpolation is an important technique for obtaining a mathematical (and al-
gorithmic) description of a constitutive relationship from measured data.
If the interpolation operator is poorly conditioned, tiny measurement errors will lead to big (local) deviations
of the interpolant from its “true” form.
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 398
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Since measurement errors are inevitable, poorly conditioned interpolation procedures are useless for de-
termining constitutive relationships from measurements.
When reconstructing a quantitative dependence of quantities from measurements, first principles from
physics often stipulate qualitative constraints, which translate into shape properties of the function f , e.g.,
when modelling the material law for a gas:
Fig. 175
The section is about “shape preservation”. In the previous example we have already seen a few properties
that constitute the “shape” of a function: sign, monotonicity and curvature. Now we have to identify
analogous properties of data sets in the form of sequences of interpolation points (t j , y j ), j = 0, . . . , n, t j
pairwise distinct.
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape preserving interpolation 399
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
(≥) y j − y j −1
∆ j ≤ ∆ j+1 , j = 1, . . . , n − 1 , ∆ j := , j = 1, . . . , n .
t j − t j −1
Fig. 176 t
Fig. 177 t
Convex data Convex function
✫ ✪
convex data −→ convex interpolant f .
More ambitious goal: local shape preserving interpolation: for each subinterval I = (ti , ti + j )
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape preserving interpolation 400
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
We perform Lagrange interpolation for the following positive and monotonic data:
ti -1.0000 -0.6400 -0.3600 -0.1600 -0.0400 0.0000 0.0770 0.1918 0.3631 0.6187 1.0000
yi 0.0000 0.0000 0.0039 0.1355 0.2871 0.3455 0.4639 0.6422 0.8678 1.0000 1.0000
created by taking points on the graph of
0 if t < − 52 ,
1
f (t) = 2 (1 + cos(π (t − 35 ))) if − 52 < t < 35 ,
1 otherwise.
1.2
Polynomial
1 Measure pts.
Natural f
0.8 ← Interpolating polynomial, degree = 10
0.6
Oscillations at the endpoints of the interval (see
0.4
Fig. 173)
y
0.2
• No locality
0
• No positivity
-0.2 • No monotonicity
-0.4
• No local conservation of the curvature
-0.6
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
t
There is a very simple method of achieving perfect shape preservation by means of a linear (→ § 5.1.13)
interpolation operator into the space of continuous functions:
Then the piecewise linear interpolant s : [t0 , tn ] → R is defined as, cf. Ex. 5.1.10:
( ti + 1 − t ) y i + ( t − ti ) y i + 1
s (t) = for t ∈ [ ti , ti + 1 ] . (5.3.8)
ti + 1 − ti
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape preserving interpolation 401
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
t
Fig. 178
t0 t1 t2 t3 t4
Piecewise linear interpolation means simply “connect the data points in R 2 using straight lines”.
Obvious: linear interpolation is linear (as mapping y 7→ s, see Def. 5.1.17) and local in the following
sense:
As obvious are the properties asserted in the following theorem. The local preservation of curvature is a
straightforward consequence of Def. 5.3.4.
Bad news: none of this properties carries over to local polynomial interpolation of higher polynomial degree
d > 1.
From Thm. 5.2.14 we know that a parabola (polynomial of degree 2) is uniquely determined by 3 data
points. Thus, the idea is to form groups of three adjacent data points and interpolate each of these triplets
by a 2nd-degree polynomial (parabola).
Assume: n = 2m even
piecewise quadratic interpolant q : [min{ti }, max{ti }] 7→ R is defined by
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape preserving interpolation 402
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
1.2
Nodes
Piecewise linear interpolant
1 Piecewise quadratic interpolant
Nodes as in Exp. 5.3.7
0.8
0.2
No shape preservation for piecewise quadratic inter-
polant 0
-0.2
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 179
However: Interpolant usually serves as input for other numerical methods like a Newton-method for solving
non-linear systems of equations, see Section 8.4, which requires derivatives.
Aim: construct local shape-preserving (→ Section 5.3) (linear ?) interpolation operator that fixes short-
coming of piecewise linear interpolation by ensuring C1 -smoothness of the interpolant.
✎ notation: C1 ([ a, b]) =
ˆ space of continuously differentiable functions [ a, b] 7→ R.
Given data points (t j , y j ) ∈ R × R , j = 0, . . . , n, with pairwise distinct ordered nodes t j , and slopes
c j ∈ R, the piecewise cubic Hermite interpolant s : [t0 , tn ] → R is defined by the requirements
s|[ti−1,ti ] ∈ P3 , i = 1, . . . , n , s ( ti ) = y i , s ′ ( ti ) = c i , i = 0, . . . , n .
5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 403
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Piecewise cubic Hermite interpolants are continuously differentiable on their interval of definition.
Proof. The assertion of the corollary follows from the agreement of function values and first derivative
values on nodes shared by two intervals, on each of which the piecewise cubic Hermite interpolant is a
polynomial of degree 3.
✷
Locally, we can write a piecewise cubic Hermite interpolant as a linear combination of generalized cardinal
basis functions with coefficients supplied by the data values y j and the slopes c j :
1
t − t i −1
H1 (t) := φ( tih−t ) , H2 (t) := φ( hi ) ,
i
0.8 t−t
H1 H3 (t) := −hi ψ( tih−t ) , H4 (t) := hi ψ ( h i −1 ) ,
i i
0.6 H2
h i : = ti − ti − 1 ,
H (t)
H3
i
0.4
H
4
φ(τ ) := 3τ 2 − 2τ 3 ,
0.2 ψ (τ ) := τ 3 − τ 2 .
(5.4.5)
0
-0.2
✁ Local basis polynomial on [0, 1]
0 0.2 0.4 0.6 0.8 1
Fig. 180 t
By tedious, but straightforward computations using the chain rule we find the following values for Hk and
Hk′ at the endpoints of the interval [ti −1 , ti ].
H ( ti − 1 ) H ( ti ) H ′ ( ti − 1 ) H ′ ( ti )
H1 1 0 0 0
H2 0 1 0 0
H3 0 0 1 0
H4 0 0 0 1
This amounts to a proof for (5.4.4) (why?).
The formula (5.4.4) is handy for the local evaluation of piecewise cubic Hermit interpolants. The function
hermloceval in Code 5.4.6 performs the efficient evaluation (in multiple points) of a piecewise cubic
polynomial s on t1 , t2 uniquely defined by the constraints s(t1 ) = y1 , s(t2 ) = y2 , s′ (t1 ) = c1 , s′ (t2 ) = c2 :
5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 404
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
7 double c1 , double c2 ) {
8 const double h = t2 −t1 , a1 = y2−y1 , a2 = a1−h ∗ c1 , a3 = h ∗ c2−a1−a2 ;
9 t = ( ( t . array ( )− t 1 ) / h ) . matrix ( ) ;
10 r et ur n ( y1 + ( a1 + ( a2+a3 ∗ t . array ( ) ) ∗ ( t . array ( ) −1) ) ∗ t . array ( ) ) . matrix ( ) ;
11 }
However, the data for an interpolation problem (→ Section 5.1) are merely the interpolation points (t j , y j ),
j = 0, . . . , n, but not the slopes of the interpolant at the nodes. Thus, in order to define an interpolation
operator into the space of piecewise cubic Hermite functions, we have supply a mapping R n+1 × R n+1 →
R n+1 computing the slopes c j from the data points.
Since this mapping should be local it is natural to rely on (weighted) averages of the local slopes ∆ j (→
Def. 5.3.4) of the data, for instance
∆1
, for i = 0 ,
y j − y j −1
ci = ∆n , for i = n , , ∆ j := ,j = 1, . . . , n . (5.4.8)
t i +1 − t i ∆ + t i − t i −1 t j − t j −1
t −t i t i + 1 − t i − 1 ∆i + 1 , if 1 ≤ i < n .
i +1 i −1
“Local” means, that, if the values y j are non-zero for only a few adjacent data points with indices j =
k, . . . , k + m, m ∈ N small, then the Hermite interpolant s is supported on [tk−ℓ , tk+m+ℓ ] for small ℓ ∈ N
independent of k and m.
5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 405
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Data points:
✦ 11 equispaced nodes
t j = −1 + 0.2 j, j = 0, . . . , 10.
f ( x ) := sin(5x ) e x .
Fig. 181
5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 406
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Invocation:
auto f = [](double x) return sin(5*x)*exp(x);
Eigen::VectorXd t = Eigen::VectorXd::LinSpaced(10, -1, 1);
hermintp(f, t);
5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 407
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
From Ex. 5.4.9 we learn that, if the slopes are chosen according to Eq. (5.4.8), then he resulting Hermite
interpolation does not preserve monotonicity.
y
Consider the situation sketched on the right ✄
The red circles (•) represent data points, the blue line
(—) the piecewise linear interpolant → Section 5.3.2.
From the discussion of Fig. 182 and Fig. 183 it is clear that local monotonicity preservation entails that the
local slopes ci of a cubic Hermite interpolant (→ Def. 5.4.1) have to fulfill
(
0 , if sgn(∆i ) 6= sgn(∆i +1 ) ,
ci = , i = 1, . . . , n − 1 . (5.4.12)
some “average” of ∆i , ∆i +1 otherwise
1 , if ξ > 0 ,
✎ notation: sign function sgn(ξ ) = 0 , if ξ = 0 , .
−1 , if ξ < 0 .
A slope selection rule that enforces (5.4.12) is called a limiter.
Of course, testing for equality with zero does not make sense for data that may be affected by measure-
ment or roundoff errors. Thus, the “average” in (5.4.12) must be close to zero already when either ∆i ≈ 0
5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 408
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
8
The harmonic mean = “smoothed min(·, ·)-
7
function”.
6
b
5
c i → 0. 4
(w a = wb = 1/2). 2
1 2 3 4 5 6 7 8 9 10
Fig. 184 a
A good choice of the weights is:
2hi +1 + hi hi +1 + 2hi
wa = , wb = ,
3( hi +1 + hi ) 3( hi +1 + hi )
This yields the following local slopes, unless (5.4.12) enforces ci = 0:
∆1 , if i = 0 ,
sgn(∆1 )=sgn(∆2 ) 3( h i + 1 + h i )
→ ci = 2h i +1 + h i 2h i + h i +1 , for i ∈ {1, . . . , n − 1} , h i : = ti − ti − 1 . (5.4.14)
+ ∆
∆i i +1
∆ , if i = n ,
n
Piecewise cubic Hermite interpolation with local slopes chosen according to (5.4.12) and (5.4.14) is avail-
able through the M ATLAB function v = pchip(t,y,x);, where t passes the interpolation nodes, y
the corresponding data values, and x is a vector of evaluation points, see doc phchip for details.
Data points
1
Piecew. cubic interpolation polynomial
Data from Exp. 5.3.7
Plot created with MATLAB-function call 0.8
v = pchip(t,y,x); 0.6
t: Data nodes t j
s(t)
x: Evaluation points xi
v: Vector s( xi ) 0.2
5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 409
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Note that the mapping y := [y0 , . . . , yn ] → ci defined by (5.4.12) and (5.4.14) is not linear.
➣ The “pchip interpolaton operator” does not provide a linear mapping from data space R n+1 into
C1 ([t0 , tn ]) (in the sense of Def. 5.1.17).
In fact, the non-linearity of the piecewise cubic Hermite interpolation operator is necessary for (only global)
monotonicity preservation:
If, for fixed node set {t j }nj=0, n ≥ 2, an interpolation scheme I : R n+1 → C1 ( I ) is linear as a map-
ping from data values to continuous functions on the interval covered by the nodes (→ Def. 5.1.17),
and monotonicity preserving, then I(y)′ (t j ) = 0 for all y ∈ R n+1 and j = 1, . . . , n − 1.
Of course, an interpolant that is flat in all data points, as stipulated by Thm. 5.4.17 for a lineaer, mono-
tonicity preserving, C1 -smooth interpolation scheme, does not make much sense.
At least, the piecewise cubic Hermite interpolation operator is local (in the sense discussed in § 5.4.7).
The cubic Hermite interpolation polynomial with slopes as in Eq. (5.4.14) provides a local
monotonicity-preserving C1 -interpolant.
Proof. See F. F RITSCH UND R. C ARLSON, Monotone piecewise cubic interpolation, SIAM J. Numer.
Anal., 17 (1980), S. 238–246.
✷
The next code demonstrates the calculation of the slopes ci in M ATLAB’s pchip (details in [?]):
5. Data Interpolation and Data Fitting in 1D, 5.4. Cubic Hermite Interpolation 410
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
13 d e l t a = ( y . t a i l ( n − 1 ) − y . head ( n −
1 ) ) . cwiseQuotient ( h ) ; // linear
slopes
14 c = VectorXd : : Zero ( n ) ;
15
5.5 Splines
Piecewise cubic Hermite Interpolation presented in Section 5.4 entailed determining reconstruction slopes
ci . Now we learn about a way how to do piecewise polynomial interpolation, which results in Ck -interpolants,
k > 0, and dispenses with auxiliary slopes. The idea is to obtain the missing conditions implicitly from
extra continuity conditions.
Obviously, spline spaces are mapped onto each other by differentiation & integration:
Z t
s ∈ Sd,M ⇒ s′ ∈ Sd−1,M ∧ s(τ ) dτ ∈ Sd+1,M .
a
The dimension of spline space can be found by a counting argument (heuristic): We count the number
of “degrees of freedom” (d.o.f.s) possessed by a M-piecewise polynomial of degree d, and subtract the
number of linear constraints implicit contained in Def. 5.5.1:
dim Sd,M = n · dim Pd − #{Cd−1 continuity constraints} = n · (d + 1) − (n − 1) · d = n + d .
dim Sd,M = n + d .
We already know the special case of interpolation in S1,M , when the interpolation nodes are the knots of
M, because this boils down to simple piecewise linear interpolation, see Section 5.3.2.
Supplementary reading. More details in [?, XIII, 46], [?, Sect. 8.6.1].
Cognitive psychology teaches us that the human eye perceives a C2 -functions as “smooth”, while it can
still spot the abrupt change of curvature at the possible discontinuities of the second derivatives of a cube
Hermite interpolant (→ Def. 5.4.1).
For this reason the simplest spline functions featuring C2 -smoothness are of great importance in computer
aided design (CAD). They are the cubic splines, M-piecewise polynomials of degree 3 contained in S3,M
(→ Def. 5.5.1).
In this section we study cubic spline interpolation (related to cubic Hermite interpolation, Section 5.4)
Task: Given a mesh M := {t0 < t1 < · · · < tn }, n ∈ N, “find” a cubic spline s ∈ S3,M that complies
with the interpolation conditions
s(t j ) = y j , j = 0, . . . , n . (5.5.4)
=
ˆ interpolation at knots !
From dimensional considerations it is clear that the interpolation conditions will fail to fix the interpolating
cubic spline uniquely:
“two conditions are missing” ➣ interpolation problem is not yet well defined!
We opt for a linear interpolation scheme (→ Def. 5.1.17) into the spline space S3,M . As explained in
§ 5.1.13, this will lead to an equivalent linear system of equations for expansion coefficients with respect
to a suitable basis.
We reuse the local representation of a cubic spline through cubic Hermite cardinal basis polynomials from
(5.4.5):
(5.4.4)
s|[t j−1,t j ] (t) = s(t j−1 ) · (1 − 3τ 2 + 2τ 3 ) + (5.5.6)
s (t j ) · (3τ 2 − 2τ 3 ) +
h j s ′ (t j −1 ) · 2
(τ − 2τ + τ ) + 3
h j s ′ (t j ) · (−τ 2 + τ 3 ) ,
Once these slopes are known, the efficient local evaluation of a cubic spline function can be done as for a
cubic Hermite interpolant, see Section 5.4.1, Code 5.4.6.
Note: if s(t j ), s′ (t j ), j = 0, . . . , n, are fixed, then the representation Eq. (5.5.6) already guarantees
s ∈ C1 ([t0 , tn ]), cf. the discussion for cubic Hermite interpolation, Section 5.4.
However, do the
From s ∈ C2 ([t0 , tn ]) we obtain n − 1 continuity constraints for s′′ (t) at the internal nodes
s′′|[t j−1,t j ] (t j ) = s′′|[t j ,t j+1] (t j ) , j = 1, . . . , n − 1 . (5.5.7)
with
1 2 2 i = 0, 1, . . . , n − 1 ,
bi := , ai : = + ,
hi +1 hi hi +1 [ bi , ai > 0 , ai = 2(bi + bi −1 ) ]; .
➙ two additional constraints are required, as already noted in § 5.5.3.
To saturate the remaining two degrees of freedom the following three approaches are popular:
Then the first and last column can be removed from the system matrix of (5.5.10). Their products with
c0 and cn , respectively, have to be subtracted from the right hand side of (5.5.10).
2 1 y1 − y0 1 2 y n − y n −1
h1 c 0 + h1 c 1 =3 , hn c n −1 + hn c n =3 .
h21 h2n
Combining these two extra equations with (5.5.10), we arrive at a linear system of equations with
tridiagonal s.p.d. (→ Def. 1.1.8, Lemma 2.8.12) system matrix and unknowns c0 , . . . , cn . Due to
Thm. 2.7.58 it can be solved with an asymptotic computational effort of O(n).
➂ Periodic cubic spline interpolation: s′ (t0 ) = s′ (tn ) (➣ c0 = cn ), s′′ (t0 ) = s′′ (tn )
This removes one unknown and adds another equations so that we end up with an n × n-linear
system with s.p.d. (→ Def. 1.1.8) system matrix
a1 b1 0 ··· 0 b0
b1 a2 b2 0
. . .. ..
0 .. .. . . bi := 1
h i +1 , i = 0, 1, . . . , n − 1 ,
A :=
... .. .. ..
,
2 2
. . . 0 ai : = h i + h i +1 , i = 0, 1, . . . , n − 1 .
..
0 . a n − 1 bn − 1
b0 0 · · · 0 bn − 1 a0
This linear system can be solved with rank-1-modifications techniques (see § 2.6.13, Lemma 2.6.22)
+ tridiagonal elimination: asymptotic computational effort O(n).
M ATLAB provides many tools for computing and dealing with splines.
(5.5.14) Extremal properties of natural cubic spline interpolants → [?, Sect. 8.6.1, Prop-
erty 8.2]
models the elastic bending energy of a rod, whose shape is described by the graph of f (Soundness
check: zero bending energy for straight rod). We will show that cubic spline interpolants have minimal
bending energy among all C2 -smooth interpolating functions.
The natural cubic spline interpolant s minimizes the elastic curvature energy among all interpolating
functions in C2 ([ a, b]), that is
We show that any small perturbation of s such that the perturbed spline still satisfies the interpolation
conditions leads to an increase in elastic energy.
Zb
Ebd (s + k) = 1
2 |s′′ + λk′′ |2 dt (5.5.16)
a
Zb Zb
= Ebd (s) + s′′ (t)k′′ (t) dt + 21 |k′′ |2 dt .
a a
| {z } | {z }
:= I ≥0
Scrutiny of I : split in interval contributions, integrate by parts twice, and use s(4) ≡ 0:
n Zt j
I=∑ s′′ (t)k′′ (t) dt
j=1t
j −1
n
= − ∑ s′′′ (t−
j ) k ( t j ) − s ′′′ +
( t j −1 ) k ( t j −1 ) + s′′ (tn ) k′ (tn ) − s′′ (t0 ) k′ (t0 ) = 0 .
j =1 |{z} | {z } | {z } | {z }
=0 =0 =0 =0
In light of (5.5.16): non perturbation compatible with interpolation conditions can make the bending energy
of s decrease!
§ 5.5.14: (Natural) cubic spline interpolant provides C2 -curve of minimal elastic bending energy that travels
through prescribed points.
m
Nature: A thin elastic rod fixed a certain points attains a shape that minimizes its potential bending energy
(virtual work principle of statics).
1.2
y1 − y0
c0 : = ,
t1 − t0 0.8
y n − y n −1
cn := . 0.6
tn − tn −1
s(t)
0.4
Remember:
• Lagrange polynomials satisfying (5.2.11) provide cardinal interpolants for polynomial interpolation
→ § 5.2.10. As is clear from Fig. 170, they do not display any decay away from their “base node”.
Rather, they grow strongly. Hence, there is no locality in global polynomial interpolation.
• Tent functions (→ Fig. 169) are the cardinal basis functions for piecewise linear interpolation, see
Ex. 5.1.10. Hence, this scheme is perfectly local, see (5.3.9).
Given a grid M := {t0 < t1 < · · · < tn } the ith natural cardinal spline is defined as
n
Natural spline interpolant: s (t) = ∑ y j L j (t) .
j =0
Cardinal cubic spline function Cardinal cubic spline in middle points of the intervals
1.2 0
10
1
-1
10
-2
10
0.6
0.4 -3
10
0.2
-4
10
0
-5
-0.2 10
-8 -6 -4 -2 0 2 4 6 8
-8 -6 -4 -2 0 2 4 6 8
x
x
Exponential decay of the cardinal splines ➞ cubic spline interpolation is weakly local.
According to Rem. 5.5.18 is cubic spline interpolation neither monotonicity preserving nor curvature pre-
serving. Necessarily so, because it is a linear interpolation scheme, see Thm. 5.4.17.
This section presents a non-linear quadratic spline (→ Def. 5.5.1, C1 -functions) based interpolation
scheme that manages to preserve both monotonicity and curvature of data even in a local sense, cf.
Section 5.3.
Sought:
✦ extended knot set M ⊂ [t0 , tn ] (→ Def. 5.5.1),
✦ an interpolating quadratic spline function s ∈ S2,M , s(ti ) = yi , i = 0, . . . , n
that preserves the “shape” of the data in the sense of § 5.3.2.
n
Notice that here M 6= {t j } j=0: s interpolates the data in the points ti but is piecewise polynomial with
respect to M! The interpolation nodes will usually not belong to M.
Recall Eq. (5.4.12) and Eq. (5.4.14): we fix the slopes ci in the nodes using the harmonic mean of data
slopes ∆ j , the final interpolant will be tangent to these segments in the points (ti , yi ). If (ti , yi ) is a local
maximum or minimum of the data, c j is set to zero (→ § 5.4.11)
( 2
, if sign(∆i ) = sign(∆i +1 ) ,
Limiter ci : = ∆−
i
1
+ ∆− 1
i +1 i = 1, . . . , n − 1 .
0 otherwise,
c0 := 2∆1 − c1 , cn := 2∆n − cn−1 ,
y j − y j −1
where ∆ j = t −t .
j j −1
yi yi
yi+1
yi−1 yi+1 yi
t i−1 ti t i+1 t i−1 ti t i+1 t i−1 ti t i+1
Figures: slopes according to limited harmonic mean formula
Rule: yi −1
1
Let Ti be the unique straight line through (ti , yi ) with
ci −1
slope ci ; — in figure ✄
If the intersection of Ti −1 and Ti is non-empty and
has a t-coordinate ∈]ti −1 , ti ], l
☞ then pi := t-coordinate of Ti −1 ∩ Ti , ci
yi 1
☞ otherwise pi = 12 (ti −1 + ti ).
Fig. 187
ti−11 ( pi + ti−1 ) pi 1
2 ( pi + ti ) ti
2
These points will be used to build the knot set for the final quadratic spline:
M′ = {t0 < 21 (t0 + p1 ) < 12 ( p1 + t1 ) < 12 (t1 + p2 ) < · · · < 21 (tn−1 + pn ) < 12 ( pn + tn ) < tn } ,
with l ( ti ) = y i , l ′ ( ti ) = c i .
In each interval ( 21 ( p j + t j ), 12 (t j + p j+1 )) the spline corresponds to the segment of slope c j pass-
ing through the data point (t j , y j ).
In each interval ( 21 (t j + p j+1 ), 21 ( p j+1 + t j+1 )) the spline corresponds to the segment connecting
the previous ones, see Fig. 187.
✞ ☎
✝ ✆
l “inherits” local monotonicity and curvature from the data.
Example 5.5.24 (Auxiliary construction for shape preserving quadratic spline interpolation)
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
y
y
-0.2
-0.2
-0.4
-0.4
-0.6
-0.6
-0.8
-0.8
-1
-1
0 2 4 6 8 10 12
Fig. 188 t 0 2 4 6 8 10 12
Fig. 189 t
Local slopes ci , i = 0, . . . , n
Linear auxiliary spline l
Lemma 5.5.25.
If g is a linear spline through the three points
1
(a, y a ) , ( (a + b), w) , (b, yb ) witha < b , y a , yb , w ∈ R ,
2
then the parabola
satisfies
1. p(a) = y a , p(b) = yb , p′ (a) = g′ (a) , p′ (b) = g′ (b),
2. g monotonic increasing / decreasing ⇒ p monotonic increasing / decreasing,
3. g convex / concave ⇒ p convex / concave.
The proof boils down to discussing many cases as indiated in the following plots:
ya Linear Spline l
Parabola p ya Linear Spline l
Parabola p w Linear Spline l
Parabola p
w
w ya
yb yb yb
Fig. 190
a 1 b
Fig. 191
a 1 b
Fig. 192
a 1 b
2 (a + b) 2 (a + b) 2 (a + b)
Lemma 5.5.25 implies that the final quadratic spline that passes through the points (t j , y j ) with slopes c j
can be built locally as the parabola p using the linear spline l that plays the role of g in the lemma.
Quadratic spline
0.8
0.6
0.4
y
Interpolating quadratic spline ➙ -0.2
-0.4
-0.6
-0.8
-1
0 2 4 6 8 10 12
Fig. 193 t
We examine the shape preserving quadratic spline that interpolates data values y j = 0, j 6= i, yi , i ∈
{0, . . . , n}, on an equidistant node set.
Data and slopes Linear auxiliary spline l Quadratic spline
1 1 1
0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 194 t Fig. 195 t Fig. 196 t
1 1 1
0 0 0
ti 0 1 2 3 4 5 6 7 8 9 10 11 12
Data from [?]:
yi 0 0.3 0.5 0.2 0.6 1.2 1.3 1 1 1 1 0 -1
1 1 1
y
0 0 0
-1 -1 -1
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 200 t Fig. 201 t Fig. 202 t
We assume the period T > 0 to be known and ti ∈ [0, T [ for all interpolation nodes ti , i = 0, . . . , n.
Task: Given data points (ti , yi ), yi ∈ K, ti ∈ [0, T [, find a 1-periodic function f : R → K (the interpolant),
f (t + T ) = f (t) ∀t ∈ R, that satisfies the interpolation conditions
f (ti ) = yi , i = 0, . . . , n . (5.6.2)
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 426
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The most fundamental periodic functions are derived from the trigonometric functions sin and cos and
dilations of them (A dilation of a function t 7→ ψ(t) is a function of the form t 7→ ψ(ct) with some c > 0).
The terminology is natural after recalling expressions for trigonometric functions via complex exponentials
(“Euler’s formula”)
1n n 2πıjt −2πıjt
o
2 j∑
= α0 + ( α j − ıβ j ) e + ( α j + ıβ j ) e
=1
−1 n
= α0 + 1
2 ∑ (α− j + ıβ − j )e2πıjt + 1
2 ∑ (α j − ıβ j )e2πıjt
j=− n j =1
1
2n 2 (αn− j + ıβ n− j ) for j = 0, . . . , n − 1 ,
= e−2πınt ∑ γ j e2πıjt , with γ j = α0 for j = n , (5.6.6)
j =0
1
2 (α j− n − ıβ j− n ) for j = n + 1, . . . , 2n .
(After scaling) a trigonometric polynomial of degree 2n is a regular polynomial ∈ P2n (in C) re-
stricted to the unit circle S1 ⊂ C.
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 427
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
T
Corollary 5.6.8. Dimension of P2n
T has dimension T
The vector space P2n dim P2n = 2n + 1.
We observed that trigonometric polynomials are standard (complex) polynomials in disguise. Next we
can relate trigonometric interpolation to well-known standard Lagrangian interpolation discussed in Sec-
tion 5.2.2. In fact, we slightly extend the method, because now we admit complex interpolation nodes. All
results obtained earlier carry over to this setting.
The key tool is a smooth bijective mapping between I := [0, 1[ and S1 defined as
Im
z = exp(−2πit)
1
t
0 1 1 Re
Fig. 203
Here we deal with a non-affine pullback, but the definition is the same as the one given in (6.1.20) for an
affine pullback:
(ΦS−11 )∗ : C0 ([0, 1[) → C0 (S1 ) , (ΦS−11 )∗ f (z) := f (ΦS−11 (z)) , z ∈ S1 . (5.6.10)
All theoretical results and algorithms from polynomial interpolation carry over to trigonometric
interpolation
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 428
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The next code finds the coefficients α j , β j ∈ R of a trigonometric interpolation polynomial in the real-valued
representation (5.6.5) for real-valued data y j ∈ R by simply solving the linear system of equations arising
from the interpolation conditions (5.6.2).
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 429
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The asymptotic computational effort of this implementation is dominated by the cost for Gaussian elimina-
tion applied to a fully populated (dense) matrix, see Thm. 2.5.2: O(n3 ) for n → ∞.
Often timeseries data for a time-periodic quantity are measured with a constant rhythm over the entire
T
(known) period of duration T > 0, that is, t j = j∆t, ∆t = n+ 1 , j = 0, . . . , n. In this case, the for-
mulas for computing the coefficients of the interpolating trigonometric polynomial (→ Def. 5.6.3) become
special versions of the discrete Fourier transform (DFT, see Def. 4.2.18) studied in Section 4.2. Efficient
implementation can thus harness the speed of FFT introduced in Section 4.3.
k
Now: 1-periodic setting, uniformly distributed interpolation nodes tk = , k = 0, . . . , 2n
2n + 1
(2n + 1) × (2n + 1) linear system of equations:
2n
jk nk
∑ γj exp 2πı
2n + 1
= (b)k := exp 2πı y , k = 0, . . . , 2n .
2n + 1 k
(5.6.13)
j =0
m
1
F2n+1 c = b , c = [γ0 , . . . , γ2n ] ⊤
Lemma 4.2.14
⇒ c= F2n+1 b . (5.6.14)
2n + 1
(2n + 1) × (2n + 1) (conjugate) Fourier matrix, see (4.2.13)
Fast solution by means of FFT: O(n log n) asymptotic complexity, see Section 4.3
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 430
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
j
4 // ( 2n+1 , y j ), j = 0, . . . , 2n.
5 // IN : y has to be a row vector of odd length, return values are
column vectors
6 // OUT: vectors α j j , α j j of expansion coefficients
7 // with respect to trigonometric basis from Def. 5.6.3
8 std : : p a i r <VectorXd , VectorXd> t r i g i p e q u i d ( const VectorXd& y ) {
9 using i n d e x _ t = VectorXcd : : Index ;
10 const i n d e x _ t N = y . siz e ( ) , n = (N − 1 ) / 2 ;
11 i f (N % 2 ! = 1 ) throw " Number o f p o i n t s mu st be odd ! " ;
12 // prepare data for fft
13 std : : complex <double> M_I ( 0 , 1 ) ; // imaginary unit
14 // right hand side vector b from (5.6.14)
15 VectorXcd b (N) ;
16 f o r ( i n d e x _ t k = 0 ; k < N ; ++k )
17 b ( k ) = y ( k ) ∗ std : : exp (2 ∗ M_PI ∗ M_I ∗ ( double ( n ) /N∗ k ) ) ;
18 Eigen : : FFT<double> f f t ; // DFT helper class
19 VectorXcd c = f f t . fwd ( b ) ; // means that “c = fft(b)”
20
0
10
trigpolycoeff
trigipequid
tic-toc-timings ✄ -2
10
runtime[s]
1 f u n c t i o n trigipequidtiming
2 % Runtime comparison between efficient (→ Code ??) and direct
computation
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 431
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
17 f i g u r e ; l o g l o g (times(:,1),times(:,2),’b+’,...
18 times(:,1),times(:,3),’r*’);
19 x l a b e l (’{\bf n}’,’fontsize’,14);
20 y l a b e l (’{\bf runtime[s]}’,’fontsize’,14);
21 le ge nd (’trigpolycoeff’,’trigipequid’,’location’,’best’);
22
23 p r i n t -depsc2 ’../PICTURES/trigipequidtiming.eps’;
Same observation as in Ex. 4.3.12: massive gain in efficiency through relying on FFT.
k
at equidistant points N , N > 2n. k = 0, . . . , N − 1.
2n
kj
q(k/N ) = e−2πın /N
k
(5.6.6) ∑ γj exp(2πı N ) , k = 0, . . . , N − 1 .
j =0
q(k/N ) = e−2πı
kn/N
v j with v = F Ne
c, (5.6.19)
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 432
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The next code merges the steps of computing the coefficient of the trigonometric interpolation polynomial
in equidistant points and its evaluation in another set of equidistant points.
C++11 code 5.6.21: Equidistant points: fast on the fly evaluation of trigonometric interpola-
tion polynomial
2 // Evaluation of trigonometric interpolation polynomial through
( 2nj+1 , y j ), j = 0, . . . , 2n
3 // in equidistant points Nk , k = 0, N − 1
4 // IN : y = vector of values to be interpolated
5 // q (COMPLEX!) will be used to save the return values
6 void t r i g p o l y v a l e q u i d ( const VectorXd y , const i n t M, VectorXd& q ) {
7 const i n t N = y . siz e ( ) ;
8 i f (N % 2 == 0 ) {
9 std : : c e r r << " Number o f p o i n t s mu st be odd ! \ n " ;
10 r et ur n ;
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 433
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
11 }
12 const i n t n = (N − 1 ) / 2 ;
13 // computing coefficient γ j , see (5.6.14)
14 VectorXcd a , b ;
15 trigipequid (y , a, b) ;
16
25 // zero padding
26 VectorXcd ch (M) ; ch << gamma, VectorXcd : : Zero (M − (2 ∗ n + 1 ) ) ;
27
As remarked in Ex. 5.1.5 the basic assumption underlying the reconstructing of the functional dependence
of two quantities by means of interpolation as that of accurate data. In case of data uncertainty or mea-
surement errors the exact satisfaction of interpolation conditions ceases to make sense and we are better
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 434
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
off reconstructing a fitting function that is merely “close to the data” in a sense to be made precise next.
The task of (multidimensional, vector-valued) least squares data fitting can be described as follows:
The function f is called the (best) least squares fit for the data in S.
Consider a special variant of the general least squares data fitting problem: The set S of admissible
continuous functions is now chosen as a finite-dimensional vector space Vn ⊂ C0 ( D ), dim Vn = n ∈ N,
cf. the discussion in § 5.1.13 for interpolation.
The best least squares fit f ∈ Vn can be represented by a finite linear combination of the basis
functions b j :
n
f (t) = ∑ j =1 x j b j (t) , xj ∈ R . (5.7.4)
Vn = W × · · · × W , (5.7.5)
| {z }
d factors
dim Vn = d · dim W .
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 435
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
is a basis of Vn (ei =
ˆ i-th unit vector).
We adopt the setting of § 5.7.3 of an n-dimensional space Vn of admissible functions with basis {bq , . . . , bn }.
Then the least squares data fitting problem can be recast as follows.
Given:
✦ data points (ti , yi ) ∈ R k × R d , i = 1, . . . , m
✦ basis functions b j : D ⊂ R k 7→ R, j = 1, . . . , n, n < m
2
m n
( x1 , . . . , xn ) = argmin ∑ ∑ z j b j ( t i ) − yi . (5.7.8)
z j ∈R d i =1 j =1 2
Special cases:
• If Vn is a product space according to (5.7.5) with basis (5.7.6), then (5.7.8) amounts to finding
vectors x j ∈ R d , j = 1, . . . , ℓ with
2
m n
(x1 , . . . , xℓ ) = argmin ∑ ∑ z j q j ( t i ) − yi . (5.7.10)
z j ∈R d i = 1 j = 1 2
Example 5.7.11 (Linear parameter estimation = linear data fitting → Ex. 3.0.5, Ex. 3.1.5)
The linear parameter estimation/linear regression problem presented in Ex. 3.0.5 can be recast as a linear
data fitting problem with
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 436
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Linear (least squares) data fitting leads to an overdetermined linear system of equations for which we seek
a least squares solution (→ Def. 3.1.3) as in Section 3.1.1. To see this rewrite
2 !2
m n m n d
∑ ∑ z j b j ( t i ) − yi = ∑ ∑∑ b j ( t i ) r z j − ( yi ) r .
i =1 j =1 2 i =1 j =1 r =1
In the one-dimensional, scalar case (= 1, d = 1) of (5.7.9) the related overdetermined linear system of
equations is
b1 (t1 ) . . . bn (t1 ) y1
.. .. .
. x = .. .
. (5.7.15)
b1 (tm ) . . . bn (tm ) ym
Having reduced the linear least squares data fitting problem to finding the least squares solution of an
overdetermined linear system of equations, we can now apply theoretical results about least squares
solutions, for instance, Cor. 3.1.22. The key issue is, whether the coefficient matrix of (5.7.15) has full rank
n. Of course, this will depend on the location of the ti .
Lemma 5.7.16. Unique solvability of linear least squares fitting problem
The scalar one-dimensional linear least squares fitting problem (5.7.9) with dim Vn = n, Vn the
vector space of admissible functions, has a unique solution, if and only if there are ti1 , . . . , tin such
that
b1 (ti1 ) . . . bn (ti1 )
.. .. n,n
. . ∈R is invertible, (5.7.17)
b1 (tin ) . . . bn (tin )
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 437
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Equivalent to (5.7.17) is the requirement, that there is an n-subset of {t1 , . . . , tn } such that the corre-
sponding interpolation problem for Vn has a unique solution for any data values yi .
Special variant of scalar (d = 1), one-dimensional k = 1 linear data fitting (→ § 5.7.7): we choose the
space of admissible functions as polynomials of degree n − 1,
Vn = Pn−1 , e.g. with basis b j (t) = t j−1 (monomial basis, Section 5.2.1) .
which, for M ≥ n, has full rank, because it contains invertible Vandermonde matrices (5.2.18), Rem. 5.2.17.
The next code demonstrates the computation of the fitting polynomial with respect to the monomial basis
of Pn−1 :
The function polyfit returns a vector [ x1 , x2 , . . . , xn ]⊤ describing the fitting polynomial according to the
convention
p ( t ) = x1 t n −1 + x2 t n −2 + · · · + x n −1 t + x n . (5.7.21)
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 438
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
1 ///////////////////////////////////////////////////////////////////////////
2 /// Demonstration code for lecture "Numerical Methods for CSE" @ ETH
Zurich
3 /// (C) 2016 SAM, D-MATH
4 /// Author(s): Xiaolin Guo, Julien Gacon
5 /// Repository: https://gitlab.math.ethz.ch/NumCSE/NumCSE/
6 /// Do not remove this header.
7 //////////////////////////////////////////////////////////////////////////
8
9 # include <Eigen / Dense>
10 # include < f i g u r e / f i g u r e . hpp>
11 # include < p o l y f i t . hpp> // NCSE’s polyfit (equivalent to Matlab’s)
12 # include < p o l y v a l . hpp> // NCSE’s polyval (equivalent to Matlab’s)
13
14 // Comparison of polynomial interpolation and polynomial fitting
15 // (“Quick and dirty”, see 5.2.3)
16 i n t main ( ) {
1
17 // use C++ lambda functions to define runge function f ( x) = 1+ x 2
18 auto f = [ ] ( const Eigen : : VectorXd& x ) {
19 r etur n ( 1 . / ( 1 + x . ar r ay ( ) ∗ x . ar r ay ( ) ) ) . matrix ( ) ;
20 };
21
22 const unsigned d = 10; // Polynomial degree
23 Eigen : : VectorXd t i p ( d + 1) ; // d + 1 nodes for interpolation
24 f o r ( unsigned i = 0 ; i <= d ; ++ i )
25 t i p ( i ) = −5 + i ∗ 1 0 . / d ;
26
27 Eigen : : VectorXd t f t (3 ∗ d + 1) ; // 3d + 1 nodes for polynomial fitting
28 f o r ( unsigned i = 0 ; i <= 3∗ d ; ++ i )
29 t f t ( i ) = −5 + i ∗ 1 0 . / ( 3 ∗ d ) ;
30
31 Eigen : : VectorXd f t i p = f ( Eigen : : VectorXd : : Ones ( 2 ) ) ;
32 Eigen : : VectorXd p i p = p o l y f i t ( t i p , f ( t i p ) , d ) , // Interpolating polynomial (deg = d)
33 p f t = p o l y f i t ( t f t , f ( t f t ) , d ) ; // Fitting polynomial (deg = d)
34
35 Eigen : : VectorXd x = Eigen : : VectorXd : : LinSpaced (1000 , − 5 ,5) ;
36 mgl : : F i g u r e f i g ;
37 f i g . p l o t ( x , f ( x ) , "g| " ) . l a b e l ( "Function f " ) ;
38 f i g . p l o t ( x , p o l y v a l ( pi p , x ) , "b" ) . l a b e l ( "Interpolating polynomial" ) ;
39 f i g . p l o t ( x , p o l y v a l ( p f t , x ) , "r" ) . l a b e l ( "Fitting polynomial" ) ;
40 f i g . p l o t ( t i p , f ( t i p ) , " b∗" ) ;
41 f i g . save ( " interpfit" ) ;
42
43 r etur n 0 ;
44 }
2
function f
interpolating polynomial
fitting polynomial
1.5
1
Data from function f (t) = ,
1 + t2
1
-0.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Fig. 205 t
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 439
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Learning outcomes
• know the details of cubic Hermite interpolation and how to ensure that it is monotonicity preserving.
• know what splines are and how cubic spline interpolation with different endpoint constraints works.
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 440
Chapter 6
Approximation of Functions in 1D
f − ef is small for some norm k·k on the space C0 ( D ) of (piecewise) continous functions.
✦ the supremum norm kgk ∞ := k gk L∞ (D) := max |g( x )|, see (5.2.66).
x∈D
Below we consider only the case n = d = 1: approximation of scalar valued functions defined on an
interval. The techniques can be applied componentwise in order to cope with the case of vector valued
function (d > 1).
441
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
A faster alternative is the advance approximation of the function U 7→ I (U ) based on a few computed
values I (Ui ), i = 0, . . . , n, followed by the fast evaluation of the approximant U 7→ e
I (U ) during actual
circuit simulations. This is an example of model reduction by approximation of functions: a complex
subsystem in a mathematical model is replaced by a surrogate function.
In this example we also encounter a typical situation: we have nothing at our disposal but, possibly expen-
sive, point evaluations of the function U 7→ I (U ) (U 7→ I (U ) in “procedural form”, see Rem. 5.1.6). The
number of evaluations of I (U ) will largely determine the cost of building e
I.
This application displays a fundamental difference compared to the reconstruction of constitutive relation-
ships from a priori measurements → Ex. 5.1.5: Now we are free to choose the number and location of the
data points, because we can simply evaluate the function U 7→ I (U ) for any U and as often as needed.
C++11 code 6.0.4: Class describing a 2-port circuit element for circuit simulation
1 class C i r c u i t E l e m e n t {
2 private :
3 // internal data describing U 7→ e I (U ).
4 public :
5 // Constructor taking some parameters and building e I
6 C i r c u i t E l e m e n t ( const Parameters &P) ;
7 // Point evaluation operators for e d e
I and dU I
8 double I ( double U) const ;
9 double dIdU ( double U) const ;
10 };
We define an abstract concept for the sake of clarity: When in this chapter we talk about an “approximation
scheme” (in 1D) we refer to a mapping A : X 7→ V , where X and V are spaces of functions I 7→ K,
I ⊂ R an interval.
Examples are
• X = Ck ( I ), the spaces of functions I 7→ K that are k times continuously differentiable, k ∈ N.
• V = Pm ( I ), the space of polynomials of degree ≤ k, see Section 5.2.1
• V = Sd,M , the space of splines of degree d on the knot set M ⊂ I , see Def. 5.5.1.
In Chapter 5 we discussed ways to construct functions whos graph runs through given data points, see
5.1. We can hope that the interpolant will approximate the function, if the data points are also located
on the graph of that function. Thus every interpolation scheme, see § 5.1.4, spawns a corresponding
approximation scheme.
sampling interpolation
f : I ⊂ R → K −−−−→ (ti , yi := f (ti ))im=0 −−−−−−→ fe := IT y ( fe(ti ) = yi ) .
In this chapter we will mainly study approximation by interpolation relying on the interpolation schemes
(→ § 5.1.4) introduced in Section 5.2, Section 5.4, and Section 5.5.
There is additional freedom compared to data interpolation: we can choose the interpolation nodes in
a smart way in order to obtain an accurate interpolant fe.
Approximation and interpolation (→ Chapter 5) are key components of many numerical methods, like for
integration, differentiation and computation of the solutions of differential equations, as well as for computer
graphics and generation of smooth curves and surfaces.
Contents
6.1 Approximation by Global Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . 434
6.1.1 Polynomial approximation: Theory . . . . . . . . . . . . . . . . . . . . . . . 435
6.1.2 Error estimates for polynomial interpolation . . . . . . . . . . . . . . . . . . 441
6.1.3 Chebychev Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
6.1.3.1 Motivation and definition . . . . . . . . . . . . . . . . . . . . . . . 451
6.1.3.2 Chebychev interpolation error estimates . . . . . . . . . . . . . . . 456
6.1.3.3 Chebychev interpolation: computational aspects . . . . . . . . . . 461
6.2 Mean Square Best Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
6.2.1 Abstract theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
6.2.1.1 Mean square norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
6.2.1.2 Normal equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
6.2.1.3 Orthonormal bases . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
6.2.2 Polynomial mean square best approximation . . . . . . . . . . . . . . . . . . 470
The space Pk of polynomials of degree ≤ k has been introduced in Section 5.2.1. For reasons listed in
§ 5.2.3 polynomials are the most important theoretical and practical tool for the approximation of functions.
The next example presents an important case of approximation by polynomials.
The local approximation of sufficiently smooth functions by polynomials is a key idea in calculus, which
manifests itself in the importance of approximation by Taylor polynomials: For f ∈ Ck ( I ), k ∈ N, I ⊂ R
an interval, we approximate
k
f ( j ) ( t0 )
f (t) ≈ ∑ (t − t0 ) j , for some t0 ∈ I .
j =0
j!
| {z }
= :Tk (t)
✎ Notation: f (k) =
ˆ k-th derivative of function f : I ⊂ R → K
Zt
(t − τ )k
f (t) − Tk (t) = f ( k + 1) ( τ ) dτ (6.1.2a)
k!
t0
( t − t0 ) k + 1
= f ( k + 1) ( ξ ) , ξ = ξ (t, t0 ) ∈] min(t, t0 ), max(t, t0 )[ , (6.1.2b)
( k + 1) !
which shows that for f ∈ Ck+1 ( I ) the Taylor polynomial Tk is pointwise close to f ∈ Ck+1 ( I ), if the
interval I is small and f (k+1) is bounded pointwise.
Approximation by Taylor polynomials is easy and direct but inefficient: a polynomial of lower degree often
gives the same accuracy. Moreover, when f is available only in procedural form as double f(double),
(approximations of) higher order derivatives are difficult to obtain.
Obviously, for every interval I ⊂ R , the spaces of polynomials are nested in the following sense:
P0 ⊂ P1 ⊂ · · · ⊂ P m ⊂ P m + 1 ⊂ · · · ⊂ C ∞ ( I ) , (6.1.4)
With this family of nested spaces of polynomials at our disposal, it is natural to study associated families
of approximation schemes, one for each degree, mapping into Pm , m ∈ N0 .
Sloppily speaking, according to (6.1.2b) the Taylor polynomials from Ex. 6.1.1 provide uniform (→ § 6.0.1)
approximation of a smooth function f in (small) intervals, provided that its derivatives do not blow up “too
fast” (We do not want to make this precise here).
The question is, whether polynomials still offer uniform approximation on arbitrary bounded closed inter-
vals and for functions that are merely continuous, but not any smoother. The answer is YES and this
profound result is known as the Weierstrass Approximation Theorem. Here we give an extended version
with a concrete formula due to Bernstein, see [?, Section 6.2].
✎ Notation: g(k) =
ˆ k-th derivative of a function g : I ⊂ R → K.
j= 0
0.9 j= 1
j= 2
j= 3
j= 4
Plots of Bernstein polynomials of degree n = 7 ✄ 0.8 j=
j=
5
6
j= 7
0.7
B (t)
0.5
∑ Bnj (t) ≡ 1 ,
j
(6.1.9)
0.4
j =0
0 ≤ Bnj (t) ≤ 1 ∀0 ≤ t ≤ 1 . (6.1.10) 0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 207 t
0.5
deg = 2, j = 1
deg = 5, j = 2
deg = 8, j = 3
deg = 11, j = 4
deg = 14, j = 6
0.4
deg = 17, j = 7
deg = 20, j = 8
deg = 23, j = 9
deg = 26, j = 10
deg = 29, j = 12
d n j n− j
0.3
✁ Since dt B j (t) = Bnj (t)( t − 1− t ), Bnj has its
Bj (t)
j
unique local maximum in [0, 1] at the site tmax := n .
0.2
As n → ∞ the Bernstein polynomials become more
and more concentrated around the maximum.
0.1
Proof. (of Thm. 6.1.6, first part) Fix t ∈ [0, 1]. Using the notations from (6.1.7) and the identity (6.1.9) we
find
n
f (t) − pn (t) = ∑ ( f (t) − f ( j/n))Bnj (t) . (6.1.11)
k=0
As we see from Fig. 208, for large n the bulk of sum will be contributed by Bernstein polynomial with index
j/n ≈ x, because for every δ > 0
n
1 1 (∗) nt(1 − t) 1
∑ Bnj (t) ≤ ∑ ( j/n − t)2 Bnj (t) ≤ ∑ ( j/n − t)2 Bnj (t) = ≤ .
| j/n− t|> δ
δ2 | j/n− t|> δ
δ2 j =0
2
δ n 2 4nδ2
∑ means summation over j ∈ N0 with summation indices confined to the set { j : | j/n − t| > δ}.
| j/n− t|> δ
The identity (∗) can be established by direct but tedious computations.
1
| f (t) − pn (t)| ≤ ∑ 4nδ 2
| f (t) − f ( j/n)| + ∑ | f (t) − f ( j/n )| .
| j/n− t|> δ | j/n− t|≤ δ
Since, f is uniformly continuous on [0, 1], given ǫ > 0 we can choose δ > 0 independently of t such that
| f (s) − f (t)| < ǫ, if |s − t| < δ. The, if we choose n > (ǫδ2 )−1 , we can bound
The following plots display the sequences of the polynomials pn for n = 2, . . . , 25.
Bernstein approximants on [0,1] Bernstein approximants on [0,1]
1 1
function f function f
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 209 t Fig. 210 t
f = f1 f = f2
We see that the Bernstein approximants “slowly” edge closer and closer to f . Apparently it takes a very
large degree to get really close to f .
Now we introduce a concept needed to gauge how close an approximation scheme gets to the best
possible performance.
distk·k ( f , Pk ) := inf k f − pk .
p∈Pk
The notation distk·k is motivated by the notation of “distance” as distance to the nearest point in a set.
For the L2 -norm k·k2 and the supremum norm k·k ∞ the best approximation error is well defined for
C = C 0 ( I ).
The polynomial realizing best approximation w.r.t. k·k may neither be unique nor computable with reason-
able effort. Often one is content with rather sharp upper bounds like those asserted in the next theorem,
due to Jackson [?, Thm. 13.3.7].
If f ∈ Cr ([−1, 1]) (r times continuously differentiable), r ∈ N, then, for any polynomial degree
n ≥ r,
( n − r ) ! (r )
inf k f − p k L∞ ([−1,1]) ≤ (1 + π 2/2)r f .
p∈Pn n! L ∞ ([−1,1])
with C(r ) dependent on r, but independent of f and, in particular, the polynomial degree n. Using the
Landau symbol from Def. 1.4.5 we can rewrite the statement of (6.1.17) in asymptotic form
What if a polynomial approximation scheme is defined only on a special interval, say [−1, 1]. Then by the
following trick it can be transferred to any interval [ a, b] ⊂ R .
b : C0 ([−1, 1]) →
Assume that an interval [ a, b] ⊂ R , a < b, and a polynomial approximation scheme A
Pn are given. Based on the affine linear mapping
Φ : [−1, 1] → [ a, b] , Φ(t̂ ) := a + 12 (t̂ + 1)(b − a) , − 1 ≤ t̂ ≤ 1 , (6.1.19)
bt t
Fig. 211
−1 1 a b
bt 7→ t := Φ(bt ) := 1 (1 − bt)a + 1 (bt + 1)b
2 2
We add the important observations that affine pullbacks are linear and bijective, they are isomorphisms of
the involved vector spaces of functions (what is the inverse?).
If Φ∗ : C0 ([ a, b]) → C0 ([−1, 1]) is an affine pullback according to (6.1.19) and (6.1.20), then
Φ∗ : Pn → Pn is a bijective linear mapping for any n ∈ N0 .
Proof. This is a consequence of the fact that translations and dilations take polynomials to polynomials of
the same degree: for monomials we find
The lemma tells us that the spaces of polynomials of some maximal degree are invariant under affine
pullback. Thus, we can define a polynomial approximation scheme A on C0 ([ a, b]) by
A : C0 ([ a, b]) → Pn , b ◦ Φ∗ ,
A : = (Φ∗ )−1 ◦ A (6.1.22)
Thm. 6.1.15 targets only the special interval [−1, 1]. What does it imply for polynomial best approximation
on a general interval [ a, b]? To answer this question we apply techniques from Rem. 6.1.18, in particular
the pullback (6.1.20).
We first have to study the change of norms of functions under the action of affine pullbacks:
Proof. The first estimate should be evident, and the second is a consequence of the transformation
formula for integrals [?, Satz 6.1.5]
Z 1 Z b
∗ b−a
Φ t(bt ) dbt = f (t) dt , (6.1.26)
−1 2 a
Thus, for norms of the approximation errors of polynomial approximation schemes defined by affine trans-
k f − A f k L∞ ([ a,b]) = Φ∗ f − A
b (Φ∗ f ) ,
L ∞ ([−1,1])
q , ∀ f ∈ C0 ([ a, b]) . (6.1.27)
k f − A f k L2 ([ a,b]) = |b − a| Φ∗ f − A b (Φ∗ f )
2
,
L ([−1,1])
b.
Equipped with approximation error estimates for A, we can infer corresponding estimates for A
The bounds for approximation errors often involve norms of derivatives as in Thm. 6.1.15. Hence, it is
important to understand the interplay of pullback and differentiation: By the 1D chain rule
d df dΦ df
(Φ∗ f )(t̂ ) = (Φ(t̂ )) = (Φ(t̂ )) · 12 (b − a) ,
dt̂ dt dt̂ dt
which implies a simple scaling rule for derivatives of arbitrary order r ∈ N0 :
∗ (r ) b − a r ∗ (r )
(Φ f ) = Φ (f ) . (6.1.28)
2
Lemma 6.1.24
∗ (r ) b − a r (r )
(Φ f ) = f , f ∈ Cr ([ a, b]), r ∈ N0 . (6.1.29)
L ∞ ([−1,1]) 2 L ∞ ([ a,b ])
The estimate (6.1.28) together with Thm. 6.1.15 paves the way for bounding the polynomial best approxi-
mation error on arbitrary intervals [ a, b], a, b ∈ R . Based on the affine mapping Φ : [−1, 1] → [ a, b] from
(6.1.19) and writing Φ∗ for the pullback according to (6.1.20) we can chain estimates. If f ∈ Cr ([ a, b])
and n ≥ r, then
(∗)
inf k f − p k L∞ ([ a,b]) = inf kΦ∗ f − pk L∞ ([−1,1])
p∈Pn p∈Pn
Thm. 6.1.15 (n − r )!
≤ (1 + π 2/2)r ( Φ ∗ f ) (r ) ∞
n! L ([−1,1])
r
(6.1.28) (n − r )! b − a
= (1 + π 2/2)r f (r ) ∞ .
n! 2 L ([ a,b ])
In step (∗) we used the result of Lemma 6.1.21 that Φ∗ p ∈ Pn for all p ∈ Pn . Invoking the arguments
that gave us (6.1.17), we end up with the simpler bound
r
b−a
inf k f − p k L∞ ([ a,b]) ≤ C (r ) f (r ) . (6.1.31)
p∈Pn n L ∞ ([ a,b ])
Observe that the length of the interval enters the bound in r-th power.
Already Thm. 6.1.15 considered the size of the best approximation error in Pn as a function of the poly-
nomial degree n. In the same vein, we may study a family of Lagrange interpolation schemes {LTn } n∈N
0
on I ⊂ R induced by a family of node sets {Tn }n∈N0 , Tn ⊂ I , according to Def. 6.1.32.
An example for such a family of node sets on I := [ a, b] are the equidistant or equispaced nodes
(n ) j
Tn := {t j := a + (b − a) : j = 0, . . . , n} ⊂ I . (6.1.34)
n
For families of Lagrange interpolation schemes {LTn }n∈N we can shift the focus onto estimating the
0
asymptotic behavior of the norm of the interpolation error for n → ∞.
||f-p ||
n ∞
-2
10 In the numerical experiment the norms of the inter-
||f-p ||
-4
n 2 polation errors can be computed only approximately
10
as follows.
-6 • L∞ -norm: approximated by sampling on a grid
Error norm
10
of meshsize π/1000.
• L2 -norm: numerical quadrature (→ Chapter 7)
-8
10
-10
10
with trapezoidal rule (7.4.4) on a grid of mesh-
size π/1000.
-12
10
✁ approximate norms k f − LTn f k ∗ , ∗ = 2, ∞.
-14
10
2 4 6 8 10 12 14 16
Fig. 212 n
In the previous experiment we observed a clearly visible regular behavior of k f − LTn f k as we increased
the polynomial degree n. The prediction of the decay law for k f − LTn f k for n → ∞ is one goal in the
study of interpolation errors.
Often this goal can be achieved, even if a rigorous quantitative bound for a norm of the interpolation error
✎ ☞
remains elusive. In other words, in many cases
no bound for k f − LTn f k can be given, but its decay for increasing n can be de-
✍ ✌
scribed precisely.
Now we introduce some important terminology for the qualitative description of the behavior of k f − LTn f k
as a function of the polynomial degree n. We assume that
Writing T (n) for the bound of the norm of the interpolation error according to (6.1.37) we distinguish
the following types of asymptotic behavior :
The bounds are assumed to be sharp in the sense, that no bounds with larger rate p (for algebraic
convergence) or smaller q (for exponential convergence) can be found.
Convergence behavior of norms of the interpolation error is often expressed by means of the Landau-O-
notation, cf. Def. 1.4.5:
Algebraic convergence: k f − IT f k = O (n− p )
for n → ∞ (“asymptotic!”)
Exponential convergence: k f − IT f k = O (qn )
Apply linear regression from Ex. 3.1.5 for data points (log ni , log ǫi ) ➣ least squares estimate for rate p.
C++11 code 6.1.42: Computing the interpolation error for Runge’s example
2
1/(1+x2)
Interpolating polynomial
1.5
0.5
-0.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Observation: Strong oscillations of IT f near the endpoints of the interval, which seem to cause
n→∞
k f − LT f k L∞ (]−5,5[) −−−→ ∞ .
Though polynomials possess great power to approximate functions, see Thm. 6.1.15 and Thm. 6.1.6, here
polynomial interpolants fail completely. Approximation theorists even discovered the following “negative
result”:
(n ) (n ) (n )
Given a sequence of meshes of increasing size {Tn }∞ n =1 , T j = {t0 , . . . , tn } ⊂ [ a, b], a ≤ t0 <
( j) (n )
t2 < · · · < tn ≤ b, there exists a continuous function f such that the sequence of interpolating
∞
polynomials (LTn f ) n=1 does not converge to f uniformly as n → ∞.
Now we aim to establish bounds for the supremum norm of the interpolation error of Lagrangian interpo-
lation similar to the result of Thm. 6.1.15.
Theorem 6.1.44. Representation of interpolation error [?, Thm. 8.22], [?, Thm. 37.4]
We consider f ∈ Cn+1 ( I ) and the Lagrangian interpolation approximation scheme (→ Def. 6.1.32)
for a node set T := {t0 , . . . , tn } ⊂ I . Then,
for every t ∈ I there exists a τt ∈] min{t, t0 , . . . , tn }, max{t, t0 , . . . , tn }[ such that
f (n+1) (τt ) n
( n + 1) ! ∏
f (t) − LT ( f )(t) = · (t − t j ) . (6.1.45)
j =0
n
Proof. Write wT (t) := ∏ (t − t j ) ∈ Pn+1 and fix t ∈ I \ T .
j =0
m: = n +1
⇒ ∃τt ∈ I: ϕ(n+1) (τt ) = f (n+1) (τt ) − c(n + 1)! = 0 .
f ( n +1) ( τ )
This fixes the value of c = (n+1)!t and by (6.1.46) this amounts to the assertion of the theorem.
✷
For f ∈ Cn+1 ( I ) let IT ∈ Pn stand for the unique Lagrange interpolant (→ Thm. 5.2.14) of f in the
node set T := {t0 , . . . , tn } ⊂ I . Then for all t ∈ I the interpolation error is
Z1 Zτ1 τZn−1Zτn
Proof. By induction on n, use (5.2.34) and the fundamental theorem of calculus [?, Sect. 3.1]:
✷
A result analogous top Lemma 6.1.48 holds also for general polynomial interpolation with multiple nodes
as defined in (5.2.22).
Lemma 6.1.48 provides an exact formula (6.1.45) for the interpolation error. From it we can derive esti-
mates for the supremum norm of the interpolation error on the interval I as follows:
➋ then increase the right hand side further by switching to the maximum (in modulus) w.r.t. t (the
resulting bound does no longer depend on t!),
➌ and, finally, take the maximum w.r.t. t on the left of ≤.
This yields the following interpolation error estimate for degree-n Lagrange interpolation on the node set
{ t0 , . . . , t n :
k f ( n +1) k L ∞ ( I )
Thm. 6.1.44 ⇒ k f − LT f k L ∞ ( I ) ≤ ( n + 1) !
max|(t − t0 ) · · · · · (t − tn )| . (6.1.50)
t∈ I
The estimate (6.1.50) hinges on bounds for (higher) derivatives of the interpoland f , which, essentially,
should belong to Cn+1 ( I ). The same can be said about the estimate of Thm. 6.1.15.
This reflects a general truth about estimates of norms of the interpolation error:
Now we are in a position to give a theoretical explanation for exponential convergence observed for poly-
nomial interpolation of f (t) = sin(t) on equidistant nodes: by Lemma 6.1.48 and (6.1.50)
1
f (k)
≤1, k f − pk L∞ ( I ) ≤ max (t − 0)(t − πn )(t − 2π
n ) · · · · · (t − π)
L∞ ( I )
(1 + n ) ! t ∈ I
⇒
∀k ∈ N0 1 π n +1
≤ .
n+1 n
➙ Uniform asymptotic (even more than) exponential convergence of the interpolation polynomials
(independently of the set of nodes T . In fact, k f − pk L∞ ( I ) decays even faster than exponential!)
How can the blow-up of the interpolation error observed in Ex. 6.1.41 be reconciled with Lemma 6.1.48 ?
Here f (t) = 1
1+ t 2
allows only to conclude | f (n) (t)| = 2n n! · O(|t|−2−n ) for n → ∞.
➙ Possible blow-up of error bound from Thm. 6.1.44 →∞ for n → ∞.
Thm. 6.1.44 gives error estimates for the L∞ -norm. What about other norms?
From Lemma 6.1.48 using the Cauchy-Schwarz inequality
2
Zb Zb Zb
2
f (t) g(t) dt ≤ | f (t)| dt | g(t)|2 dt , ∀ f , g ∈ C0 ([ a, b]) , (6.1.55)
a a a
Z Z1 Zτ1 τZn−1Zτn 2
n
kf − LT ( f )k2L2 ( I ) = ··· f ( n + 1)
(. . .) dτdτn · · · dτ1 · ∏ (t − t j ) dt
I 0 0 0 0 j =0
| {z }
| t− t j |≤| I |
Z Z
2n +2
≤ |I| vol(n+1) (Sn+1 ) | f (n+1) (. . .)|2 dτ dt
I | {z } S n +1
=1/( n+1) !
Z Z
| I |2n+2
= vol(n) (Ct,τ ) | f (n+1) (τ )|2 dτdt ,
I ( n + 1) ! I | {z }
≤2( n−1)/2 /n!
where
kIT (y)k L∞ ( I )
λ T : = kI T k ∞ → ∞ : = sup ,
y ∈R n+1 \{0} k yk ∞
establishes an important connection between the norms of the interpolation error and of the best approx-
imation error.
We first observe that the polynomial approximation scheme LT induced by IT preserves polynomials of
degree ≤ n:
LT p = IT [ p(t)] t∈T = p ∀ p ∈ Pn . (6.1.58)
Thus, by the triangle inequality, for a generic norm on C0 ( I ) and kLT k designating the associated oper-
ator norm of the linear mapping LT , cf. (5.2.70),
(6.1.58)
k f − LT f k = k( f − p) − LT ( f − p)k ≤ (1 + kLT k)k f − p k ∀ p ∈ Pn ,
Note that for k·k = k·k L∞ ( I ) , since [ f (t)] t∈T ∞ ≤ k f k L∞ ( I ) , we can estimate the operator norm, cf.
(5.2.70),
k f − LT f k L∞ ( I ) ≤ (1 + λT ) inf k f − pk L∞ ( I ) ∀ f ∈ C0 ( I ) . (6.1.61)
p∈Pn
Hence, if a bound for λT is available, the best approximation error estimate of Thm. 6.1.15 immediately
yields interpolation error estimates.
Exponential convergence can often be observed for families of Lagrangian approximation schemes, when
they are applied to an analytic interpoland.
The mathematical area of complex analysis (→ course in the BSc program CSE) studies analytic func-
tions. Analyticity gives access to powerful tools provided by complex analysis. One of theses tools is the
residue theorem.
R
• Note that the integral γ in Thm. 6.1.64 is a path integral in the complex plane (“contour integral”): If
the path of integration γ is described by a parameterization τ ∈ J 7→ γ(τ ) ∈ C, J ⊂ R , then
Z Z
f (z) dz := f (γ(τ )) · γ̇(τ ) dτ , (6.1.65)
γ J
where γ̇ designates the derivative of γ with respect to the parameter, and · indicates multiplication
in C. For contour integrals we have the estimate
Z
f (z) dz ≤ |γ| max | f (z)| . (6.1.66)
γ z∈γ
• Π often stands for the set of poles of f , that is, points where “ f attains the value ∞”.
The residue theorem is very useful, because there are simple formulas for res p f :
let g and h be complex valued functions that are both analytic in a neighborhood of p ∈ C, and
satisfy h( p) = 0, h′ ( p) 6= 0. Then
g g( p)
res p = ′ .
h h ( p)
Im
Assumption 6.1.68. Analyticity of inter-
poland
D
γ We assume that the interpoland f : I → C
t0 t1 t2 t4 Re can be extended to a function f : D ⊂ C →
a b C, which is analytic (→ Def. 6.1.63) on the
open set D ⊂ C with [ a, b] ⊂ D.
Fig. 215
Key is the following representation of the Lagrange polynomials (5.2.11) for node set T = {t0 , . . . , tn }:
n
t − tk w(t) w(t)
L j (t) = ∏ t − tk
= n =
(t − t j )w′ (t j )
, (6.1.69)
k=0,k6= j j (t − t j ) ∏ (t j − tk )
k=0,k6= j
where w ( t ) = ( t − t0 ) · · · · · ( t − t n ) ∈ P n + 1 .
Consider the following parameter dependent function gt , whose set of poles in D is Π = {t, t0 , . . . , tn }
f (z)
gt ( z ) : = , z ∈ C \ Π , t ∈ [ a, b] \ {t0 , . . . , tn } .
(z − t)w(z)
Apply residue theorem Thm. 6.1.64 to gt and a closed path of integration γ ⊂ D winding once
around [ a, b], such that its interior is simply connected, see the magenta curve in Fig. 215:
Z n n
1 Lemma 6.1.67 f (t) f (t j )
2πı γ
gt (z) dz = rest gt + ∑ restj gt = +∑
w (t) j =0 (t j − t)w ′ (t j )
j =0
n Z
w(t) w(t)
f (t) = − ∑ f (t j ) + gt (z) dz . (6.1.70)
j =1
(t j − t)w′ (t j ) 2πı γ
| {z } | {z }
−Lagrange polynomial ! interpolation error !
| {z }
polynomial interpolant !
This is another representation formula for the interpolation error, an alternative to that of Thm. 6.1.44 and
Lemma 6.1.48. We conclude that for all t ∈ [ a, b]
In concrete setting, in order to exploit the estimate (6.1.71) to study the n-dependence of the supremum
norm of the interpolation error, we need to know
• an upper bound for |w(t)| for a ≤ t ≤ b,
• an a lower bound for |w(z)|, z ∈ γ, for a suitable path of integration γ ⊂ D,
• a lower bound for the distance of the path γ and the interval [ a, b] in the complex plane.
The subset of C, where a function f given by a formula, is analytic can often be determined without
computing derivatives using the following consequence of the chain rule:
{z ∈ U : g(z) ∈ D } .
As pointed out in § 6.0.6, when we build approximation schemes from interpolation schemes, we have
the extra freedom to choose the sampling points (= interpolation nodes). Now, based on the insight into
the structure of the interpolation error gained from Thm. 6.1.44, we seek to choose “optimal” sampling
points. They will give rise to the so-called Chebychev polynomial approximation schemes, also known as
Chebychev interpolation.
Setting:
✦ Without loss of generality (→ Rem. 6.1.18): I = [−1, 1],
✦ interpoland f : I → R at least continuous, f ∈ C0 ( I ),
✦ set of interpolation nodes T := {−1 ≤ t0 < t1 < · · · < tn−1 < tn ≤ 1}, n ∈ N.
1
Recall Thm. 6.1.44: k f − LT f k L ∞ ( I ) ≤ f ( n + 1) k wk L∞ ( I ) ,
( n + 1) ! L∞ ( I )
Are there polynomials satisfying these requirements? If so, do they allow a simple characterization?
Proof. Just use the trigonometric identity cos(n + 1) x = 2 cos nx cos x − cos(n − 1) x with cos x = t.
✷
The theorem implies: • Tn ∈ Pn ,
• their leading coefficients are equal to 2n−1,
• the
Tnn are linearly independent,
• Tj j=0 is a basis of Pn = Span{T0 , . . . , Tn }, n ∈ N0 .
See Code 6.1.79 for algorithmic use of the 3-term recursion (6.1.78).
1 1 n=5
n=6
n=7
0.8 0.8 n=8
n=9
0.6 0.6
0.4 0.4
0.2 0.2
Tn(t)
Tn(t)
0 0
-0.2 -0.2
n=0
-0.4 -0.4
n=1
-1 -1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 217 t Fig. 218 t
C++11 code 6.1.80: Plotting Chebychev polynomials, see Fig. 217, 218
From Def. 6.1.76 we conclude that Tn attains the values ±1 in its extrema with alternating signs, thus
matching our heuristic demands:
kπ
| Tn (t k )| = 1 ⇔ ∃ k = 0, . . . , n: tk = cos , k Tn k L∞ ([−1,1]) = 1 . (6.1.81)
n
What is still open is the validity of the heuristics guiding the choice of the optimal nodes. The next funda-
mental theorem will demonstrate that, after scaling, the Tn really supply polynomials on [−1, 1] with fixed
leading coefficient and minimal supremum norm.
Theorem 6.1.82. Minimax property of the Chebychev polynomials [?, Section 7.1.4.], [?,
Thm. 32.2]
The polynomials Tn from Def. 6.1.76 minimize the supremum norm in the following sense:
2k + 1
The zeros of Tn are tk = cos π , k = 0, . . . , n − 1 . (6.1.84)
2n
18
16
14
12
n
10
When we use Chebychev nodes for polynomial interpolation we call the resulting Lagrangian approxima-
tion scheme Chebychev interpolation. On the interval [−1, 1] is is characterized by:
n o
2k+1
• “optimal” interpolation nodes T = cos 2( n + 1)
π , k = 0, . . . , n ,
• w(t) = (t − t0 ) · · · (t − tn ) = 2−n Tn+1 (t) , kwk L∞ ( I ) = 2−n , with leading coefficient 1.
Then, by Thm. 6.1.44, we immediately get an interpolation error estimate for Chebychev interpolation of
f ∈ Cn+1 ([−1, 1]):
2− n
k f − IT ( f )k L∞ ([−1,1]) ≤ f ( n + 1) . (6.1.85)
( n + 1) ! L ∞ ([−1,1])
Following the recipe of Rem. 6.1.18 Chebychev interpolation on an arbitrary interval [ a, b] can immediately
be defined. The same polynomial Lagrangian approximation scheme is obtained by transforming the
Chebychev nodes (6.1.84) from [−1, 1] to [ a, b] using the unique affine transformation (6.1.19):
✬ ✩
The Chebychev nodes in the interval I = [ a, b]
are
1 2k + 1
tk := a + 2 (b − a) cos π +1 ,
2( n + 1)
(6.1.87)
k = 0, . . . , n .
a b✫ ✪
dn fb dn f
With transformation formula for the integrals & (bt ) = ( 1 | I |)n ( t ):
2
dbtn dtn
2− n dn+1 fb
k f − IT ( f )k L∞ ( I ) = fb − ITb ( fb) ≤
L ∞ ([−1,1]) (n + 1)! dbtn+1
L ∞ ([−1,1])
2−2n−1 n+1 (n+1)
≤ |I| f . (6.1.88)
( n + 1) ! L∞ ( I )
We consider Runge’s function f (t) = 1+1t2 , see Ex. 6.1.41, and compare polynomial interpolation based
on uniformly spaced nodes and Chebychev nodes in terms of behavior of interpolants.
2 1.2
1/(1+x2) Function f
Interpolating polynomial Chebychev interpolation polynomial
1
1.5
0.8
1
0.6
0.4
0.5
0.2
0
0
-0.5 -0.2
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
Fig. 221 Fig. 222 t
We saw in Rem. 5.2.74 that the Lebesgue constant λT that measures the sensitivity of a polynomial
interpolation scheme, blows up exponentially with increasing number of equispaced interpolation nodes.
In stark contrast λT grows only logarithmically in the number of Chebychev nodes.
3.2
the bound
2.8
2.6
2
λT ≤ π log(1 + n) + 1 . (6.1.91)
λT
2.4
2.2
Measured Lebesgue constant for Chebychev nodes
based on approximate evaluation of (5.2.72) by sam- 2
pling. ✄
1.8
0 5 10 15 20 25
Polynomial degree n
k f − LT f k L∞ ( I ) ≤ (1 + λT ) inf k f − pk L∞ ( I ) ∀ f ∈ C0 ( I ) , (6.1.61)
p∈Pn
and the bound for the best approximation error for polynomaial from Thm. 6.1.15
( n − r ) ! (r )
inf k f − p k L∞ ([−1,1]) ≤ (1 + π 2/2)r f ,
p∈Pn n! L ∞ ([−1,1])
we end up with a bound for the supremum norm of the interpolation error in the case of Chebychev
interpolation on [−1, 1]
( n − r ) ! (r )
k f − LT f k L∞ ([−1,1]) ≤ (2/π log(1 + n) + 2)(1 + π 2/2)r f . (6.1.92)
n! L ∞ ([−1,1])
Now we empirically investigate the behavior of norms of the interpolation error for Chebychev interpolation
and functions with different (smoothness) properties as we increase the number of interpolation nodes.
1.2
||f-p ||
Function f n ∞
Chebychev interpolation polynomial
1
||f-p n||2
0.8 10
0
Error norm
0.6
0.4
-1
10
0.2
-0.2
-5 -4 -3 -2 -1 0 1 2 3 4 5 -2
10
t 2 4 6 8 10 12 14 16 18 20
Fig. 223 Polynomial degree n
pn → f , k f − In f k L∞ ([−5,5]) ≈ 0.8n
.
➁ f (t) = max{1 − |t|, 0}, I = [−2, 2], n = 10 nodes (plot on the left).
Now f ∈ C0 ( I ) but f ∈
/ C 1 ( I ).
0
1.2 10
Function f ||f-p ||
n ∞
Chebychev interpolation polynomial ||f-p n||2
1
0.8
Error norm
0.6
-1
10
0.4
0.2
-2
-0.2 10
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2 4 6 8 10 12 14 16 18 20
t Polynomial degree n
Fig. 224 Fig. 225
0
10
||f-p ||
n ∞
||f-p n||2
Error norm
-1
• no exponential convergence
• algebraic convergence (?)
-2
10
0 1 2
10 10 10
Fig. 226
Polynomial degree n
(
1
2 (1 + cos πt) |t| < 1
➂ f (t) = I = [−2, 2], n = 10 (plot on the left).
0 1 ≤ |t| ≤ 2
0
1.2 10
Function f ||f-p n||∞
Chebychev interpolation polynomial ||f-p n||2
1
0.8
-1
10
Error norm
0.6
0.4
-2
10
0.2
-3
-0.2 10
0 1 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 10 10 10
✦ for analytic f ∈ C∞ (→ Def. 6.1.63) the approximation error of the Cheychev interpolant seems to
decay to zero exponentially in the polynomial degree n.
Assuming that the interpoland f possesses an analytic extension to a complex neighborhood D of [−1, 1],
we now apply the theory of § 6.1.62 to bound the supremum norm of the Chebychev interpolation error of
f on [−1, 1].
To convert
as obtained in § 6.1.62, into a more concrete estimate, we have to study the behavior of
2k + 1
wn (t) = (t − t0 )(t − t1 ) · · · · · (t − tn ) , tk = cos π , k = 0, . . . , n ,
2n + 2
where the tk are the Chebychev nodes according to (6.1.87). They are the zeros of the Chebychev poly-
nomial (→ Def. 6.1.76) of degree n + 1. Since w has leading coefficient 1, we conclude w = 2−n Tn+1 ,
and
0.8
ρ=1
0.6
ρ=1.2
ρ=1.4
0.4
ρ=1.6
Thus, we see that γ is an ellipsis with foci ±1, large 0.2 ρ=1.8
ρ=2
axis 12 (ρ + ρ−1 ) > 1 and small axis 12 (ρ − ρ−1 ) > 0.
Im
0
✄ -0.4
-0.6
-0.8
-1 -0.5 0 0.5 1
Fig. 229 Re
Appealing to geometric evidence, we find dist(γ, [−1, 1]) = 12 (ρ + ρ−1 ) − 1, which gives another term
in (6.1.71).
The rationale for choosing this particular integration contour is that the cos in its defintion nicely cancels
the arccos in the formula for the Chebychev polynomials. This lets us compute (s := n + 1)
for all 0 ≤ θ ≤ 2π , which provides a lower bound for |wn | on γ. Plugging all these estimates into (6.1.71)
we arrive at
2| γ | 1
k f − LT f k L∞ ([−1,1]) ≤ · max | f (z)| . (6.1.98)
π (ρ n + 1 − 1)(ρ + ρ−1 − 2) z∈γ
Note that instead on the nodal polynomial w we have inserted Tn+1 into (6.1.71), which is a simple
multiple. The factor will cancel.
-1
10
-2 ||f-p ||
10 n ∞
||f-p n||2
-3
10
Error norm
-5
10
(Faster) exponential convergence than on the interval
I =] − 5, 5[: -6
10
k f − In f k L2 ([−1,1]) ≈ 0.42n . -7
10
-8
10
-9
10
2 4 6 8 10 12 14 16 18 20
Fig. 230 Polynomial degree n
Explanation, cf. Rem. 6.1.96: for I = [−1, 1] the poles ±i of f are farther away relative to the size of the
interval than for I = [−5, 5].
We recover the point value p( x ) as the point value of another polynomial of degree n − 1 with know
Chebychev expansion:
n −1
α j + 2xα j+1 , if j = n − 1 ,
p( x ) = ∑ e α j Tj ( x ) with e
α j = α j − α j +2 , if j = n − 2 , (6.1.103)
j =0
αj else.
C++11 code 6.1.105: Clenshaw algorithm for evalation of Chebychev expansion (6.1.101)
2 // Clenshaw algorithm for evaluating p = ∑nj=+11 a j Tj−1
3 // at points passed in vector x
4 // IN : a = α j , coefficients for p = ∑nj=+11 α j Tj−1
5 // x = (many) evaluation points
6 // OUT: values p( x j ) for all j
7 VectorXd clenshaw ( const VectorXd& a , const VectorXd& x ) {
8 const i n t n = a . siz e ( ) − 1 ; // degree of polynomial
9 MatrixXd d ( n + 1 , x . siz e ( ) ) ; // temporary storage for intermediate
values
10 f o r ( i n t c = 0 ; c < x . siz e ( ) ; ++c ) d . col ( c ) = a ;
11 f o r ( i n t j = n − 1 ; j > 0 ; −− j ) {
12 d . row ( j ) += 2 ∗ x . transpose ( ) . cwiseProduct ( d . row ( j +1) ) ; // see
(6.1.103)
13 d . row ( j −1) −= d . row ( j + 1 ) ;
14 }
15 r et ur n d . row ( 0 ) + x . transpose ( ) . cwiseProduct ( d . row ( 1 ) ) ;
16 }
Chebychev interpolation is a linear interpolation scheme, see § 5.1.13. Thus, the expansion α j in (6.1.101)
can be computed by solving a linear system of equations of the form (5.1.15). However, for Chebychev
interpolation this linear system can be cast into a very special form, which paves the way for its fast direct
solution:
Task: Efficiently compute the Chebychev expansion coefficients α j in (6.1.101) from the interpolation
conditions
2k + 1
p(tk ) = f (tk ) , k = 0, . . . , n , for tk := cos π . (6.1.107)
2( n + 1)
Chebychev nodes
Trick: transformation of p into a 1-periodic function, which turns out to be a Fourier sum (= finite Fourier
series):
n n
Def. 6.1.76
q(s) := p(cos 2πs) = ∑ α j Tj (cos 2πs) = ∑ α j cos(2πjs)
j =0 j =0
n
= ∑ 12 α j exp(2πıjs) + exp(−2πıjs) [ by cos z = 21 (ez + e−z ) ]
j =0
(6.1.108)
0 , for j = n + 1 ,
n +1
1α
2 j , for j = 1, . . . , n ,
= ∑ β j exp(−2πıjs) , with β j :=
α0 , for j = 0 ,
j=− n
1
2 α n − j , for j = − n, . . . , −1 .
Transformed interpolation conditions (6.1.107) for q:
(6.1.107) 2k + 1
t = cos(2πs) =⇒ q = yk := f (tk ) , k = 0, . . . , n . (6.1.109)
4( n + 1)
This is an interpolation problem for equidistant points on the unit circle as we have seen them in Sec-
tion 5.6.3.
q ( s ) = q (1 − s )
⇓← (6.1.109)
2k + 1
q (1 − ) = yk , k = 0, . . . , n .
4( n + 1)
Fig. 231
x = 1/2
0 1
Fig. 232
Trigonometric interpolation at equidistant points can be done very efficiently by means of FFT-based al-
gorithms, see Code 5.6.15. We can also apply these for the computation of Chebychev expansion coeffi-
cients.
n +1
k 1 2πıj 2πı
q +
2( n + 1) 4( n + 1)
= ∑ β j exp −
4( n + 1)
exp −
2n + 1
kj = zk .
j=− n
m
2n +1
2πı( j − n) 2πı nk
∑ β j exp −
4( n + 1)
= exp −
2n + 1
kj = exp −πı
n+1 k
z , k = 0, . . . , 2n + 1 .
j =0 | {z }
kj
= ω 2( n +1) !
m
h i
2πıj 2n +1
c = β j exp − 4(n+1) ,
F2(n+1) c = b with j =0 (6.1.111)
h i2n+1
b = exp −πı nnk +1 zk .
k=0
necessary!!
15 b (2 ∗ n+1− j ) = std : : exp (om∗ double (2 ∗ n+1− j ) ) ∗ y ( j ) ;
16 }
17
Computers use approximation by sums of Chebychev polynomials in the computation of functions like
log, exp, sin, cos, . . .. The evaluation by means of Clenshaw algorithm according to Code 6.1.105 is more
efficient and stable than the approximation by Taylor polynomials.
There is a particular family of norms for which the best approximant of a function f in a finite dimensional
function space VN , that is, the element of VN that is closest to f with respect to that particular norm can
actually be computed. It turns out that this computation boils down to solving a kind of least squares
problem, similar to the least squares problems in K n discussed in Chapter 3.
Concerning mean square best approximation it is useful to learn an abstract framework first into which the
concrete examples can be fit later.
Mean square norms generalize the Euclidean norm on K n , see [?, Sect. 4.4]. In a sense, they endow a
vector space with a geometry and give a meaning to concepts like “orthogonality”.
Let V be a vector space over the field K. A mapping b : V × V → K is called an inner product on
V , if it satisfies
(i) b is linear in the first argument: b(αv + βw, u) = αb(v, u) + βb(w, u) for all α, β ∈ K,
u, v, w ∈ V ,
(ii) b is (anti-)symmetric: b(v, w) = b(w, v) ( = ˆ complex conjugation),
(iii) b is positive definite: v 6= 0 ⇔ b(v, v) > 0.
b is a semi-inner product, if it still complies with (i) and (ii), but is only positive semi-definite:
b(v, v) ≥ 0 for all v ∈ V .
✎ notation: usually we write (·, ·)V for an inner product on the vector space V .
Let V be a vector space equipped with a (semi-)inner product (·, ·)V . Any two elements v and w of
V are called orthogonal, if (v, w)V = 0. We write v ⊥ w.
If (·, ·)V is a (semi-)inner product (→ Def. 6.2.1) on the vector space V , then
q
kvkV := (v, v)V
defines a (semi-)norm (→ Def. 1.5.70) on V , the mean square (semi-)norm/ inner product
(semi-)norm induced by (·, ·)V .
✦ The Euclidean norm on K n induced by the dot product (Euclidean inner product).
n
(x, y)Kn := ∑ (x ) j (y) j [“Mathematical indexing” !] x, y ∈ K n .
j =1
From § 3.1.8 we know that in Euclidean space K n the best approximation of vector x ∈ K n in a subspace
V ⊂ K n is unique and given by the orthogonal projection of x onto V . Now we generalize this to vector
spaces equipped with inner products.
ˆ a vector space over K = R, equipped with an mean square semi-norm k·k X induced by a semi-
X=
inner product (·, ·) X , see Thm. 6.2.3.
It can be an infinite dimensional function space, e.g., X = C0 ([ a, b]).
Assumption 6.2.6.
The semi-inner product (·, ·) X is a genuine inner product (→ Def. 6.2.1) on V , that is, it is positive
definite: (v, v) X > 0 ∀v ∈ V \ {0}.
Now we give a formula for the element q of V , which is nearest to a given element f of X with respect to
the norm k·k X . This is a genuine generalization of Thm. 3.1.10.
Theorem 6.2.7. Mean square norm best approximation through normal equations
k f − qk X = inf k f − p k X .
p ∈V
Proof. (inspired by Rem. 3.1.14) We first show that M is s.p.d. (→ Def. 1.1.8). Symmetry is clear from the
definition and the symmetry of (·, ·) X . That M is even positive definite follows from
N N 2
N
xH Mx = ∑ ∑ ξ k ξ j bk , b j X
= ∑ ξ b
j =1 j j X
>0, (6.2.9)
k=1 j =1
N
if x := ξ j j=1 6= 0 ⇔ ∑ N
j=1 ξ j b j 6= 0, since k·k X is a norm on V by Ass. 6.2.6.
N h iN
Now, writing c := γ j j=1 ∈ K N , b := f , bj X j =1
∈ K N , and using the basis representation
N
q= ∑ γj bj ,
j =1
we find
Since M is s.p.d., the unique solution of grad Φ(c) = Mc − b = 0 yields the unique global minimizer of
Φ; the Hessian 2M is s.p.d. everywhere!
✷
( f − q, p) X = 0 ∀ p ∈ V ⇔ f −q ⊥ V .
f
The message of Cor. 6.2.12:
In Section 3.1.1 we introduced the concept of least squares solutions of overdetermined linear systems
of equations Ax = b, A ∈ R m,n , m > n, see Def. 3.1.3. Thm. 3.1.10 taught that the normal equations
A⊤ AX = A⊤ b give the least squares solution, if rank(A) = n.
In fact, Thm. 3.1.10 and the above Thm. 6.2.7 agree if X = K n (Euclidean space) and V = Span{a1 , . . . , an },
where a j ∈ R m are the columns of A and N = n.
In the setting of Section 6.2.1.2 we may ask: Which choice of basis B = {b1 , . . . , b N } of
V ⊂ X renders
the normal equations (6.2.8) particularly simple? Answer: A basis B, for which bk , b j X = δkj (δkj the
Kronecker symbol), because this will imply M = I for the coefficient matrix of the normal equations.
N
q= ∑ f , bj b
X j
. (6.2.16)
j =1
From Section 1.5.1 we already know how to compute orthonormal bases: The algorithm from § 1.5.1 can
be run in the framework of any vector space V endowed with an inner product (·, ·)V and induced mean
square norm k·kV .
7: else { bj ←
bj
}
Span{b1 , . . . , bℓ } = Span{ p1 , . . . , pℓ }
k b j kV
} for all ℓ ∈ {1, . . . , k}.
This suggests the following alternative approach to the computation of the mean square best approximant
q in V of f ∈ X :
➊ Orthonormalize a basis {b1 , . . . , b N } of V , N := dim V , using Gram-Schmidt algorithm (6.2.18).
➋ Compute q according to (6.2.16).
Number of inner products to be evaluated: O( N 2 ) for N → ∞.
To match the abstract framework of Section 6.2.1 we need to find (semi-)inner products on C0 ([ a, b]) that
supply positive definite inner products on Pm . The following options are commonly considered:
✦ On any interval [ a, b] we can use the L2 ([ a, b])-inner product (·, ·) L2 ([ a,b)] , defined in (6.2.5).
✦ Given a positive weight function w : [ a, b] → R, w(t) > 0 for all t ∈ [ a, b], we can consider the
weighted L2 -inner product on the interval [ a, b]
Z b
( f , g) w,[ a,b] := w(τ ) f (τ ) g (τ ) dτ . (6.2.21)
a
✦ For n ≥ m and n + 1 distinct points collected in the set T := {t0 , t1 , . . . , tn } ⊂ [ a, b] we can use
the discrete L2 -inner product
n
( f , g)T := ∑ f (t j ) g(t j ) . (6.2.22)
j =0
that is, multiplication with the independent variable can be shifted to the other function inside the inner
product.
✎ notation: Note that we have to plug a function into the slots of the inner products; this is indicated by
the notation {t 7→ . . .}.
The ideas of Section 6.2.1.3 that center around the use of orthonormal bases can also be applied to
polynomials.
The sequence of orthonormal polynomials from Def. 6.2.25 is unique up to signs, supplies an (·, ·) X -
orthonormal basis (→ Def. 6.2.14) of Pm , and satisfies
Proof. Comparing Def. 6.2.14 and (6.2.26) the ONB-property of {r0 , . . . , rm } is immediate. Then (6.2.26)
follows from dimensional considerations.
Pk−1 ⊂ Pk has co-dimension 1 so that there a unit “vector” in Pk , which is orthogonal to Pk−1 and unique
up to sign.
✷
Hence sk (t) := t · rk (t) is a polynomial of degree k + 1 with leading coefficient 6= 0, that is sk ∈ Pk+1 \ Pk .
Therefore, rk+1 can be obtain by orthogonally projecting sk onto Pk plus normalization, cf. Lines 4-5 of
Algorithm (6.2.18):
k
e
r k+1
r k+1 = ± r k+1 = s k − ∑ s k , r j X r j .
, e (6.2.30)
kerk+1 kX j =0
The sum in (6.2.30) collapses to two terms! In fact, since (rk , q) X = 0 for all q ∈ Pk−1 , by Ass. 6.2.24
(6.2.23)
sk , r j X
= {t 7→ trk (t)} , r j X
= rk , {t 7→ tr j } X
= 0 , if j < k − 1 ,
because in this case {t 7→ tr j } ∈ Pk−1 . As a consequence (6.2.30) reduces to the 3-term recursion
e
r k+1
r k+1 = ± ,
kerk+1 k X , k = 1, . . . , m − 1 . (6.2.31)
e
r k+1 = sk − ({t 7→ trk }, rk ) X rk − ({t 7→ trk }, rk−1 )X rk−1
The 3-term recursion (6.2.31) can be recast in various ways. Forgoing normalization the next theorem
presents one of them.
Proof. (by rather straightforward induction) We first confirm, thanks to the definition of α1 ,
For the induction step we assume that the assertion is true for p0 , . . . , pk and observe that for pk+1
according to (6.2.33) we have
This amounts to the assertion of orthogonality for k + 1. Above, several inner product vanish because of
the induction hypothesis!
✷
It is a natural question what is the unique sequence of L2 ([−1, 1])-orthonormal polynomials. Their rather
simple characterization will be discussed in the sequel.
Legendre polynomials
1 n=0
n=1
0.8 n=2
The Legendre polynomials Pn can be defined by the n=3
0.6 n=4
3-term recursion n=5
0.4
2n + 1 n 0.2
Pn+1 (t) := tPn (t) − Pn−1 (t) ,
Pn(t)
n+1 n+1 0
(7.3.33) -0.2
P0 := 1 , P1 (t) := t . -0.4
-0.6
-0.8
-1
-1 -0.5 0 0.5 1
Fig. 234 t
Since they involve integrals, weighted L2 -inner products (6.2.21) are not accessible computationally, un-
less one resigns to approximation, see Chapter 7 for corresponding theory and techniques.
Therefore, given a point set T := {t0 , t1 , . . . , tn }, we focus on the associated discrete L2 -inner product
n
( f , g) X := ( f , g) T := ∑ f (t j ) g(t j ) , f , g ∈ C0 ([ a, b]) , (6.2.22)
j =0
The polynomials pk generated by the 3-term recursion (6.2.33) from Thm. 6.2.32 are then called discrete
orthogonal polynomials. The following C++ code computes the recursion coefficients αk and β k , k =
1, . . . , n − 1.
C++11 code 6.2.39: Computation of weights in 3-term recursion for discrete orthogonal poly-
nomials
2 // Computation of coefficients α, β from 6.2.32
3 // IN : t = points in the definition of the discrete L2 -inner product
Given a point set T := {t0 , t1 , . . . , tn } ⊂ [ a, b], and a function f : [ a, b] → K, we may seek to ap-
proximate f by its polynomial best approximant with respect to the discrete L2 -norm k·kT induced by the
discrete L2 -inner product (6.2.22).
qk := argmink f − p kT , k ∈ {0, . . . , n} ,
p∈Pk
The stable and efficient computation of fitting polynomials can rely on combining Thm. 6.2.32 with Cor. 6.2.15:
2
We use equidistant points T := {tk = −1 + k m , k = 0, . . . , m} ⊂ [−1, 1], m ∈ N to compute fitting
polynomials (→ Def. 6.2.41) for two different functions.
We monitor the L2 -norm and L∞ -norm of the approximation error, both norms approximated by sampling
j
in ξ j = −1 + 500 , j = 0, . . . , 1000.
➀ f (t) = (1 + (5t)2 )−1 , I = [−1, 1] → Ex. 6.1.41, analytic in complex neighborhood of [−1, 1]:
1.2 Polynomial fitting of Runge function: equidistant points
0
2 { 10
(1+(5x) ) -1} ∞
n=0 L -norm, m=50
1
n=2 L2 -norm, m=100
n=4 L∞-norm, m=100
2
n=6 L -norm, m=200
0.8 ∞
L -norm, m=200
n=8
n=10 10 -1 L2 -norm, m=400
error norm
0.6
y
0.4
-2
10
0.2
-0.2 10 -3
0 5 10 15 20 25 30 35
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 236
Fig.
Fig. 235 polynomial degree n
t
➣ We observe exponential convergence (→ Def. 6.1.38) in the polynomial degree n.
0.4
y
0.2
10 -2
0
-0.2
10 -3
-0.4 10 0 10 1 10 2
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 238
Fig.
Fig. 237
polynomial degree n
t
➣ We observe only algebraic convergence (→ Def. 6.1.38 in the polynomial degree n (for n ≪ m!).
error norm
4 10 -2
braic convergence.
10 -4
10 0 10 1 10 2
Fig. 239 polynomial degree n
q ∈ argmink f − p k L∞ ( I ) .
p∈Pn
The results of Section 6.2.1 cannot be applied because the supremum norm is not induced by an inner
product on Pn .
Theory provides us with surprisingly precise necessary and sufficient conditions to be satisfied by the
polynomial L∞ ([ a, b])-best approximant q.
q = argmink f − p k L∞ ( I )
p∈Pn
if and only if there exist n + 2 points a ≤ ξ 0 < ξ 1 < · · · < ξ n+1 ≤ b such that
The widely used iterative algorithm (Remez algorithm) for finding an L∞ -best approximant is motivated
by the alternation theorem. The idea is to determine successively better approximations of the set of
alternants: A(0) → A(1) → . . ., ♯A(l ) = n + 2.
Key is the observation that, due to the alternation theorem, the polynomial L∞ ([ a, b])-best approximant q
will satisfy (one of the) interpolation conditions
(l ) (l ) (l ) (l )
➁ Given approximate alternants A(l ) := {ξ 0 < ξ 1 < · · · < ξ n < ξ n+1 } ⊂ [ a, b] determine
q ∈ Pn and a deviation δ ∈ R satisfying the extended interpolation condition
(l ) (l )
q(ξ k )+(−1)k δ = f (ξ k ) , k = 0, . . . , n + 1 . (6.3.6)
After choosing a basis for Pn , this is (n + 2) × (n + 2) linear system of equations, cf. § 5.1.13.
➂ Choose A(l +1) as the set of extremal points of f − q, truncated in case more than n + 2 of these
exist.
These extreme can be located approximately by sampling on a fine grid covering [ a, b]). If the
derivative of f ∈ C1 ([ a, b]) is available, too, then search for zeros of ( f − p)′ using the secant
method from § 8.3.22.
➃ If k f − qk L∞ ([ a,b]) | ≤ TOL · k dk L∞ ([ a,b]) STOP, else GOTO ➁.
(TOL is a prescribed relative tolerance.)
C++11 code 6.3.7: Remez algorithm for uniform polynomial approximation on an interval
1 f u n c t i o n c = remes(f,f1,a,b,d,tol)
2 % f is a handle to the function, f1 to its derivative
3 % d = polynomial degree (positive integer)
4 % a,b = interval boundaries
5 % returns coefficients of polynomial in monomial basis
6 % (M A T L A B convention, see Rem. 5.2.4).
7
18 maxit = 10;
19 % Main iteration loop of Remez algorithm
20 f o r k=1:maxit
21 % Interpolation at d + 2 points xe with deviations ±δ
22 % Algorithm uses monomial basis, which is not optimal
23 V= v a nde r (xe); A=[V(:,2:d+2), (-1).^[0:d+1]’]; % LSE
24 c=A\fxe; % Solve for coefficients of polynomial q
25 c1=[d:-1:1]’.*c(1:d); % Monomial doefficients of derivative q′
26
27 % Find initial guesses for the inner extremes by sampling; track sign
28 % changes of the derivative of the approximation error
29 deltab = ( p o l y v a l (c1,xtab) - f1tab);
30 s=[deltab(1:n-1)].*[deltab(2:n)];
31 ind= f i n d (s<0); xx0=xtab(ind); % approximate zeros of e’
32 nx = l e n g t h (ind); % number of approximate zeros
33 % Too few extrema; bail out
34 i f (nx < d), e r r o r (’Too few extrema’); end
35
We examine the convergence of the Remez algorithm from Code 6.3.7 for two different functions:
n=3 n=3
n=5 n=5
-2
10 10 -2 n=7
n=7
n=9 n=9
10 -4 n=11 n=11
10 -4
-6
10
10 -6
10 -8
10 -8
10 -10
10 -10
10 -12
10 -12
10 -14
-14
10
10 -16
10 -18 10 -16
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Fig. 241 Step of Remez algorithm Fig. 242 Step of Remez algorithm
Convergence in both cases; faster convergence observed for smooth function, for which machine precision
is reached after a few steps.
f ∈ C 0 (R ) , f ( t + 1) = f ( t ) ∀ t ∈ R .
The natural space for approximating generic periodic functions is a space of trigonometric polyno-
mials with the same period.
Remember from Def. 5.6.3: space of 1-periodic trigonometric polynomials of degree 2n, n ∈ N,
n
T
P2n = Span t 7→ 1, t 7→ sin(2πt), t 7→ cos(2πt), t 7→ sin(4πt), t 7→ cos(4πt), . . . (6.4.1a)
o
t 7→ sin(2πnt), t 7→ cos(2πnt)
= Span{t 7→ exp(2πıkt) : k = −n . . . , n} . (6.4.1b)
Terminology: T =
P2n ˆ space of trigonometric polynomials of degree 2n.
From Section 5.6 remember a few more facts about trigonometric polynomials and trigonometric interpo-
lation:
T = 2n + 1
✦ Cor. 5.6.8: Dimension of the space of trigonometric polynomials: dim P2n
✦ Trigonometric interpolation can be reduced to polynomial interpolation on the unit circle S1 ⊂ C in
the complex plane, see (5.6.7).
existence & uniqueness of trigonometric interpolant q satisfying (6.4.3) and (6.4.4)
✦ Code 5.6.15: Efficient FFT-based algorithms for trigonometric interpolation in equidistant nodes
tk = 2nk+1 , k = 0, . . . , 2n.
The relationship of trigonometric interpolation and polynomial interpolation on the unit circle suggests a
uniform distribution of nodes for general trigonometric interpolation.
k
✎ notation: trigonometric interpolation operator in 2n + 1 equidistant nodes tk = 2n +1 , k = 0, . . . , 2n
T
Tn : C0 ([0, 1[) → P2n , Tn ( f )(tk ) = f (tk ) ∀k ∈ {0, . . . , 2n} . (6.4.6)
f (t)
for functions f : [0, 1[→ C with different smoothness properties. To begin with we perform an empiric
study.
Now we study the asymptotic behavior of the error of equidistant trigonometric interpolation as n → ∞ in
a numerical experiment for functions with different smoothness properties.
-2 -2
10 10
-4 -4
10 10
||Interpolationsfehler||∞
||Interpolationsfehler||2
-6
-6
10 10
#1 #1
#2 -8 #2
-8 10
10 #3 #3
-10
-10 10
10
-12
-12 10
10
-14
-14
10
10
2 4 8 16 32 64 128 2 4 8 16 32 64 128
Fig. 244 n Fig. 245 n
We conclude that in this experiment higher smoothness of f leads to faster convergence of the trigono-
metric interplant.
Of course the smooth trigonometric interpolants of step function fail to converge in L∞ -norm in Exp. 6.4.7.
Moreover, they will not even converge “visually” to the step function, which becomes manifest by a closer
inspection of the interpolants.
1.2 1.2
f f
p p
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t
n = 16 n = 128
Observation: overshooting in neighborhood of discontinuity: Gibbs phenomenon
(6.4.10) Aliasing
We study the action of the trigonometric interpolation operator Tn from (6.4.6) on individual Fourier modes
µk (t) := exp(−2πkıt), t ∈ R, k ∈ Z. Due to the 1-periodicity of t 7→ exp(2πıt) we find for every node
j
t j := 2n +1 , j = 0, . . . , 2n,
j j
µk (t j ) = exp(−2πık 2n+1 ) = exp(−2πı (k − ℓ(2n + 1)) 2n+1 ) = µk−ℓ(2n+1) (t j ) ∀ℓ ∈ Z .
When sampled on the node set Tn := {t0 , . . . , t2n } all the Fourier modes µk−ℓ(2n+1) , ℓ ∈ Z, yield the
same values. Thus the trigonometric cannot distinguish them! This phenomenon is called aliasing.
Aliasing demonstrated for f (t) = sin(2π · 19t) = Im(exp(2πı19t)) for different node sets.
1 1 1
p p p
f f f
0.8 0.8 0.8
0 0 0
-1 -1 -1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t t
Tn µk = µek , e
k ∈ {−n, . . . , n} , k − e
k ∈ (2n + 1)Z [ e
k := k mod (2n + 1) ] . (6.4.11)
e = n, n]
(n fn = −n, −^
+ 1 = − n, − f = −1, etc.)
n − 1 = n, 2n
Trigonometric interpolation by Tn folds all Fourier modes (“frequencies”) to the finite range {−n, . . . , n}.
From (6.4.11), by linearity of Tn , we obtain for f : [0, 1[→ C in Fourier series representation
∞ n ℓ
f (t) = ∑ fbj µ j (t) Tn ( f )(t) = ∑ γ j µ j (t) , γ j = ∑ fbj+ℓ(2n+1) . (6.4.12)
j=− ∞ j=− n ℓ=− ∞
We can read the trigonometric polynomial Tn f ∈ P2n T as a Fourier series with non-zero coefficients only
in the index range {−n, . . . , n}. Thus, for the Fourier coefficients of the trigonometric interpolation error
E(t) := f (t) − Tn f (t) we find from (6.4.12)
− ∑ fbj+ℓ(2n+1) , if j ∈ {−n, . . . , n} ,
bj =
E ℓ∈Z \{0} j∈Z. (6.4.13)
fbj , if | j| > n ,
In order to estimate these norms of the trigonometric interpolation error we need quantitative information
about the decay of the Fourier coefficients fbj as | j| → ∞.
d
For 1-periodic c ∈ C0 (R ) with integrable derivative ċ := dt c we find by integration by parts (boundary
terms cancel due to periodicity)
Z1 Z1
b
cj = c (t)e 2πı jt
dt = −2πıj · ċ(t)e2πı jt dt = (−2πıj) ḃc j , j ∈ Z .
0 0
We can also arrive at this formula by (formal) term-wise differentiation of the Fourier series:
∞ ∞
c (t) = ∑ c j e−2πıjt =⇒ ċ(t) =
b ∑ c j e−2πıjt .
(−2πıj)b (6.4.18)
j=− ∞ j=− ∞ | {z }
=ḃc j
For the Fourier coefficients of the derivatives a 1-periodic function f ∈ Ck−1 (R ), k ∈ N, with
integrable k-th derivative f (k) holds
d
f (k) j = (−2πı )k fbj , j ∈ Z .
The smoother a periodic function the faster the decay of its Fourier coefficients
The isometry property of Thm. 4.2.89 also yields for f ∈ Ck−1 (R ) with f (k) ∈ L2 (unitintv ) that
2 ∞
f (k)
= (2π ) k
∑ j2k | fbj |2 . (6.4.24)
L2 (]0,1[)
j=− ∞
We can now combine the identity (6.4.24) with (6.4.15) and obtain an interpolation error estimate in
L2 (]0, 1[)-norm.
with c k = 2 ∑∞
ℓ=1 (2ℓ − 1)
−2k < ∞.
Thm. 6.4.25 confirms algebraic convergence of the L2 -norm of the trigonometric interpolation error for
functions with limited smoothness. Higher rates can be expected for smoother functions, which we have
also found in cases #1 and #3 in Exp. 6.4.7.
In § 6.1.62 we saw that we can expect exponential decay of the maximum norm of polynomial interpolation
errors in the case of “very smooth” interpolands. To capture this property of functions we resorted to
the notion of analytic functions, as defined in Def. 6.1.63. Since trigonometric interpolation is closely
connected to polynomial interpolation (on the unit circle S1 , see Section 5.6.2), it is not surprising that
analyticity of interpolands will also involve exponential convergence of trigonometric interpolants. This
result will be established in this section.
In case #2 of Exp. 6.4.7 we already say an instance of exponential convergence for an analytic interpoland.
A more detailed study follows.
2
10
1
Interpolationsfehlernorm
-4
f (t) = p on I = [0, 1] . 10
1 − α sin(2πt) -6
10
(6.4.28) -8 α=0.5, L∞
10
α=0.5, L2
Lemma 6.4.22 asserts algebraic decay of the Fourier coefficients of functions with limited smoothness. As
analytic 1-periodic functions are “infinitely smooth”, the will always belong to C∞ (R ), we expect a stronger
result in this case. In fact, we can conclude exponential decay of the Fourier coefficients.
Proof.
z-plane
Fig. 247
Z1 Z1
gbr k = f (t + ir )e −2πıkt
dt = f (t)e2πık(t−ir) dt = e−2πrk fbk ,
0 0
Knowing exponential decay of the Fourier coefficients, the geometric sum formula can be used to extract
estimates from (6.4.15) and (6.4.16):
Lemma 6.4.32. Interpolation error estimates for exponentially decaying Fourier coefficients
This estimate can be combined with the result of Thm. 6.4.30 and gives the main result of this section:
The speed of exponential convergence clearly depends on the width η of the “strip of analyticity” S̄.
Similar to Chebychev interpolants, also trigonometric interpolants converge exponentially fast, if the inter-
poland f is 1-periodic analytic (→ Def. 6.1.63) in a strip around the real axis in C, see Thm. 6.4.33 for
details.
1 + α sin(2πz) 6∈ R0− ⇔ sin(2πz) = sin(2πx ) cosh (2πy) + i cos(2πx ) sinh (2πy) 6∈] − ∞, −1 − α1 ] ,
Domain of analyticity of f :
[
C\ ( 2k + 14 + i (R \] − ζ, ζ [)) , ζ ∈ R + , cosh(2πζ ) = 1 + 1
α .
k ∈Z
10 Im
9
7
1
6
cosh Re
5
−2 −1 1 2
4
3
−1
2
1
Fig. 249
0
➣ f analytic in strip
Fig. 248 -1
-3 -2 -1 0 1 2 3 S := {z ∈ C: : −ζ < Im(z) < ζ }.
➣ As α decreases the strip of analyticity becomes wider, since x → cosh( x ) is increasing for x > 0.
(6.5.1) Grid/mesh
The attribute “piecewise” refers to partitioning of the interval on which we aim to approximate. In the
case of data interpolation the natural choice was to use intervales defined by interpolation nodes. Yet we
already saw exceptions in the case of shape-preserving interpolation by means of quadratic splines, see
Section 5.5.3.
In the case of function approximation based on an interpolation scheme the additional freedom to choose
the interpolation nodes suggests that those be decoupled from the partitioning.
Borrowing from terminology for splines, cf. Def. 5.5.1, the underlying mesh for piecewise polynomial
Terminology:
✦ xj = ˆ nodes of the mesh M,
✦ [ x j −1 , x j [ =
ˆ intervals/cells of the mesh, a b
✦ hM := max | x j − x j−1 | = ˆ mesh width, x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14
j
✦ If x j = a + jh =
ˆ equidistant (uniform) mesh
with meshwidth h > 0
We will see that most approximation schemes relying on piecewise polynomials are local in the sense that
finding the approximant on a cell of the mesh relies only on a fixed number of function evaluations in a
neighborhood of the cell.
Recall theory of polynomial interpolation → Section 5.2.2: n + 1 data points needed to fix interpolating
polynomial, see Thm. 5.2.14.
Obviously, IM depends on M, the local degrees n j , and the sets T j of local interpolation points (the latter
two are suppressed in notation).
then the piecewise polynomial Lagrange interpolant according to (6.5.5) is continuous on [ a, b]:
s ∈ C0 ([ a, b]).
k f − IM f k ≤ CT ( N ) for N → ∞ , (6.5.9)
m
where N := ∑ ( n j + 1 ).
j =1
But why do we choose this strange number N as parameter when investigating the approximation error?
Because, by Thm. 5.2.2, it agrees with the dimension of the space of discontinuous, piecewise polynomials
functions
{q : [ a, b] → R: q| Ij ∈ Pn j ∀ j = 1, . . . , m} !
This dimension tells us the number of real parameters we need to describe the interpolant s, that is, the
“information cost” of s. N is also proportional to the number of interpolation conditions, which agrees with
the number of f -evaluations needed to compute s (why only proportional in general?).
1.5
atan(t)
piecew. linear
Compare Exp. 5.3.7: 1
piecew. quadratic
Grid M := {−5, − 52 , 0, 25 , 5}
Local interpolation nodes equidistant in I j , endpoints 0
polynomial interpolants ✄
-1.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Fig. 250 t
i
✦ Sequence of (equidistant) meshes: Mi := {−5 + j 2−i 10}2j=0, i = 1, . . . , 6.
✦ Equidistant local interpolation nodes (endpoints of grid intervals included).
Monitored: interpolation error in (approximate) L∞ - and L2 -norms, see (6.1.95), (6.1.94)
2 0
10 10
0
10
-2
10
∞
||Interpolation error||2
||Interpolation error||
-4 -5
10 10
-6
10
-8
10
-10 -10
10 10
-12
10 Deg. =1 Deg. =1
Deg. =2 Deg. =2
-14
Deg. =3 Deg. =3
10 Deg. =4 Deg. =4
Deg. =5 Deg. =5
Deg. =6 Deg. =6
-16 -15
10 10
-2 -1 0 1 -2 -1 0 1
10 10 10 10 10 10 10 10
Fig. 251 mesh width h Fig. 252 mesh width h
(nearly linear error norm graphs in doubly logarithmic scale, see Rem. 6.1.40)
n 1 2 3 4 5 6
w.r.t. L2 -norm 1.9957 2.9747 4.0256 4.8070 6.0013 5.2012
w.r.t. L∞ -norm 1.9529 2.8989 3.9712 4.7057 5.9801 4.9228
➣ Higher polynomial degree provides faster algebraic decrease of interpolation error norms. Empiric
evidence for rates α = p + 1
Here: rates estimated by linear regression (→ Ex. 3.1.5) based on MATLAB’s polyfit and the interpo-
lation errors for meshwidth h ≤ 10 · 2−5. This was done in order to avoid erratic “preasymptotic”, that is,
for large meshwidth h, behavior of the error.
The bad rates for n = 6 are probably due to the impact of roundoff, because the norms of the interpolation
error had dropped below machine precision, see Fig. 251, 252.
The observations made in Ex. 6.5.10 are easily explained by applying the polynomial interpolation error
estimates of Section 6.1.2 locally on the mesh intervals [ x j−1, x j ], j = 1, . . . , m: for constant polynomial
degree n = n j , j = 1, . . . , m, we get
hnM+1
(6.1.50) ⇒ k f − s k L∞ ([ x0 ,xm ]) ≤ f ( n + 1) , (6.5.12)
( n + 1) ! L ∞ ([ x0 ,xm ])
0
10 -1
10
-1
10 -2
10
-2
10
∞
||Interpolation error||2
-3
10
||Interpolation error||
-3
10
-4
10
-4
10
-5
10
-5
10
-6
10
-6
10
-7
-7 10
10
h =5 h =5
h =2.5 -8
h =2.5
-8
10 h =1.25 10 h =1.25
h =0.625 h =0.625
h =0.3125 h =0.3125
-9 -9
10 10
1 2 3 4 5 6 1 2 3 4 5 6
Fig. 253 Local polynomial degree Fig. 254 Local polynomial degree
In this example we deal with an analytic function, see Rem. 6.1.96. Though equidistant local interpolation
nodes are used cf. Ex. 6.1.41, the mesh intervals seems to be small enough that even in this case
exponential convergence prevails.
See Section 5.4 for definition and algorithms for cubic Hermite interpolation of data points, with a focus
on shape preservation, however. If the derivative f ′ of the interpoland f is available (in procedural form),
then it can be used to fix local cubic polynomials by prescribing point values and derivative values in the
endpoints of grid intervals.
Definition 6.5.14. Piecewise cubic Hermite interpolant (with exact slopes) → Def. 5.4.1
[Piecewise cubic Hermite interpolant (with exact slopes) Given f ∈ C1 ([ a, b]) and a mesh M :=
{a = x0 < x1 < . . . < xm−1 < xm = b} the piecewise cubic Hermite interpolant (with exact
slopes) s : [ a, b] → R is defined as
s|[ x j−1,x j ] ∈ P3 , j = 1, . . . , m , s( x j ) = f ( x j ) , s′ ( x j ) = f ′ ( x j ) , j = 0, . . . , m .
Clearly, the piecewise cubic Hermite interpolant is continuously differentiable: s ∈ C1 ([ a, b]), cf. Cor. 5.4.2.
In this experiment we study the h-convergence of Cubic Hermite interpolation for a smooth function.
2
10
sup-norm
2
L -norm
1
10
f ( x ) = arctan( x ) . -1
10
✦ domain: I = (−5, 5)
-2
10
-6
10
-1 0 1
10 10 10
Fig. 255 meshwidth h
C++11 code 6.5.16: Hermite approximation and orders of convergence with exact slopes
The observation made in Exp. 6.5.15 matches the theoretical prediction of the rate of algebraic conver-
gence for cubic Hermite interpolation with exact slopes for a smooth function.
Let s be the cubic Hermite interpolant of f ∈ C4 ([ a, b]) on a mesh M := { a = x0 < x1 < . . . <
xm−1 < xm = b} according to Def. 6.5.14. Then
1 4
k f − sk L∞ ([ a,b]) ≤ h f (4) ,
4! M L ∞ ([ a,b ])
In Section 5.4.2 we saw variants of cubic Hermite interpolation, for which the slopes c j = s′ ( x j ) were
computed from the values y j in preprocessing step. Now we study the use of such a scheme for approxi-
mation.
2
10
sup-norm
2
L -norm
1
10
Piecewise cubic Hermite interpolation of
norm of interpolation error
0
10
f ( x ) = arctan( x ) .
-1
10
✦ domain: I = (−5, 5)
✦ equidistant mesh T in I , see Exp. 6.5.15, -2
10
We observe lower rate of algebraic convergence compared to the use of exact slopes due to averaging
(5.4.8). From the plot we deduce O(h3 ) asymptotic decay of the L2 - and L∞ -norms of the approximation
error for meshwidth h → 0.
88 c ( n ) = d e l t a ( n −1) ;
89 f o r ( unsigned i = 1 ; i < n ; ++ i ) {
90 c ( i ) = ( h ( i ) ∗ d e l t a ( i − 1) + h ( i − 1) ∗ d e l t a ( i ) ) / ( t ( i + 1) − t ( i − 1) ) ;
91 }
92 r etur n c ;
93 }
94
95 // Appends a Eigen::VectorXd to another Eigen::VectorXd
96 void append ( VectorXd& x , const VectorXd& y ) {
97 x . c o n s e r v a t i v e R e s i z e ( x . siz e ( ) + y . siz e ( ) ) ;
98 x . t a i l ( y . siz e ( ) ) = y ;
99 }
Recall concept and algorithms for cubic spline interpolation from Section 5.5.1. As an interpolation scheme
it can also serve as the foundation for an approximation scheme according to § 6.0.6: the mesh will double
as know set, see Def. 5.5.1. Cubic spline interpolation is not local as we saw in § 5.5.19. Nevertheless,
cubic spline interpolants can be computed with an effort of O(m) as elaborated in § 5.5.5.
2 n
We take I = [−1, 1] and rely on an equidistant mesh (knot set) M := {−1 + j} , n ∈ N ➙
n j =0
meshwidth h = 2/n.
We study h-convergence of complete (→ § 5.5.11) cubic spline interpolation, where the slopes at the
endpoints of the interval are made to agree with the derivatives of the interpoland at these points. As
interpolands we consider
0 , if t < − 52 ,
1
f 1 (t) = ∈ C∞ ( I ) , f 2 (t) = 1
(1 + cos(π (t − 35 ))) , if − 25 < t < 35 , ∈ C1 ( I ) .
1 + e−2t
2
1 otherwise.
-2 0
10 10
∞
L -Norm L∞-Norm
L2-Norm -1 L2-Norm
10
-4
10
-2
10
-6
10 -3
10
||s-f||
||s-f||
-4
-8 10
10
-5
10
-10
10
-6
10
-12 -7
10 -2 -1 0 10 -2 -1 0
10 10 10 10 10 10
Fig. 257 Meshwidth h Fig. 258 Meshwidth h
We observe algebraic order of convergence in h with empiric rate approximately given by min{1 +
regularity of f , 4}.
We remark that there is the following theoretical result [?], [?, Rem. 9.2]:
5 4 (4)
f ∈ C4 ([t0 , tn ]) k f − s k L∞ ([t0,tn ]) ≤ h f .
384 L ∞ ([ t0 ,tn ])
13 // build rhs
14 Eigen : : VectorXd r hs ( n − 1 ) ;
15 f o r ( long i = 0 ; i < n − 1 ; ++ i ) {
16 r hs ( i ) = 3 ∗ ( ( y ( i + 1 ) − y ( i ) ) / ( h ( i ) ∗ h ( i ) ) + ( y ( i + 2 ) − y ( i +
1 ) ) / ( h ( i + 1 ) ∗h ( i + 1 ) ) ) ;
17 }
18 // modify according to complete cubic spline
19 r hs ( 0 ) −= b ( 0 ) ∗ c0 ;
20 r hs ( n − 2 ) −= b ( n − 1 ) ∗ cn ;
21
45
46 // plot interpolation
47 mgl : : F i g u r e f i g ;
48 f i g . t i t l e ( "Spline interpolation " + plotname ) ;
49 f i g . p l o t ( t , f e v a l ( f , t ) , " m∗" ) . l a b e l ( "Data points" ) ;
50 f i g . p l o t ( x , fv , "b" ) . l a b e l ( " f " ) ;
51 f i g . p l o t ( x , v , "r" ) . l a b e l ( "Cubic spline interpolant" ) ;
52 f i g . legend ( 1 , 0 ) ;
53 f i g . save ( "interp_" + plotname ) ;
54
55 // plot error
56 mgl : : F i g u r e e r r ;
57 e r r . t i t l e ( "Spline approximation error " + plotname ) ;
58 e r r . s e t l o g ( true , tr ue ) ;
59 e r r . p l o t ( h , errL2 , "r ; " ) . l a b e l ( "L^2 norm" ) ;
60 e r r . p l o t ( h , e r r I n f , "b; " ) . l a b e l ( "L^\\ infty norm" ) ;
61 f i g . legend ( 1 , 0) ;
62 e r r . save ( "approx_" + plotname ) ;
63 }
Numerical Quadrature
Supplementary reading. Numerical quadrature is covered in [?, VII] and [?, Ch. 10].
Z
Numerical quadrature deals with the approximate numerical evaluation of integrals f (x) dx for a given
Ω
(closed) integration domain Ω ⊂ R d . Thus, the underlying problem in the sense of § 1.5.67 is the
mapping
C0 (Ω ) → RR
I: , (7.0.1)
f 7→ Ω f (x) dx
If f is complex-valued or vector-valued, then so is the integral. The methods presented in this chapter can
immediately be generalized to this case by componentwise application.
General methods for numerical quadrature should rely only on finitely many point evaluations of
the integrand.
512
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
☞ Numerical quadrature methods are key building blocks for so-called variational methods for the nu-
merical treatment of partial differential equations. A prominent example is the finite element method.
Zb
f (t) dt
f
1.5
In Ex. 2.1.3 we learned about the nodal analysis of electrical circuits. Its application to a non-linear circuit
will be discussed in Ex. 8.0.1, which will reveal that every computation of currents and voltages can be
rather time-consuming. In this example we consider a non-linear circuit in quasi-stationary operation
(capacities and inductances are ignored). Then the computation of branch currents and nodal voltages
entails solving a non-linear system of equations.
The goal is to compute the energy dissipated by the circuit, which is equal to the energy injected by the
voltage source. This energy can be obtained by integrating the power P(t) = U (t) I (t) over period [0, T ]:
Z T
Wtherm = U (t) I (t) dt , where I = I (U ) .
0
double I(double U) involves solving non-linear system of equations, see Ex. 8.0.1!
This is a typical example where “point evaluation” by solving the non-linear circuit equations is the only
way to gather information about the integrand.
Contents
7.1 Quadrature Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
7.2 Polynomial Quadrature Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
7.3 Gauss Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
7.4 Composite Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
7.5 Adaptive Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
Quadrature formulas realize the approximation of an integral through finitely many point evaluations of the
integrand.
a
f (t) dt ≈ Qn ( f ) := ∑ wnj f (cnj ) . (7.1.2)
j =1
wnj : ∈R
quadrature weights
Terminology:
cnj : quadrature nodes ∈ [ a, b]
Obviously (7.1.2) is compatible with integrands f given in procedural form as double f(double t),
compare § 7.0.2.
A single invocation costs n point evaluations of the integrand plus n additions and multiplications.
In the setting of function approximation by polynomials we learned in Rem. 6.1.18 that an approximation
schemes for any interval could be obtained from an approximation scheme on a single reference interval
([−1, 1] in Rem. 6.1.18) by means of affine pullback, see (6.1.22). A similar affine transformation technique
makes it possible to derive quadrature formula for an arbitrary interval from a single quadrature formula on
a reference interval.
n
Given: quadrature formula b b j j=1 on reference interval [−1, 1]
cj, w
τ t
Fig. 262
−1 1 a b
τ 7→ t := Φ(τ )t := 21 (1 − τ )a + 12 (τ + 1)b
Rb n n c j )a + 21 (1 + b
c j = 21 (1 − b c j )b ,
a f (t) dt ≈ 1
2 (b b j fb(b
− a) ∑ w c j ) = ∑ w j f (c j ) with
j =1 j =1 w j = 21 (b − a)w
bj .
In words, the nodes are just mapped through the affine transformation c j = Φ(b
c j ), the weights are scaled
by the ratio of lengths of [ a, b] and [−1, 1].
b j /nodes b
A 1D quadrature formula on arbitrary intervals can be specified by providing its weights w cj
for the integration domain [−1, 1] (reference interval). Then the above transformation is assumed.
Another common choice for the reference interval: [0, 1], pay attention!
In many codes families of quadrature rules are used to control the quadrature error. Usually, suitable
sequences of weights wnj and nodes cnj are precomputed and stored in tables up to sufficiently large
values of n. A possible interface could be the following:
s t r u c t QuadTab {
t e m p l a t e < typename VecType>
s t a t i c v o i d getrule( i n t n,VecType &c,VecType &w,
double a=-1.0, double b=1.0);
}
Calling the method getrule() fills the vectors c and w with the nodes and the weights for a desired
n-point quadrature on [ a, b] with [−1, 1] being the default reference interval. For VecType we may assume
the basic functionality of Eigen::VectorXd.
Every approximation scheme A : C0 ([ a, b]) → V , V a space of “simple functions” on [ a, b], see § 6.0.5,
gives rise to a method for numerical quadrature according to
Z b Z b
f (t) dt ≈ QA ( f ) := (A f )(t) dt . (7.1.8)
a a
As explained in § 6.0.6 every interpolation scheme IT based on the node set T = {t0 , t1 , . . . , tn } ⊂ [ a, b]
(→ § 5.1.4) induces an approximation scheme, and, hence, also a quadrature scheme on [ a, b]:
Z b Z b
f (t) dt ≈ IT [ f (t0 ), . . . , f (tn )] ⊤ (t) dt . (7.1.9)
a a
Every linear interpolation operator IT according to Def. 5.1.17 spawns a quadrature formula (→
Def. 7.1.1) by (7.1.9).
Hence, we have arrived at an n + 1-point quadrature formula with nodes t j , whose weights are the
integrals of the cardinal interpolants for the interpolation scheme T .
✷
✓ ✏
Summing up, we have found
In general the quadrature formula (7.1.2) will only provide an approximate value for the integral.
As in the case of function approximation by interpolation Section 6.1.2, our focus will on the asymptotic
behavior of the quadrature error as a function of the number n of point evaluations of the integrand.
n o
✦ quadrature weights wnj , j = 1, . . . , n and
n o n ∈N
✦ quadrature nodes cnj , j = 1, . . . , n .
n ∈N
Bounds for the maximum norm of the approximation error of an approximation scheme directly translate
into estimates of the quadrature error of the induced quadrature scheme (7.1.8):
Z b Z b
f (t) dt − QA( f ) ≤ | f (t) − A( f )(t)| dt ≤ |b − a|k f − A( f )k L∞ ([ a,b]) . (7.1.14)
a a
Hence, the various estimates derived in Section 6.1.2 and Section 6.1.3.2 give us quadrature error esti-
mates “for free”. More details will be given in the next section.
Now we specialize the general recipe of § 7.1.7 for approximation schemes based on global polynomials,
the Lagrange approximation scheme as introduced in Section 6.1, Def. 6.1.32.
The cardinal interpolants for Lagrange interpolation are the Lagrange polynomials (5.2.11)
n −1 (5.2.13) n −1
t − tj
Li ( t ) : = ∏ ti − t j
, i = 0, . . . , n − 1 p n −1 (t) = ∑ f ( t i ) Li ( t ) .
j =0 i =0
j 6 =i
Zb n −1 Zb nodes c i = ti − 1 ,
Z b
pn−1 (t) dt = ∑ f ( ti ) Li (t) dt
weights wi : = Li −1 (t) dt .
(7.2.2)
a i =0 a a
2
Zb
f (t) dt ≈ Qmp ( f ) = (b − a) f ( 12 (a + b)) .
f
1.5
a
1 “midpoint”
0.5
✁ the area under the graph of f is approximated by
the area of a rectangle.
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 263 t
The n := m + 1-point Newton-Cotes formulas arise from Lagrange interpolation in equidistant nodes
(6.1.34) in the integration interval [ a, b]:
b−a
Equidistant quadrature nodes t j := a + hj, h := , j = 0, . . . , m:
m
The weights for the interval [0, 1] can be found, e.g., by symbolic computation using MAPLE: the following
MAPLE function expects the polynomial degree as input argument.
111111111111111
000000000000000
2.5
> trapez := newtoncotes(1);
000000000000000
111111111111111
000000000000000
111111111111111
2
000000000000000
111111111111111
000000000000000
111111111111111
btrp ( f ) := 1 ( f (0) + f (1)) 000000000000000
111111111111111
Q (7.2.5) 000000000000000
111111111111111
000000000000000
111111111111111
2
f
1.5
000000000000000
111111111111111
000000000000000
111111111111111
Zb
b−a 000000000000000
111111111111111
f (t) dt ≈ ( f (a) + f (b)) 1
000000000000000
111111111111111
X
000000000000000
111111111111111
2
a 000000000000000
111111111111111
000000000000000
111111111111111
0.5
000000000000000
111111111111111
000000000000000
111111111111111
000000000000000
111111111111111
0
0 0.5
000000000000000
111111111111111
1 1.5 2 2.5 3 3.5 4
Fig. 264 t
• n = 3: Simpson rule
> simpson := newtoncotes(2);
Zb
h 1 b−a a+b
f (0 ) + 4 f ( 2 ) + f (1 ) f (t) dt ≈ f ( a) + 4 f + f (b) (7.2.6)
6 6 2
a
• n = 5: Milne rule
> milne := newtoncotes(4);
1
h 7 f (0) + 32 f ( 14 ) + 12 f ( 12 ) + 32 f ( 43 ) + 7 f (1)
90
b − a
(7 f (a) + 32 f (a + (b − a)/4) + 12 f (a + (b − a)/2) + 32 f (a + 3(b − a)/4) + 7 f (b))
90
• n = 7: Weddle rule
> weddle := newtoncotes(6);
1
h (41 f (0) + 216 f ( 61 ) + 27 f ( 31 ) + 272 f ( 12 )
840
+27 f ( 32 ) + 216 f ( 56 ) + 41 f (1))
1
h (989 f (0) + 5888 f ( 81 ) − 928 f ( 14 ) + 10496 f ( 83 )
28350
−4540 f ( 12 ) + 10496 f ( 85 ) − 928 f ( 34 ) + 5888 f ( 87 ) + 989 f (1))
! From Ex. 6.1.41 we know that the approximation error incurred by Lagrange interpolation
in equidistant nodes can blow up even for analytic functions. This blow-up can also infect
the quadrature error of Newton-Cotes formulas for large n, which renders them essentially
useless. In addition they will be marred by large (in modulus) and negative weights, wich
compromises numerical stability (→ Def. 1.5.85)
The considerations of Section 6.1.3 confirmed the superiority of the “optimal” Chebychev nodes (6.1.84)
for globally polynomial Lagrange interpolation. This suggests that we use these nodes also for numerical
quadrature with weights given by (7.2.2). This yields the so-called Clenshaw-Curtis rules with the following
rather desirable property:
The weights wnj , j = 1, . . . , n, for every n-point Clenshaw-Curtis rule are positive.
The weights of any n-point Clenshaw-Curtis rule can be computed with a computational effort of O(n log n)
using FFT.
As a concrete application of § 7.1.13, (7.1.14) we use the L∞ -bound (6.1.50) for Lagrange interpolation
f ( n + 1)
L∞ ( I )
k f − LT f k L ∞ ( I ) ≤ max|(t − t0 ) · · · · · (t − tn )| . (6.1.50)
( n + 1) ! t∈ I
Much sharper estimates for Clenshaw-Curtis rules (→ Rem. 7.2.7) can be inferred from the interpola-
tion error estimate (6.1.88) for Chebychev interpolation. For functions with limited smoothness algebraic
convergence of the quadrature error for Clenshaw-Curtis quadrature follows from (6.1.92). For integrands
that possess an analytic extension to the complex plane in a neighborhood of [ a, b], we can conclude
exponential convergence from (6.1.98).
Supplementary reading. Gauss quadrature is discussed in detail in [?, Ch. 40-41], [?,
Sect.10.3]
How to gauge the “quality” of an n-point quadrature formula Qn without testing it for specific integrands?
The next definition gives an answer.
that is, as the maximal degree +1 of polynomials for which the quadrature rule is guaranteed to be
exact.
First we note a simple consequence of the invariance of the polynomial space Pn under affine pullback,
see Lemma 6.1.21.
An affine transformation of a quadrature rule according to Rem. 7.1.4 does not change its order.
Further, by construction all polynomial n-point quadrature rules possess order at least n.
where Lk , k = 0, . . . , n − 1, is the k-th Lagrange polynomial (5.2.11) associated with the ordered
node set {t1 , t2 , . . . , tn }.
Proof. The conclusion of the theorem is a direct consequence of the facts that
By construction (7.2.2) polynomial n-point quadrature formulas (7.2.1) exact for f ∈ Pn−1 ⇒ n-point
polynomial quadrature formula has at least order n.
Thm. 7.3.5 provides a concrete formula for quadrature weights, which guaranteed order n for an n-point
quadrature formula. Yet evaluating integrals of Lagrange polynomials may be cumbersome. Here we
give a general recipe for finding the weights w j according to Thm. 7.3.5 without dealing with Lagrange
polynomials.
From Def. 7.3.1 we immediately conclude the following procedure: If p0 , . . . , pn−1 is a basis of Pn , then,
thanks to the linearity of the integral and quadrature formulas,
Z b
Qn ( p j )= p j (t) dt ∀ j = 0, . . . , n − 1 ⇔ Qn has order ≥ n . (7.3.7)
a
For instance, for the computation of quadrature weights, one may choose the monomial basis p j (t) = t j .
From the order rule for polynomial quadrature rule we immediately conclude the orders of simple repre-
sentatives.
n Order
1 midpoint rule 2
2 trapezoidal rule (7.2.5) 2
3 Simpson rule (7.2.6) 4
3
4 8 -rule 4
5 Milne rule 6
The orders for even n surpass the predictions of Thm. 7.3.5 by 1, which can be verified by straightforward
computations; following Def. 7.3.1 check the exactness of the quadrature rule on [0, 1] (this is sufficient →
Cor. 7.3.4) for monomials {t 7→ tk }, k = 0, . . . , q − 1, which form a basis of Pq , where q is the order that
is to be confirmed: essentially one has to show
n
1
Q({t 7→ tk }) = ∑ w j ckj = k + 1 , k = 0, . . . , q − 1 , (7.3.10)
j =1
where Q =
ˆ quadrature rule on [0, 1] given by (7.1.2).
For the Simpson rule (7.2.6) we can also confirm order 4 with symbolic calculations in MAPLE:
1 (4)
err := D ( f )(0)h5 + O h6 , h, 6
90
q(t) := (t − c1 )2 · · · · · (t − cn )2 ∈ P2n .
Heuristics: A quadrature formula has order m ∈ N already, if it is exact for m polynomials ∈ Pm−1 that
form a basis of Pm−1 (recall Thm. 5.2.2).
m
An n-point quadrature formula has 2n “degrees of freedom” (n node positions, n weights).
⇓
It might be possible to achieve order 2n = dim P2n−1
(“No. of equations = No. of unknowns”)
Necessary & sufficient conditions for order 4, cf. (7.3.8), integrate the functions of the monomial basis of
P3 exactly:
Z b
1
Qn ( p) = p(t) dt ∀ p ∈ P3 ⇔ Qn ({t 7→ tq }) = (bq+1 − aq+1 ) , q = 0, 1, 2, 3 .
a q+1
4 equations for weights w j and nodes c j , j = 1, 2 (a = −1, b = 1), cf. Rem. 7.3.6
Z 1 Z 1
1 dt = 2 = 1w1 + 1w2 , t dt = 0 = c1 w1 + c2 w2
−1 −1
Z 1 Z 1 (7.3.14)
2
t dt = = c21 w1 + c22 w2 ,
2 3
t dt = 0 = c31 w1 + c32 w2 .
−1 3 −1
n √ √ o
➣ weights & nodes: w2 = 1, w1 = 1, c1 = 1/3 3, c2 = −1/3 3
Z 1
1 1
quadrature formula (order 4): f ( x ) dx ≈ f √ + f −√ (7.3.15)
−1 3 3
First we search for necessary conditions that have to be met by the nodes, if an n-point quadrature rule
has order 2n.
Z 1 n
(7.3.17)
q(t) P̄n (t) dt
−1 | {z }
= ∑ wnj q(cnj ) P̄n (cnj ) = 0 .
j =1 | {z }
∈P2n−1 =0
R1
⇒ L2 ([−1, 1])-orthogonality q(t) P̄n (t) dt = 0 ∀q ∈ Pn−1 . (7.3.18)
−1
Hence, A is regular and the coefficients α j are uniquely determined. Thus there is only one n-point
quadrature rule of order n.
The nodes of an n-point quadrature formula of order 2n, if it exists, must coincide with the unique zeros
of the polynomials P̄n ∈ Pn \ {0} satisfying (7.3.18).
Rb
Recall: ( f , g) 7→ f (t) g(t) dt is an inner product on C0 ([ a, b]), the L2 -inner product, see
a
Rem. 6.2.20, [?, Sect. 4.4, Ex. 2], [?, Ex. 6.5]
➣ As we have seen in Section 6.2.2, abstract techniques for vector spaces with inner product can be
applied to polynomials, for instance Gram-Schmidt orthogonalization, cf. § 6.2.17, [?, Thm. 4.8], [?,
Alg. 6.1].
Now carry out the abstract Gram-Schmidt orthogonalization according to Algorithm (6.2.18) and recall
Thm. 6.2.19: in a vector space V with inner product (·, ·)V orthogonal vectors q0 , q1 , . . . spanning the
same subspaces as the linearly independent vectors v0 , v1 , . . . are constructed recursively via
n ( v n + 1 , q k )V
qn +1 : = v n +1 − ∑ ( q k , q k )
q k , q0 : = v 0 . (7.3.20)
k=0 V
Note: P̄n has leading coefficient = 1 ⇒ P̄n uniquely defined (up to sign) by (7.3.21).
The considerations so far only reveal necessary conditions on the nodes of an n-point quadrature rule of
order 2n:
They do by no means confirm the existence of such rules, but offer a clear hint on how to construct them:
n
Proof. Conclude from the orthogonality of the P̄n that { P̄k } k=0 is a basis of Pn and
Z 1
h(t) P̄n (t) dt = 0 ∀h ∈ Pn−1 . (7.3.23)
−1
Recall division of polynomials with remainder (Euclid’s algorithm → Course “Diskrete Mathematik”): for
any p ∈ P2n−1
p(t) = h(t) P̄n (t) + r (t) , for some h ∈ Pn−1 , r ∈ Pn−1 . (7.3.24)
Z1 Z1 Z1 m
(∗)
p(t) dt = h(t) P̄n (t) dt + r (t) dt = ∑ wnj r(cnj ) , (7.3.25)
−1 −1 −1 j =1
| {z }
=0 by (7.3.23)
(∗): by choice of weights according to Rem. 7.3.6 Qn is exact for polynomials of degree ≤ n − 1!
m m m Z1
(7.3.24) (7.3.25)
∑ wnj p(cnj ) = ∑ wnj h(cnj ) P̄n (cnj ) + ∑ wnj r (cnj ) = p(t) dt .
j =1 j =1 | {z } j =1
=0 −1
The family of polynomials { P̄n }n∈N0 are so-called orthogonal polynomials w.r.t. the L2 (] − 1, 1[)-inner
product, see Def. 6.2.25. We have made use of orthogonal polynomials already in Section 6.2.2. L2 ([−1, 1])-
orthogonal polynomials play a key role in analysis.
Legendre polynomials
The L2 (]
− 1, 1[)-orthogonal are those already dis- 1 n=0
n=1
cussed in Rem. 6.2.34: 0.8 n=2
n=3
Definition 7.3.27. Legendre polynomials 0.6 n=4
n=5
0.4
The n-th Legendre polynomial Pn is defined by
0.2
• ZPn ∈ Pn ,
Pn(t)
1 0
-0.8
Legendre polynomials P0 , . . . , P5 ➣ -1
-1 -0.5 0 0.5 1
Fig. 265 t
Notice: the polynomials P̄n defined by (7.3.21) and the Legendre polynomials Pn of Def. 7.3.27 (merely)
differ by a constant factor!
Note: the above considerations, recall (7.3.18), show that the nodes of an n-point quadrature formula of
order 2n on [−1, 1] must agree with the zeros of L2 (] − 1, 1[)-orthogonal polynomials.
✞ ☎
✝ ✆
n-point quadrature formulas of order 2n are unique
We are not done yet: the zeros of P̄n from (7.3.21) may lie outside [−1, 1].
! In principle P̄n could also have less than n real zeros.
18
✁ Obviously:
Number n of quadrature nodes
16
6
Zeros of Legendre polynomials = Gauss points
4
2
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 266 t
Proof. (indirect) Assume that Pn has only m < n zeros ζ 1 , . . . , ζ m in ] − 1, 1[ at which it changes sign.
Define
m
q(t) := ∏(t − ζ j ) ⇒ qPn ≥ 0 or qPn ≤ 0 .
j =1
Z 1
⇒ q(t) Pn (t) dt 6= 0 .
−1
The n-point Quadrature formulas whose nodes, the Gauss points, are given by the zeros of the n-th
Legendre polynomial (→ Def. 7.3.27), and whose weights are chosen according to Thm. 7.3.5, are
called Gauss-Legendre quadrature formulas.
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 267 t
j
Proof. Writing ξ nj , j = 1, . . . , n, for the nodes (Gauss points) of the n-point Gauss-Legendre quadrature
formula, n ∈ N, we define
n
qk (t) = ∏(t − ξ nj )2 ⇒ qk ∈ P2n−2 .
j =1
j6=k
Z1
0< q(t) dt = wnk q(ξ kn ) ,
| {z }
−1 >0
From Thm. 6.2.32 we learn the orthogonal polynomials satisfy the 3-term recursion (6.2.33), see also
(7.3.33). To keep this chapter self-contained we derive it independently for Legendre polynomials.
Note: the polynomials P̄n from (7.3.21) are uniquely characterized by the two properties (try a proof!)
➣ we get the same polynomials P̄n by another Gram-Schmidt orthogonalization procedure, cf. (7.3.20)
and § 6.2.29:
R1
n
−1 τ P̄n (τ ) P̄k (τ ) dτ
P̄n+1 (t) = t P̄n (t) − ∑ R1 2 · P̄k (t)
k=0 −1 P̄k (τ ) dτ
if k + 1 < n:
R1 R1
−1 τ P̄n (τ ) P̄n (τ ) dτ −1 τ P̄n (τ ) P̄n −1 (τ ) dτ
P̄n+1 (t) = t P̄n (t) − R1 · P̄n (t) − R1 2 · P̄n−1 (t) . (7.3.32)
2
−1 P̄n (τ ) dτ −1 P̄n −1 (τ ) dτ
2n + 1 n
Pn+1 (t) := tPn (t) − Pn−1 (t) , P0 := 1 , P1 (t) := t . (7.3.33)
n+1 n+1
Reminder (→ Section 6.1.3.1): we have a similar 3-term recursion (6.1.78) for Chebychev polynomials.
Coincidence? Of course not, nothing in mathematics holds “by accident”. By Thm. 6.2.32 3-term recur-
sions are a distinguishing feature of so-called families of orthogonal polynomials, to which the Chebychev
polynomials belong as well, spawned by Gram-Schmidt orthogonalization with respect to a weighted L2 -
inner product, however, see [?, VI].
➤ Efficient and stable evaluation of Legendre polynomials by means of 3-term recursion (7.3.33), cf.
the analoguous algorithm for Chebychev polynomials given in Code 6.1.79.
11 }
12 }
There are several efficient ways to find the Gauss points. Here we discuss an intriguing connection with
an eigenvalue problem.
Compute nodes/weights of Gaussian quadrature by solving an eigenvalue problem!
(Golub-Welsch algorithm [?, Sect. 3.5.4], [?, Sect. 1])
In codes Gauss nodes and weights are usually retrieved from tables, cf. Rem. 7.1.6.
22 r et ur n qr ;
23 }
Justification: en = √ 1 Pn
rewrite 3-term recurrence (7.3.33) for scaled Legendre polynomials P 1 n + /2
n n+1
t Pen (t) = √ Pen−1 (t) + p Pen+1 (t) . (7.3.37)
2
4n − 1 4 ( n + 1 )2 − 1
| {z } | {z }
= :β n = :β n+1
The zeros of Pn can be obtained as the n real eigenvalues of the symmetric tridiagonal matrix
Jn ∈ R n,n !
This matrix Jn is initialized in ??–?? of Code 7.3.36. The computation of the weights in ?? of Code 7.3.36
is explained in [?, Sect. 3.5.4].
The positivity of the weights wnj for all n-point Gauss-Legendre and Clenshaw-Curtis quadrature rules has
important consequences.
Theorem 7.3.39. Quadrature error estimate for quadrature rules with positive weights
Proof. The proof runs parallel to the derivation of (6.1.61). Writing En ( f ) for the quadrature error, the left
hand side of (7.3.40), we find by the definition Def. 7.3.1 of the order of a quadrature rule
Z b n
En ( f ) = En ( f − p ) ≤
a
( f − p)(t) dt + ∑ w j ( f − p)(c j ) (7.3.41)
j =1
n
≤|b − a|k f − p k L∞ ([ a,b]) + ∑ |w j | k f − pk L∞ ([ a,b]) .
j =1
Appealing to Thm. 6.1.15 and Rem. 6.1.23, and (6.1.50), the dependence of the constants on the length
of the integration interval can be quantified for integrands with limited smoothness.
Please note the different estimates depending on whether the smoothness of f (as described by r) or the
order of the quadrature rule is the “limiting factor”.
We examine three families of global polynomial (→ Thm. 7.3.5) quadrature rules: Newton-Cotes formulas,
Gauss-Legendre rules, and Clenshaw-Curtis rules. We record the convergence of the quadrature errors
for the interval [0, 1] and two different functions
-4
10
-2
10
|quadrature error|
|quadrature error|
-6
10
-3
10
-8
10
-4
10
-10
10
-5
-12
10 10
Equidistant Newton-Cotes quadrature
Chebyshev quadrature
Gauss quadrature
-14 -6
10 10
0 2 4 6 8 10 12 14 16 18 20 0 1
Fig. 268 10 10
Number of quadrature nodes Fig. 269 Number of quadrature nodes
quadrature error, f 1 (t) := 1+(15t)2 on [0, 1]
√
quadrature error, f 2 (t) := t on [0, 1]
R1
Asymptotic behavior of quadrature error ǫn := 0 f (t) dt − Qn ( f ) for "n → ∞”:
➣
exponential convergence ǫn ≈ O(qn ), 0 < q < 1, for C∞ -integrand f 1 ❀ : Newton-Cotes quadra-
ture : q ≈ 0.61, Clenshaw-Curtis quadrature : q ≈ 0.40, Gauss-Legendre quadrature : q ≈ 0.27
➣
algebraic convergence ǫn ≈ O(n−α ), α > 0, for integrand f 2 with singularity at t = 0 ❀ Newton-
Cotes quadrature : α ≈ 1.8, Clenshaw-Curtis quadrature : α ≈ 2.5, Gauss-Legendre quadrature :
α ≈ 2.7
From Ex. 7.3.45 teaches us that a lack of smoothness of the integrand can thwart exponential convergence
and severely limits the rate of algebraic convergence of a global quadrature rule for n → ∞.
Here is an example:
Z b√
For a general but smooth f ∈ C∞ ([0, b]) compute t f (t) dt via a quadrature rule, e.g., n-point
0
Gauss-Legendre quadrature on [0, b]. Due to the presence of a square-root singularity at t = 0 the direct
application of n-point Gauss-Legendre quadrature will result in a rather slow algebraic convergence of the
quadrature error as n → ∞, see Ex. 7.3.45.
√ Z b√ Z √b
substitution s = t: t f (t) dt = 2s2 f (s2 ) ds . (7.3.47)
0 0
There is one blot on most n-asymptotic estimates obtained from Thm. 7.3.39: the bounds usually in-
volve quantities like norms of higher derivatives of the interpoland that are elusive in general, in particular
for integrands given only in procedural form, see § 7.0.2. Such unknown quantities are often hidden in
“generic constants C”. Can we extract useful information from estimates marred by the presence of such
constants?
For fixed integrand f let us assume sharp algebraic convergence (in n) with rate r ∈ N of the quadrature
error En ( f ) for a family of n-point quadrature rules:
sharp
En ( f ) = O(n−r ) =⇒ En ( f ) ≈ Cn−r , (7.3.49)
with a “generic constant C > 0” independent of n.
In the case of algebraic convergence with rate r ∈ R a reduction of the quadrature error by a factor
of ρ is bought by an increase of the number of quadrature points by a factor of ρ /r .
1
Now assume sharp exponential convergence (in n) of the quadrature error En ( f ) for a family of n-point
quadrature rules, 0 ≤ q < 1:
sharp
En ( f ) = O(qn ) =⇒ En ( f ) ≈ Cqn , (7.3.51)
Cqnold ! log ρ
=ρ ⇔ nnew − nold = − log q .
Cqnnew
In the case of exponential convergence (7.3.51) a fixed increase of the number of quadrature points
by − log ρ : log q results in a reduction of the quadrature error by a factor of ρ > 1.
In 6, Section 6.5.1 we studied approximation by piecewise polynomial interpolants. A similar idea under-
lies the so-called composite quadrature rules on an interval [ a, b]. Analogously to piecewise polynomial
techniques they start from a grid/mesh
a
f (t) dt = ∑ f (t) dt . (7.4.1)
j = 1 x j −1
On each mesh interval [ x j−1, x j ] we then use a local quadrature rule, which may be one of the polynomial
quadrature formulas from 7.2.
Zb
1 2
f (t)dt = 2 ( x1 − x0 ) f (a)+ (7.4.4)
a m−1 1.5
1
∑ 2 ( x j +1 − x j−1 ) f ( x j )+
1
j =1
1
2 ( xm − x m−1 ) f (b) . 0.5
Fig. 270 0
-1 0 1 2 3 4 5 6
Zb 3
f (t)dt = 2.5
a
1
6 ( x1 − x0 ) f (a)+ (7.4.5)
2
m−1 1.5
➣
1
∑ 6 ( x j +1 − x j−1 ) f ( x j )+
j =1 1
m
∑ 32 (x j − x j−1) f ( 21 (x j + x j−1))+
0.5
j =1 Fig. 271 0
-1 0 1 2 3 4 5 6
1
6 ( xm − x m−1 ) f (b) .
Formulas (7.4.4), (7.4.5) directly suggest efficient implementation with minimal number of f -evaluations.
6 f o r ( unsigned i = 0 ; i < N ; ++ i ) {
7 // rule: T = (b - a)/2 * (f(a) + f(b)),
8 // apply on N intervals: [a + i*h, a + (i+1)*h], i=0..(N-1)
9 I += h / 2 ∗ ( f ( a + i ∗ h ) + f ( a + ( i + 1 ) ∗ h ) ) ;
10 }
11 r et ur n I ;
12 }
6 f o r ( unsigned i = 0 ; i < N ; ++ i ) {
7 // rule: S = (b - a)/6*( f(a) + 4*f(0.5*(a + b)) + f(b) )
8 // apply on [a + i*h, a + (i+1)*h]
9 I += h / 6 ∗ ( f ( a + i ∗ h ) + 4 ∗ f ( a + ( i + 0 . 5 ) ∗ h ) + f ( a + ( i +1) ∗ h ) ) ;
10 }
11
12 r et ur n I ;
13 }
In both cases the function object passed in f must provide an evaluation operator double operator
(double)const.
Composite quadrature scheme based on local polynomial quadrature can usually be understood as “quadra-
ture by approximation schemes” as explained in § 7.1.7. The underlying approximation schemes belong
to the class of general local Lagrangian interpolation schemes introduced in Section 6.5.1.
In other words, many composite quadrature schemes arise from replacing the integrand by a piecewise
interpolating polynomial, see Fig. 270 and Fig. 271 and compare with Fig. 250.
To see the main rationale behind the use of composite quadrature rules recall Lemma 7.3.42: for a poly-
nomial quadrature rule (7.2.1) of order q with positive weights and f ∈ Cr ([ a, b]) the quadrature error
shrinks with the min{r, q} + 1-st power of the length |b − a| of the integration domain! Hence, applying
polynomial quadrature rules to small mesh intervals should lead to a small overall quadrature error.
Assume a composite quadrature rule Q on [ x0 , xm ] = [ a, b], b > a, based on n j -point local quadrature
j
rules Qn j with positive weights (e.g. local Gauss-Legendre quadrature rules or local Clenshaw-Curtis
quadrature rules) and of fixed orders q j ∈ N on each mesh interval [ x j−1, x j ]. From Lemma 7.3.42 recall
the estimate for f ∈ Cr ([ x j−1 , x j ])
Z xj
j
f (t) dt − Qn j ( f ) ≤ C| x j − x j−1 |min{r,q j }+1 f (min{r,q j }) . (7.2.10)
x j −1 L ∞ ([ x j−1 ,x j ])
For f ∈ Cr ([ a, b]), summing up these bounds we get for the global quadrature error
Z xm m
min{r,q j }+1
f (t) dt − Q( f ) ≤ C ∑ hj f (min{r,q j }) ,
x0 L ∞ ([ x j−1 ,x j ])
j =1
As with polynomial quadrature rules, we study the asymptotic behavior of the quadrature error for families
of composite quadrature rules as a function on the total number n of function evaluations.
As in the case of M-piecewise polynomial approximation of function (→ Section 6.5.1) families of com-
posite quadrature rules can be generated in two different ways:
(I) use a sequence of successively refined meshes Mk = { x kj } j with ♯M = m(k) + 1, m(k) →
k ∈N
∞ for k → ∞, , combined with the same (transformed, → Rem. 7.1.4) local quadrature rule on all
mesh intervals [ x kj−1 , x kj ]. Examples are the composite trapezoidal rule and composite Simpson rule
from Ex. 7.4.3 on sequences of equidistant meshes.
➣ h-convergence
m
(II) On a fixed mesh M = xj j =0
, on each cell use the same (transformed) local quadrature rule
taken from a sequence of polynomial quadrature rules of increasing order.
➣ p-convergence
• trapezoidal rule (7.2.5) ➣ local order 2 (exact for linear functions, see Ex. 7.3.9),
• Simpson rule (7.2.6) ➣ local order 4 (exact for cubic polynomials, see Ex. 7.3.9)
n
on equidistant mesh M := { jh} j=0, h = 1/n, n ∈ N.
2
0
numerical quadrature of function 1/(1+(5t) ) numerical quadrature of function sqrt(t)
0
10 10
trapezoidal rule trapezoidal rule
Simpson rule
Simpson rule
O(h2) -1 1.5
4
10 O(h )
O(h )
-2
-5
10
10
|quadrature error|
|quadrature error|
-3
10
-4
10
-10
10
-5
10
-6
10
-15 -7
10 -2 -1 0 10
10 10 10 -2 -1 0
10 10 10
Fig. 272 meshwidth Fig. 273 meshwidth
√
quadrature error, f 1 (t) := 1+(15t)2 on [0, 1] quadrature error, f 2 (t) := t on [0, 1]
R1
Asymptotic behavior of quadrature error E(n) := 0 f (t) dt − Qn ( f ) for meshwidth "h → 0”
• a family of composite quadrature rules basedon single localℓ-point rule (with positive weights) of
order q on a sequence of equidistant meshes Mk = { x kj } j ,
k ∈N
• the family of Gauss-Legendre quadrature rules from Def. 7.3.29.
We study the asymptotic dependence of the quadrature error on the number n of function evaluations.
The quadrature errors EnGL ( f ) of the n-point Gauss-Legendre quadrature rules are given in Lemma 7.3.42,
(7.3.43):
Gauss-Legendre quadrature converges at least as fast fixed order composite quadrature on equidistant
meshes.
Moreover, Gauss-Legendre quadrature “automatically detects” the smoothness of the integrand, and en-
joys fast exponential convergence for analytic integrands.
✞ ☎
✝ ✆
Use Gauss-Legendre quadrature instead of fixed order composite quadrature on equidistant meshes.
Sometimes there are surprises: Now we will witness a convergence behavior of a composite quadrature
rule that is much better than predicted by the order of the local quadrature formula.
We consider the equidistant trapezoidal rule (order 2), see (7.4.4), Code 7.4.6
Z b m−1
1 1 b−a
a
f (t) dt ≈ Tm ( f ) := h 2 f ( a) + ∑ f (kh) + f (b) , h := 2 m
. (7.4.17)
k=1
-2
10
-1
10
-4
10
|quadrature error|
|quadrature error|
-2
-6
10
10
-8
10 -3
10
-10
10
-4
10
-12 a=0.5 a=0.5
10 a=0.9
a=0.9
a=0.95 a=0.95
a=0.99 a=0.99
-14 -5
10 10 0 1 2
0 2 4 6 8 10 12 14 16 18 20 10 10 10
Fig. 274 no. of. quadrature nodes Fig. 275 no. of. quadrature nodes
In this § we use I := [0, 1[ as a reference interval, cf. Exp. 7.4.16. We rely on similar techniques as in
Section 5.6, Section 5.6.2. Again, a key tool will be the bijective mapping, see Fig. 203,
If f ∈ Cr (R ) and 1-periodic, then (ΦS−11 )∗ f ∈ Cr (S1 ). Further, ΦS1 maps equidistant nodes on I :=
[0, 1] to equispaced nodes on S1 , which are the roots of unity:
j j j
ΦS1 ( n ) = exp(2πı n ) [ exp(2πı n )n = 1; ] . (7.4.19)
Now consider an n-point polynomial quadrature rule on S1 based on the set of equidistant nodes Z :=
{z j := exp(2πı j−n 1 ), j = 1, . . . , n} and defined as
Z n
1 1
QSn ( g) := LZ g (τ ) dS(τ ) = ∑ wSj g(z j ) , (7.4.20)
j =1
S1
where LZ is the Lagrange interpolation operator (→ Def. 6.1.32). This means that the weights obey
Thm. 7.3.5, where the definition (5.2.11) of Lagrange polynomials remains the same for complex nodes.
By sheer symmetry, all the weights have to be the same, which, since the rule will be at least of order 1,
means
1 2π
wSj = , j = 1, . . . , n .
n
1
Moreover, the quadrature rule QSn will be of order n, see Def. 7.3.1, that is, it will integrate polynomials of
degree ≤ n − 1, exactly.
1
By transformation (→ Rem. 7.1.4 and pullback (7.4.18), QSn induces a quadrature rule on I := [0, 1] by
n n
1 S1 1 1 1 j −1
QnI ( f ) := Qn (ΦS−11 )∗ f = ∑ wSj f (Φ−1 (z j )) = ∑ n f( n ) . (7.4.21)
2π 2π j =1 j =1
This is exactly the equidistant trapezoidal rule(7.4.17), if f is 1-periodic, f (0) = f (1): QnI = Tn . Hence
we arrive at the following estimate for the quadrature error
Z 1
En ( f ) : = f (t) dt − Tn ( f ) ≤ 2π max (ΦS−11 )∗ f (z) − LZ (ΦS−11 )∗ f (z) .
0 z ∈S 1
Equivalently, one can show that Tn integrates trigonometric polynomials up to degree 2n − 1 exactly:
(
R 1 0 , if k 6= 0 ,
0 f (t) dt = 1 , if k = 0 .
f (t) = e2πıkt (
m−1 2πı (4.2.8) 0 , if k 6∈ nZ ,
1 lk =
Tn ( f ) = m ∑ e n
1 , if k ∈ nZ .
l =0
Recall from Section 4.2.5: recovery of signal (yk )k∈Z from its Fourier transform c(t)
Z1
yj = c(t) exp (2πijt) dt . (4.2.79)
0
23 f f t . i n v ( z , c _ev al ) ;
24
We distinguish
(I) a priori adaptive quadrature: the nodes a fixed before the evaluation of the quadrature for-
mula, taking into account external information about f , and
(II) a posteriori adaptive quadrature: the node positions are chosen or improved based on infor-
mation gleaned during the computation inside a loop. It terminates when sufficient accuracy
has been reached.
In this section we will chiefly discuss a posteriori adaptive quadrature for composite quadrature rules (→
Section 7.4) based on a single local quadrature rule (and its transformation).
This example presents an extreme case. We consider the composite trapezoidal rule (7.4.4) on a mesh
M := {a = x0 < x1 < · · · < xm = b} and for the integrand f (t) = 10−14 +t2 on [−1, 1].
10000
9000
8000
f is a spike-like function 1
✄ 7000 f (t) = 10−4 + t2
6000
Intuition: quadrature nodes should cluster around 0,
f(t)
5000
whereas hardly any are needed close to the end-
points of the integration interval, where the function 4000
1000
0
-1 -0.5 0 0.5 1
Fig. 276 t
A quantitative justification can appeal to (7.2.10) and the resulting bound for the local quadrature error (for
f ∈ C2 ([ a, b])):
Zxk
1
f (t) dt − ( f ( xk−1 ) + f ( xk )) ≤ h3k f ′′ L ∞ ([ xk−1 ,xk ])
, h k : = x k − x k−1 . (7.5.3)
2
x k −1
The ultimate but elusive goal is to find a mesh with a minimal number of cells that just delivers a quadrature
error below a prescribed threshold. A more practical goal is to adjust the local meshwidths hk := xk − xk−1
in order to achieve a minimal sum of local error bounds. This leads to the constrained minimization
problem:
m m
∑ h3k f ′′ L ∞ ([ x
→ min s.t. ∑ hk = b − a . (7.5.5)
k −1 ,xk ])
k=1 k=1
Lemma 7.5.6.
Let f : R 0+ → R 0+ be a convex function with f (0) = 0 and x > 0. Then the constrained
minimization problem: seek ζ 1 , . . . , ζ m ∈ R 0+ such that
m m
∑ f (ζ k ) → min and ∑ ζk = x , (7.5.7)
k=1 k=1
x
has the solution ζ 1 = ζ 2 = · · · = ζ m = m .
This means that we should strive for equal bounds h3k k f ′′ k L∞ ([ x for all mesh cells.
k −1 ,xk ])
The mesh for a posteriori adaptive composite numerical quadrature should be chosen to achieve
equal contributions of all mesh intervals to the quadrature error
A indicated above, guided by the equidistribution principle, the improvement of the mesh will be done
gradually in an iteration. The change of the mesh in each step is called mesh adaptation and there are
two fundamentally different ways to do it:
(I) by moving nodes, keeping their total number, but making them cluster where mesh intervals should
be small, or
(II) by adding nodes, where mesh intervals should be small (mesh refinement).
Algorithms for a posteriori adaptive quadrature based on mesh refinement usually have the following
structure:
(1) ESTIMATE: based on available information compute an approximation for the quadrature error
on every mesh interval.
(2) CHECK TERMINATION: if total error sufficient small → STOP
(3) MARK: single out mesh intervals with the largest or above average error contributions.
(4) REFINE: add node(s) inside the marked mesh intervals. GOTO (1)
We now see a concrete algorithm based on the two composite quadrature rules introduced in Ex. 7.4.3.
Idea: local error estimation by comparing local results of two quadrature formu-
las Q1 , Q2 of different order → local error estimates
❶ (Error estimation)
hk h
ESTk := ( f ( xk−1 ) + 4 f ( pk ) + f ( xk )) − k ( f ( xk−1 ) + 2 f ( pk ) + f ( xk )) . (7.5.11)
|6 {z } |4 {z }
Simpson rule trapezoidal rule on split mesh interval
❷ (Check termination)
Rb
Simpson rule on M ⇒ intermediate approximation I ≈ a f (t) dt
m
If ∑ ESTk ≤ RTOL · I ( RTOL := prescribed relative tolerance) ⇒ STOP (7.5.12)
k=1
❸ (Marking)
m
1
Marked intervals: S := {k ∈ {1, . . . , m}: ESTk ≥ η ·
m ∑ EST j } , η ≈ 0.9 . (7.5.13)
j =1
1
new mesh: M∗ := M ∪ { pk := ( xk−1 + xk ): k ∈ S} . (7.5.14)
2
Then continue with step ❶ and mesh M ← M∗ .
• Arguments: f =
ˆ handle to function f , M =
ˆ initial mesh, rtol =
ˆ relative tolerance for termination,
atol =
ˆ absolute tolerance for termination, necessary in case the exact integral value = 0, which
renders a relative tolerance meaningless.
• Line 20: difference of values obtained from local composite trapezoidal rule (∼ Q1 ) and local simp-
son rule (∼ Q2 ) is used as an estimate for the local quadrature error.
• Line 22: estimate for global error by summing up moduli of local error contributions,
• Line 26: terminate, once the estimated total error is below the relative or absolute error threshold,
• Line 43 otherwise, add midpoints of mesh intervals with large error contributions according to
(7.5.14) to the mesh and continue.
5 i n t main ( ) {
6 auto f = [ ] ( double x ) { r e tur n std : : exp(− x ∗ x ) ; } ;
7 VectorXd M( 4 ) ;
8 M << − 100, 0 . 1 , 0 . 5 , 1 0 0 ;
9 std : : cout << " S q r t ( P i ) − I n t _ { − 100}^{100} exp(− x ∗ x ) dx = " ;
10 std : : cout << adaptquad ( f , M, 1e −10, 1e −12) − std : : s q r t ( M_PI ) << " \ n " ;
11 r e tur n 0 ;
12 }
In Code 7.5.15 we use the higher order quadrature rule, the Simpson rule of order 4, to compute an ap-
proximate value for the integral. This is reasonable, because it would be foolish not to use this information
after we have collected it for the sake of error estimation.
Yet, according to our heuristics, what est_loc and est_tot give us are estimates for the error of the
second-order trapezoidal rule, which we do not use for the actual computations.
est_loc gives useful (for the sake of mesh refinement) information about the distribution of
the error of the Simpson rule, though it fails to capture its size.
In this numerical test we investigate whether the adaptive technique from § 7.5.10 produces an appropriate
distribution of integration nodes. We do this for different functions.
Z 1
✦ approximate exp(6 sin(2πt)) dt, initial mesh M0 = { j/10}10 j =0
0
Algorithm: adaptive quadrature, Code 7.5.15 with tolerances rtol = 10−6, abstol = 10−10
We monitor the distribution of quadrature points during the adaptive quadrature and the true and esti-
mated quadrature errors. The “exact” value for the integral is computed by composite Simpson rule on an
equidistant mesh with 107 intervals.
1
10
exact error
0
estimated error
10
500
450 -1
10
400
-2
10
350
quadrature errors
300 -3
10
250
f
-4
200 10
150 -5
10
100
-6
50 10
0 -7
0 10
0
5 0.2
-8
0.4 10
10 0.6
0.8
15 1 -9
10
x 0 200 400 600 800 1000 1200 1400 1600
Fig. 277 quadrature level Fig. 278 no. of quadrature points
Z 1
✦ approximate min{exp(6 sin(2πt)), 100} dt, initial mesh as above
0
1
10
exact error
estimated error
0
100 10
90
-1
10
80
70 -2
10
quadrature errors
60
-3
50 10
f
40
-4
10
30
20 -5
10
10
-6
0 10
0
0
5 0.2 -7
0.4 10
10 0.6
0.8
15 1 -8
10
x 0 100 200 300 400 500 600 700 800
Fig. 279 quadrature level Fig. 280 no. of quadrature points
Observation:
• Adaptive quadrature locally decreases meshwidth where integrand features variations or kinks.
Learning Outcomes
✦ You should know what is a quadrature formula and terminology connected with it,
✦ You should be able to transform quadrature formulas to arbitrary intervals.
✦ You should understand how a interpolation and approximation schemes spawn quadrature formulas
and how quadrature errors are connected to interpolation/approximation errors.
✦ You should remember the maximal and minimal order of polynomial quadrature rules.
✦ You should know the order of the n-point Gauss-Legendre quadrature rule.
✦ You should understand why Gauss-Legendre quadrature converges exponentially for integrands that
can be extended analytically and algebraically for integrands with limited smoothness.
✦ You should be apply to apply regularizing transformations to integrals with non-smooth integrands.
✦ You should know about asymptotic convergence of the h-version of composite quadrature.
✦ You should know the principles of adaptive composite quadrature.
Non-linear systems naturally arise in mathematical models of electrical circuits, once non-linear circuit
elements are introduced. This generalizes Ex. 2.1.3, where the current-voltage relationship for all circuit
elements was the simple proportionality (2.1.5) (of the complex amplitudes U and I ).
As an example we consider the
U+
Schmitt trigger circuit ✄
Its key non-linear circuit element is the NPN bipolar R3 R4
R1
junction transistor:
collector ➀ ➃
Rb
➂
➄ ➁
Uout
base Uin
Re R2
Fig. 281
emitter
A transistor has three ports: emitter, collector, and base. Transistor models give the port currents as
functions of the applied voltages, for instance the Ebers-Moll model (large signal approximation):
UBC
U
UBE IS BC
IC = IS e UT −e UT − e T − 1 = IC (UBE , UBC ) ,
U
βR
U U
IS BE IS BC
IB = e T −1 +
U e T − 1 = IB (UBE , UBC ) ,
U
(8.0.2)
βF βR
U UBC
U
BE IS BE
IE = IS e UT − e UT + e UT − 1 = IE (UBE , UBC ) .
βF
IC , IB , IE : current in collector/base/emitter,
UBE , UBC : potential drop between base-emitter, base-collector.
The parameters have the following meanings: β F is the forward common emitter current gain (20 to 500),
β R is the reverse common emitter current gain (0 to 20), IS is the reverse saturation current (on the order
of 10−15 to 10−12 amperes), UT is the thermal voltage (approximately 26 mV at 300 K).
550
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The circuit of Fig. 281 has 5 nodes ➀–➄ with unknown nodal potentials. Kirchhoffs law (2.1.4) plus the
constitutive relations gives an equation for each of them.
Non-linear system of equations from nodal analysis, static case (→ Ex. 2.1.3):
5 equations ↔ 5 unknowns U1 , U2 , U3 , U4 , U5
A non-linear system of equations is a concept almost too abstract to be useful, because it covers an
extremely wide variety of problems . Nevertheless in this chapter we will mainly look at “generic” methods
for such systems. This means that every method discussed may take a good deal of fine-tuning before it
will really perform satisfactorily for a given non-linear system of equations.
Here, D is the domain of definition of the function F, which cannot be evaluated for x 6∈ D.
In contrast to the situation for linear systems of equations (→ Thm. 2.2.4), the class of non-linear systems
is far too big to allow a general theory:
Contents
8.1 Iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
8.1.1 Speed of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
8.1.2 Termination criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
8. Iterative Methods for Non-Linear Systems of Equations, 8. Iterative Methods for Non-Linear Systems of 551
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Gaussian elimination (→ Section 2.3) provides an algorithm that, if carried out in exact arithmetic (no
roundoff errors), computes the solution of a linear system of equations with a finite number of elementary
operations. However, linear systems of equations represent an exceptional case, because it is hardly ever
possible to solve general systems of non-linear equations using only finitely many elementary operations.
All methods for general non-lineare systems of equations are iterative in the sense that they will usually
8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 552
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
yield only approximate solutions whenever they terminate after finite time.
D
★ ✥
An iterative method for (approximately) solving the
non-linear equation F(x) = 0 is an algorithm
gen-
x (3)
erating an arbitrarily long sequence x(k) of ap- x (2)
k Φ
x∗
✧ ✦
proximate solutions.
x (1) x (4)
x (6)
x(k) =
ˆ k-th iterate
x (0) x (5)
Initial guess
Fig. 282
All the iterative methods discussed below fall in the class of (stationary) m-point, m ∈ N, iterative meth-
ods, for which the iterate x(k) depends on F and the m most recent iterates x(k−1) , . . . , x(k−m) , e.g.,
x ( k ) = Φ F ( x ( k − 1) , . . . , x ( k − m ) ) (8.1.4)
| {z }
iteration function for m-point method
When applying an iterative method to solve a non-linear system of equations F(x) = 0, the following
issues arise:
✦ Speed of convergence: How “fast” does x(k) − x∗ (k·k a suitable norm on R N ) decrease for
increasing k?
k→∞
An iterative method converges (for fixed initial guess(es)) :⇔ x(k) → x∗ and F(x∗ ) = 0.
8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 553
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
A statioanry m-point iterative method is consistent with the non-linear system of equations F(x) = 0
:⇔ Φ F (x∗ , . . . , x∗ ) = x∗ ⇔ F (x∗ ) = 0
For a consistent stationary iterative method we can study the error of the iterates x(k) defined as: e(k) :=
x(k) − x∗
Unfortunately, convergence may critically depend on the choice of initial guesses. The property defined
next weakens this dependence:
Fig. 283
Our goal: Given a non-linear system of equations, find iterative methods that converge (locally) to a
solution of F(x) = 0.
Two general questions: How to measure, describe, and predict the speed of convergence?
When to terminate the iteration?
Here and in the sequel, k·k designates a generic vector norm on R n , see Def. 1.5.70. Any occurring
matrix norm is induced by this vector norm, see Def. 1.5.76.
It is important to be aware which statements depend on the choice of norm and which do not!
8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 554
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
∃0 < L < 1: x ( k + 1) − x ∗ ≤ L x ( k ) − x ∗ ∀k ∈ N0 .
If dim V < ∞ all norms (→ Def. 1.5.70) on V are equivalent (→ Def. 8.1.11).
Often we will study the behavior of a consistent iterative method for a model problem in a numerical
experiments and measure the norms of the iteration errors e(k) := x(k) − x∗ . How can we tell that the
method enjoys linear convergence?
log e(k)
norms of iteration errors
l
∼ straight line in lin-log plot
e ( k ) ≤ L k e (0) ,
Fig. 284
1 2 3 4 5 6 7 8 k
8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 555
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Let us abbreviate the error norm in step k by ǫk := x(k) − x∗ . In the case of linear convergence (see
Def. 8.1.9) assume (with 0 < L < 1)
ǫk+1 ≈ Lǫk ⇒ log ǫk+1 ≈ log L + log ǫk ⇒ log ǫk ≈ k log L + log ǫ0 . (8.1.14)
We conclude that log L < 0 determines the slope of the graph in lin-log error chart.
Related: guessing time complexity O(nα ) of an algorithm from measurements, see § 1.4.9.
Note the green dots • in Fig. 284: Any “faster” convergence also qualifies as linear convergence in the strict
sense of the definition. However, whenever this term is used, we tacitly imply, that no “faster convergence”
prevails.
8 f o r ( i n t i = 0; i <N; ++ i ) {
9 x = x + ( cos ( x ) +1) / s i n ( x ) ;
10 y( i ) = x;
11 }
12 e r r . r e s i z e (N) ; r at es . r e s i z e (N) ;
13 e r r = y−VectorXd : : Constant (N, x ) ;
14 r at es =
e r r . bottomRows(N−1) . cwiseQuotient ( e r r . topRo
15 }
cos x (k) + 1
x ( k + 1) = x ( k ) + .
sin x (k)
In the C++ code (✄) x has to be initialized
with the different values for x0 .
8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 556
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
→ Rem. 8.1.13
-4
10
1 2 3 4 5 6 7 8 9 10
Fig. 285 index of iterate
There are notions of convergence that guarantee a much faster (asymptotic) decay of norm of the iteration error
than linear convergence from Def. 8.1.9.
Definition 8.1.17. Order of convergence → [?, Sect. 17.2], [?, Def. 5.14], [?, Def. 6.1]
Of course, the order p of convergence of an iterative method refers to the largest possible p in the def-
inition, that is, the error estimate will in general not hold, if p is replaced with p + ǫ for any ǫ > 0, cf.
Rem. 1.4.6.
8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 557
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
0
10
-2
10
-4
10
iteration error
-6
10
✁ Qualitative error graphs for convergence of or-
-8
10 der p
-10
10
(lin-log scale)
-12
10
p = 1.1
p = 1.2
-14
10 p = 1.4
p = 1.7
p=2
-16
10
0 1 2 3 4 5 6 7 8 9 10 11
index k of iterates
k
p
ǫk+1 ≈ Cǫk ⇒ log ǫk+1 = log C + p log ǫk ⇒ log ǫk+1 = log C ∑ pl + pk+1 log ǫ0
l =0
log C log C
⇒ log ǫk+1 = − + + log ǫ0 pk+1 .
p−1 p−1
In this case, the error graph is a concave power curve (for sufficiently small ǫ0 !)
How to guess the order of convergence (→ Def. 8.1.17) from tabulated error norms measured in a numer-
ical experiment?
➣ monitor the quotients (log ǫk+1 − log ǫk )/(log ǫk − log ǫk−1 ) over several steps of the iteration.
√
From your analysis course [?, Bsp. 3.3.2(iii)] recall the famous iteration for computing a , a > 0:
1 (k) a √ 1 √
x ( k + 1) = ( x + ( k ) ) ⇒ | x ( k + 1) − a | = ( k ) | x ( k ) − a | 2 . (8.1.21)
2 x 2x
√ √
By the arithmetic-geometric mean inequality (AGM) ab ≤ 12 (a + b) we conclude: x (k) > a for
kó 1. Therefore estimate from (8.1.21) means that the sequence from (8.1.21) converges with order 2 to
a.
8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 558
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Note: x (k+1) < x (k) for all k ≥ 2 ➣ ( x (k) )k∈N0 converges as a decreasing sequence that is bounded
from below (→ analysis course)
Note the doubling of the number of significant digits in each step ! [impact of roundoff !]
The doubling of the number of significant digits for the iterates holds true for any quadratically convergent
iteration:
Recall from Rem. 1.5.25 that the relative error (→ Def. 1.5.24) tells the number of significant digits. Indeed,
denoting the relative error in step k by δk , we have in the case of quadratic convergence.
x (k) = x ∗ (1 + δk ) ⇒ x (k) − x ∗ = δk x ∗ .
⇒| x ∗ δk+1 | = | x (k+1) − x ∗ | ≤ C| x (k) − x ∗ |2 = C| x ∗ δk |2
⇒ |δk+1 | ≤ C| x ∗ |δk2 . (8.1.22)
As remarked above, usually (even without roundoff errors) an iteration will never arrive at an/the exact
solution x∗ after finitely many steps. Thus, we can only hope to compute an approximate solution by
accepting x(K ) as result for some K ∈ N0 . Termination criteria (stopping rules) are used to determine a
suitable value for K.
For the sake of efficiency ✄ stop iteration when iteration error is just “small enough”
(“small enough” depends on the concrete problem and user demands.)
(8.1.23) Classification of termination criteria (stopping rules) for iterative solvers for non-
linear systems of equations
✎ ☞
A termination criterion (stopping rule) is an algorithm deciding in each step of an iterative method
✍ ✌
whether to STOP or to CONTINUE.
8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 559
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Decision to stop based on information about F and Beside x(0) and F, also current and past iterates
x(0) , made before starting iteration. are used to decide about termination.
A termination criterion for a convergent iteration is deemed reliable, if it lets the iteration CONTINUE, until
the iteration error e(k) := x(k) − x∗ , x∗ the limit value, satisfies certain conditions (usually imposed before
the start of the iteration).
Termination criteria are usually meant to ensure accuracy of the final iterate x(K ) in the following sense:
it seems that the second criterion, asking that the relative (→ Def. 1.5.24) iteration error be below a
prescribed threshold, alone would suffice, but the absolute tolerance should be checked, if, by “accident”,
kx∗ k = 0 is possible. Otherwise, the iteration might fail to terminate at all.
➀ A priori termination: stop iteration after fixed number of steps (possibly depending on x(0) ).
(A priori =
ˆ without actually taking into account the computed iterates, see § 8.1.23)
Invoking additional properties of either the non-linear system of equations F(x) = 0 or the iteration
it is sometimes possible to tell that for sure x(k) − x∗ ≤ τ for all k ≥ K, though this K may be
(significantly) larger than the optimal termination index from (8.1.25), see Rem. 8.1.28.
➁ Residual based termination: STOP convergent iteration {x(k) } k∈N0 , when
F (x(k) ) ≤ τ , τ=
ˆ prescribed tolerance > 0 .
8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 560
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
no guaranteed accuracy
x x
Also for this criterion, we have no guarantee that (8.1.25) will be satisfied only remotely.
A special variant of correction based termination exploits that M is finite! (→ Section 1.5.3)
Remark 8.1.28 (A posteriori termination criterion for linearly convergent iterations → [?,
Lemma 5.17, 5.19])
Let us assume that we know that an iteration linearly convergent (→ Def. 8.1.9) with rate of convergence
0 < L < 1:
The following simple manipulations give an a posteriori termination criterion (for linearly convergent itera-
tions with rate of convergence 0 < L < 1):
△-inequ.
x(k) − x∗ ≤ x ( k + 1) − x ( k ) + x ( k + 1) − x ∗ ≤ x ( k + 1) − x ( k ) + L x ( k ) − x ∗ .
8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 561
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Iterates satisfy: x ( k + 1) − x ∗ ≤ L
1− L x ( k + 1) − x ( k ) . (8.1.29)
This suggests that we take the right hand side of (8.1.29) as a posteriori error bound and use it instead of
the inaccessible x(k+1) − x∗ for checking absolute and relative accuracy in (8.1.25). The resulting ter-
mination criterium will be reliable (→ § 8.1.23), since we will certainly have achieved the desired accuracy
when we stop the iteration.
(Using e
L > L in (8.1.29) still yields a valid upper bound for x(k) − x∗ .)
cos x (k) + 1
x ( k + 1) = x ( k ) + ⇒ x (k) → π for x (0) close to π .
sin x (k)
Observed rate of convergence: L = 1/2
Error and error bound for x (0) = 0.4:
k | x (k) − π | L
1− L | x
(k) − x ( k − 1) | slack of bound
8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Iterative methods 562
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Supplementary reading. The contents of this section are also treated in [?, Sect. 5.3], [?,
1-point stationary iterative methods, see (8.1.4), for F(x) = 0 are also called fixed point iterations.
iteration function Φ : U ⊂ R n 7→ R n
➣ iterates (x(k) )k∈N0 : x ( k + 1) : = Φ ( x ( k ) ) .
initial guess x (0) ∈ U
| {z }
→ 1-point method, cf. (8.1.4)
Note that the sequence of iterates need not be well defined: x(k) 6∈ U possible !
A fixed point iteration x(k+1) = Φ(x(k) ) is consistent with F(x) = 0, if, for x ∈ U ∩ D,
F (x) = 0 ⇔ Φ(x) = x .
This is an immediate consequence that for a continuous function limits and function evaluations commute
[?, Sect. 4.1].
x ( k + 1) : = Φ ( x ( k ) ) . (8.2.2)
Note: there are many ways to transform F(x) = 0 into a fixed point form !
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 563
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
In this example we construct three different consistent fixed point iteration for a single scalar (n = 1)
non-linear equation F( x ) = 0. In numerical experiments we will see that they behave very differently.
2
1.5
F( x ) = xe x − 1 , x ∈ [0, 1] .
1
Different fixed point forms:
F(x)
Φ1 ( x ) = e − x ,
0.5
1+x
Φ2 ( x ) = , 0
1 + ex
Φ3 ( x ) = x + 1 − xe x . -0.5
-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
1 1 1
0.5
Φ
0.5
Φ
0.5
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x x
With the same intial guess x (0) = 0.5 for all three fixed point iterations we obtain the following iterates:
k x ( k + 1) : = Φ1 ( x ( k ) ) x ( k + 1) : = Φ2 ( x ( k ) ) x ( k + 1) : = Φ3 ( x ( k ) )
0 0.500000000000000 0.500000000000000 0.500000000000000
1 0.606530659712633 0.566311003197218 0.675639364649936
2 0.545239211892605 0.567143165034862 0.347812678511202
3 0.579703094878068 0.567143290409781 0.855321409174107
4 0.560064627938902 0.567143290409784 -0.156505955383169
5 0.571172148977215 0.567143290409784 0.977326422747719
6 0.564862946980323 0.567143290409784 -0.619764251895580
7 0.568438047570066 0.567143290409784 0.713713087416146
8 0.566409452746921 0.567143290409784 0.256626649129847
9 0.567559634262242 0.567143290409784 0.924920676910549
10 0.566907212935471 0.567143290409784 -0.407422405542253
We can also tabulate the modulus of the iteration error and mark correct digits with red:
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 564
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
( k + 1) ( k + 1) ( k + 1)
k | x1 − x∗ | | x2 − x∗ | | x3 − x∗ |
0 0.067143290409784 0.067143290409784 0.067143290409784
1 0.039387369302849 0.000832287212566 0.108496074240152
2 0.021904078517179 0.000000125374922 0.219330611898582
3 0.012559804468284 0.000000000000003 0.288178118764323
4 0.007078662470882 0.000000000000000 0.723649245792953
5 0.004028858567431 0.000000000000000 0.410183132337935
6 0.002280343429460 0.000000000000000 1.186907542305364
7 0.001294757160282 0.000000000000000 0.146569797006362
8 0.000733837662863 0.000000000000000 0.310516641279937
9 0.000416343852458 0.000000000000000 0.357777386500765
10 0.000236077474313 0.000000000000000 0.974565695952037
(k) (k)
Observed: linear convergence of x1 , quadratic convergence of x2 ,
(k) (0)
no convergence (erratic behavior of x3 ) ( xi = 0.5 in all cases).
In this section we will try to find easily verifiable conditions that ensure convergence (of a certain order) of
fixed point iterations. It will turn out that these conditions are surprisingly simple and general.
In Exp. 8.2.3 we observed vastly different behavior of different fixed point iterations for n = 1. Is it possible
to predict this from the shape of the graph of the iteration functions?
1 1 1
0.5
Φ
0.5
Φ
0.5
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x x
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 565
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
angular bisector of the first/third quadrant, that is, to the point ( x (k+1) , x (k+1) . Returning vertically to the
abscissa gives x (k+1) .
Φ( x ) Φ( x )
x x
x x
It seems that the slope of the iteration function Φ in the fixed point, that is, in the point where it intersects
the bisector of the first/third quadrant, is crucial.
Now we investigate rigorously, when a fixed point iteration will lead to a convergent iteration with a partic-
ular qualitative kind of convergence according to Def. 8.1.17.
A simple consideration: if Φ(x∗ ) = x∗ (fixed point), then a fixed point iteration induced by a contractive
mapping Φ satisfies
(8.2.7)
x ( k + 1) − x ∗ = Φ ( x ( k ) ) − Φ ( x ∗ ) ≤ L x(k) − x∗ ,
that is, the iteration converges (at least) linearly (→ Def. 8.1.9).
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 566
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Remark 8.2.8 (Banach’s fixed point theorem → [?, Satz 6.5.2],[?, Satz 5.8])
then there is a unique fixed point x∗ ∈ D, Φ(x∗ ) = x∗ , which is the limit of the sequence of iterates
x(k+1) := Φ( x (k) ) for any x(0) ∈ D.
Lk
≤ x (1) − x (0) k→∞
−−−→ 0 .
1−L
Lemma 8.2.10. Sufficient condition for local linear convergence of fixed point iteration →
[?, Thm. 17.2], [?, Cor. 5.12]
x ( k + 1) : = Φ ( x ( k ) ) , (8.2.2)
∂Φ ∂Φ1 ∂Φ1
∂x (x)
1
∂x2 (x) ··· ··· ∂xn (x)
" #n ∂Φ12 ∂Φ2
∂Φi ∂x (x)
∂xn (x)
D Φ(x) = (x) = 1
∂x j .. .. . (8.2.11)
i,j=1 . .
∂Φ n ∂Φ n ∂Φ n
∂x1 (x) ∂x2 (x) ··· ··· ∂xn (x)
“Visualization” of the statement of Lemma 8.2.10 in Rem. 8.2.5: The iteration converges locally, if Φ is flat
in a neighborhood of x ∗ , it will diverge, if Φ is steep there.
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 567
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
if x(k) − x∗ < δ.
✷
Lemma 8.2.12. Sufficient condition for linear convergence of fixed point iteration
If Φ(x∗ ) = x∗ for some interior point x∗ ∈ U , then the fixed point iteration x(k+1) = Φ(x(k) )
converges to x∗ at least linearly with rate L.
We find that Φ is contractive on U with unique fixed point x∗ , to which x(k) converges linearly for k → ∞.
✷
By asymptotic rate of a linearly converging iteration we mean the contraction factor for the norm of the
iteration error that we can expect, when we are already very close to the limit x∗ .
If 0 < k DΦ(x∗ )k < 1, x(k) ≈ x∗ then the (worst) asymptotic rate of linear convergence is L =
k DΦ( x ∗ )k
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 568
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
In this example we encounter the first genuine system of non-linear equations and apply Lemma 8.2.12 to
it.
What about higher order convergence (→ Def. 8.1.17, cf. Φ2 in Ex. 8.2.3)? Also in this case we should
study the derivatives of the iteration functions in the fixed point (limit point).
Here we used the Landau symbol O(·) to describe the local behavior of a remainder term in the vicinity of
x∗
Lemma 8.2.18. Higher order local convergence of fixed point iterations
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 569
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Now, Lemma 8.2.12 and Lemma 8.2.18 permit us a precise prediction of the (asymptotic) convergence
we can expect from the different fixed point iterations studied in Exp. 8.2.3.
1 1 1
0.5
Φ
0.5
Φ
0.5
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x x
1 − xe x
Φ2′ ( x ) = = 0 , if xe x − 1 = 0 hence quadratic convergence ! .
(1 + e x )2
∗
Since x ∗ e x − 1 = 0, simple computations yield
We recall the considerations of Rem. 8.1.28 about a termination criterion for contractive fixed point iter-
ation (= linearly convergence fixed point iteration → Def. 8.1.9), c.f. (8.2.7), with contraction factor (=
rate of convergence) 0 ≤ L < 1:
Lk− l
x∗ − x(k) ≤ x ( l + 1) − x ( l ) . (8.2.21)
1−L
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Fixed Point Iterations 570
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Lk L
x∗ − x(k) ≤ x (1) − x (0) (8.2.22) x∗ − x(k) ≤ x ( k ) − x ( k − 1) (8.2.23)
1−L 1−L
With the same arguments as in Rem. 8.1.28 we see that overestimating L, that is, using a value for L that
is larger than the true value, still gives reliable termination criteria.
However, whereas overestimating L in (8.2.23) will not lead to a severe deterioration of the bound, unless
L ≈ 1, using a pessimistic value for L in (8.2.22) will result in a bound way bigger than the true bound, if
k ≫ 1. Then the a priori termination criterion (8.2.22) will recommend termination many iterations after
the accuracy requirements have already been met. This will thwart the efficiency of the method.
Supplementary reading. [?, Ch. 3] is also devoted to this topic. The algorithm of “bisection”
discussed in the next subsection, is treated in [?, Sect. 5.5.1] and [?, Sect. 3.2].
Sought: x∗ ∈ I : F( x∗ ) = 0
8.3.1 Bisection
Idea: use ordering of real numbers & intermediate value theorem [?, Sect. 4.6]
F( x)
Fig. 288
Find a sequence of intervals with geometrically decreasing lengths, in each of which F will change
sign.
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 571
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Such a sequence can easily be found by testing the sign of F at the midpoint of the current interval, see
Code 8.3.2.
The following C++ code implements the bisection method for finding the zeros of a function passed through
the function handle F in the interval [ a, b] with absolute tolerance tol.
Line 13: the test ((a<x)&& (x<b)) offers a safeguard against an infinite loop in case tol < resolution
of M at zero x ∗ (cf. “M-based termination criterion”).
This is also an example for an algorithm that (in the case of tol=0) uses the properties of machine
arithmetic to define an a posteriori termination criterion, see Section 8.1.2. The iteration will terminate,
when, e.g., a+e 21 (b − a) = a (+
e is the floating point realization of addition), which, by the Ass. 1.5.32 can
only happen, when
| 12 (b − a)| ≤ EPS · | a| .
Since the exact zero is located between a and b, this condition implies a relative error ≤ EPS of the
computed zero.
• “foolproof”, robust: will always terminate with a zero of requested accuracy,
Advantages: • requires only point evaluations of F,
• works with any continuous function F, no derivatives needed.
Merely “linear-type”
(∗)convergence: | x (k) − x ∗ | ≤ 2−k |b − a|
Drawbacks: |b − a|
log2 steps necessary
tol
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 572
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
(∗): the convergence of a bisection algorithm is not linear in the sense of Def. 8.1.9, because the condition
x (k+1) − x ∗ ≤ L x (k) − x ∗ might be violated at any step of the iteration.
It is straightforward to combine the bisection idea with more elaborate “model function methods” as they
will be discussed in the next section: Instead of stubbornly choosing the midpoint of the probing interval
[ a, b] (→ Code 8.3.2) as next iterate, one may use a refined guess for the location of a zero of F in [ a, b].
A method of this type is used by M ATLAB’s fzero function for root finding in 1D [?, Sect. 6.2.3].
=ˆ class of iterative methods for finding zeroes of F: iterate in step k + 1 is computed according to the
following idea:
one-point methods : x (k+1) = Φ F ( x (k) ), k ∈ N (e.g., fixed point iteration → Section 8.2)
multi-point methods : x (k+1) = Φ F ( x (k) , x (k−1) , . . . , x (k−m) ), k ∈ N, m = 2, 3, . . ..
Supplementary reading. Newton’s method in 1D is discussed in [?, Sect. 18.1], [?, Sect. 5.5.2],
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 573
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
F (x(k))
x ( k + 1) : = x ( k ) − F ′ (x(k) )
, (8.3.4)
F x ( k + 1) x ( k )
that requires F ′ ( x ( k ) ) 6 = 0.
In Ex. 8.1.20 we learned about the quadratically convergent fixed point iteration (8.1.21) for the approxi-
mate computation of the square root of a positive number. It can be derived as a Newton iteration (8.3.4)!
For F( x ) = x2 − a, a > 0, we find F′ ( x ) = 2x, and, thus, the Newton iteration for finding zeros of F
reads:
( x ( k ) )2 − a a
x ( k + 1) = x ( k ) − = 1
2 x (k) + ,
2x (k) x (k)
which is exactly (8.1.21). Thus, for this F Newton’s method converges globally with order p = 2.
Newton iterations for two different scalar non-linear equations F( x ) = 0 with the same solution sets:
(k) (k)
x ′ x ( k + 1) (k) x (k) e x
−1 ( x ( k ) )2 + e − x
F( x ) = xe − 1 ⇒ F ( x ) = e (1 + x ) ⇒ x =x − (k) =
e x (1 + x ( k ) ) 1 + x (k)
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 574
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
(k)
−x ′ −x ( k + 1) (k) x (k) − e− x 1 + x (k)
F( x) = x − e ⇒ F (x) = 1 + e ⇒ x =x − (k)
= (k)
1 + e− x 1 + ex
Exp. 8.2.3 confirms quadratic convergence in both cases! (→ Def. 8.1.17)
Note that for the computation of its zeros, the function F in this example can be recast in different forms!
In fact, based on Lemma 8.2.18 it is straightforward to show local quadratic convergence of Newton’s
method to a zero x ∗ of F, provided that F′ ( x ∗ ) 6= 0:
Newton iteration (8.3.4) ˆ fixed point iteration (→ Section 8.2) with iteration function
=
F( x) F( x ) F′′ ( x )
Φ( x ) = x − ⇒ Φ′ ( x ) = ⇒ Φ′ ( x ∗ ) = 0 , if F( x ∗ ) = 0, F′ ( x ∗ ) 6= 0 .
F′ ( x) ( F′ ( x ))2
Thus from Lemma 8.2.18 we conclude the following result:
R1 R2 R3 R4 Rn −1 Rn
U R R R R R R R
Fig. 289
How do we have to choose the leak resistance R in the linear circuit displayed in Fig. 289 in order to
achieve a prescribed potential at one of the nodes?
Using nodal analysis of the circuit introduced in Ex. 2.1.3, this problem can be formulated as: find x ∈ R ,
x := R−1 , such that
R → R
F( x ) = 0 with F : , (8.3.10)
x 7→ w⊤ (A + xI)−1 b − 1
where A ∈ R n,n is a symmetric, tridiagonal, diagonally dominant matrix, w ∈ R n is a unit vector singling
out the node of interest, and b takes into account the exciting voltage U .
In order to apply Newton’s method to (8.3.10), we have to determine the derivative F′ ( x ) and so by implicit
ˆ vector of nodal potentials as a function of x = R−1 )
differentiation [?, Sect. 7.8], first rewriting (u( x ) =
F( x ) = w⊤ u( x ) − 1 , (A + xI)u( x ) = b .
Then we differentiate the linear system of equations defining u( x ) on both sides with respect to x using
the product rule (8.4.10)
d
dx
(A + xI)u( x ) = b =⇒ (A + xI)u′ ( x ) + u( x ) = 0 .
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 575
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
x ( k + 1) = x ( k ) + F ′ ( x ( k ) ) − 1 F ( x ( k ) )
w⊤ u( x (k) ) − 1 (8.3.13)
= , (A + x (k) I )u( x (k) ) = b .
w ⊤ (A + x (k) I )−1 u ( x (k) )
In each step of the iteration we have to solve two linear systems of equations, which can be done with
asymptotic effort O(n) in this case, because A + x (k) I is tridiagonal.
Note that in a practical application one must demand x > 0, in addition, because the solution must provide
a meaningful conductance (= inverse resistance.)
Also note that bisection (→ 8.3.1) is a viable alternative to using Newton’s method in this case.
Useful, if a priori knowledge about the structure of F (e.g. about F being a rational function, see below) is
available. This is often the case, because many problems of 1D zero finding are posed for functions given
in analytic form with a few parameters.
This example demonstrates that non-polynomial model functions can offer excellent approximation of F.
In this example the model function is chosen as a quotient of two linear function, that is, from the simplest
class of true rational functions.
Of course, that this function provides a good model function is merely “a matter of luck”, unless you have
some more information about F. Such information might be available from the application context.
a a ′ (k) 2a
( k )
+ c = F ( x (k)
) , − ( k ) 2
= F ( x ) , ( k ) 3
= F′′ ( x (k) ) .
x +b ( x + b) ( x + b)
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 576
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
F ( x (k) ) 1
x ( k + 1) = x ( k ) − · .
F ′ ( x (k) ) 1 − 1 F ( x ( k) ) F ′′ ( x ( k) )
2 F ′ ( x ( k ) )2
1 1
Halley’s iteration for F( x) = 2
+ − 1 , x > 0 : and x (0) = 0
( x + 1) ( x + 0.1)2
k x (k) F ( x (k) ) x ( k ) − x ( k − 1) x (k) − x ∗
1 0.19865959351191 10.90706835180178 -0.19865959351191 -0.84754290138257
2 0.69096314049024 0.94813655914799 -0.49230354697833 -0.35523935440424
3 1.02335017694603 0.03670912956750 -0.33238703645579 -0.02285231794846
4 1.04604398836483 0.00024757037430 -0.02269381141880 -0.00015850652965
5 1.04620248685303 0.00000001255745 -0.00015849848821 -0.00000000804145
Compare with Newton method (8.3.4) for the same problem:
! Newton method converges more slowly, but also needs less effort per step (→ Section 8.3.3)
In the previous example Newton’s method performed rather poorly. Often its convergence can be boosted
by converting the non-linear equation to an equivalent one (that is, one with the same solutions) for another
function g, which is “closer to a linear function”:
b, where Fb is invertible with an inverse Fb−1 that can be evaluated with little effort.
Assume F ≈ F
g( x ) := Fb−1 ( F( x )) ≈ x .
Then apply Newton’s method to g( x ), using the formula for the derivative of the inverse of a function
d b−1 1 1
( F )(y) = ⇒ g′ ( x ) = · F′ ( x) .
dy Fb′ ( Fb−1 (y)) Fb′ ( g( x ))
1 1
As in Ex. 8.3.14: F( x) = 2
+ −1, x > 0 :
( x + 1) ( x + 0.1)2
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 577
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
10
F(x)
9 g(x)
7
Observation:
6
F( x ) + 1 ≈ 2x −2 for x ≫ 1
5
1 4
and so g( x ) := p is “almost” linear for
F( x) + 1 3
x ≫ 1. 2
0
0 0.5 1 1.5 2 2.5 3 3.5 4
x
! !
Idea: instead of F( x ) = 0 tackle g( x ) = 1 with Newton’s method (8.3.4).
( k )
g( x ) − 1 (k) 3/2
x ( k + 1) = x (k) − = x (k)
+ q 1
− 1 2( F ( x ) + 1)
g′ ( x (k) ) F ( x (k) ) + 1 F ′ ( x (k) )
q
2( F( x (k) ) + 1)(1 − F( x (k) ) + 1)
= x (k) + .
F ′ ( x (k) )
Convergence recorded for x (0) = 0:
For zero finding there is wealth of iterative methods that offer higher order of convergence. One class is
discussed next.
Taking the cue from the iteration function of Newton’s method (8.3.4), we extend it by introducing an extra
function H :
F( x)
new fixed point iteration : Φ( x ) = x − H ( x ) with “proper” H : I 7→ R .
F′ ( x)
Still, every zero of F is a fixed point of this Φ,that is, the fixed point iteration is still consistent (→ Def. 8.2.1).
Aim: find H such that the method is of p-th order. The main tool is Lemma 8.2.18, which tells us that we
have to ensure Φ(ℓ) ( x ∗ ) = 0, 1 ≤ ℓ ≤ p − 1, guarantees local convergence of order p.
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 578
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
F′′ ( x ∗ )
Φ′ ( x ∗ ) = 1 − H ( x ∗ ) , Φ′′ ( x ∗ ) = H ( x ∗ ) − 2H ′ ( x ∗ ) . (8.3.17)
F′ ( x∗ )
Lemma 8.2.18 ➢ Necessary conditions for local convergence of order p:
p = 2 (quadratical convergence): H (x∗ ) = 1 ,
1 F′′ ( x ∗ )
p = 3 (cubic convergence): H (x∗ ) = 1 ∧ H ′ (x∗ ) = .
2 F′ ( x∗ )
Trial expression: H ( x ) = G (1 − u′ ( x )) with “appropriate” G
!
F ( x (k) ) F( x (k) ) F′′ ( x (k) )
fixed point iteration x ( k + 1) = x (k) − ′ (k) G . (8.3.18)
F (x ) ( F′ ( x (k) ))2
If F ∈ C2 ( I ), F( x ∗ ) = 0, F′ ( x ∗ ) 6= 0, G ∈ C2 (U ) in a neighbourhood U of 0, G (0) = 1,
G ′ (0) = 12 , then the fixed point iteration (8.3.18) converge locally cubically to x ∗ .
Proof. We apply Lemma 8.2.18, which tells us that both derivatives from (8.3.17) have to vanish. Using
the definition of H we find.
F′′ ( x ∗ )
H ( x ∗ ) = G (0) , H ′ ( x ∗ ) = − G ′ (0)u′′ ( x ∗ ) = G ′ (0) .
F′ ( x∗ )
Plugging these expressions into (8.3.17) finishes the proof.
✷
1
• G (t) = ➡ Halley’s iteration (→ Ex. 8.3.14)
1 − 12 t
2
• G (t) = √ ➡ Euler’s iteration
1 + 1 − 2t
• G (t) = 1 + 12 t ➡ quadratic inverse interpolation
k e(k) := x (k) − x ∗
Halley Euler Quad. Inv.
Numerical experiment: 1 2.81548211105635 3.57571385244736 2.03843730027891
2 1.37597082614957 2.76924150041340 1.02137913293045
F( x ) = xe x − 1 , 3 0.34002908011728 1.95675490333756 0.28835890388161
x (0) = 5 4 0.00951600547085 1.25252187565405 0.01497518178983
5 0.00000024995484 0.51609312477451 0.00000315361454
6 0.14709716035310
7 0.00109463314926
8 0.00000000107549
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 579
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Supplementary reading. The secant method is presented in [?, Sect. 18.2], [?, Sect. 5.5.3],
F( x)
Fig. 290
F ( x ( k ) ) − F ( x ( k − 1) )
s( x ) = F ( x (k) ) + ( x − x (k) ) , (8.3.23)
x ( k ) − x ( k − 1)
F( x (k) )( x (k) − x (k−1) )
x ( k + 1) = x (k) − . (8.3.24)
F ( x ( k ) ) − F ( x ( k − 1) )
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 580
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
17 fo = fn ;
18 }
19 r et ur n x1 ;
20 }
Remember: F( x ) may only be available as output of a (complicated) procedure. In this case it is difficult
to find a procedure that evaluates F′ ( x ). Thus the significance of methods that do not involve evaluations
of derivatives.
Model problem: find zero of F( x ) = xe x − 1, using secant method of Code 8.3.25 with initial guesses
x ( 0 ) = 0, x ( 1 ) = 5.
log | e( k+1) |−log | e( k) |
k x (k) F ( x (k) ) e(k) := x (k) − x ∗
log | e( k) |−log | e( k−1) |
2 0.00673794699909 -0.99321649977589 -0.56040534341070
3 0.01342122983571 -0.98639742654892 -0.55372206057408 24.43308649757745
4 0.98017620833821 1.61209684919288 0.41303291792843 2.70802321457994
5 0.38040476787948 -0.44351476841567 -0.18673852253030 1.48753625853887
6 0.50981028847430 -0.15117846201565 -0.05733300193548 1.51452723840131
7 0.57673091089295 0.02670169957932 0.00958762048317 1.70075240166256
8 0.56668541543431 -0.00126473620459 -0.00045787497547 1.59458505614449
9 0.56713970649585 -0.00000990312376 -0.00000358391394 1.62641838319117
10 0.56714329175406 0.00000000371452 0.00000000134427
11 0.56714329040978 -0.00000000000001 -0.00000000000000
Rem. 8.1.19: the rightmost column of the table provides an estimate for the order of convergence →
Def. 8.1.17. For further explanations see Rem. 8.1.19.
A startling observation: the method seems to have a fractional (!) order of convergence, see Def. 8.1.17.
Indeed, a fractional order of convergence can be proved for the secant method, see [?, Sect. 18.2]. Here
we give an asymptotic argument that holds, if the iterates are already very close to the zero x ∗ of F.
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 581
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Thanks to the asymptotic perspective we may assume that |e(k) |, |e(k−1) | ≪ 1 so that we can rely on
two-dimensional Taylor expansion around ( x ∗ , x (∗) ), cf. [?, Satz 7.5.2]:
∂Φ ∗ ∗ ∂Φ ∗ ∗
Φ( x ∗ + h, x ∗ + k) = Φ( x ∗ , x ∗ ) + ( x , x )h + ( x , x )k+
∂x ∂y
2 2 2 (8.3.30)
1∂ Φ ∗ ∗ 2 ∂ Φ ∗ ∗ 1∂ Φ ∗ ∗ 2 ∗
2 ∂2 x ( x , x ) h ( x , x ) hk 2 ∂2 y ( x , x )k + R( x , h, k) ,
∂x∂y
with | R| ≤ C(h3 + h2 k + hk2 + k3 ) .
Computations invoking the quotient rule and product rule and using F( x ∗ ) = 0 show
∂Φ ∗ ∗ ∂Φ ∗ ∗ ∂2 Φ ∂2 Φ
Φ( x ∗ , x ∗ ) = x ∗ , (x , x ) = ( x , x ) = 12 2 ( x ∗ , x ∗ ) = 21 2 ( x ∗ , x ∗ ) = 0 .
∂x ∂y ∂ x ∂ y
We may also use MAPLE to find the Taylor expansion (assuming F sufficiently smooth):
> Phi := (x,y) -> x-F(x)*(x-y)/(F(x)-F(y));
> F(s) := 0;
> e2 = normal(mtaylor(Phi(s+e1,s+e0)-s,[e0,e1],4));
➣ truncated error propagation formula (products of three or more error terms ignored)
. 1 F ′′ ( x ∗ ) (k) (k−1)
e ( k + 1) = 2 F ′ (x∗ ) e e = Ce(k) e(k−1) . (8.3.31)
How can we deduce the order of converge from this recursion formula? We try e ( k ) = K ( e ( k − 1) ) p
inspired by the estimate in Def. 8.1.17:
2
⇒ e ( k + 1) = K p + 1 ( e ( k − 1) ) p
2 − p−1 √
⇒ ( e ( k − 1) ) p
= K − p C ⇒ p2 − p − 1 = 0 ⇒ p = 12 (1 ± 5) .
√
As e(k) → 0 for k → ∞ we get the order of convergence p = 12 (1 + 5) ≈ 1.62 (see Exp. 8.3.26 !)
10
F( x ) = arctan( x )
6
· =
(1)
✄
( x (0) , x (1) ) ∈ R2+ of initial guesses. 4
0
0 1 2 3 4 5 6 7 8 9 10
(0)
Fig. 291 x
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 582
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
F ( x ∗ ) = 0 ⇒ F − 1 (0 ) = x ∗ .
p( F( x (k− j) )) = x (k− j) , j = 0, . . . , m − 1 .
F ( x ∗ ) = 0 ⇔ F − 1 (0 ) = x ∗
Fig. 292
F −1
x∗
F
Case m = 2 (2-point method)
➢ secant method
Fig. 293
Case m = 3: quadratic inverse interpolation, a 3-point method, see [?, Sect. 4.5]
We interpolate the points ( F( x (k) ), x (k) ), ( F( x (k−1) ), x (k−2) ), ( F( x (k−1) ), x (k−2) ) with a parabola (polyno-
mial of degree 2). Note the importance of monotonicity of F, which ensures that F( x (k) ), F( x (k−1) ), F( x (k−1) )
are mutually different.
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 583
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
We test the method for the model problem/initial guesses F( x ) = xe x − 1 , x (0) = 0 , x (1) = 2.5 ,x (2) = 5 .
log | e( k+1) |−log | e( k) |
k x (k) F ( x (k) ) e(k) := x (k) − x ∗ log | e( k) |−log | e( k−1) |
3 0.08520390058175 -0.90721814294134 -0.48193938982803
4 0.16009252622586 -0.81211229637354 -0.40705076418392 3.33791154378839
5 0.79879381816390 0.77560534067946 0.23165052775411 2.28740488912208
6 0.63094636752843 0.18579323999999 0.06380307711864 1.82494667289715
7 0.56107750991028 -0.01667806436181 -0.00606578049951 1.87323264214217
8 0.56706941033107 -0.00020413476766 -0.00007388007872 1.79832936980454
9 0.56714331707092 0.00000007367067 0.00000002666114 1.84841261527097
10 0.56714329040980 0.00000000000003 0.00000000000001
Also in this case the numerical experiment hints at a fractional rate of convergence p ≈ 1.8, as in the case
of the secant method, see Rem. 8.3.27.
Efficiency is measured by forming the ratio of gain and the effort required to achieve it. For iterative
methods for solving F(x) = 0, F : D ⊂ R n → R n , this means the following:
Ingredient ➊: W=
ˆ computational effort per step
#{evaluations of D F} #{evaluations of F′ }
(e.g, W≈ +n· +··· )
step step
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 584
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Let us consider an iterative method of order p ≥ 1 (→ Def. 8.1.17). Its error recursion can be converted
into expressions (8.3.36) and (8.3.37) that related the error norm e(k) to e(0) and lead to quantitative
bounds for the number of steps to achieve (8.3.35):
p
∃C > 0: e ( k ) ≤ C e ( k − 1) ∀k ≥ 1 (C < 1 for p = 1) .
p−1
Assuming C e (0) < 1 (guarantees convergence!), we find the following minimum number of
steps to achieve (8.3.35) for sure:
! log ρ
p = 1: e ( k ) ≤ C k e (0) requires k≥ , (8.3.36)
log C
! p k −1 pk log ρ
p > 1: e(k) ≤ C p −1 e (0) requires pk ≥ 1 +
log C/p−1 + log( e (0) )
log ρ
⇒ k ≥ log(1 + )/ log p , (8.3.37)
log L0
L0 : = C /p − 1 e ( 0) < 1 .
1
Now we adopt an asymptotic perspective and ask for a large reduction of the error, that is ρ ≪ 1.
log ρ
If ρ ≪ 1, then log(1 + log L ) ≈ log | log ρ| − log | log L0 | ≈ log | log ρ|. This simplification will be
0
made in the context of asymptotic considerations ρ → 0 below.
We conclude that
• when requiring high accuracy, linearly convergent iterations should not be used, because their effi-
ciency does not increase for ρ → 0,
log p
• for method of order p > 1, the factor W offers a gauge for efficiency.
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Finding Zeros of Scalar Functions 585
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
10
C = 0.5
9 C = 1.0
C = 1.5
6
The plot displays the number of iteration steps ac- 5
cording to (8.3.37).
4
0
1 1.5 2 2.5
Fig. 294 p
7
Newton method
secant method
6
We compare
5
Newton’s method ↔ secant method
no. of iterations
2
Newton’s method requires only marginally fewer
steps than the secant method. 1
0
0 2 4 6 8 10
Fig. 295 -log (ρ)
10
We set the effort for a step of Newton’s method to twice that for a step of the secant method from
Code 8.3.25, because we need an addition evaluation of F′ in Newton’s method.
The multi-dimensional Newton method is also presented in [?, Sect. 19], [?, Sect. 5.6], [?, Sect. 9.1].
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 586
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
for F : D ⊂ R n 7→ R n find x∗ ∈ D: F ( x ∗ ) = 0.
We assume: F : D ⊂ R n 7→ R n is continuously differentiable
Fig. 296
Here a correction based a posteriori termination criterion for the Newton iteration is used; it stops the
iteration if the relative size of the Newton correction drops below the prescribed relative tolerance rtol.
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 587
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
If x∗ ≈ 0 also the absolute size of the Newton correction has to be tested against an absolute tolerance
atol in order to avoid non-termination despite convergence of the iteration,
10 do {
11 s = DFinv ( x , F ( x ) ) ; // compute Newton correction
12 x −= s ; // compute next iterate
13
14 i f ( callback != nullpt r )
15 callback ( x , s ) ;
16 }
17 // correction based termination (relative and absolute)
18 while ( ( s . norm ( ) > r t o l ∗ x . norm ( ) ) && ( s . norm ( ) > a t o l ) ) ;
19
20 r et ur n x ;
21 }
that computes the Newton correction, that is it returns the solution of a linear system with system
matrix D F(x) (x ↔ x) and right hand side f ↔ f.
☞ The argument x will be overwritten with the computed solution of the non-linear system.
The next code demonstrates the invocation of newton for a 2 × 2 non-linear system from a code relying
on E IGEN. It also demonstrates the use of fixed size eigen matrices and vectors.
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 588
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
An important property of the Newton iteration (8.4.1): affine invariance → [?, Sect .1.2.2]
☛ ✟
Affine invariance: Newton iterations for GA (x) = 0 are the same for all regular A !
✡ ✠
This is a simple computation:
• convergence theory for Newton’s method: assumptions and results should be affine invariant, too.
• modifying and extending Newton’s method: resulting schemes should preserve affine invariance.
In particular, termination criteria for Newton’s method should also be affine invariant in the sense that,
when applied for GA they STOP the iteration at exactly the same step for any choice of A.
The function F : R n → R n defining the non-linear system of equations may be given in various formats,
as explicit expression or rather implicitly. In most cases, D F has to be computed symbolically in order to
obtain concrete formulas for the Newton iteration. We now learn how these symbolic computations can be
carried out harnessing advanced techniques of multi-variate calculus.
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 589
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The reader will probably agree that the derivative of a function F : I ⊂ R → R in x ∈ I is a number
F′ ( x ) ∈ R, the derivative of a function F : D ⊂ R n → R m , in x ∈ D a matrix D F(x) ∈ R m,n . However,
the nature of a derivative in a point is that of a linear mapping that approximates F locally up to second
order:
☞ Note that D F(x)h ∈ W is the vector returned by the linear mapping D F(x) when applied to h ∈ V .
☞ In Def. 8.4.6 k·k can be any norm on V (→ Def. 1.5.70).
☞ A common shorthand notation for (8.4.7) relies on the “little-o” Landau symbol:
In the context of the Newton iteration (8.4.1) the computation of the Newton correction s in the k + 1-th
step amounts to solving a linear system of equations:
Matching this with Def. 8.4.6 we see that we need only determine expressions for D F(x(k) )h, h ∈ V ,
in order to state the LSE yielding the Newton correction. This will become important when applying the
“compact” differentiation rules discussed next.
Statement of the Newton iteration (8.4.1) for F : R n 7→ R n given as analytic expression entails computing
the Jacobian D F. The safe, but tedious way is to use the definition (8.2.11) directly and compute the
partial derivatives.
To avoid cumbersome component-oriented considerations, it is sometimes useful to know the rules of
multidimensional differentiation:
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 590
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
T (x) = b( F(x), G (x)) ⇒ D T (x)h = b(D F(x)h, G (x)) + b( F(x), D G (x)h) , (8.4.10)
h ∈ V, x ∈ D .
The first and second derivatives of real-valued functions occur frequently and have special names, see [?,
Def. 7.3.2] and [?, Satz 7.5.3].
“High level differentiation”: We apply the product rule (8.4.10) with F, G = Id, which means D F(x) =
D G (x) = I, and the bilinear form b(x, y) := x T Ay:
D Ψ(x)h = h⊤ Ax + x⊤ Ah = x⊤ A⊤ + x T A h ,
| {z }
=(grad Ψ(x))⊤
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 591
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
“Low level differentiation”: Using the rules of matrix×vector multiplication, Ψ can be written in terms of the
vector components xi , i = 1, . . . , n:
n n n n n n
Ψ(x) = ∑ ∑ (A)k,j xk x j = (A)i,i xi2 + ∑ (A)i,j xi x j + ∑ (A)k,i xk xi + ∑ ∑ (A)k,j xk x j ,
k=1 j =1 j =1 k =1 k =1 j =1
j6=i k6=i k6=i j6=i
=(Ax + A⊤ x)i .
We seek the derivative of the Euclidean norm, that is, of the function F(x) := kxk2 , x ∈ R n \ {0} ( F is
defined but not differentiable in x = 0, just look at the case n = 1!)
“High level differentiation”: We can write F as the composition of two functions F = G ◦ H with
p
G : R + → R + , G (ξ ) := ξ,
H : Rn → R , H (x) := x⊤ x .
Using the rule for the differentiation of bilinear forms from Ex. 8.4.12 for the case A = I and basic
calculus, we find
D H (x)h = 2x⊤ h , x, h ∈ R n ,
ζ
D G (ξ )ζ = √ , ξ > 0, ζ ∈ R .
2 ξ
Finally, the chain rule (8.4.9) gives
2x⊤ h x⊤
D F(x)h = D G ( H (x))(D H (x)h) = √ = ·h . (8.4.14)
2 x⊤ x kxk2
x
Def. 8.4.11 ⇒ grad F(x) = .
k x k2
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 592
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
This paragraph explains the use of the general product rule (8.4.10) to derive the linear system solved by
the Newton correction. It implements the insights from § 8.4.5.
We seek solutions of F(x) = 0 with F(x) := b(G (x), H (x)), where
✦ V, W are some vector spaces (finite- or even infinite-dimensional),
✦ G : D → V , H : D → W , D ⊂ R n , are continuously differentiable in the sense of Def. 8.4.6,
✦ b : V × W 7→ R n is bilinear (linear in each argument).
According to the general product rule (8.4.10) we have
This already defines the linear system of equations to be solved to compute the Newton correction s
b(D G (x(k) )s, H (x(k) )) + b(G (x(k) ), D H (x(k) )s) = −b(G (x(k) ), H (x(k) )) . (8.4.17)
Since the left-hand side is linear in s, this really represents a square linear system of n equations. The
next example will present a concrete case.
For many quasi-linear systems, for which there exist solutions, the fixed point iteration (→ Section 8.2)
x ( k + 1) = A ( x ( k ) ) − 1 b ⇔ A ( x ( k ) ) x ( k + 1) = b , (8.4.20)
D F ( x ) h = (D A ( x ) h ) x + A ( x ) h , h ∈ R n . (8.4.21)
Note that D A(x(k) ) is a mapping from R n into R n,n , which gets h as an argument. Then the Newton
iteration reads
x ( k + 1) = x ( k ) − s , D F ( x ) s = (D A ( x ( k ) ) s ) x ( k ) + A ( x ( k ) ) s = A ( x ( k ) ) x ( k ) − b . (8.4.22)
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 593
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
where γ(x) := 3 + k xk2 (Euclidean vector norm), the right hand side vector b ∈ R n is given and x ∈ R n
is unknown.
The derivative of the first term is straightforward, because it is linear in x, see the discussion following
Def. 8.4.6.
The “pedestrian” approach to the second term starts with writing it explicitly in components as
q
(xkxk)i = xi x12 + · · · + x2n , i = 1, . . . , n .
Then we can compute the Jacobian according to (8.2.11) by taking partial derivatives:
q
∂ xi
(xkxk)i = x12 + · · · + x2n + xi q ,
∂xi x12 + · · · + x2n
∂ xj
(xkxk)i = xi q , j 6= i .
∂x j 2
x +···+x 2
1 n
For the “high level” treatment of the second term x 7→ xkxk2 we apply the product rule (8.4.10), together
with (8.4.14):
x⊤ h xx⊤
DF(x)h = Th + kxk2 h + x = A (x) + h.
kxk2 kxk2
Thus, in concrete terms the Newton iteration (8.4.22) becomes
x (k) (x (k) )⊤ −1
x ( k + 1) = x ( k ) − A ( x ( k ) ) + (A (x(k) )x(k) − b) .
k xk2
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 594
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Note that that matrix of the linear system to be solved in each step is a rank-1-modification (2.6.17) of the
symmetric positive definite tridiagonal matrix A(x(k) ), cf. Lemma 2.8.12. Thus the Sherman-Morrison-
Woodbury formula from Lemma 2.6.22 can be used to solve it efficiently.
Given are
This relationship will provide a valid definition of F in a neighborhood of x0 ∈ W , if we assume that there
is x0 , z0 ∈ W such that b(G (x0 ), z0 ) = b, and that the linear mapping z 7→ b(G (x0 ), z) is invertible.
Then, for x close to x0 , F(x) can be computed by solving a square linear system of equations in W . In
Ex. 8.3.9 we already saw an example of an implicitly defined F for W = R .
We want to solve F(x) = 0 for this implicitly defined F by means of Newton’s method. In order to
determine the derivative of F we resort to implicit differentiation [?, Sect. 7.8] of the defining equation
(8.4.26) by means of the general product rule (8.4.10). We formally differentiate both sides of (8.4.26):
and find that the Newton correction s in the k + 1-th Newton step can be computed as follows:
which constitutes an dim W × dim W linear system of equations. The next example discusses a concrete
application of implicit differentiation with W = R n,n .
We consider matrix inversion as a mapping and (formally) compute its derivative, that is, the derivative of
function
R n,n
∗ → R n,n
inv : ,
X 7 → X−1
n,n
where R ∗ denotes the (open) set of invertible n × n-matrices, n ∈ N.
inv(X) · X = I , X ∈ R n,n
∗ . (8.4.29)
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 595
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Differentiation on both sides of (8.4.29) by means of the product rule (8.4.10) yields
For n = 1 we get D inv( x )h = − xh2 , which recovers the well-known derivative of the function x → x −1 .
Surprisingly, it is possible to obtain the inverse of a matrix as a the solution of a non-linear system of
equations. Thus it can be computed using Newton’s method.
Given a regular matrix A ∈ R n,n , its inverse can be defined as the unique zero of a function:
−1 R n,n
∗ → R n,n
X=X ⇐⇒ F(X) = 0 for F : .
X 7 → A − X−1
n,n
Using (8.4.30) we find for the derivative of F in X ∈ R ∗
X ( k + 1) = X ( k ) − S , S : = D F ( X ( k ) ) − 1 F ( X ( k ) ) . (8.4.33)
The Newton correction S in the k-th step solves the linear system of equations
(8.4.32) −1
−1 −1
D F (X(k) )S = X(k)
S X(k) = F (X(k) ) = A − X(k) .
−1 (k)
S = X(k) (A − X(k) )X = X(k) AX(k) − X(k) . (8.4.34)
in (8.4.33)
X(k+1) = X(k) − X(k) AX(k) − X(k) = X(k) 2I − AX(k) . (8.4.35)
To study the convergence of this iteration we derive a recursion for the iteration errors E(k) := X(k) −
A −1 :
E ( k + 1) = X ( k + 1) − A − 1
(8.4.35)
= X(k) 2I − AX(k) − A−1
= (E(k) + A−1 ) 2I − A(E(k) + A−1 ) − A−1
= (E(k) + A−1 )(I − AE(k) ) − A−1 = −E(k) AE(k) .
For the norm of the iteration error (a matrix norm → Def. 1.5.76) we conclude from submultiplicativity
(1.5.77) a recursive estimate
2
E ( k + 1) ≤ E ( k ) kAk . (8.4.36)
This holds for any matrix norm according to Def. 1.5.76, which is induced by a vector norm. For the
relative iteration error we obtain
!2
E ( k + 1) E(k)
≤ k A k A −1 , (8.4.37)
kAk kAk | {z }
| {z } | {z }
relative error relative error =cond(A)
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 596
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
From (8.4.36) we conclude that the iteration will converge (limk→∞ E(k) = 0), if
which gives a condition on the initial guess S(0) . Now let us consider the Euclidean matrix norm k·k2 ,
which can be expressed in terms of eigenvalues, see Cor. 1.5.82. Motivated by this relationship, we use
the initial guess X(0) = αA⊤ with a > 0 still to be determined.
! 2
X (0) A − I = αA⊤ A − I = αkAk22 − 1 < 1 ⇔ α < ,
2 2 kAk22
which is a sufficient condition for the initial guess X(0) = αA⊤ , in order to make (8.4.35) converge. In this
case we infer quadratic convergence from both (8.4.36) and (8.4.37).
Simplified Newton Method: ☞ use the same Jacobian D F(x(k) ) for all/several steps
If D F(x) is not available (e.g. when F(x) is given only as a procedure) we may resort to approximation
by difference quotients:
∂Fi Fi (x + h~e j ) − Fi (x)
Numerical Differentiation: (x) ≈ .
∂x j h
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 597
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
√
Caution: Roundoff errors wreak havoc for small h → Ex. 1.5.45 ! Therefore use h ≈ EPS.
Newton iteration (8.4.1) ˆ fixed point iteration (→ Section 8.2) with iteration function
=
Φ(x ) = x − D F (x )−1 F (x ) .
F (x∗ ) = 0 ⇒ D Φ(x∗ ) = 0 ,
that is, the derivative (Jacobian) of the iteration function of the Newton fixed point iteration vanishes in
the limit point. Thus from Lemma 8.2.18 we draw the same conclusion as in the scalar case n = 1, cf.
Section 8.3.2.1.
∂ x1 F1 ( x ) ∂ x2 F1 ( x ) 2x1 −4x23
Jacobian (analytic computation): D F (x) = =
∂ x1 F2 ( x ) ∂ x2 F2 ( x ) 1 −3x22
where x(k) = [ x1 , x2 ] T .
2. Set x(k+1) = x(k) + ∆x(k) .
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 598
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
11 k = k+1;
12 end
13
14 ld = d i f f ( l o g (res(:,4))); %
15 rates = ld(2: end )./ld(1:end -1); %
Line 14, Line 15: estimation of order of convergence, see Rem. 8.1.19.
log ǫk+1 − log ǫk
k x(k) ǫk : = k x ∗ − x ( k ) k 2
log ǫk − log ǫk−1
0 [0.7, 0.7] T 4.24e-01
1 [0.87850000000000, 1.064285714285714] T 1.37e-01 1.69
2 [1.01815943274188, 1.00914882463936] T 2.03e-02 2.23
3 [1.00023355916300, 1.00015913936075] T 2.83e-04 2.15
4 [1.00000000583852, 1.00000002726552] T 2.79e-08 1.77
5 [0.999999999999998, 1.000000000000000] T 2.11e-15
6 [ 1, 1] T
☞ (Some) evidence of quadratic convergence, see Rem. 8.1.19.
There is a sophisticated theory about the convergence of Newton’s method. For example one can find the
following theorem in [?, Thm. 4.10], [?, Sect. 2.1]):
If:
(A) D ⊂ R n open and convex,
(B) F : D 7→ R n continuously differentiable,
(C) D F(x) regular ∀x ∈ D,
∀v ∈ R n , v + x ∈ D,
(D) ∃ L ≥ 0: D F(x)−1 (D F(x + v) − D F(x)) ≤ Lk vk 2 ,
2 ∀x ∈ D
(E) ∃x∗ : F (x∗ ) = 0 (existence of solution in D)
2
(F) initial guess x(0) ∈ D satisfies ρ : = x ∗ − x (0) < ∧ Bρ ( x ∗ ) ⊂ D .
2 L
then the Newton iteration (8.4.1) satisfies:
(i) x(k) ∈ Bρ (x∗ ) := {y ∈ R n , ky − x∗ k < ρ} for all k ∈ N,
(ii) lim x(k) = x∗ ,
k→∞
2
(iii) x ( k + 1) − x ∗ ≤ L
2 x(k) − x∗ (local quadratic convergence) .
2 2
Usually, it is hardly possible to verify the assumptions of the theorem for a concrete non-linear
system of equations, because neither L nor x ∗ are known.
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 599
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
An abstract discussion of ways to stop iterations for solving F(x) = 0 was presented in Section 8.1.2, with
“ideal termination” (→ § 8.1.24) as ultimate, but unfeasible, goal.
Yet, in 8.4.2 we saw that Newton’s method enjoys (asymptotic) quadratic convergence, which means rapid
decrease of the relative error of the iterates, once we are close to the solution, which is exactly the point,
when we want to STOP. As a consequence, asymptotically, the Newton correction (difference of two
consecutive iterates) yields rather precise information about the size of the error:
x ( k + 1) − x ∗ ≪ x ( k ) − x ∗ ⇒ x ( k ) − x ∗ ≈ x ( k + 1) − x ( k ) . (8.4.46)
→ uneconomical: one needless update, because x(k) would already be accurate enough.
Some facts about the Newton method for solving large (n ≫ 1) non-linear systems of equations:
☛ Solving the linear system to compute the Newton correction may be expensive (asymptotic compu-
tational effort O(n3 ) for direct elimination → § 2.3.5) and accounts for the bulk of numerical cost of
a single step of the iteration.
☛ In applications only very few steps of the iteration will be needed to achieve the desired accuracy
due to fast quadratic convergence.
✄ The termination criterion (8.4.47) computes the last Newton correction ∆x(k) needlessly, because
x(k) already accurate enough!
Therefore we would like to use an a-posteriori termination criterion that dispenses with computing (and
“inverting”) another Jacobian D F(x(k) ) just to tell us that x(k) is already accurate enough.
Due to fast asymptotic quadratic convergence, we can expect D F(x(k−1) ) ≈ D F(x(k) ) during the final
steps of the iteration.
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 600
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
∆x̄(k) available
Effort: Reuse of LU-factorization (→ Rem. 2.5.10) of D F(x(k−1) ) ➤
with O(n2 ) operations
C++11 code 8.4.51: Generic Newton iteration with termination criterion (8.4.50)
2 template <typename FuncType , typename JacType , typename VecType>
3 void newton_stc ( const FuncType &F , const JacType &DF,
4 VecType &x , double r t o l , double a t o l )
5 {
6 using s c a l a r _ t = typename VecType : : S c a l a r ;
7 s c a l a r _ t sn ;
8 do {
9 auto j a c f a c = DF( x ) . l u ( ) ; // LU-factorize Jacobian ]
10 x −= j a c f a c . solve ( F ( x ) ) ; // Compute next iterate
11 // Compute norm of simplified Newton correction
12 sn = j a c f a c . solve ( F ( x ) ) . norm ( ) ;
13 }
14 // Termination based on simplified Newton correction
15 while ( ( sn > r t o l ∗ x . norm ( ) ) && ( sn > a t o l ) ) ;
16 }
F (x(k) ) ≤ τ ,
then the resulting algorithm would not be affine invariant, because for F(x) = 0 and AF(x) = 0,
A ∈ R n,n regular, the Newton iteration might terminate with different iterates.
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 601
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
converges asymptotically very fast: doubling of number of significant digits in each step
Potentially big problem: Newton method converges quadratically, but only locally , which may render it use-
less, if convergence is guaranteed only for initial guesses very close to exact solution, see also Ex. 8.3.32.
In this section we study a method to enlarge the region of convergence, at the expense of quadratic
convergence, of course.
The dark side of local convergence (→ Def. 8.1.8): for many initial guesses x(0) Newton’s method will not
converge!
1.5
F( x ) = xe x − 1 ⇒ F′ (−1) = 0 1
x 7→ xe x − 1
x (0) < − 1 ⇒ x ( k ) → − ∞ ,
0.5
x (0) > − 1 ⇒ x ( k ) → x ∗ , 0
1.5
∗
arctan(ax)
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 602
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
5
Diverging Newton iteration for F(x) = arctan x
1.5
4.5
1 4
3.5
0.5
3
a
2.5
0
x ( k + 1) x ( k − 1) x (k) 2
-0.5 1.5
1
-1
0.5
-1.5 0
Fig. 299 -15 -10 -5 0 5 10 15
-6 -4 -2 0 2 4 6
Fig. 300 x
In Fig. 300 the red zone = { x (0) ∈ R, x (k) → 0}, domain of initial guesses for which Newton’s
method converges.
If the Newton correction points in the wrong direction (Item ➊), no general remedy is available. If the
Newton correction is too large (Item ➋), there is an effective cure:
With λ(k) > 0: x(k+1) := x(k) − λ(k) D F(x(k) )−1 F(x(k) ) . (8.4.55)
Choice of damping factor: affine invariant natural monotonicity test [?, Ch. 3]:
λ(k)
choose “maximal” 0 < λ(k) ≤ 1: ∆x (λ(k) ) ≤ (1 − ) ∆x(k) (8.4.57)
2 2
✦ When the method converges ⇔ size of Newton correction decreases ⇔ (8.4.57) satisfied.
✦ In the case of strong damping (λ(k) ≪ 1) the size of the Newton correction cannot be expected to
shrink significantly, since iterates do not change much ➣ factor (1 − 21 λ(k) ) in (8.4.57).
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 603
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Note: LU-factorization of Jacobi matrix D F(x(k) ) is done once per successful iteration step (Line 12 of
the above code) and reused for the computation of the simplified Newton correction in Line 10, Line 14 of
the above M ATLAB code.
Policy: Reduce damping factor by a factor q ∈]0, 1[ (usually q = 12 ) until the affine invariant natural
monotonicity test (8.4.57) passed, see Line 13 in the above M ATLAB code.
C++11 code 8.4.58: Generic damped Newton method based on natural monotonicity test
1 template <typename FuncType , typename JacType , typename VecType>
2 void dampnewton ( const FuncType &F , const JacType &DF,
3 VecType &x , double r t o l , double a t o l )
4 {
5 using i n d e x _ t = typename VecType : : Index ;
6 using s c a l a r _ t = typename VecType : : S c a l a r ;
7 const i n d e x _ t n = x . siz e ( ) ;
8 const s c a l a r _ t l m i n = 1E−3; // Minimal damping factor
9 s c a l a r _ t lambda = 1 . 0 ; // Initial and actual damping factor
10 VecType s ( n ) , s t ( n ) ; // Newton corrections
11 VecType xn ( n ) ; // Tentative new iterate
12 s c a l a r _ t sn , s t n ; // Norms of Newton corrections
13
14 do {
15 auto j a c f a c = DF( x ) . l u ( ) ; // LU-factorize Jacobian
16 s = j a c f a c . solve ( F ( x ) ) ; // Newton correction
17 sn = s . norm ( ) ; // Norm of Newton correction
18 lambda ∗= 2 . 0 ;
19 do {
20 lambda / = 2 ;
21 i f ( lambda < l m i n ) throw " No c o n v e r g e n c e : l a mb d a −> 0 " ;
22 xn = x−lambda ∗ s ; // Tentative next iterate
23 s t = j a c f a c . solve ( F ( xn ) ) ; // Simplified Newton correction
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 604
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
24 s t n = s t . norm ( ) ;
25 }
26 while ( s t n > (1− lambda / 2 ) ∗ sn ) ; // Natural monotonicity test
27 x = xn ; // Now: xn accepted as new iterate
28 lambda = std : : min ( 2 . 0 ∗ lambda , 1 . 0 ) ; // Try to mitigate damping
29 }
30 // Termination based on simplified Newton correction
31 while ( ( s t n > r t o l ∗ x . norm ( ) ) && ( s t n > a t o l ) ) ;
32 }
The arguments for Code 8.4.58 are the same as for Code 8.4.51. As termination criterion is uses (8.4.50).
Note that all calls to solve boil down to forward/backward elimination for triangular matrices and incur cost
of O(n2 ) only.
We test the damped Newton method for Item ➋ of Ex. 8.4.54, where excessive Newton corrections made
Newton’s method fail.
k λ(k) x (k) F ( x (k) )
F( x) = arctan( x ) ,
1 0.03125 0.94199967624205 0.75554074974604
• x (0) = 20 2 0.06250 0.85287592931991 0.70616132170387
• q = 21 3 0.12500 0.70039827977515 0.61099321623952
• LMIN = 0.001 4 0.25000 0.47271811131169 0.44158487422833
We observe that damping
5 0.50000 0.20258686348037 0.19988168667351
is effective and asymptotic
6 1.00000 -0.00549825489514 -0.00549819949059
quadratic convergence is
7 1.00000 0.00000011081045 0.00000011081045
recovered.
8 1.00000 -0.00000000000001 -0.00000000000001
1.5
✦ As in Ex. 8.4.54: 1
x 7→ xe x − 1
F( x ) = xe x − 1, 0.5
-1
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 605
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Supplementary reading. For related expositions refer to [?, Sect. 7.1.4], [?, 2.3.2].
How can we solve F(x) = 0 iteratively, in case D F(x) is not available and numerical differentiation (see
Rem. 8.4.41) is too expensive?
In 1D (n = 1 we can choose among many derivative-free methods that rely on F-evaluations alone, for
instance the secant method (8.3.24) from Section 8.3.2.3:
Recall that the secant method converges locally with order p ≈ 1.6 and beats Newton’s method in terms
of efficiency (→ Section 8.3.3).
F ( x ( k ) ) − F ( x ( k − 1) )
F ′ ( x (k) ) ≈ "difference quotient" (8.4.61)
x ( k ) − x ( k − 1)
already computed ! → cheap
J k ( x ( k ) − x ( k − 1) ) = F ( x ( k ) ) − F ( x ( k − 1) ) . (8.4.62)
Iteration: x ( k + 1) : = x ( k ) − J − 1 (k)
k F (x ) . (8.4.63)
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 606
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Reasoning: If we assume that Jk is a good approximation of D F(x(k) ), then it would be foolish not to use
the information contained in Jk for the construction of Jk .
What can “small modification” mean: Demand that Jk acts like Jk−1 on a complement of the span of
x ( k ) − x ( k − 1) !
To start the iteration we have to initialize J0 , e.g. with the exact Jacobi matrix D F(x(0) ).
in another sense Jl is closest to Jk−1 under the constraint of the secant condition (8.4.62):
Let x(k) and Jk be the iterates and matrices, respectively, from Broyden’s method (8.4.66), and let J ∈ R n,n
satisfy the same secant condition (8.4.62) as Jk+1 :
J ( x ( k + 1) − x ( k ) ) = F ( x ( k + 1) ) − F ( x ( k ) ) . (8.4.68)
(I − J − 1
k J )(x
( k + 1)
− x(k) ) = − J− 1 (k) −1
k F (x ) − J k ( F (x
( k + 1)
) − F(x(k) )) = −J− 1
k F (x
( k + 1)
). (8.4.69)
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 607
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Using the submultiplicative property (1.5.77) of the Euclidean matrix norm, we conclude
which we saw in Ex. 1.5.86. This estimate holds for all matrices J satisfying (8.4.68).
We may read this as follows: (8.4.65) gives the k·k2 -minimal relative correction of Jk−1 , such that the
secant condition (8.4.62) holds.
We revisit the 2 × 2 non-linear system of the Exp. 8.4.42 and take x(0) = [0.7, 0.7] T . As starting value for
the matrix iteration we use J0 = D F(x(0) ).
-4
10
Euclidean norms of errors
-14
10
0 1 2 3 4 5 6 7 8 9 10 11
Fig. 302 Step of iteration
In general, any iterative methods for non-linear systems of equations convergence can fail, that is it may
stall or even diverge.
Demand on good numerical software: Algorithms should warn users of impending failure. For iterative
methods this is the task of convergence monitors, that is, conditions, cheaply verifiable a posteriori during
the iteration, that indicate stalled convergence or divergence.
For the damped Newton’s method this role can be played by the natural monotonicity test, see Code 8.4.58;
if it fails repeatedly, then the iteration should terminate with an error status.
For Broyden’s quasi-Newton method, a similar strategy can rely on the relative size of the “simplified
Broyden correction” Jk F(x(k+1) ):
J− 1 (k)
k−1 F (x )
Convergence monitor for (8.4.66) : µ := <1? (8.4.72)
∆x(k−1)
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 608
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
10 0
-2 10 1
10
We rely on the setting of Exp. 8.4.70.
Convergence monitor
-4 10 0
10
We track
error norm
10 -1
1. the Euclidean norm of the iteration error,
10 -6
2. and the value of the convergence monitor from
10
-8 10 -2
(8.4.72).
-10
✁ Decay of (norm of) iteration error and µ are well
10
correlated.
1 2 3 4 5 6 7 8 9 10 11
Fig. 303
Step of iteration
damped Broyden method (cf. same idea for Newton’s method, Section 8.4.4)
J− 1
k F (x
( k + 1) ) < ∆x(k) . (8.4.77)
2 2
Iterated application of (8.4.76) pays off, if iteration terminates after only a few steps. For large n ≫
1 it is not advisable to form the matrices J− 1
k (which will usually be dense in contrast to J k ), but we
employ fast successive multiplications with rank-1-matrices (→ Ex. 1.4.11) to apply J− 1
k to a vector. This
is implemented in the following code.
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 609
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Memory cost :
N steps ✦ LU-factors of J + auxiliary vectors ∈ R n
✦ N vectors x(k) ∈ R n
R n 7→ R n
F (x) =
x 7→ diag(x)Ax − b , 0
10
n
b = [1, 2, . . . , n] ∈ R ,
A = I + aa T ∈ R n,n ,
Normen
-5
1 10
a= √ (b − 1) .
1·b−1
(k)
Broyden: ||F(x )||
Initial guess: h = 2/n; x0 = (2:h:4-h)’; -10
10 Broyden: Fehlernorm
Newton: ||F(x(k))||
Newton: Fehlernorm
The results resemble those of Exp. 8.4.70 ✄ Newton (vereinfacht)
0 1 2 3 4 5 6 7 8 9
Fig. 304 Iterationsschritt
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Newton’s Method 610
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
20
Broyden-Verfahren
18 Newton-Verfahren 30
16
25
14
Anzahl Schritte
12 20
Laufzeit [s]
10
15
8
6 10
4
5
2 Broyden-Verfahren
Newton-Verfahren
0 0
0 500 1000 1500 0 500 1000 1500
Fig. 305 Fig. 306 n
n
☞ In conclusion,
the Broyden method is worthwhile for dimensions n ≫ 1 and low accuracy requirements.
Learning Outcomes
• Knowledge about concepts related to the speed of convergence of an iteration for solving a non-
linear system of equations.
• Ability to estimate type and orders of convergence from empiric data.
• Ability to predict asymptotic linear, quadratic and cubic convergence by inspection of the iteration
function.
• Familiarity with (damped) Newton’s method for general non-linear systems of equations and with the
secant method in 1D.
• Ability to derive the Newton iteration for an (implicitly) given non-linear system of equations.
• Knowledge about quasi-Newton method as multi-dimensional generalizations of the secant method.
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 611
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
m
( x1∗ , . . . , xn∗ ) = argmin ∑ | f ( x1 , . . . , xn ; ti ) − yi |2 . (8.6.2)
x ∈R n i =1
Example 8.6.3 (Non-linear data fitting (parametric statistics) → Ex. 8.6.1 revisited)
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 612
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Given: F : D ⊂ R n 7→ R m , m, n ∈ N, m > n.
Find: x∗ ∈ D: x∗ = argminx∈ D Φ(x) , Φ(x) := 12 k F(x)k22 . (8.6.5)
Terminology: D=
ˆ parameter space, x1 , . . . , xn =
ˆ parameter.
As in the case of linear least squares problems (→ Section 3.1.1): a non-linear least squares problem is
related to an overdetermined non-linear system of equations F(x) = 0.
As for non-linear systems of equations (→ Chapter 8): existence and uniqueness of x∗ in (8.6.5) has to
be established in each concrete case!
★ ✥
We require “independence for each parameter”: → Rem. 3.1.27
∃ neighbourhood U (x∗ )such that DF(x) has full rank n ∀ x ∈ U (x∗ ) . (8.6.6)
✧ ✦
(It means: the columns of the Jacobi matrix DF(x) are linearly independent.)
If (8.6.6) is not satisfied, then the parameters are redundant in the sense that fewer parameters would be
enough to model the same dependence (locally at x∗ ), cf. Rem. 3.1.27.
Simple idea: use Newton’s method (→ Section 8.4) to determine a zero of grad Φ : D ⊂ R n 7→ R n .
x(k+1) = x(k) − HΦ(x(k) )−1 grad Φ(x(k) ) , ( HΦ(x) = Hessian matrix) . (8.6.7)
Recommendation, cf. § 8.4.8: when in doubt, differentiate components of matrices and vectors!
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 613
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Newton’s method (8.6.7) for (8.6.5) can be read as successive minimization of a local quadratic approxi-
mation of Φ:
1
Φ(x) ≈ Q(s) := Φ(x(k) ) + grad Φ(x(k) )T s + sT HΦ(x(k) )s , (8.6.10)
2
(k) (k)
grad Q(s) = 0 ⇔ HΦ(x )s + grad Φ(x ) = 0 ⇔ (8.6.8) .
➣ So we deal with yet another model function method (→ Section 8.3.2) with quadratic model function
for Q.
Note: This approach is different from local quadratic approximation of Φ underlying Newton’s method for
(8.6.5), see Section 8.6.1, Rem. 8.6.9.
Gauss-Newton iteration (under assumption (8.6.6))
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 614
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
for A ∈ R m,n
x = A\b
l
x minimizer of k Ax − bk2
with minimal 2-norm
16 r et ur n x ;
17 }
Note: Code 8.6.12 also implements Newton’s method (→ Section 8.4.1) in the case m = n!
Summary:
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 615
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
C++-code 8.6.14:
1 # include <Eigen / Dense>
2 using Eigen : : VectorXd ;
3
2
10
0
10
norm of grad Φ(x(k) )
1
10
2
-2
10
value of F (x(k) )
-4
10
0 -6
10 10
-8
10
-10
10
-1
10
-12
10
-14
10
-2 -16
10 10
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Fig. 307 No. of step of undamped Newton method Fig. 308 No. of step of undamped Newton method
initial value (1.8, 1.8, 0.1)T (red curve) ➤ Newton method caught in local minimum,
initial value (1.5, 1.5, 0.1)T (cyan curve) ➤ fast (locally quadratic) convergence.
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 616
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
2 2
10 10
0
10
-2
10
2
value of F (x(k) )
-4
10
-6
10
0
10
-8
10
-10
10
-1
10 -12
10
-14
10
-2 -16
10 10
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Fig. 309 No. of step of damped Newton method Fig. 310 No. of step of damped Newton method
initial value (1.8, 1.8, 0.1)T (red curve) ➤ fast (locally quadratic) convergence,
initial value (1.5, 1.5, 0.1)T (cyan curve) ➤ Newton method caught in local minimum.
Second experiment: iterative solution of non-linear least squares data fitting problem by means of the
Gauss-Newton method (8.6.11), see Code 8.6.12.
0 0
10 10
-2
10
norm of the corrector
-4
10
2
2
value of F (x(k) )
-6
10
-1 -8
10 10
-10
10
-12
10
-14
10
-2 -16
10 10
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Fig. 311 No. of step of Gauss-Newton method Fig. 312 No. of step of Gauss-Newton method
We observe: linear convergence for all initial values, cf. Def. 8.1.9, Rem. 8.1.13.
As in the case of Newton’s method for non-linear systems of equations, see Section 8.4.4: often over-
shooting of Gauss-Newton corrections occurs.
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 617
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
10 , if F(x(k) ) ≥ 10 ,
2
λ = γ F (x(k) ) , γ := 1 , if 1 < F(x k) ) < 10 ,
(
2
2
0.01 , if F(x(k) ) ≤ 1 .
2
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Non-linear Least Squares [?, Ch. 6] 618
Chapter 9
Eigenvalues
Supplementary reading. [?] offers comprehensive presentation of numerical methods for the
C L L
Simple electric circuit, cf. Ex. 2.1.3 ✄ ➀ ➁ ➂
Ex. 2.1.3: nodal analysis of linear (↔ composed of resistors, inductors, capacitors) electric circuit in fre-
quency domain (at angular frequency ω > 0) , see (2.1.6)
➣ linear system of equations for nodal potentials with complex system matrix A
For circuit of Code 9.0.3: three unknown nodal potentials
619
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
R = 1, C= 1, L= 1
30
|u1|
|u |
2
|u |
3
25
maximum nodal potential
20
10
Blow-up of some nodal potentials for certain ω !
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Fig. 314 angular frequency ω of source voltage U
5 Z = 1/R; K = 1/L;
6
20 f i g u r e (’name’,’resonant circuit’);
21 p l o t (res(:,1),res(:,2),’r-’,res(:,1),res(:,3),’m-’,res(:,1),res(:,4),’b-’)
22 x l a b e l (’{\bf angular frequency \omega of source voltage
U}’,’fontsize’,14);
23 y l a b e l (’{\bf maximum nodal potential}’,’fontsize’,14);
24 t i t l e ( s p r i n t f (’R = %d, C= %d, L= %d’,R,L,C));
25 legend (’|u_1|’,’|u_2|’,’|u_3|’);
26
27 p r i n t -depsc2 ’../PICTURES/rescircpot.eps’
28
37 f i g u r e (’name’,’resonances’);
38 p l o t ( r e a l (omega), imag (omega),’r*’); hold on;
39 ax = a x i s ;
40 p l o t ([ax(1) ax(2)],[0 0],’k-’);
41 p l o t ([ 0 0],[ax(3) ax(4)],’k-’);
42 g r i d on;
43 x l a b e l (’{\bf Re(\omega)}’,’fontsize’,14);
44 y l a b e l (’{\bf Im(\omega)}’,’fontsize’,14);
45 t i t l e ( s p r i n t f (’R = %d, C= %d, L= %d’,R,L,C));
46 legend (’\omega’);
47
48 p r i n t -depsc2 ’../PICTURES/rescircomega.eps’
☛ ✟
✡ ✠
resonant frequencies = ω ∈ {ω ∈ R: A(ω ) singular}
If the circuit is operated at a real resonant frequency, the circuit equations will not possess a solution. Of
course, the real circuit will always behave in a well-defined way, but the linear model will break down due
to extremely large currents and voltages. In an experiment this breakdown manifests itself as a rather
explosive meltdown of circuits components. Hence, it is vital to determine resonant frequencies of circuits
in order to avoid their destruction.
1
A(ω )x = (W + ıωC + S)x = 0 . (9.0.4)
ıω
1
Substitution: y= ıω x ↔ x = ıωy [?, Sect. 3.4]:
W S x −ıC 0 x
(9.0.4) ⇔ =ω (9.0.5)
I 0 y 0 −ıI y
| {z } |{z} | {z }
:=M :=z :=B
➣ generalized linear eigenvalue problem of the form: find ω ∈ C, z ∈ C2n \ {0} such that
Mz = ωBz . (9.0.6)
In this example one is mainly interested in the eigenvalues ω , whereas the eigenvectors z usually need
not be computed.
R = 1, C= 1, L= 1
0.4
ω
0.35
0.3
0.25
0.2
✁ resonant frequencies for circuit from Code 9.0.3
Im(ω)
0.05
-0.05
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Fig. 315 Re(ω)
ẏ = Ay , A ∈ C n,n . (9.0.8)
λ1
.. −1 n,n z = S −1 y
A = S . S , S ∈ C regular =⇒ ẏ = Ay ←→ ż = Dz .
λn
| {z }
= :D
In order to find the transformation matrix S all non-zero solution vectors (= eigenvectors) x ∈ C n of the
linear eigenvalue problem
Ax = λx
have to be found.
Contents
Supplementary reading. [?, Ch. 7], [?, Ch. 9], [?, Sect. 1.7]
Definition 9.1.1. Eigenvalues and eigenvectors → [?, Sects. 7.1,7.2], [?, Sect. 9.1]
• λ ∈ C eigenvalue (ger.: Eigenwert) of A ∈ K n,n :⇔ det(λI − A) = 0
| {z }
characteristic polynomial χ(λ)
• n,n
spectrum of A ∈ K : σ (A) := {λ ∈ C: λ eigenvalue of A}
• eigenspace (ger.: Eigenraum) associated with eigenvalue λ ∈ σ (A):
EigAλ := N λI − A
• x ∈ EigAλ \ {0} ⇒ x is eigenvector
• Geometric multiplicity (ger.: Vielfachheit) of an eigenvalue λ ∈ σ (A):
m(λ) := dim EigAλ
For any matrix norm k·k induced by a vector norm (→ Def. 1.5.76)
ρ(A ) ≤ kA k .
Proof. Let z ∈ C n \ {0} be an eigenvector to the largest (in modulus) eigenvalue λ of A ∈ C n,n . Then
kAxk kAzk
k Ak := sup ≥ = | λ| = ρ(A ) .
x∈C n,n \{0} k xk kzk
✷
Lemma 9.1.5. Gershgorin circle theorem → [?, Thm. 7.13], [?, Thm. 32.1], [?, Sect. 5.1]
For any A ∈ K n,n holds true
n n
[ o
σ (A ) ⊂ z ∈ C: |z − a jj | ≤ ∑i6= j |a ji | .
j =1
Lemma 9.1.6. Similarity and spectrum → [?, Thm. 9.7], [?, Lemma 7.6], [?, Thm. 7.2]
The spectrum of a matrix is invariant with respect to similarity transformations:
Lemma 9.1.7.
Existence of a one-dimensional invariant subspace
• Hermitian matrices: A H = A ➤ σ (A ) ⊂ R
H
Examples of normal matrices are • unitary matrices: A = A − 1 ➤ |σ(A)| = 1
• skew-Hermitian matrices: A = −A H ➤ σ(A) ⊂ iR
➤
Normal matrices can be diagonalized by unitary similarity transformations
Eigenvalue
problems: ➊ Given A ∈ K n,n find all eigenvalues (= spectrum of A).
n,n
(EVPs) ➋ Given A ∈ K find σ (A) plus all eigenvectors.
➌ Given A ∈ K n,n find a few eigenvalues and associated eigenvectors
x=
ˆ generalized eigenvector, λ =
ˆ generalized eigenvalue
Ax = λBx ⇔ B−1 A = λx .
However, usually it is not advisable to use this equivalence for numerical purposes!
Purpose: solution of eigenvalue problems ➊, ➋ for dense matrices “up to machine precision”
M ATLAB-function: eig
Remark 9.2.1 (QR-Algorithm → [?, Sect. 7.5], [?, Sect. 10.3],[?, Ch. 26],[?, Sect. 5.5-5.7])
✍ ✌
eigensolvers (→ Def.1.5.85)
(➞ =
ˆ affected rows/columns, =
ˆ targeted vector)
0 0 0 0 0 0
0 0 0 0
0 0 0 0 0
0 0
−−−→ −−−→ −−−→ 0
0 0 0 0 0 0
transformation to tridiagonal form ! (for general matrices a similar strategy can achieve a similarity
transformation to upper Hessenberg form)
if D = diag(d1 , . . . , dn ).
M ATLAB-code 9.2.6:
1 A = rand (500,500); B = A’*A; C = g a l l e r y (’tridiag’,500,1,3,1);
end
10 t3 = 1000; f o r k=1:3, t i c ; d = eig(Bn); t3 = min (t3, t o c ); end
11 t4 = 1000; f o r k=1:3, t i c ; [V,D] = eig(Bn); t4 = min (t4, t o c );
end
12 t5 = 1000; f o r k=1:3, t i c ; d = eig(Cn); t5 = min (t5, t o c ); end
13 times = [times; n t1 t2 t3 t4 t5];
14 end
15
16 figure;
17 l o g l o g (times(:,1),times(:,2),’r+’, times(:,1),times(:,3),’m*’,...
18 times(:,1),times(:,4),’cp’, times(:,1),times(:,5),’b^’,...
19 times(:,1),times(:,6),’k.’);
20 x l a b e l (’{\bf matrix size n}’,’fontsize’,14);
21 y l a b e l (’{\bf time [s]}’,’fontsize’,14);
22 t i t l e (’eig runtimes’);
23 legend (’d = eig(A)’,’[V,D] = eig(A)’,’d = eig(B)’,’[V,D] =
eig(B)’,’d = eig(C)’,...
24 ’location’,’northwest’);
25
26 p r i n t -depsc2 ’../PICTURES/eigtimingall.eps’
27
28 figure;
29 l o g l o g (times(:,1),times(:,2),’r+’, times(:,1),times(:,3),’m*’,...
30 times(:,1),(times(:,1).^3)/(times(1,1)^3)*times(1,2),’k-’);
31 x l a b e l (’{\bf matrix size n}’,’fontsize’,14);
32 y l a b e l (’{\bf time [s]}’,’fontsize’,14);
33 t i t l e (’nxn random matrix’);
34 legend (’d = eig(A)’,’[V,D] =
eig(A)’,’O(n^3)’,’location’,’northwest’);
35
36 p r i n t -depsc2 ’../PICTURES/eigtimingA.eps’
37
38 figure;
39 l o g l o g (times(:,1),times(:,4),’r+’, times(:,1),times(:,5),’m*’,...
40 times(:,1),(times(:,1).^3)/(times(1,1)^3)*times(1,2),’k-’);
41 x l a b e l (’{\bf matrix size n}’,’fontsize’,14);
42 y l a b e l (’{\bf time [s]}’,’fontsize’,14);
43 t i t l e (’nxn random Hermitian matrix’);
44 legend (’d = eig(A)’,’[V,D] =
eig(A)’,’O(n^3)’,’location’,’northwest’);
45
46 p r i n t -depsc2 ’../PICTURES/eigtimingB.eps’
47
48 figure;
49 l o g l o g (times(:,1),times(:,6),’r*’,...
50 times(:,1),(times(:,1).^2)/(times(1,1)^2)*times(1,2),’k-’);
51 x l a b e l (’{\bf matrix size n}’,’fontsize’,14);
52 y l a b e l (’{\bf time [s]}’,’fontsize’,14);
53 t i t l e (’nxn tridiagonel Hermitian matrix’);
54 legend (’d = eig(A)’,’O(n^2)’,’location’,’northwest’);
55
56 p r i n t -depsc2 ’../PICTURES/eigtimingC.eps’
0
10
-1
10
-1
10
time [s]
time [s]
-2
10
-2
10
-3
10
-3
10
-4
10 -4
10
-5 -5
10 10
0 1 2 3 0 1 2 3
10 10 10 10 10 10 10 10
Fig. 316 matrix size n Fig. 317 matrix size n
0
10
-2
-1
10
10
time [s]
time [s]
-2
10 -3
10
-3
10
-4
10
-4
10
-5 -5
10 10
0 1 2 3 0 1 2 3
10 10 10 10 10 10 10 10
Fig. 318 matrix size n Fig. 319 matrix size n
☛ For the sake of efficiency: think which information you really need when computing eigenvalues/eigen-
vectors of dense matrices
Potentially more efficient methods for sparse matrices will be introduced below in Section 9.3, 9.4.
Supplementary reading. [?, Sect. 7.5], [?, Sect. 5.3.1], [?, Sect. 5.3]
Model: Random surfer visits a web page, stays there for fixed time ∆t, and then
➊ either follows each of ℓ links on a page with probabilty 1/ℓ.
➋ or resumes surfing at a randomly (with equal probability) selected page
Option ➋ is chosen with probability d, 0 ≤ d ≤ 1, option ➊ with probability 1 − d.
This number ∈]0, 1[ can be used to gauge the “importance” of a web page, which, in turns, offers a way
to sort the hits resulting from a keyword query: the GOOGLE idea.
(G)ij = 1 ⇒ link j → i ,
0.08 0.08
0.07 0.07
0.06 0.06
page rank
page rank
0.05 0.05
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 320 harvard500: no. of page Fig. 321 harvard500: no. of page
Observation: relative visit times stabilize as the number of hops in the stochastic simulation → ∞.
The limit distribution is called stationary distribution/invariant measure of the Markov chain. This is what
we seek.
✦ Numbering of pages 1, . . . , N , ℓi =
ˆ number of links from page i
N
✦ N × N -matrix of transition probabilities page j → page i: A = (aij )i,j N,N
=1 ∈ R
aij ∈ [0, 1] =
ˆ probability to jump from page j to page i.
N
⇒ ∑ aij = 1 . (9.3.3)
i =1
A matrix A ∈ [0, 1] N,N with the property (9.3.3) is called a (column) stochastic matrix.
“Meaning” of A: given x ∈ [0, 1] N , k xk1 = 1, where xi is the probability of the surfer to visit page i,
i = 1, . . . , N , at an instance t in time, y = Ax satisfies
N N N N N N
yj ≥ 0 , ∑ yj = ∑ ∑ a ji xi = ∑ xi ∑ aij = ∑ xi = 1 .
j =1 j =1 i =1 i =1 j =1 i =1
| {z }
=1
yj =
ˆ probability for visiting page j at time t + ∆t.
Thought experiment: Instead of a single random surfer we may consider m ∈ N, m ≫ 1, of them who
visit pages independently. The fraction of time m · T they all together spend on page i will obviously be
the same for T → ∞ as that for a single random surfer.
Instead of counting the surfers we watch the proportions of them visiting particular web pages at an
(k)
instance of time. Thus, after the k-th hop we can assign a number xi ∈ [0, 1] to web page i, which gives
(k)
(k) ni (k)
the proportion of surfers currently on that page: xi := m , where ni ∈ N0 designates the number of
surfers on page i after the k-th hop.
Now consider m → ∞. The law of law of large numbers suggests that the (“infinitely many”) surfers visiting
page j will move on to other pages proportional to the transistion probabilities aij : in terms of proportions,
for m → ∞ the stochastic evolution becomes a deterministic discrete dynamical system and we find
N
( k + 1) (k)
xi = ∑ aij x j , (9.3.6)
j =1
that is, the proportion of surfers ending up on page i equals the sum of the proportions on the “source
pages” weighted with the transition probabilities.
Notice that (9.3.6) amounts to matrix×vector. Thus, writing x(0) ∈ [0, 1] N , x (0) = 1 for the initial
distribution of the surfers on the net we find
x ( k ) = A k x (0)
will be their mass distribution after k hops. If the limit exists, the i-th component of x∗ := lim x(k) tells us
k→∞
which fraction of the (infinitely many) surfers will be visiting page i most of the time. Thus, x∗ yields the
stationary distribution of the Markov chain.
6 l o a d harvard500.mat; A = prbuildA(G,d);
7 N = s i z e (A,1); x = ones(N,1)/N;
8
step 5 step 15
0.1 0.1
0.09 0.09
0.08 0.08
0.07 0.07
0.06 0.06
page rank
page rank
0.05 0.05
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 322 harvard500: no. of page Fig. 323 harvard500: no. of page
Comparison:
harvard500: 1000000 hops step 5
0.09 0.1
0.08 0.09
0.07 0.08
0.07
0.06
0.06
page rank
page rank
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 324 harvard500: no. of page Fig. 325 harvard500: no. of page
For r ∈ EigA1, that is, Ar = r, denote by |r| the vector (|ri |)iN=1 . Since all entries of A are non-negative,
we conclude by the triangle inequality that kAr k1 ≤ kA|r|k1
kAxk1 kA|r|k1 kArk1
⇒ 1 = kAk1 = sup ≥ ≥ =1.
x ∈R N
k x k1 k|r|k1 krk1
if aij >0
⇒ kA|r|k1 = kArk1 ⇒ |r| = ±r .
Hence, different components of r cannot have opposite sign, which means, that r can be chosen to have
non-negative entries, if the entries of A are strictly positive, which is the case for A from (9.3.4). After
normalization krk1 = 1 the eigenvector can be regarded as a probability distribution on {1, . . . , N }.
0.09
0.08
0.08
0.07
0.07
0.06
entry of r-vector
0.06
page rank
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01 0.01
0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 326 harvard500: no. of page Fig. 327 harvard500: no. of page
The possibility to compute the stationary probability distribution of a Markov chain through an eigenvector
of the transition probability matrix is due to a property of stationary Markov chains called ergodicity.
0
10
Errors: ✄
error 1-norm
-3
10
k
A x0 − r ,
1 -4
10
plot)
-7
10
0 10 20 30 40 50 60
Fig. 328 iteration step
The computation of page rank amounts to finding the eigenvector of the matrix A of transition probabilities
that belongs to its largest eigenvalue 1. This is addressed by an important class of practical eigenvalue
problems:
Try the above iteration for general 10 × 10-matrix, largest eigenvalue 10, algebraic multiplicity 1.
1
10
M ATLAB-code 9.3.11:
0
10 d = ( 1 : 1 0 ) ’ ; n = length ( d ) ;
S = t r i u ( diag ( n : − 1 : 1 , 0 ) + . . .
ones ( n , n ) ) ;
errors
-1
10 A = S∗ diag ( d , 0 ) ∗ i n v ( S ) ;
z( k )
-2 ✁ error norm − (S):,10
10
k z( k ) k
(Note: (S):,10 =
ˆ eigenvector for eigenvalue 10)
Suggests direct power method (ger.: Potenzmethode): iterative method (→ Section 8.1)
Note: the “normalization” of the iterates in (9.3.12) does not change anything (in exact arithmetic) and
helps avoid overflow in floating point arithmetic.
Due to (9.3.13) for large k ≫ 1 (⇒ |λkn | ≫ |λkj | for j 6= n) the contribution of vn (size ζ n λkn ) in the eigen-
vector expansion (9.3.15) will be much larger than the contribution (size ζ n λkj ) of any other eigenvector (,
if ζ n 6= 0): the eigenvector for λn will swamp all other for k → ∞.
Further (9.3.15) nutures expectation: vn will become dominant in z(k) the faster, the better |λn | is sepa-
rated from |λn−1 |, see Thm. 9.3.21 for rigorous statement.
When (9.3.12) has converged, two common ways to recover λmax → [?, Alg. 7.20]
kAz(k) k
➊ Az(k) ≈ λmax z(k) ➣ |λn | ≈ (modulus only!)
kz( k ) k
2 (z(k) ) H Az(k)
➋ λmax ≈ argmin Az(k) − θz(k) ➤ λmax ≈ 2
.
θ ∈R 2 z(k) 2
This latter formula is extremely useful, which has earned it a special name:
Definition 9.3.16.
For A ∈ K n,n , u ∈ K n the Rayleigh quotient is defined by
u H Au
ρA (u) := .
uH u
M ATLAB-code 9.3.19:
0
10
n = length(d); S = triu(diag(n:-1:1,0)+...
ones(n,n)); A = S*diag(d,0)*inv(S);
d = (1:10)’;
errors
-1
10
o : error |λn − ρA (z(k) )|
✁ ∗ : error norm z(k) − s·,n
kAz(k−1) k
-2
10
+ : λ n − ( k −1) 2
kz k2
-3
10
0 5 10 15 20 25 30
z(0) = random vector
Fig. 330 iteration step k
Test matrices:
① d=(1:10)’; ➣ |λn−1 | : |λn | = 0.9
② d = [ones(9,1); 2]; ➣ |λn−1 | : |λn | = 0.5
③ d = 1-2.^(-(1:0.5:5)’); ➣ |λn−1 | : |λn | = 0.9866
17 res = [];
18
① ② ③
(k) (k) (k) (k) (k) (k)
k ρEV ρEW ρEV ρEW ρEV ρEW
22 0.9102 0.9007 0.5000 0.5000 0.9900 0.9781
(k)
z(k) − s·,n 23 0.9092 0.9004 0.5000 0.5000 0.9900 0.9791
ρEV := , 24 0.9083 0.9001 0.5000 0.5000 0.9901 0.9800
z(k−1) − s·,n
25 0.9075 0.9000 0.5000 0.5000 0.9901 0.9809
(k) | ρA (z(k) ) − λn | 26 0.9068 0.8998 0.5000 0.5000 0.9901 0.9817
ρEW := .
| ρ A ( z ( k − 1) ) − λ n | 27 0.9061 0.8997 0.5000 0.5000 0.9901 0.9825
28 0.9055 0.8997 0.5000 0.5000 0.9901 0.9832
29 0.9049 0.8996 0.5000 0.5000 0.9901 0.9839
30 0.9045 0.8996 0.5000 0.5000 0.9901 0.9844
Observation: linear convergence (→ ??)
| λn −1 |
Az(k) → λn , z(k) → ±v linearly with rate ,
2 | λn |
where z(k) are the iterates of the direct power iteration and y H z(0) 6= 0 is assumed.
(→ Section 8.1.2)
More general segmentation problem (non-local): identify parts of the image, not necessarily connected,
with the same texture.
( m − 1) n + 1 mn
Local similarity matrix:
W ∈ R N,N , N := mn , (9.3.26)
0 , if pixels i, j not adjacent,
(W)ij = 0 , if i = j ,
σ ( pi , p j ) , if pixels i, j adjacent.
m
↔=
ˆ adjacent pixels ✄
Similarity function, e.g., with α > 0
n+1 n+2 2n
2
σ( x, y) := exp(−α( x − y) ) , x, y ∈ R .
1 2 3 n
Lexicographic numbering ✄
Fig. 331
n
The entries of the matrix W measure the “similarity” of neighboring pixels: if (W)ij is large, they encode
(almost) the same intensity, if (W)ij is close to zero, then they belong to parts of the picture with very
different brightness. In the latter case, the boundary of the segment may separate the two pixels.
cut(X ) cut(X )
Ncut(X ) := + ,
weight(X ) weight(V \ X )
with cut(X ) := ∑ wij , weight(X ) := ∑ wij .
i ∈X ,j6∈X i ∈X ,j∈X
5 5
10 10
15 15
pixel
pixel
20 20
25 25
30 30
5 10 15 20 25 5 10 15 20 25
Fig. 332 pixel Fig. 333 pixel
5
20
10
15
pixel
15
pixel
10
20
25
5
30
2 4 6 8 10 12 14 16 18
5 10 15 20 25
Fig. 334 pixel Fig. 335 pixel
△ Ncut(X ) for pixel subsets X defined by sliding rectangles, see Fig. 333.
Equivalent reformulation:
(
1 , if i ∈ X ,
indicator function: z : {1, . . . , N } 7→ {−1, 1} , zi := z(i ) = (9.3.29)
−1 , if i 6∈ X .
∑ −wij zi z j ∑ −wij zi z j
zi >0,z j <0 zi >0,z j <0
Ncut(X ) = + , (9.3.30)
∑ di ∑ di
zi >0 zi <0
di = ∑ wij = weight({i}) . (9.3.31)
j∈V
Sparse matrices:
∑ di
y⊤ Ay zi >0
Ncut(X ) = ⊤ , y := (1 + z) − β(1 − z) , β := .
y Dy ∑ di
zi <0
( 1 + z ) ⊤ D ( 1 − z ) = 0,
⊤1 1 y⊤ Ay
4 Ncut(X ) = (1 + z) A(1 + z) + = ,
κ1 D1 (1 − κ )1⊤ D1
⊤ β1⊤ D1
β
where κ := ∑ di/∑ di = 1+ β . Also observe
z i >0 i
✦ (9.3.33) ⇒ 1 ∈ EigA0
✦ Lemma 2.8.12: A diagonally dominant =⇒ A is positive semidefinite (→ Def. 1.1.8)
Ncut(X ) ≥ 0 and 0 is the smallest eigenvalue of A.
However, we are by no means interested in a minimizer y ∈ Span{1} (with constant entries) that does
not provide a meaningful segmentation.
y ⊥ D1 ⇔ 1⊤ Dy = 0 . (9.3.36)
still NP-hard
➣ Minimizing Ncut(X ) amounts to minimizing a (generalized) Rayleigh quotient (→ Def. 9.3.16) over
a discrete set of vectors, which is still an NP-hard problem.
Idea: Relaxation
✎ ☞
Task: (9.3.38) ⇔ Find minimizer of (generalized) Rayleigh quotient under linear
✍ ✌
constraint
Let λ1 < λ2 < · · · < λm , m ≤ n, be the sorted sequence of all (real!) eigenvalues of A = AH ∈
C n,n . Then
Thm. 9.3.39 is a an immediate consequence of the following more general and fundamentally important
result.
Theorem 9.3.41. Courant-Fischer min-max theorem → [?, Thm. 8.1.2]
Let λ1 < λ2 < · · · < λm , m ≤ n, be the sorted sequence of the (real!) eigenvalues of A = AH ∈
C n,n . Write
ℓ
U0 = {0} , Uℓ := ∑ EigAλ j , ℓ = 1, . . . , m and Uℓ⊥ := {x ∈ C n : uH x = 0 ∀u ∈ Uℓ } .
j =1
Then
Proof. For diagonal A ∈ R n,n the assertion of the theorem is obvious. Thus, Cor. 9.1.9 settles everything.
Well, in Lemma 9.3.35 we encounter a generalized Rayleigh quotient ρA,D (y)! How can Thm. 9.3.39 be
applied to it?
ρA,D (D− /2 z) = ρD−1/2 AD−1/2 (z) , y ∈ R n .
1
Transformation idea: (9.3.43)
e := D− /2 AD− /2 . Elementary manipulations show
Apply Thm. 9.3.41 to transformed matrix A
1 1
1/2
z= D y
argmin ρA,D (D− /2 z)
1
(9.3.38) ⇔ argmin ρA,D (y) =
1⊤ Dy =0 1⊤ D1/2 z=0 (9.3.44)
e := D−1/2 AD−1/2 .
= argmin ρAe (z) with A
1⊤ D1/2 z=0
Related: transformation of a generalized eigenvalue problem into a standard eigenvalue problem accord-
ing to
1/2
z= B x
B− /2 AB− /2 z = λz .
1 1
Ax = λBx =⇒ (9.3.45)
B1/2 =
ˆ square root of s.p.d. matrix B → Rem. 10.3.2.
For segmentation problem: B = D diagonal with positive diagonal entries, see (9.3.32)
D−1/2 = diag(d1− /2 , . . . , d− e
1 1/2 −1/2 AD−1/2 can easily be computed.
➥ N ) and A := D
z∗ = argmin ρA
e (z)
e = λz .
←→ Az (9.3.46)
1⊤ D1/2 z=0
1⊤ D /2 z = 0 ?
1
How to deal with constraint
Idea: Penalization
Add term P(z) to ρA e (z) that becomes “sufficiently large” in case the con-
straint is violated.
How to choose the penalty function P(z) for the segmentation problem ?
z⊤ (D /2 11⊤ D /2 )z
1 1
∗
z = argmin ρA e (z) + P(z) = argmin ρA e (z) +
z∈R N \{0} z∈R N \{0} z⊤ z
(9.3.47)
e + D1/2 11⊤ D1/2 .
b := A
= argmin ρ Ab (z) with A
z∈R N \{0}
Cor. 9.1.9 ➤ The orthogonal complement of an eigenvector of a symmetric matrix is spanned by the
other eigenvectors (orthonormalization of eigenvectors belonging to the same eigenvalue
is assumed).
Note: This eigenvector z∗ will be orthogonal to D /2 1, it satisfies the constraint, and, thus, P(z∗ ) = 0!
1
Note: e and A
eigenspaces of A b agree.
Note: Lemma 2.8.12 e is positive semidefinite (→ Def. 1.1.8) with smallest eigenvalue 0.n
=⇒ A
b.
teed not to be an eigenvector belonging to the smallest eigenvalue of A
(1.5.79)
e
µ= A = 2. (9.3.49)
∞
z∗ = argmin ρA
e (z) = argmin ρA
b (z) . (9.3.50)
1⊤ D1/2 z=0 z6 =0
By Thm. 9.3.39:
b,
z∗ = eigenvector belonging to minimal eigenvalue of A
m
∗ 1/2
z = eigenvector ⊥ D 1 belonging to minimal eigenvalue of Ae,
m
D − 1/2 ∗
z = minimizer for (9.3.38).
➊ Given similarity function σ compute (sparse!) matrices W, D, A ∈ R N,N , see (9.3.26), (9.3.32).
b :=
➋ Compute y∗ , ky∗ k2 = 1, as eigenvector belonging to the smallest eigenvalue of A
D − 1/2
AD − 1/2 1/2 1/2 ⊤
+ 2(D 1)(D 1) .
N
X := {i ∈ {1, . . . , N }: xi∗ > 1
N ∑ xi∗ } . (9.3.52)
i =1
5 5
10 10
15 15
20 20
25 25
30 30
Fig. 336 Fig. 337
5 10 15 20 25 5 10 15 20 25
0.02
0.025 0.018
0.02 0.016
0.014
0.015 Image from Fig. 336:
0.012
0.01
0.01 ✁ eigenvector x∗ plotted on pixel grid
0.005
0.008
0 0.006
30
25 25 0.004
20 20
15 15 0.002
10 10
Fig. 338 5 5
To identify more segments, the same algorithm is recursively applied to segment parts of the image
already determined.
Practical segmentation algorithms rely on many more steps of which the above algorithm is only one, pre-
ceeded by substantial preprocessing. Moreover, they dispense with the strictly local perspective adopted
above and take into account more distant connections between image parts, often in a randomized fashion
[?].
The image segmentation problem falls into the wider class of graph partitioning problems. Methods based
on (a few of) the eigenvector of the connectivity matrix belonging to the smallest eigenvalues are known
as spectral partitioning methods. The eigenvector belonging to the smallest non-zero eigenvalue that we
computed above is usually called the Fiedler vector of the graph, see [?, ?].
The solution of the image segmentation problem by means of eig in Code 9.3.53 amounts a tremendous
waste of computational resources: we compute all eigenvalues/eigenvectors of dense matrices, though
only a single eigenvector associated with the smallest eigenvalue is of interest.
This motivates the quest to find efficient numerical methods for the following task.
Task: given A ∈ K n,n , find smallest (in modulus) eigenvalue of regular A ∈ K n,n
and (an) associated eigenvector.
If A ∈ K n,n regular:
−1
Smallest (in modulus) EV of A = Largest (in modulus) EV of A−1
M ATLAB-code 9.3.54: inverse iteration for computing λmin (A) and associated eigenvector
1 f u n c t i o n [lmin,y] = invit(A,tol)
2 [L,U] = lu(A); % single intial LU-factorization, see Rem. 2.5.10
3 n = s i z e (A,1); x = rand (n,1); x = x/norm(x); % random initial guess
4 y = U\(L\x); lmin = 1/norm(y); y = y*lmin; lold = 0;
5 w h i l e ( abs (lmin-lold) > tol*lmin) % termination, if small relative
change
6 lold = lmin; x = y;
7 y = U\(L\x); % core iteration: y = A−1 x,
8 lmin = 1/norm(y); % new approxmation of λmin (A)
9 y = y*lmin; % normalization y := kyyk
2
10 end
where: (A − αI)−1 z(k−1) = ˆ solve (A − αI)w = z(k−1) based on Gaussian elimination (↔ a single
LU-factorization of A − αI as in Code 9.3.54).
Stability of Gaussian elimination/LU-factorization (→ ??) will ensure that “w from (9.3.56) points in
the right direction”
In other words, roundoff errors may badly affect the length of the solution w, but not its direction.
Practice [?]: If, in the course of Gaussian elimination/LU-factorization a zero pivot element is really en-
countered, then we just replace it with eps, in order to avoid inf values!
|λ j − α|
with λ j ∈ σ (A ) , | α − λ j | ≤ | α − λ| ∀ λ ∈ σ (A ) .
min{|λi − α|, i 6= j}
M ATLAB-code 9.3.61:
d = (1:10) ’;
10
-5 n = length ( d ) ;
Z = diag ( s q r t ( 1 : n ) , 0 ) + ones ( n , n ) ;
[ Q, R] = q r ( Z ) ;
A = Q∗ diag ( d , 0 ) ∗ Q ’ ;
-10
10
o: |λmin − ρA (z(k) )|
∗ : z(k) − x j , λmin = λ j , x j ∈ EigAλ j ,
10
-15
1 2 3 4 5 6 7 8 9 10
: xj 2 = 1
k
Task: given A ∈ K n,n , find smallest (in modulus) eigenvalue of regular A ∈ K n,n
and (an) associated eigenvector.
Options: inverse iteration (→ Code 9.3.54) and Rayleigh quotient iteration (9.3.59).
• for large sparse A the amount of fill-in exhausts memory, despite sparse elimination techniques (→
Section 2.7.5),
We expect that an approximate solution of the linear systems of equations encountered during inverse iteration
should be sufficient, because we are dealing with approximate eigenvectors anyway.
Thus, iterative solvers for solving Aw = z(k−1) may be considered, see Chapter 10. However, the required
accuracy is not clear a priori. Here we examine an approach that completely dispenses with an iterative
solver and uses a preconditioner (→ Notion 10.3.3) instead.
➣ B=
ˆ Preconditioner for A, see Notion 10.3.3
M ATLAB-code 9.3.66:
1 A = spdiags (repmat([1/n,-1,2*(1+1/n),-1,1/n],n,1),
[-n/2,-1,0,1,n/2],n,n);
2 evalA = @(x) A*x;
3 % inverse iteration
4 invB = @(x) A\x;
5 % tridiagonal preconditioning
6 B = spdiags ( spdiags (A,[-1,0,1]),[-1,0,1],n,n); invB = @(x) B\x;
Monitored: error decay during iteration of Code 9.3.64: |ρA (z(k) ) − λmin (A)|
-2
10
error in approximation for λ
22
#iteration steps
-4
10
20
-6
10 18
-8
10 16
14
-10
10
12
-12
10
10
-14
10
0 5 10 15 20 25 30 35 40 45 50 8
1 2 3 4 5
Fig. 339 # iterationstep 10 10 10 10 10
Fig. 340 n
For small residual Az(k−1) − ρA (z(k−1) )z(k−1) PINVIT almost agrees with the regular inverse iteration.
ÿ + λ2 y = cos(ωt) , (9.3.69)
ÿ + Ay = b cos(ωt) , (9.3.71)
with symmetric, positive (semi)definite matrix A ∈ R n,n , b ∈ R n . By Cor. 9.1.9 there is an orthogonal
matrix Q ∈ R n,n such that
Q⊤ AQ = D := diag(λ1 , . . . , λn ) .
(9.3.71) z̈ + Dz = Q⊤ b cos(ωt) .
☛ ✟
We have obtained decoupled linear 2nd-order scalar ODEs of the type (9.3.69).
√
✡ ✠
(9.3.71) can have growing (with time) solutions, if ω = λi for some i = 1, . . . , n
p
If ω = λ j for one j ∈ {1, . . . , n}, then the solution for the initial value problem for (9.3.71) with
y(0) = ẏ(0) = 0 (↔ z(0) = ż (0) = 0) is
t
z(t) ∼ sin(ωt)e j + bounded oscillations
2ω
m
t
y(t) ∼ sin(ωt)(Q):,j + bounded oscillations .
2ω
j-th eigenvector of A
Eigenvectors of A ↔ excitable states
Example 9.3.72 (Vibrations of a truss structure cf. [?, Sect. 3], M ATLAB’s truss demo)
2.5
-1.5
Fig. 341
0 1 2 3 4 5
Assumptions: ✦ Truss in static equilibrium (perfect balance of forces at each point mass).
✦ Rods are perfectly elastic (i.e., frictionless).
Hook’s law holds for force in the direction of a rod:
∆l
F=α , (9.3.74)
l
✁ deformed truss:
lij := ∆p ji , ∆p ji := p j − pi , (9.3.75)
2
∆lij (t) := ∆p ji + ∆u ji (t) − lij , ∆u ji (t) := u j (t) − ui (t) . (9.3.76)
2
∆lij ∆p ji + ∆u ji (t)
Fij (t) = −αij · . (9.3.77)
lij ∆p ji + ∆u ji (t) 2
✞ ☎
✝ ✆
Assumption: Small displacements
2
Possibility of linearization by neglecting terms of order ui 2
!
(9.3.75) 1 1
Fij (t) = αij − · (∆p ji + ∆u ji (t)) . (9.3.78)
(9.3.76) ∆p ji + ∆u ji (t) 2
∆p ji
1 1 x·y
= − 3
+ O(kyk22 ) .
k x + y k2 k xk2 kxk2
Proof. Simple Taylor expansion up to linear term for f (x) = ( x12 + · · · + x2d )−1/2 and f (x + y) =
f (x) + grad f (x) · y + O(k yk22 ).
✷
2
Linearization of force: apply Lemma 9.3.79 to (9.3.78) and drop terms O( ∆u ji 2 ):
∆p ji · ∆u ji (t)
Fij (t) ≈ − αij · (∆p ji + ∆u ji (t))
lij3
(9.3.80)
∆p ji · ∆u ji (t)
≈ − αij 3
· ∆p ji .
lij
n
d2 i
mi
dt2
u (t) = Fi = ∑ − Fij (t) , (9.3.81)
j =1
j6=i
mi =
ˆ mass of point mass i.
d2 i n
1
mi u (t) = ∑ αij l 3 ∆p ji (∆p ji )⊤ (u j (t) − ui (t)) . (9.3.82)
dt2 j =1 ij
j6=i
n
Compact notation: collect all displacements into one vector u ( t ) = ui ( t ) ∈ R2n
i =1
du
(9.3.82) M (t) + Au(t) = f(t) . (9.3.83)
dt2
✛ ✘
Rem. 9.3.68: if periodic external forces f(t) = cos(ωt)f, f ∈ R2n , (wind, earthquake)p
act on the
truss they can excite vibrations of (linearly in time) growing amplitude, if ω coincides with λ j for an
✚ ✙
eigenvalue λ j of A.
Excited vibrations can lead to the collapse of a truss structure, cf. the notorious Tacoma-Narrows bridge disaster.
It is essential to know whether eigenvalues of a truss structure fall into a range that can be excited
by external forces.
These will typically(∗) be the low modes ↔ a few of the smallest eigenvalues.
((∗) Reason: fast oscillations will quickly be damped due to friction, which was neglected in our model.)
3
The stiffness matrix will always possess three zero
2
eigenvalues corresponding to rigid body modes (=
displacements without change of length of the rods)
1
-1
0 5 10 15 20 25
Fig. 343 no. of eigenvalue
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
Fig. 344 Fig. 345
-1 0 1 2 3 4 5 6 -1 0 1 2 3 4 5 6
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
Fig. 346 Fig. 347
-1 0 1 2 3 4 5 6 -1 0 1 2 3 4 5 6
To compute a few of a truss’s lowest resonant frequencies and excitable mode, we need efficient numerical
methods for the following tasks. Obviously, Code 9.3.85 cannot be used for large trusses, because eig
invariable operates on dense matrices and will be prohibitively slow and gobble up huge amounts of
memory, also recall the discussion of Code 9.3.53.
Of course, we aim to tackle this task by iterative methods generalizing power iteration (→ Section 9.3.1)
and inverse iteration (→ Section 9.3.2).
9.3.4.1 Orthogonalization
According to Cor. 9.1.9: For A = A⊤ ∈ R n,n there is a factorization A = UDU⊤ with D = diag(λ1 , . . . , λn ),
λ j ∈ R, λ1 ≤ λ2 ≤ · · · ≤ λn , and U orthogonal. Thus, u j := (U):,j , j = 1, . . . , n, are (mutually orthog-
onal) eigenvectors of A.
If we just carry out the direct power iteration (9.3.12) for two vectors both sequences will converge to the
largest (in modulus) eigenvector. However, we recall that all eigenvectors are mutually orthogonal. This
suggests that we orthogonalize the iterates of the second power iteration (that is to yield the eigenvector for
the second largest eigenvalue) with respect to those of the first. This idea spawns the following iteration,
cf. Gram-Schmidt orthogonalization in (10.2.11):
v
w−w· k v k2
✁ Orthogonalization of two vectors
(see Line 4 of Code 9.3.86)
v
v
Fig. 348 w· k v k2
Analysis through eigenvector expansions (v, w ∈ R n , k vk2 = kwk2 = 1)
n n
v= ∑ αjuj , w = ∑ β j uj ,
i =1 i =1
n n
⇒ Av = ∑ λj αjuj , Aw = ∑ λj β j uj ,
i =1 i =1
n −1/2 n
v
v0 : =
k vk 2
= ∑ λ2j α2j ∑ λj αjuj ,
i =1 i =1
!
n n n
Aw − (v0⊤ Aw)v0 = ∑ βj− ∑ λ2j α j β j / ∑ λ2j α2j αj λj uj .
i =1 i =1 i =1
We notice that v is just mapped to the next iterate in the regular direct power iteration (9.3.12). After many
steps, it will be very close to un , and, therefore, we may now assume v = un ⇔ α j = δj,n (Kronecker
symbol).
n −1
z := Aw − (v0⊤ Aw)v0 = 0 · un + ∑ λj β j uj ,
i =1
n −1 −1/2 n −1
z
w(new) := = ∑ λ2j β2j ∑ λj β j uj .
kzk2 i =1 i =1
The sequence w(k) produced by repeated application of the mapping given by Code 9.3.86 asymp-
totically (that is, when v(k) has already converged to un ) agrees with the sequence produced by the
direct power method for A e := U diag(λ1 , . . . , λn−1, 0). Its convergence will be governed by the relative
gap λn−2 /λn−1 , see Thm. 9.3.21.
However: if v(k) itself converges slowly, this reasoning does not apply.
M ATLAB-code 9.3.88: power iteration with orthogonal projection for two vectors
1 f u n c t i o n sppowitdriver(d,maxit)
2 % monitor power iteration with orthogonal projection for finding
3 % the two largest (in modulus) eigenvalues and associated eigenvectors
4 % of a symmetric matrix with prescribed eigenvalues passed in d
5 i f ( n a r g i n < 10), maxit = 20; end
6 i f ( n a r g i n < 1), d = (1:10)’; end
7 % Generate matrix
8 n = l e n g t h (d);
9 Z = d i a g ( s q r t (1:n),0) + ones(n,n);
10 [Q,R] = qr (Z); % generate orthogonal matrix
11 A = Q* d i a g (d,0)*Q’; % “synthetic” A = A T with spectrum σ(A) = {d1 , . . . , dn }
12 % Compute “exact” eigenvectors and eigenvalues
13 [V,D] = e i g (A); [d,idx] = s o r t ( d i a g (D)),
14 v_ex = V(:,idx(n)); w_ex = V(:,idx(n-1));
15 lv_ex = d(n); lw_ex = d(n-1);
16
29 min (norm(v-v_ex),norm(v+v_ex)),
min (norm(w-w_ex),norm(w+w_ex))];
30 end
31
32 f i g u r e (’name’,’sspowit’);
33 semilogy (result(:,1),result(:,2),’m-+’,...
34 result(:,1),result(:,3),’r-*’,...
35 result(:,1),result(:,4),’k-^’,...
36 result(:,1),result(:,5),’b-p’);
37 t i t l e (’d = [0.5*(1:8),9.5,10]’);
38 x l a b e l (’{\bf power iteration step}’,’fontsize’,14);
39 y l a b e l (’{\bf error}’,’fontsize’,14);
40 legend (’error in \lambda_n’,’error in \lambda_n-1’,’error in
v’,’error in w’,’location’,’northeast’);
41 p r i n t -depsc2 ’../PICTURES/sspowitcvg1.eps’;
42
0.85
0
10
error quotient
0.8
error
0.75
0.7
-1
10
0.65
0.6
error in λ
n
error in λn-1
0.55
error in v
error in w
-2
10 0.5
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 349 power iteration step Fig. 350 power iteration step
d = [0.5*(1:8),9.5,10] d = [0.5*(1:8),9.5,10]
1
10 1
error in λn
error in λn-1
0.95
error in v
error in w
0.9
0
10
0.85
error quotient
0.8
error
-1
10 0.75
0.7
0.65
-2
10
0.6
error in λn
error in λn-1
0.55
error in v
error in w
-3
10 0.5
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 351 power iteration step Fig. 352 power iteration step
Nothing new:
Gram-Schmidt orthonormalization
(→ [?, Thm. 4.8], [?, Alg. 6.1], [?, Sect. 3.4.3])
➋ q⊤
l qk = δlk (orthonormality) , (9.3.89)
➋ Span{q1 , . . . , qk } = Span{v1 , . . . , vk } for all k = 1, . . . , m . (9.3.90)
z 1 = v1 ,
v2⊤ z1
z2 = v2 − z ,
z1⊤ z1 1
v3⊤ z1 v3⊤ z2 (9.3.91)
z3 = v3 − z
z1⊤ z1 1
− z
z2⊤ z2 2
,
..
.
zk
+ normalization qk = , k = 1, . . . , m . (9.3.92)
kzk k2
Easy computation: the vectors q1 , . . . , qm produced by (9.3.91) satisfy (9.3.89) and (9.3.90).
11 q = q - dot (Q(:,k),V(:,l))*Q(:,k);
12 end
13 Q = [Q,q/norm(q)]; % normalization
14 end
The following two M ATLAB code snippets perform the same function, cf. Code 9.3.86:
M ATLAB-code 9.3.97: General subspace power iteration step with qr based orthonormaliza-
tion
1 f u n c t i o n V = sspowitstep(A,V)
2 % power iteration with orthonormalization for A = A T .
3 % columns of matrix V span subspace for power iteration.
4 V = A*V; % actual power iteration on individual columns
5 [V,R] = qr(V,0); % Gram-Schmidt orthonormalization (9.3.91)
✦ the first column of V, (V):,1 , is a sequence of vectors created by the standard direct power method
(9.3.12).
✦ reasoning: the other columns of V, after each multiplication with A can be expected to contain a
significant component in the direction of the eigenvector associated with the eigenvalue of largest
modulus.
Since the columns of V span a subspace of R n , this idea can be recast as the following task:
⇔ ∃w ∈ V \ {0}: Aw = λw
⇔ ∃u ∈ Km \ {0}: AVu = λVu
⇒ ∃u ∈ Km \ {0}: VH AVu = λVH Vu , (9.3.98)
If our initial assumption holds true and u solves (9.3.99) and is a simple eigenvalue, a corresponding
x ∈ EigAλ can be recovered as x = Vu.
Note: If V is unitary (→ Def. 6.2.2), then the generalized eigenvalue problem (9.3.99) will become a
standard linear eigenvalue problem.
We revisit m = 2, see Code 9.3.86. Recall that by the min-max theorem Thm. 9.3.41
Idea: maximize Rayleigh quotient over Span{v, w}, where v, w are output by Code 9.3.86. This leads to
the optimization problem
∗ ∗ α
(α , β ) := argmax ρA (αv + βw) = argmax ρ(v,w)⊤ A(v,w) ( ). (9.3.102)
α,β∈R, α2 + β2 =1 α,β∈R, α2 + β2 =1
β
v∗ := α ∗ v + β∗ w .
Note that kv∗ k2 = 1, if both v and w are normalized, which is guaranteed in Code 9.3.86.
Again the min-max theorem Thm. 9.3.41 tells us that we can find (α∗ , β∗ )⊤ as eigenvector to the largest
eigenvalue of
⊤ α α
(v, w) A(v, w) =λ . (9.3.103)
β β
M ATLAB-code 9.3.104: one step of subspace power iteration with Ritz projection, matrix ver-
sion
1 f u n c t i o n V = sspowitsteprp(A,V)
2 V = A*V; % power iteration applied to columns of V
3 [Q,R] = qr(V,0); % orthonormalization, see Section 9.3.4.1
4 [U,D] = eig(Q’*A*Q); % Solve Ritz projected m × m eigenvalue problem
5 V = Q*U; % recover approximate eigenvectors
Note that he orthogonalization step in Code 9.3.104 is actually redundant, if exact arithmetic could be em-
ployed, because the Ritz projection could also be realized by solving the generalized eigenvalue problem.
However, prior orthogonalization is essential for numerical stability (→ Def. 1.5.85), cf. the discussion in
Section 3.3.3.
Listing 9.1: Main loop: power iteration with Ritz projection for two eigenvectors
1 % See Code 9.3.88 for generation of matrix A and output
2 f o r k=1:maxit
3 v_new = A*v; w_new = A*w; “power iteration”, cf. (9.3.12)
%
4 [Q,R] = qr([v_new,w_new],0); % orthogonalization, see Sect. 9.3.4.1
5 [U,D] = eig(Q’*A*Q); %
Solve Ritz projected eigenvalue problem
6 [ev,idx] = s o r t ( abs ( d i a g (D))),
% Sort eigenvalues
7 w = Q*U(:,idx(1)); v = Q*U(:,idx(2)); % Recover approximate
eigenvectors
8
0 -2
10 10
-4
10
error
error
-1 -6
10 10
-8
10
-2 -10
10 10
-12
10
-3 -14
10 10
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 353 power iteration step Fig. 354 power iteration step
d = [0.5*(1:8),9.5,10] d = [0.5*(1:8),9.5,10]
1 1
error in λ
n
error in λ -1
0.95 0.9 n
error in v
error in w
0.9 0.8
0.85 0.7
error quotient
error quotient
0.8 0.6
0.75 0.5
0.7 0.4
0.65 0.3
0.6 0.2
error in λn
error in λ -1
n
0.55 0.1
error in v
error in w
0.5 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 355 power iteration step Fig. 356 power iteration step
In Code 9.3.104: diagonal entries of D provide approximations of eigenvalues. Their (relative) changes
can be used as a termination criterion.
2
10
0
10
j
-2
10
aij := min{ ij , i }
S.p.d. test matrix:
n=200; A = gallery(’lehmer’,n);
error in eigenvalue
-4
10
“Initial eigenvector guesses”:
-6
10 V = eye(n,m);
λ , m=3
-8
10
1 • Observation:
λ , m=3
-10
2
λ , m=3
linear convergence of eigenvalues
10 3
λ , m=6
1
• choice m > k boosts convergence
-12
10 λ , m=6 of eigenvalues
2
λ3, m=6
-14
10
1 2 3 4 5 6 7 8 9 10
Fig. 357 iteration step
Analoguous to § 9.3.106: construction of subspace variants of inverse iteration (→ Code 9.3.54), PINVIT
(9.3.63), and Rayleigh quotient iteration (9.3.59).
All power methods (→ Section 9.3) for the eigenvalue problem (EVP) Ax = λx only rely on the last iterate
to determine the next one (1-point methods, cf. (8.1.4))
“Memory for power iterations”: pursue same idea that led from the gradient method, § 10.1.11, to the con-
jugate gradient method, § 10.2.17: use information from previous iterates to achieve efficient minimization
over larger and larger subspaces.
u1 , . . . , u n =
ˆ corresponding orthonormal eigenvectors, cf. Cor. 9.1.9.
AU = DU , U = (u1 , . . . , un ) ∈ R n,n , D = diag(λ1 , . . . , λn ) .
We recall
V = Span{z(0) , Az(0) , . . . , A(k) z(0) } = Kk+1 (A, z(0) ) a Krylov space, → Def. 10.2.6 . (9.4.2)
M ATLAB-code 9.4.5:
1 n=100;
2 M= g a l l e r y (’tridiag’,-0.5*ones(n-1,1),2*ones(n,1),-1.5*ones(n-1,1));
3 [Q,R]= qr (M); A=Q’* d i a g (1:n)*Q; % synthetic matrix,
σ(A) = {1, 2, 3, . . . , 100}
2
10
100 |λ - µ |
m m
|λm-1 - µm-1 |
1
10 |λm-2 - µm-2 |
95
0
10
90
Ritz value
Ritz value
-1
10
85
-2
10
80
-3
10
75 µ -4
10
m
µ
m-1
µ
m-2
-5
70 10
5 10 15 20 25 30 5 10 15 20 25 30
Fig. 358 dimension m of Krylov space Fig. 359 dimension m of Krylov space
Observation: “vaguely linear” convergence of largest Ritz values (notation µi ) to largest eigenvalues.
Fastest convergence of largest Ritz value → largest eigenvalue of A
2
10
40 µ |λ - µ |
1 1 1
µ2 |λ - µ |
2 2
35 µ3 |λ - µ |
2 3
1
10
30
error of Ritz value
25 0
10
Ritz value
20
-1
10
15
10
-2
10
0 -3
10
5 10 15 20 25 30 5 10 15 20 25 30
Fig. 360 dimension m of Krylov space Fig. 361 dimension m of Krylov space
Observation: Also the smallest Ritz values converge “vaguely linearly” to the smallest eigenvalues of A.
Fastest convergence of smallest Ritz value → smallest eigenvalue of A.
➣ u1 can also be expected to be “well captured” by Kk (A, x) and the smallest Ritz value should provide
a good aproximation for λmin (A).
✓ ✏
Recall from Section 10.2.2 Lemma 10.2.12 :
Proof. Lemma 10.2.12: {r0 , . . . , rℓ−1 } is an orthogonal basis of Kℓ (A, r0 ), if all the residuals are non-
zero. As AKℓ−1 (A, r0 ) ⊂ Kℓ (A, r0 ), we conclude the orthogonality rm T Ar for all j = 0, . . . , m − 2. Since
j
T
Vm AVm = riT−1 Ar j−1 , 1 ≤ i, j ≤ m ,
ij
α1 β 1
β 1 α2 β 2
..
β 2 α 3 .
.. ..
. .
VlH AVl = =: Tl ∈ K k,k [tridiagonal matrix] (9.4.11)
..
.
.. ..
. . β k−1
β k−1 αk
Total computational effort for l steps of Lanczos process, if A has at most k non-zero entries per row:
O(nkl )
Note: Code 9.4.12 assumes that no residual vanishes. This could happen, if z0 exactly belonged to
the span of a few eigenvectors. However, in practical computations inevitable round-off errors will always
ensure that the iterates do not stay in an invariant subspace of A, cf. Rem. 9.3.22.
Convergence (what we expect from the above considerations) → [?, Sect. 8.5])
(l ) (l ) (l )
In l -th step: λ n ≈ µ l , λ n − 1 ≈ µ l − 1 , . . . , λ1 ≈ µ1 ,
(l ) (l ) (l ) (l ) (l )
σ ( T l ) = { µ1 , . . . , µ l }, µ1 ≤ µ2 ≤ · · · ≤ µ l .
2
10
1
10
0
10
|Ritz value-eigenvalue|
0 -2
error in Ritz values
10 10
-4
10
-1
10
-6
10
-2 -8
10 10
-10
10
λ λn
-3 n
10 λ λn-1
n-1
-12
λ 10 λ
n-2 n-2
λn-3 λn-3
-4 -14
10 10
0 5 10 15 20 25 30 0 5 10 15
Fig. 362 step of Lanzcos process Fig. 363 step of Lanzcos process
However for A ∈ R 10,10 , aij = min{i, j} good initial convergence, but sudden “jump” of Ritz values off
eigenvalues!
σ(A) = {0.255680,0.273787,0.307979,0.366209,0.465233,0.643104,1.000000,1.873023,5.048917,44.766069}
σ(T) = {0.263867,0.303001,0.365376,0.465199,0.643104,1.000000,1.873023,5.048917,44.765976,44.766069}
l σ (Tl )
1 38.500000
2 3.392123 44.750734
10 0.263867 0.303001 0.365376 0.465199 0.643104 1.000000 1.873023 5.048917 44.765976 44.766069
Idea:
✦ do not rely on orthogonality relations of Lemma 10.2.12
✦ use explicit Gram-Schmidt orthogonalization [?, Thm. 4.8], [?,
Alg .6.1]
l
el +1
v
vl +1 := Avl − ∑ (v H
e j Avl ) v j , vl +1 := ⇒ vl +1 ⊥ Kl (A, z) . (9.4.15)
j =1
vl +1 k 2
ke
➣ Computational cost for l steps, if at most k non-zero entries in each row of A: O(nkl 2 )
16 end
✎ ☞
If it does not stop prematurely, the Arnoldi process of Code 9.4.16 will yield an orthonormal basis (ONB)
✍ ✌
of Kk+1 (A, v0 ) for a general A ∈ C n,n .
v l+1
=
vl
v1
v1
vl
approximation
f u n c t i o n [ dn , V , Ht ] = a r n o l d i e i g ( A , v0 , k , t o l )
n = s i z e ( A , 1 ) ; V = [ v0 / norm ( v0 ) ] ;
H = zeros ( 1 , 0 ) ; dn = zeros ( k , 1 ) ;
f o r l = 1: n
d = dn ;
Ht = [ Ht , zeros ( l , 1 ) ; zeros ( 1 , l ) ] ;
v t = A∗V ( : , l ) ;
f o r j = 1: l
Ht ( j , l ) = dot ( V ( : , j ) , v t ) ;
v t = v t − Ht ( j , l ) ∗ V ( : , j ) ;
end
ev = s o r t ( e i g ( Ht ( 1 : l , 1 : l ) ) ) ;
9. Eigenvalues, 9.4. Krylov Subspace Methods 676
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
✖ ✕
matrices)
4 4
10 10
2 2
10 10
Approximation error of Ritz value
0 0
10 10
-2 -2
error in Ritz values
10 10
-4 -4
10 10
-6 -6
10 10
-8 -8
10 10
-10 -10
10 10
λn λn
λn-1 λn-1
-12 -12
10 λ 10 λ
n-2 n-2
λn-3 λn-3
-14 -14
10 10
0 5 10 15 0 5 10 15
Fig. 364 step of Lanzcos process Fig. 365 Step of Arnoldi process
l σ (Hl )
1 38.500000
2 3.392123 44.750734
10 0.255680 0.273787 0.307979 0.366209 0.465233 0.643104 1.000000 1.873023 5.048917 44.766069
For the above examples both the Arnoldi process and the Lanczos process are algebraically equivalent,
because they are applied to a symmetric matrix A = A T . However, they behave strikingly differently,
which indicates that they are not numerically equivalent.
The Arnoldi process is much less affected by roundoff than the Lanczos process, because it does not take
for granted orthogonality of the “residual vector sequence”. Hence, the Arnoldi process enjoys superior
numerical stability (→ ??, Def. 1.5.85) compared to the Lanczos process.
Eigenvalue approximation from Arnoldi process for non-symmetric A, initial vector ones(100,1);
M ATLAB-code 9.4.23:
1 n=100;
2 M= f u l l ( g a l l e r y (’tridiag’,-0.5*ones(n-1,1),2*ones(n,1),-1.5*ones(n-1,1)));
3 A=M* d i a g (1:n)* i n v (M);
95 1
10
Approximation error of Ritz value
90
0
10
85
80
Ritz value
-1
10
75
-2
10
70
-3
65 10
60
λ -4
10 λn
n
55 λn-1 λn-1
λn-2 λ
n-2
-5
50 10
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Fig. 366 Step of Arnoldi process Fig. 367 Step of Arnoldi process
8 0
10
7
-1
10
Ritz value
6
-2
10
5
-3
4 10
3
-4
10
2
λ
-5 1
10
λ
1 2
λ
3
-6
0 10
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Fig. 368 Step of Arnoldi process Fig. 369 Step of Arnoldi process
Observation: “vaguely linear” convergence of largest and smallest eigenvalues, cf. Ex. 9.4.4.
Krylov subspace iteration methods (= Arnoldi process, Lanczos process) attractive for computing a
few of the largest/smallest eigenvalues and associated eigenvectors of large sparse matrices.
Adaptation of Krylov subspace iterative eigensolvers to generalized EVP: Ax = λBx, B s.p.d.: replace
Euclidean inner product with “B-inner product” (x, y) 7→ x H By.
M ATLAB-functions:
Supplementary reading. There is a wealth of literature on iterative methods for the solution of
linear systems of equations: The two books [?] and [?] offer a comprehensive treatment of the topic
(the latter is available online for ETH students and staff).
Concise presentations can be found in [?, Ch. 4] and [?, Ch. 13].
Learning outcomes:
• Understanding when and why iterative solution of linear systems of equations may be preferred to
direct solvers based on Gaussian elimination.
•
= A class of iterative methods (→ Section 8.1) for approximate solution of large linear systems of
equations Ax = b, A ∈ K n,n .
BUT, we have reliable direct methods (Gauss elimination → Section 2.3, LU-factorization → § 2.3.30,
QR-factorization → ??) that provide an (apart from roundoff errors) exact solution with a finite number
of elementary operations!
Alas, direct elimination may not be feasible, or may be grossly inefficient, because
• it may be too expensive (e.g. for A too large, sparse), → (2.3.25),
• inevitable fill-in may exhaust main memory,
• the system matrix may be available only as procedure y=evalA(x) ↔ y = Ax
Contents
10.1 Descent Methods [?, Sect. 4.3.3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
10.1.1 Quadratic minimization context . . . . . . . . . . . . . . . . . . . . . . . . . 671
10.1.2 Abstract steepest descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
10.1.3 Gradient method for s.p.d. linear system of equations . . . . . . . . . . . . . 673
10.1.4 Convergence of the gradient method . . . . . . . . . . . . . . . . . . . . . . . 674
10.2 Conjugate gradient method (CG) [?, Ch. 9], [?, Sect. 13.4], [?, Sect. 4.3.4] . . . . . 678
10.2.1 Krylov spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
10.2.2 Implementation of CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
10.2.3 Convergence of CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
10.3 Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, Sect. 4.3.5] . . . . . . . . . . . . . . . 688
680
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Focus:
Linear system of equations Ax = b, A ∈ R n,n , b ∈ R n , n ∈ N given,
with symmetric positive definite (s.p.d., → Def. 1.1.8) system matrix A
➨
A-inner product (x, y) 7→ x⊤ Ay ⇒ “A-geometry”
Definition 10.1.1. Energy norm → [?, Def. 9.1]
A s.p.d. matrix A ∈ R n,n induces an energy norm
However, the (conjugate) gradient methods introduced below also work for LSE Ax = b with A ∈ C n,n ,
A = A H s.p.d. when ⊤ is replaced with H (Hermitian transposed). Then, all theoretical statements remain
valid unaltered for K = C.
Lemma 10.1.3. S.p.d. LSE and quadratic minimization problem [?, (13.37)]
A quadratic functional
= 12 kx − x∗ k2A .
Then the assertion follows from the properties of the energy norm.
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 681
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
14
1.5 14
12
1 12
10
10
J(x1,x2)
0.5 8
6
2
0 8
x
-0.5 6
2
4
-1 0
-2
2 -2
-1.5
0
0 0.5 1 1.5 2
2 -2 -1.5 -1 -0.5
0
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x2
x1
✞ ☎
Fig. 370 x Fig. 371
1
✝ ✆
Level lines of quadratic functionals with s.p.d. A are (hyper)ellipses
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 682
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Fig. 372
However, for the quadratic minimization problem (10.1.4) § 10.1.7 will converge:
(“Geometric intuition”, see Fig. 370: quadratic functional J with s.p.d. A has unique global minimum,
grad J 6= 0 away from minimum, pointing towards it.)
Adaptation: steepest descent algorithm § 10.1.7 for quadratic minimization problem (10.1.4), see [?,
Sect. 7.2.4]:
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 683
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
➣ For the descent direction in § 10.1.7 applied to the minimization of J from (10.1.4) holds
dk = b − Ax(k) =: rk the residual (→ Def. 2.4.1) for x(k−1) .
§ 10.1.7 for F = J from (10.1.4): function to be minimized in line search step:
ϕ(t) := J (x(k) + tdk ) = J (x(k) ) + td⊤
k (Ax
(k)
− b) + 12 t2 d⊤
k Adk ➙ a parabola ! .
dϕ ∗ d⊤
k dk
(t ) = 0 ⇔ t∗ = (unique minimizer) . (10.1.10)
dt d⊤
k Adk
✬ ✩
One step of gradient method involves
✦ A single matrix×vector product with A ,
✦ 2 AXPY-operations (→ Section 1.3.2) on vectors of length n,
✦ 2 dot products in R n .
✫ ✪
Computational cost (per step) = cost(matrix×vector) + O(n)
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 684
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
➣ If A ∈ R n,n is a sparse matrix (→ ??) with “O(n) nonzero entries”, and the data structures allow
to perform the matrix×vector product with a computational effort O(n), then a single step of the
gradient method costs O(n) elementary operations.
➣ Gradient method of § 10.1.11 only needs A×vector in procedural form y = evalA(x).
9 9
8 8
x(0)
7 7
6 6
2
x2
5 5
x
x(1)
4 4
3 3
2 2
x(2)
x(3)
1 1
0 0
0 2 4 6 8 10 0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 373 x1 Fig. 374 x1
n
J (Qb b⊤ Db
y ) = 12 y y − (Q⊤ b)⊤ b
| {z }
y= 1
2 ∑ di yb2i − bbi ybi .
i =1
b⊤
=:b
Hence, a rigid transformation (rotation, reflection) maps the level surfaces of J from (10.1.4) to ellipses
with principal axes di . As A s.p.d. di > 0 is guaranteed.
Observations:
• Larger spread of spectrum leads to more elongated ellipses as level lines ➣ slower convergence
of gradient method, see Fig. 374.
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 685
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
r⊤
k rk
r⊤ ⊤ ⊤
k r k+1 = r k r k − r k Ark = 0 . (10.1.16)
r⊤
k Ar k
4 2
10 10
2
10 0
10
0
10 -2
10
-2
10
energy norm of error
-4
10
2-norm of residual
-4
10
-6
10
-6
10
-8
10
-8
10
-10
10
-10
10
-12
-12 10
10
-14
-14
10 A = diag(1:0.01:2) 10 A = diag(1:0.01:2)
A = diag(1:0.1:11) A = diag(1:0.1:11)
A = diag(1:1:101) A = diag(1:1:101)
-16 -16
10 10
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Fig. 375 iteration step k Fig. 376 iteration step k
1. (10.1.15): for every A ∈ R n,n with A⊤ = A there is an orthogonal matrix Q ∈ R n.n such that
A = Q⊤ DQ with a diagonal matrix D (principal axis transformation), → Cor. 9.1.9, [?, Thm. 7.8],
[?, Satz 9.15],
2. when applying the gradient method § 10.1.11 to both Ax = b and De e := Qb, then the iterates
x=b
( k )
x and e ( k )
x are related by Qx = e( k ) (
x .k )
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 686
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Observation:
✦ linear convergence (→ Def. 8.1.9), see also Rem. 8.1.13
✦
rate of convergence increases (↔ speed of convergence decreases) with spread of
spectrum of A
Impact of distribution of diagonal entries (↔ eigenvalues) of (diagonal matrix) A
(b = x∗ = 0, x0 = cos((1:n)’);)
Test matrix #1: A=diag(d); d = (1:100);
Test matrix #2: A=diag(d); d = [1+(0:97)/97 , 50 , 100];
Test matrix #3: A=diag(d); d = [1+(0:49)*0.05, 100-(0:49)*0.05];
Test matrix #4: eigenvalues exponentially dense at 1
3
10
error norm, #1
norm of residual, #1
error norm, #2
norm of residual, #2
#4 error norm, #3
norm of residual, #3
2
10 error norm, #4
norm of residual, #4
#3
matrix no.
2-norms 1
10
#2
0
10
#1
-1
10
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45
Fig. 377 diagonal entry iteration step k
Observation: Matrices #1, #2 & #4 ➣ little impact of distribution of eigenvalues on asymptotic con-
vergence (exception: matrix #2)
cond2 (A) − 1
x ( k + 1) − x ∗ ≤ L x(k) − x∗ , L := ,
A A cond2 (A) + 1
that is, the iteration converges at least linearly (→ Def. 8.1.9) w.r.t. energy norm (→ Def. 10.1.1).
Remark 10.1.19 (2-norm from eigenvalues → [?, Sect. 10.6], [?, Sect. 7.4])
A −1 = min(|σ(A)|)−1 , if A regular.
2
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [?, Sect. 4.3.3] 687
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
λmax (A)
✎ other notation κ (A ) := =
ˆ spectral condition number of A
λmin (A)
(for general A: λmax (A)/λmin (A) largest/smallest eigenvalue in modulus)
These results are an immediate consequence of the fact that
∀A ∈ R n,n , A⊤ = A ∃U ∈ R n,n , U−1 = U⊤ : U⊤ AU is diagonal,
see (10.1.15), Cor. 9.1.9, [?, Thm. 7.8], [?, Satz 9.15].
Please note that for general regular M ∈ R n,n we cannot expect cond2 (M) = κ (M).
10.2 Conjugate gradient method (CG) [?, Ch. 9], [?, Sect. 13.4], [?,
Sect. 4.3.4]
Again we consider a linear system of equations Ax = b with s.p.d. (→ Def. 1.1.8) system matrix A ∈
R n,n and given b ∈ R n .
1D line search in § 10.1.11 is oblivious of former line searches, which rules out reuse of information gained
in previous steps of the iteration. This is a typical drawback of 1-point iterative methods.
Idea:
Replace linear search with subspace correction
Given:
✦ initial guess x(0)
✦ nested subspaces U1 ⊂ U2 ⊂ U3 ⊂ · · · ⊂ Un = R n , dim Uk = k
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 688
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Lemma 10.2.3. rk ⊥ Uk
With x(k) according to (10.2.1), Uk from (10.2.2) the residual rk := b − Ax(k) satisfies
r⊤
k u = 0 ∀ u ∈ Uk (”r k ⊥ Uk ”).
Geometric consideration: since x(k) is the minimizer of J over the affine space Uk + x(0) , the projection of
the steepest descent direction grad J (x(k) ) onto Uk has to vanish:
Proof. Consider
Corollary 10.2.5.
Lemma 10.2.3 also implies that, if U0 = {0}, then dim Uk = k as long as x(k) 6= x∗ , that is, before we
have converged to the exact solution.
(10.2.1) and (10.2.2) define the conjugate gradient method (CG) for the iterative solution of Ax = b
(hailed as a “top ten algorithm” of the 20th century, SIAM News, 33(4))
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 689
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Lemma 10.2.7.
The subspaces Uk ⊂ R n , k ≥ 1, defined by (10.2.1) and (10.2.2) satisfy
Since Uk+1 = Span{Uk , rk }, we obtain Uk+1 ⊂ Kk+1 (A, r0 ). Dimensional considerations based on
Lemma 10.2.3 finish the proof.
✷
10.2.2 Implementation of CG
∂ψ
(10.2.1) ⇔ = 0 , j = 1, . . . , l .
∂γ j
This leads to a linear system of equations by which the coefficients γ j can be computed:
p1⊤ Ap1 · · · p1⊤ Apl γ1 p1⊤ r
.. .. .. .. (0)
. . . = . , r := b − Ax . (10.2.8)
p⊤
l Ap1 · · · p⊤
l Apl
γl p⊤
l r
Recall: s.p.d. A induces an inner product ➣ concept of orthogonality [?, Sect. 4.4], [?, Sect. 6.2].
“A-geometry” like standard Euclidean space.
Span{p1 , . . . , pl } = Kl (A, r) .
(Efficient) successive computation of x(l ) becomes possible, see [?, Lemma 13.24]
(LSE (10.2.8) becomes diagonal !)
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 690
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
r0 := b − Ax(0) ;
p⊤
j r0 (10.2.9)
( j) ( j − 1)
for j = 1 to l do { x := x + pj }
p⊤
j Ap j
From linear algebra we already know a way to construct orthogonal basis vectors:
(10.2.10) ⇒ Idea:
Gram-Schmidt orthogonalization [?, Thm. 4.8], [?, Alg. 6.1],
of residuals r j := b − Ax( j) w.r.t. A-inner product:
j
p⊤ Ar j
p1 := r0 , p j+1 := (b − Ax( j) ) − ∑ ⊤k p , j = 1, . . . , l − 1 .
| {z } k=1 pk Apk k
= :r j | {z }
(∗)
(10.2.11)
rj
Geometric interpretation of
K j (A, r0 ) (10.2.11):
(∗) =
ˆ orthogonal projection of r j on the subspace
(∗)
Span{p1 , . . . , p j }
Fig. 378
j
p⊤
k r0
j
p⊤
k Ar j
(10.2.9) & (10.2.11) ⇒ p j +1 = r0 − ∑ p⊤
Apk − ∑ p⊤
pk .
k=1 k Apk k=1 k Apk
⇒ p j+1 ∈ Span{r0 , p1 , . . . , p j , Ap1 , . . . , Ap j } .
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 691
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
✷
Orthogonalities from Lemma 10.2.12 ➤ short recursions for pk , rk , x(k) !
p⊤
j Ar j
(10.2.10) ⇒ (10.2.11) collapses to p j+1 := r j − p j , j = 1, . . . , l .
p⊤
j Ap j
p⊤
j r0
(10.2.9) r j = r j −1 − Ap j .
p⊤
j Ap j
!T
m−1 r0⊤ pk
Lemma 10.2.12, (i) rH
j −1 p j = r0 + ∑ Apk p j =r0⊤ p j . (10.2.16)
k=1 pkT Apk
The orthogonality (10.2.16) together with (10.2.15) permits us to replace r0 with r j−1 in the actual imple-
mentation.
In CG algorithm: r j = b − Ax(k) agrees with the residual associated with the current iterate (in exact
arithmetic, cf. Ex. 10.2.21), but computation through short recursion is more efficient.
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 692
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
➣ We find that the CG method possesses all the algorithmic advantages of the gradient method, cf. the
discussion in Section 10.1.3.
✎ ☞
1 matrix×vector product, 3 dot products, 3 AXPY-operations per step:
✍ ✌
If A sparse, nnz(A) ∼ n ➤ computational effort O(n) per step
M ATLAB-function:
For any vector norm and associated matrix norm (→ Def. 1.5.76) hold (with residual rl := b − Ax(l ) )
1 krl k kx( l ) − x∗ k kr k
≤ x(0) −x∗ ≤ cond(A) l . (10.2.20)
cond(A) kr0 k k k k r0 k
(10.2.20) can easily be deduced from the error equation A(x(k) − x∗ ) = rk , see Def. 2.4.1 and (2.4.12).
10.2.3 Convergence of CG
Note: CG is a direct solver, because (in exact arithmetic) x(k) = x∗ for some k ≤ n
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 693
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
1
10
0
10
2-norm of residual
-1
10
residual
h norms during
i CG iteration ✄:
(10)
R = r0 , . . . , r -3
10
-4
10
0 2 4 6 8 10 12 14 16 18 20
Fig. 379 iteration step k
R⊤ R =
1.000000 −0.000000 0.000000 −0.000000 0.000000 −0.000000 0.016019 −0.795816 −0.430569 0.348133
−0.000000 1.000000 −0.000000 0.000000 −0.000000 0.000000 −0.012075 0.600068 −0.520610 0.420903
0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000000 0.001582 −0.078664 0.384453 −0.310577
−0.000000 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000024 0.001218 −0.024115 0.019394
0.000000 −0.000000 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000002 0.000151 −0.000118
−0.000000 0.000000 −0.000000 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000000 0.000000
0.016019 −0.012075 0.001582 −0.000024 0.000000 −0.000000 1.000000 −0.000000 −0.000000 0.000000
−0.795816 −0.078664 −0.000002 −0.000000 −0.000000
0.600068 0.001218 0.000000 1.000000 0.000000
−0.430569 −0.520610 0.384453 −0.024115 0.000151 −0.000000 −0.000000 0.000000 1.000000 0.000000
0.348133 0.420903 −0.310577 0.019394 −0.000118 0.000000 0.000000 −0.000000 0.000000 1.000000
➣ Roundoff
✦ destroys orthogonality of residuals
✦ prevents computation of exact solution after n steps.
Numerical instability (→ Def. 1.5.85) ➣ pointless to (try to) use CG as direct solver!
Practice: CG used for large n as iterative solver : x(k) for some k ≪ n is expected to provide good
approximation for x∗
CG (Code 10.2.18) & gradient method (Code 10.1.12) for LSE with sparse s.p.d. “Poisson matrix”
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 694
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Poisson matrix, m = 10
0
20 35
30 30
50 20
60 15
70 10
80 5
90 0
-2 -1 0 1
10 10 10 10
Fig. 381 eigenvalues poissoneig
100
0 10 20 30 40 50 60 70 80 90 100
Fig. 380 nz = 460 poissonspy
0 0
10 10
-1
10
normalized (!) 2-norms
normalized (!) norms
-2
10
-1
10
-3
10
• CG much faster than gradient method (as expected, because it has “memory”)
• Both, CG and gradient method converge more slowly for larger sizes of Poisson matrices.
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 695
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
(10.2.24)
Bound this minimum for λ ∈ [λmin (A), λmax (A)] by using suitable “polynomial candidates”
Tool: Chebychev polynomials (→ Section 6.1.3.1) ➣ lead to the following estimate [?, Satz 9.4.2],
[?, Satz 13.29]
The iterates of the CG method for solving Ax = b (see Code 10.2.18) with A = A⊤ s.p.d. satisfy
l
2 1− √1
(l ) κ (A)
x−x ≤ 2l 2l x − x (0)
A A
1+ √1 + 1− √1
κ (A) κ (A)
p !l
κ (A ) − 1
≤ 2 p x − x (0) .
κ (A ) + 1 A
of this theorem confirms asymptotic linear convergence of the CG method (→ Def. 8.1.9)
The estimate p
κ (A ) − 1
with a rate of p
κ (A ) + 1
Plots of bounds for error reduction (in energy norm) during CG iteration from Thm. 10.2.25:
100
9
90 0.
1
error reduction (energy norm)
80
0.8 70
9
0.
0.6 60
0.8
κ(A)1/2
0.4 50
0.2 40 0.
9 0.8 0.7
30
0 0.7
100 0.8 0.6
80 20
10 0.5 0.4
9 0.7
60 8 0. 0.6
6 0.5 0.4 0.3
40 10 0.2
4 0.8 0.1
20 0.6 0.4 0.3
0.2 0.1
2 0.7 0. 5
0.3 0.1
0 0.2
0
κ(A)1/2 CG step l
1 2 3 4 5 6 7 8 9 10
CG step l
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 696
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
17 p r i n t -depsc2 ’../PICTURES/theorate1.eps’;
18
23 p r i n t -depsc2 ’../PICTURES/theorate2.eps’;
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 697
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
0.9
0.8
convergence rate of CG
convergence rate of CG
0.7
0.6
cond (A)
2
0.4
0.3
0.2
k r k + 1 k 2 ≈ L k r k k2 ⇒ k r k + m k2 ≈ L m k r k k2 .
-2
#4 10
-4
10
#3
matrix no.
2-norms
-6
10
#2
-8
10
error norm, #1
norm of residual, #1
#1 error norm, #2
-10
10 norm of residual, #2
error norm, #3
norm of residual, #3
error norm, #4
norm of residual, #4
-12
10
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25
diagonal entry no. of CG steps
✞ ☎
(in stark contrast to the behavior of the gradient method, see Ex. 10.1.17)
✝ ✆
CG convergence boosted by clustering of eigenvalues
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [?, Ch. 9], [?, 698
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
10.3 Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, Sect. 4.3.5]
Idea: Preconditioning
Apply CG method to transformed linear system
Ae e , A
ex = b e := B−1/2 AB−1/2 , e e := B−1/2 b ,
x := B /2 x , b
1
(10.3.1)
e ), B = B⊤ ∈ R N,N s.p.d. =
with “small” κ (A ˆ preconditioner.
Recall (10.1.15) : for every B ∈ R n,n with B⊤ = B there is an orthogonal matrix Q ∈ R n.n such that
B = Q⊤ DQ with a diagonal matrix D (→ Cor. 9.1.9, [?, Thm. 7.8], [?, Satz 9.15]). If B is s.p.d. the
(diagonal) entries of D are strictly positive and we can define
p p
D = diag(λ1 , . . . , λn ), λi > 0 ⇒ D /2 := diag(
1
λ1 , . . . , λn ) .
This is generalized to
B /2 := Q⊤ D /2 Q ,
1 1
and one easily verifies, using Q⊤ = Q−1 , that (B /2 )2 = B and that B /2 is s.p.d. In fact, these two
1 1
2. the evaluation of B−1 x is about as expensive (in terms of elementary operations) as the
matrix×vector multiplication Ax, x ∈ R n .
λmax (A)
Recall: spectral condition number κ (A ) := , see (10.1.21)
λmin (A)
There are several equivalent ways to express that κ (B− /2 AB− /2 ) is “small”:
1 1
• κ (B−1 A) is “small”,
because spectra agree σ (B−1 A) = σ (B− /2 AB− /2 ) due to similarity (→ Lemma 9.1.6)
1 1
10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, 699
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
☛ ✟
“Reader’s digest” version of Notion 10.3.3:
✡ ✠
S.p.d. B preconditioner :⇔ B−1 = cheap approximate inverse of A
Problem: B /2 , which occurs prominently in (10.3.1) is usually not available with acceptable computational
1
costs.
from (10.3.1), it becomes apparent that, after suitable transformation of the iteration variables p j and r j ,
B1/2 and B−1/2 invariably occur in products B−1/2 B−1/2 = B−1 and B1/2 B−1/2 = I. Thus, thanks to this
intrinsic transformation square roots of B are not required for the implementation!
CG for Ae e
ex = b Equivalent CG with transformed variables
x (0) ∈ R n
Input : initial guess e Input : initial guess x(0) ∈ R n
Output : approximate solution ex(l ) ∈ R n Output : approximate solution x(l ) ∈ R n
e − B−1/2 AB−1/2e
e 1 := er0 := b
p x (0) ; e − AB−1/2e
B1/2er0 := B1/2 b x (0) ;
for j = 1 to l do { B− /2 p
1
e 1 := B−1 (B /2er0 ) ;
1
for j = 1 to l do {
e Tjer j−1
p
α := (B−1/2 p
e j )T B1/2er j−1
e Tj B−1/2 AB−1/2 p
p ej α :=
(B−1/2 p
e j )T AB−1/2 p ej
x ( j − 1) + α p
x( j) := e
e e j; −1/2 ( j) −1/2 ( j−1)
+ α B− /2 p
1
B x := B
e e
x e j;
er j = er j−1 − αB− /2 AB /2 p
1 1
e j;
B /2er j = B /2er j−1 − αAB− /2 p
1 1 1
e j;
(B−1/2 AB−1/2 p
e j )Ter j
B− /2 p
e j+1 = B−1 (B− /2er j )
1 1
e j +1
p = er j − T −1/2 −1/2 e j;
p
ej B
p AB ej
p
(B−1/2 p
e j )T AB−1 (B1/2er j ) −1/2
} − B e j;
p
(B−1/2 pe j )T AB−1/2 p
ej
}
(10.3.5) Preconditioned CG method (PCG) [?, Alg. 13.32], [?, Alg. 10.1]
10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, 700
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
p := r := b − Ax; p := B−1 r; q := p; τ0 := p⊤ r;
for l = 1 to lmax do {
β
β := r⊤ q; h := Ap; α := p⊤ h ;
x := x + αp;
r := r − αh; (10.3.6)
r⊤ q
q := B−1 r; β := β ;
if |q⊤ r| ≤ τ · τ0 then stop;
p := q + βp;
}
✛ ✘
Computational effort per step: 1 evaluation A×vector,
1 evaluation B−1 ×vector,
✚ ✙
3 dot products, 3 AXPY-operations
e
Assertions of Thm. 10.2.25 remain valid with κ (A) replaced with κ (B−1 A) and energy norm based on A
instead of A.
10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, 701
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
x + triu(A)−1 (b − Ae
x=e x) .
x = (L− 1 −1 −1 −1
A + U A − U A AL A )b ➤ B
−1
= L− 1 −1 −1 −1
A + U A − U A AL A . (10.3.10)
For all these approaches the evaluation of B−1 r can be done with effort of O(n) in the case of a sparse
matrix A (e.g. with O(1) non-zero entries per row). However, there is absolutely no guarantee that
κ (B−1 A) will be reasonably small. It will crucially depend on A, if this can be expected.
10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, 702
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The Code 10.3.12 highlights the use of a preconditioner in the context of the PCG method; it only takes a
function that realizes the application of B−1 to a vector. In Line 10 of the code this function is passed as
function handle invB.
5 5
10 10
0
10 0
10
-5
10
B -norm of residuals
-5
10
A-norm of error
-10
10
-10
10
-15
10
-15
-1
10
-20
10
CG, n = 50 CG, n = 50
-20 CG, n = 100
CG, n = 100 10
-25
10 CG, n = 200 CG, n = 200
PCG, n = 50 PCG, n = 50
PCG, n = 100 PCG, n = 100
PCG, n = 200 PCG, n = 200
-30 -25
10 10
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 384 # (P)CG step Fig. 385 # (P)CG step
CG
16 8 3 PCG
32 16 3
64 25 4
2
128 38 4 10
# (P)CG step
256 66 4
512 106 4
1024 149 4 1
10
2048 211 4
4096 298 3
8192 421 3
16384 595 3 0
10
1 2 3 4 5
32768 841 3 Fig. 386
10 10 10
n
10 10
Clearly in this example the tridiagonal part of the matrix is dominant for large n. In addition, its condition
number grows ∼ n2 as is revealed by a closer inspection of the spectrum.
Preconditioning with the tridiagonal part manages to suppress this growth of the condition number of
B−1 A and ensures fast convergence of the preconditioned CG method
1 krl k kx( l ) − x∗ k kr k
≤ (0) ∗ ≤ cond(A) l . (10.2.20)
cond(A) kr0 k k x − x k k r0 k
10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, 703
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
(10.2.20) 2
estimate for 2-norm of transformed iteration errors: e(l )
e = (e(l ) )⊤ Be(l )
2
Analogous to (10.2.20), estimates for energy norm (→ Def. 10.1.1) of error e(l ) := x − x(l ) , x∗ := A−1 b:
2 2
1 e(l ) (B−1 r l )⊤ r l e(l )
A
≤ ≤ κ (B−1 A ) A
(10.3.14)
κ (B−1 A ) e ( 0 ) 2 ( B−1 r0 ) ⊤ r0 e ( 0 ) 2
A A
10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [?, Sect. 13.5], [?, Ch. 10], [?, 704
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Theorem 10.4.1.
p
Note: similar formula for (linear) rate of convergence as for CG, see Thm. 10.2.25, but with κ (A )
replaced with κ (A) !
10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 705
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
➤ GMRES method for general matrices A ∈ R n,n → [?, Ch. 16], [?, Sect. 4.4.2]
M ATLAB-function: • [x,flag,relr,it,rv] = gmres(A,b,rs,tol,maxit,B,[],x0);
• [. . .] = gmres(Afun,b,rs,tol,maxit,Binvfun,[],x0);
After many steps of GMRES we face considerable computational costs and memory requirements for
every further step. Thus, the iteration may be restarted with the current iterate x(l ) as initial guess →
rs-parameter triggers restart after every rs steps (Danger: failure to converge).
Zoo of methods with short recursions (i.e. constant effort per step)
10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 706
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Computational costs : 2 A×vector, 2 B−1 ×vector, 4 dot products, 6 SAXPYs per step
Memory requirements: 8 vectors ∈ R n
Computational costs : 2 A×vector, 2 B−1 ×vector, 2 dot products, 12 SAXPYs per step
Memory requirements: 10 vectors ∈ R n
0 1 0 ··· ··· 0
0
..
0 0 1 0 . ..
. .
. ..
.
..
.
..
.
.
.. ..
A= . . , b= x = e1 .
.
.. .. .. ..
. . 0.
0
0 0 1
1 0 ··· ··· 0 1
☛ ✟
✡ ✠
TRY & PRAY
Example 10.4.4 (Convergence of Krylov subspace methods for non-symmetric system ma-
trix)
1 A = g a l l e r y (’tridiag’,-0.5*ones(n-1,1),2*ones(n,1),-1.5*ones(n-1,1));
2 B = g a l l e r y (’tridiag’,0.5*ones(n-1,1),2*ones(n,1),1.5*ones(n-1,1));
Plotted: k r l k 2 : k r0 k 2 :
10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 707
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
3 0
10 10
bicgstab bicgstab
qmr qmr
2
10
Relative 2-norm of residual
1
10
-2
10
0
10
-1 -3
10 10
0 5 10 15 20 25 0 5 10 15 20 25
iteration step iteration step
Summary:
Advantages of Krylov methods vs. direct elimination (IF they converge at all/sufficiently fast).
• They require system matrix A in procedural form y=evalA(x) ↔ y = Ax only.
• They can perfectly exploit sparsity of system matrix.
• They can cash in on low accuracy requirements (IF viable termination criterion available).
• They can benefit from a good initial guess.
10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 708
Chapter 11
Contents
11.1 Initial value problems (IVP) for ODEs . . . . . . . . . . . . . . . . . . . . . . . . . 698
11.1.1 Modeling with ordinary differential equations: Examples . . . . . . . . . . . 699
11.1.2 Theory of initial value problems . . . . . . . . . . . . . . . . . . . . . . . . . 703
11.1.3 Evolution operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
11.2 Introduction: Polygonal Approximation Methods . . . . . . . . . . . . . . . . . . 709
11.2.1 Explicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
11.2.2 Implicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712
11.2.3 Implicit midpoint method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
11.3 General single step methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714
11.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714
11.3.2 Convergence of single step methods . . . . . . . . . . . . . . . . . . . . . . . 717
11.4 Explicit Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
11.5 Adaptive Stepsize Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
In our parlance, a (first-order) ordinary differential equation (ODE) is an equation of the form
ẏ = f(t, y) , (11.1.2)
with
In the context of mathematical modeling the state vector y ∈ R d is supposed to provide a complete (in the
sense of the model) description of a system. Then (11.1.2) models a finite-dimensional dynamical system.
709
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
A solution of the ODE ẏ = f(t, y) with continuous right hand side function f is a continuously
differentiable function “of time t” y : J ⊂ I → D, defined on an open interval J , for which ẏ(t) =
f(t, y(t)) holds for all t ∈ J .
A solution describes a continuous trajectory in state space, a one-parameter family of states, parameter-
ized by time.
It goes without saying that smoothness of the right hand side function f is inherited by solutions of the
ODE:
Supplementary reading. Some grasp of the meaning and theory of ordinary differential equa-
tions (ODEs) is indispensable for understanding the construction and properties of numerical meth-
ods. Relevant information can be found in [?, Sect. 5.6, 5.7, 6.5].
Example 11.1.5 (Growth with limited resources [?, Sect. 1.1], [?, Ch. 60])
11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 710
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
αy0
y(t) = ,
y
(11.1.7)
βy0 + (α − βy0 ) exp(−αt)
0.5
for all t ∈ R .
Note that by fixing the initial value y(0) we can single out a unique representative from the family of
solutions. This will turn out to be a general principle, see Section 11.1.2.
An ODE of the from ẏ = f(y), that is, with a right hand side function that does not depend on time,
but only on state, is called autonomous.
For an autonomous ODE the right hand side function defines a vector field (“velocity field”) y 7→ f(y) on
state space.
Example 11.1.9 (Predator-prey model [?, Sect. 1.1],[?, Sect. 1.1.1],[?, Ch. 60], [?, Ex. 11.3])
Predators and prey coexist in an ecosystem. Without predators the population of prey would be gov-
erned by a simple exponential growth law. However, the growth rate of prey will decrease with increasing
numbers of predators and, eventually, become negative. Similar considerations apply to the predator
population and lead to an ODE model.
11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 711
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
population densities:
u(t) → density of prey at time t,
v(t) → density of predators at time t
v
✄
α/β
Solution curves are trajectories of particles carried
along by velocity field f.
6
u=y 6
1
v = y2
5
5
4
4
3
v = y2
y
2
2
1
1
0
1 2 3 4 5 6 7 8 9 10 0
0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 389
t
Fig. 390 u = y1
u(t) u (0 ) 4
Solution for y0 := = Solution curves for (11.1.10)
v(t) v (0 ) 2
Parameter values for Fig. 390, 389: α = 2, β = 1, δ = 1, γ = 1
stationary point
l˙ = −(l 3 − αl + p) ,
Phenomenological model: (11.1.12)
ṗ = βl ,
with parameters: α =
ˆ pre-tension of muscle fiber
β =
ˆ (phenomenological) feedback parameter
11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 712
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
This is the so-called Zeeman model: it is a phenomenological model entirely based on macroscopic
observations without relying on knowledge about the underlying molecular mechanisms.
2
1.5
1
1
0.5
l/p
p
0 0
-0.5
-1
-1
-1.5
-2
-2
-2.5 -3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 0 10 20 30 40 50 60 70 80 90 100
Fig. 391 l Fig. 392 time t
Phase flow for Zeeman model (α = 5.000000e-01, β=1.000000e-01) Heartbeat according to Zeeman model (α = 5.000000e-01, β=1.000000e-01)
2.5 3
l(t)
p(t)
2
2
1.5
1
1
0.5
l/p
p
0 0
-0.5
-1
-1
-1.5
-2
-2
-2.5 -3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 0 10 20 30 40 50 60 70 80 90 100
Fig. 393 l Fig. 394 time t
In Chapter 1 and Chapter 8 we discussed circuit analysis as a source of linear and non-linear systems of
equations, see Ex. 2.1.3 and Ex. 8.0.1. In the former example we admitted time-dependent currents and
potentials, but dependence on time was confined to be “sinusoidal”. This enabled us to switch to frequency
domain, see (2.1.6), which gave us a complex linear system of equations for the complex nodal potentials.
Yet, this trick is only possible for linear circuits. In the general case, circuits have to be modelled by ODEs
connecting time-dependent potentials and currents. This will be briefly explained now.
11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 713
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
di R di di
(t) − L (t) − C (t) = 0 ,
dt dt dt
and plug in the above constitutive relations for circuit elements:
−1 du R −1 d2 u C
R (t) − L u L (t) − C 2 (t) = 0 .
dt dt
We continue following the policy of nodal analysis and express all voltages by potential differences between
nodes of the circuit.
u R ( t ) = Us ( t ) − u ( t ) , u C ( t ) = u ( t ) − 0 , u L ( t ) = u ( t ) − 0 .
For this simple circuit there is only one node with unknown potential, see Fig. 395. Its time-dependent
potential will be denoted by u(t) and this is the unknown of the model, a function of time obeying the
ordinary differential equation
d2 u
R−1 (U̇s (t) − u̇(t)) − L−1 u(t) − C (t) = 0 .
dt2
This is an autonomous 2nd-order ordinary differential equation:
The attribute “2nd-order” refers to the occurrence of a second derivative with respect to time.
11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 714
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
A generic Initial value problem (IVP) for a first-order ordinary differential equation (ODE) (→ [?,
Sect. 5.6], [?, Sect. 11.1]) can be stated as: find a function y : I → D that satisfies, cf. Def. 11.1.3,
Recall Def. 11.1.8: For an autonomous ODE ẏ = f(y), that is the right hand side f does not depend on
time t.
Hence, for autonomous ODEs we have I = R and the right hand side function y 7→ f(y) can be regarded
as a stationary vector field (velocity field), see Fig. 388 or Fig. 391.
An important observation: If t 7→ y(t) is a solution of an autonomous ODE, then, for any τ ∈ R , also the
shifted function t 7→ y(t − τ ) is a solution.
➣ For initial value problems for autonomous ODEs the initial time is irrelevant and therefore we can
always make the “canonical choice t0 = 0.
Autonomous ODEs naturally arise when modeling time-invariant systems or phenomena. All examples for
Section 11.1.1 belong to this class.
In fact, autonomous ODEs already represent the general case, because every ODE can be converted into
an autonomous one:
Remark 11.1.23 (From higher order ODEs to first order systems [?, Sect. 11.2])
11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 715
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
dn
✎ Notation: superscript (n) =
ˆ n-th temporal derivative t: dtn
No special treatment of higher order ODEs is necessary, because (11.1.24) can be turned into a 1st-order
ODE (a system of size nd) by adding all derivatives up to order n − 1 as additional components to the
state vector. This extended state vector z(t) ∈ R nd is defined as
z2
y(t) z1
y (1) ( t ) z z3
2 ..
z(t) := .. = .. ∈ R dn : (11.1.24) ↔ ż = g(z) , g(z) := . .
. .
( n − 1 )
zn
y (t) zn
f(t, z1 , . . . , zn )
(11.1.25)
Note that the extended system requires initial values y(t0 ), ẏ(t0 ), . . . , y(n−1) (t0 ): for ODEs of order n ∈
N well-posed initial value problems need to specify initial values for the first n − 1 derivatives.
Now we review results about existence and uniqueness of solutions of initial value problems for first-order
ODEs. These are surprisingly general and do not impose severe constraints on right hand side functions.
The property of local Lipschitz continuity means that the function (t, y) 7→ f(t, y) has “locally finite slope”
in y.
11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 716
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The meaning of local Lipschitz continuity is best explained by giving an example of a function that fails to
possess this property.
√
Consider the square root function t 7→ t on the closed interval [0, 1]. Its slope in t = 0 is infinite and so
it is not locally Lipschitz continuous on [0, 1].
However, if we consider the square root on the open interval ]0, 1[, then it is locally Lipschitz continuous
there.
The next lemma gives a simple criterion for local Lipschitz continuity, which can be proved by the mean
value theorem, cf. the proof of Lemma 8.2.12.
If f and Dy f are continuous on the extended state space Ω, then f is locally Lipschitz continuous
(→ Def. 11.1.28).
Theorem 11.1.32. Theorem of Peano & Picard-Lindelöf [?, Satz II(7.6)], [?, Satz 6.5.1], [?,
Thm. 11.10], [?, Thm. 73.1]
If the right hand side function f : Ω̂ 7→ R d is locally Lipschitz continuous (→ Def. 11.1.28) then
for all initial conditions (t0 , y0 ) ∈ Ω̂ the IVP (11.1.20) has a solution y ∈ C1 ( J (t0 , y0 ), R d ) with
maximal (temporal) domain of definition J (t0 , y0 ) ⊂ R .
Notation: for autonomous ODE we always have t0 = 0, and therefore we write J (y0 ) := J (0, y0 ).
11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 717
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Let us explain the still mysterious “maximal domain of definition” in statement of Thm. 11.1.32. I is related
to the fact that every solution of an initial value problem (11.1.33) has its own largest possible time interval
J (y0 ) ⊂ R on which it is defined naturally.
As an example we consider the autonomous scalar (d = 1) initial value problem, modeling “explosive
growth” with a growth rate increasing linearly with the density:
ẏ = y2 , y(0) = y0 ∈ R . (11.1.36)
10
y = -0.5
0
( 1 6
y = 0.5
0
, if y0 6= 0 ,
y(t) = y0−1 − t (11.1.37) 4
0 , if y0 = 0 , 2
y(t)
0
with domains of definition
-2
−1
] − ∞, y0 [ , if y0 > 0 ,
-4
J ( y0 ) = R , if y0 = 0 ,
−1
-6
] y0 , ∞ [ , if y0 < 0 . -8
-10
-3 -2 -1 0 1 2 3
Fig. 396 t
In this example, for y0 > 0 the solution experiences a blow-up in finite time and ceases to exists afterwards.
For the sake of simplicity we restrict the discussion to autonomous IVPs (11.1.33) with locally Lipschitz
continuous right hand side and make the following assumption. A more general treatment is given in [?].
Now we return to the study of a generic ODE (11.1.2) instead of an IVP (11.1.20). We do this by temporarily
changing the perspective: we fix a “time of interest” t ∈ R \ {0} and follow all trajectories for the duration
t. This induces a mapping of points in state space:
t D 7→ D
➣ mapping Φ : , t 7→ y(t) solution of IVP (11.1.33) ,
y0 7 → y ( t )
This is a well-defined mapping of the state space into itself, by Thm. 11.1.32 and Ass. 11.1.38.
Now, we may also let t vary, which spawns a family of mappings Φ t of the state space into itself.
However, it can also be viewed as a mapping with two arguments, a duration t and an initial state value
y0 !
11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 718
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
where t 7→ y(t) ∈ C1 (R, R d ) is the unique (global) solution of the IVP ẏ = f(y), y(0) = y0 , is
the evolution operator/mapping for the autonomous ODE ẏ = f(y).
Note that t 7→ Φt y0 describes the solution of ẏ = f(y) for y(0) = y0 (a trajectory). Therefore, by virtue
of definition, we have
∂Φ
(t, y) = f(Φ t y) .
∂t
For d = 2 the action of an evolution operator can be visualized by tracking the movement of point sets in
state space. Here this is done for the Lotka-Volterra ODE (11.1.10):
6
Flow map for Lotka-Volterra system, α=2, β=γ =δ =1
8
t=0
t=0.5
t=1
5 7 t=1.5
t=2
t=3
5
v (predator)
v = y2
3 4
3
2 X
1
1
0
0 0 1 2 3 4 5 6
0 0.5 1 1.5 2 2.5 3 3.5 4 398
Fig.
u = y1 u (prey)
Fig. 397
state mapping y 7→ Φt y
trajectories t 7→ Φ t y0
Under Ass. 11.1.38 the evolution operator gives rise to a group of mappings D 7→ D:
This is a consequence of the uniqueness theorem Thm. 11.1.32. It is also intuitive: following an evolution
up to time t and then for some more time s leads us to the same final state as observing it for the whole
time s + t.
11. Numerical Integration – Single Step Methods, 11.1. Initial value problems (IVP) for ODEs 719
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
We target an initial value problem (11.1.20) for a first-order ordinary differential equation
As usual, the right hand side function f may be given only in procedural form, in M ATLAB as
function v = f(t,y),
or in a C++ code as an object providing an evaluation operator, see Rem. 5.1.6. An evaluation of f may
involve costly computations.
Two basic tasks can be identified in the field of numerical integration = approximate solution of initial value
problems for ODEs (Please distinguish from “numerical quadrature”, see Chapter 7.):
(I) Given initial time t0 , final time T , and initial state y0 compute an approximation of y(T ), where
t 7→ y(t) is the solution of (11.1.20). A corresponding function in C++ could look like
State solveivp( double t0, double T,State y0);
Here statedim is the dimension d of the state space that has to be known at compile time.
(II) Output an approximate solution t → yh (t) of (11.1.20) on [t0 , T ] up to final time T 6= t0 for “all
times” t ∈ [t0 , T ] (actually for many times t0 = τ0 < τ1 < τ2 < · · · < τm−1 < τm = T
consecutively): “plot solution”!
s t d :: v e c t o r <State>
solveivp(State y0, const s t d :: v e c t o r < double > &tauvec);
This section presents three methods that provide a piecewise linear, that is, “polygonal” approximation of
solution trajectories t 7→ y(t), cf. Ex. 5.1.10 for d = 1.
As in Section 6.5.1 the polygonal approximation in this section will be based on a (temporal) mesh (→
§ 6.5.1)
covering the time interval of interest between initial time t0 and final time T > t0 . We assume that the
interval of interest is contained in the domain of definition of the solution of the IVP: [t0 , T ] ⊂ J (t0 , y0 ).
11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 720
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
For d = 1 polygonal methods can be constructed by geometric considerations in the t − y plane, a model
for the extended state space. We explain this for the Riccati differential equation, a scalar ODE:
ẏ = y2 + t2 ➤ d = 1, I, D = R + . (11.2.5)
1.5 1.5
1 1
y
y
0.5 0.5
0 0
0 0.5 1 1.5 0 0.5 1 1.5
Fig. 399 t Fig. 400 t
y1
y
11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 721
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
2.4
exact solution
Temporal mesh 2.2
explicit Euler
2
M := {t j := j/5: j = 0, . . . , 5} .
1.8
y
1.4
ẏ = y2 + t2 . (11.2.5)
1.2
Here: y0 = 12 , t0 = 0, T = 1, ✄ 1
0.8
—=
ˆ “Euler polygon” for uniform timestep h = 0.2
0.6
7→ =
ˆ tangent field of Riccati ODE 0.4
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Fig. 402 t
Formula: When applied to a general IVP of the from (11.1.20) the explicit Euler method generates a
N
sequence (yk )k=0 by the recursion
d
One can obtain (11.2.7) by approximating the derivative dt by a forward difference quotient on the (tem-
poral) mesh M := {t0 , t1 , . . . , t N }:
yk+1 − yk
ẏ = f(t, y) ←→ = f(tk , yh (tk )) , k = 0, . . . , N − 1 . (11.2.9)
hk
Difference schemes follow a simple policy for the discretization of differential equations: replace all deriva-
tives by difference quotients connecting solution values on a set of discrete points (the mesh).
To begin with, the explicit Euler recursion (11.2.7) produces a sequence y0 , . . . , y N of states. How does it
deliver on the task (I) and (II) stated in § 11.2.1? By “geometric insight” we expect
yk ≈ y(tk ) .
(As usual, we use the notation t 7→ y(t) for the exact solution of an IVP.)
Task (II): The trajectory t 7→ y(t) is approximated by the piecewise linear function (‘Euler polygon”)
tk+1 − t t − tk
y h : [ t0 , t N ] → R d , y h ( t ) : = y k + yk+1 for t ∈ [ tk , tk+1 ] , (11.2.11)
tk+1 − tk tk+1 − tk
11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 722
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
see Fig. 402. This function can easily be sampled on any grid of [t0 , t N ]. In fact, it is the M-piecewise
linear interpolant of the data points (tk , yk ), k = 0, . . . , N , see Section 5.3.2).
The same considerations apply to the methods discussed in the next two sections and will not be repeated
there.
Why forward difference quotient and not backward difference quotient? Let’s try!
yk+1 − yk
ẏ = f (t, y) ←→ = f (tk+1 , yh (tk+1 )) , k = 0, . . . , N − 1 . (11.2.12)
hk
backward difference quotient
Note: (11.2.13) requires solving a (possibly non-linear) system of equations to obtain yk+1 !
(➤ Terminology “implicit”)
y
y h ( t1 ) Geometry of implicit Euler method:
y(t)
Approximate solution through (t0 , y0 ) on [t0 , t1 ] by
y0
• straight line through (t0 , y0 )
• with slope f (t1 , y1 )
✁ —= ˆ trajectory through (t0 , y0 ),
t —= ˆ trajectory through (t1 , y1 ),
—= ˆ tangent at — in (t1 , y1 ).
t0 t1
Fig. 403
Issue: Is (11.2.13) well defined, that is, can we solve it for yk+1 and is this solution unique?
Intuition: for small timesteps h > 0 the right hand side of (11.2.13) is a “small perturbation of the identity”.
11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 723
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Formal: Consider an autonomous ODE ẏ = f(y), assume a continuously differentiable right hand side
function f, f ∈ C1 ( D, R d ), and regard (11.2.13) as an h-dependent non-linear system of equa-
tions:
To investigate the solvability of this non-linear equation we start with an observation about a partial deriva-
tive of G:
dG dG
(h, z) = I − h Dy f(tk+1 , z) ⇒ (0, z) = I .
dz dz
In addition, G (0, yk ) = 0. Next, recall the implicit function theorem [?, Thm. 7.8.1]:
If the Jacobian ∂G
∂y (p0 ) ∈ R
ℓ,ℓ is invertible, then there is an open neighborhood U of x ∈ R k and
0
l
a continuously differentiable function g : U → R such that
For sufficiently small |h| it permits us to conclude that the equation G (h, z) = 0 defines a continuous
function g = g(h) with g(0) = yk .
➣ for sufficiently small h > 0 the equation (11.2.13) has a unique solution yk+1 .
Beside using forward or backward difference quotients, the derivative ẏ can also be approximated by the
symmetric difference quotient, see also (5.2.44),
y(t + h ) − y(t − h )
ẏ(t) ≈ . (11.2.16)
2h
The idea is to apply this formula in t = 12 (tk + tk+1 ) with h = hk/2, which transforms the ODE into
yk+1 − yk 1
ẏ = f (t, y) ←→ = f 2 (tk + tk+1 ), yh ( 21 (tk+1 + tk+1 )) , k = 0, . . . , N − 1 .
hk
(11.2.17)
The trouble is that the value yh ( 12 (tk+1 + tk+1 )) does not seem to be available, unless we recall that the
approximate trajectory t 7→ yh (t) is supposed to be piecewise linear, which implies yh ( 12 (tk+1 + tk+1 )) =
11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 724
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
1
2 (yh (tk )
+ yh (tk+1 )). This gives the recursion formula for the implicit midpoint method in analogy to
(11.2.7) and (11.2.13):
1
yk+1 = yk + h k f 2 (tk + tk+1 ), 12 (yk + yk+1 ) , k = 0, . . . , N − 1 , (11.2.18)
Now we fit the numerical schemes introduced in the previous section into a more general class of methods
for the solution of (autonomous) initial value problems (11.1.33) for ODEs. Throughout we assume that all
times considered belong to the domain of definition of the unique solution t → y(t) of (11.1.33), that is,
for T > 0 we take for granted [0, T ] ⊂ J (y0 ) (temporal domain of definition of the solution of an IVP is
explained in § 11.1.34).
11.3.1 Definition
If y0 is the initial value, then y1 := Ψ (h, y0 ) can be regarded as an approximation of y(h), the value
returned by the evolution operator (→ Def. 11.1.39) for ẏ = f(y) applied to y0 over the period h. y(tk ):
11. Numerical Integration – Single Step Methods, 11.3. General single step methods 725
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
In a sense the polygonal approximation methods as based on approximations for the evolution operator
associated with the ODE.
This is what every single step method does: it tries to approximate the evolution operator Φ for an ODE
by a mapping of the type (11.3.2).
➙ mapping Ψ from (11.3.2) is called discrete evolution.
The adjective “discrete” used above designates (components of) methods that attempt to approximate the
solution of an IVP by a sequence of finitely many states. “Discretization” is the process of converting an
ODE into a discrete model. This parlance is adopted for all procedures that reduce a “continuous model”
involving ordinary or partial differential equations to a form with a finite number of unknowns.
Above we identified the discrete evolutions underlying the polygonal approximation methods. Vice versa,
a mapping Ψ as given in (11.3.2) defines a single step method.
Definition 11.3.5. Single step method (for autonomous ODE) → [?, Def. 11.2]
defines a single step method (SSM) for the autonomous IVP ẏ = f(y), y(0) = y0 on the interval
[0, T ].
☞ In a sense, a single step method defined through its associated discrete evolution does not ap-
proximate a concrete initial value problem, but tries to approximate an ODE in the form of its
evolution operator.
In M ATLAB syntax a discrete evolutions can be incarnated by a function of the following form:
Ψh y ←→ function y1 = discevl(h,y0) .
( function y1 = discevl(@(y) rhs(y),h,y0) )
The concept of single step method according to Def. 11.3.5 can be generalized to non-autonomous ODEs,
which leads to recursions of the form:
yk+1 := Ψ(tk , tk+1 , yk ) , k = 0, . . . , N − 1 ,
for a discrete evolution operator Ψ defined on I × I × D.
All meaningful single step methods turn out to be modifications of the explicit Euler method (11.2.7).
11. Numerical Integration – Single Step Methods, 11.3. General single step methods 726
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
h ψ : I × D → R d continuous,
Ψ y = y + hψ (h, y) with (11.3.9)
ψ(0, y) = f(y) .
A single step method according to Def. 11.3.5 based on a discrete evolution of the form (11.3.9) is
called consistent with the ODE ẏ = f(y).
The discrete evolution Ψ and, hence, the function ψ = ψ(h, y) for the implicit midpoint method are defined
only implicitly, of course. Thus, consistency cannot immediately be seen from a formula for ψ.
1
yk+1 = yk + hf (tk + tk+1 ), 21 (yk + yk+1 ) , k = 0, . . . , N − 1 . (11.2.18)
2
Assume that
(11.2.18)
yk+1 = yk + hf( 21 (yk + yk+1 )) = yk + h f(yk + 12 hf( 21 (yk + yk+1 ))) .
| {z }
= ψ(h,y k )
Since, by the implicit function theorem, yk+1 continuously depends on h and yk , ψ(h, yk ) has the desired
properties, in particular ψ(0, y) = f(y) is clear.
Many authors specify a single step method by writing down the first step for a general stepsize h
y1 = expression in y0 , h and f .
Actually, this fixes the underlying discrete evolution. Also this course will sometimes adopt this practice.
11. Numerical Integration – Single Step Methods, 11.3. General single step methods 727
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Here we resume and continue the discussion of Rem. 11.2.10 for general single step methods according
to Def. 11.3.5. Assuming unique solvability of the systems of equations faced in each step of an implicit
method, every single step method based on a mesh M = {0 = t0 < t1 < · · · < t N := T } produces a
finite sequence (y0 , y1 , . . . , y N ) of states, where the first agrees with the initial state y0 .
We expect that the states provide a pointwise approximation of the solution trajectory t → y(t):
yk ≈ y(tk ) , k = 1, . . . , N .
Thus task (I) from § 11.2.1, computing an approximation for y(T ), is again easy: output y N as an
approximation of y(T ).
Task (II) from § 11.2.1, computing the solution trajectory, requires interpolation of the data points (tk , yk )
using some of the techniques presented in Chapter 5. The natural option is M-piecewise polynomial
interpolation, generalizing the polygonal approximation (11.2.11) used in Section 11.2.
Note that from the ODE ẏ = f(y) the derivatives ẏh (tk ) = f(yk ) are available without any further
approximation. This facilitates cubic Hermite interpolation (→ Def. 5.4.1), which yields
dyh
yh ∈ C1 ([0, T ]): yh |[ xk−1,xk ] ∈ P3 , yh (tk ) = yk , (t ) = f (yk ) .
dt k
Summing up, an approximate trajectory t 7→ yh (t) is built in two stages:
Supplementary reading. See [?, Sect. 11.5] and [?, Sect. 11.3] for related presentations.
Errors in numerical integration are called discretization errors, cf. Rem. 11.3.4.
Depending on the objective of numerical integration as stated in § 11.2.1 different notions of discretization
error appropriate
(I) If only the solution at final time is sought, the discretization error is
ǫ N : = k y( T ) − y N k ,
11. Numerical Integration – Single Step Methods, 11.3. General single step methods 728
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
(II) If we want to approximate the solution trajectory for (11.1.33) the discretization error is the function
(III) Between (I) and (II) is the pointwise discretization error, which is the sequence (grid function)
e : M → D , ek := y(tk ) − yk , k = 0, . . . , N . (11.3.15)
In this case we may consider the maximum error in the mesh points
where k·k is a suitable vector norm on R d , usually the Euclidean vector norm.
Once the discrete evolution Ψ associated with the ODE ẏ = f(y) is specified, the single step method
according to Def. 11.3.5 is fixed. The only way to control the accuracy of the solution y N or t 7→ yh (t) is
through the selection of the mesh M = {0 = t0 < t1 < · · · < t N = T }.
Hence we study convergence of single step methods for families of meshes {Mℓ } and track the decay of
(a norm) of the discretization error (→ § 11.3.14) as a function of the number N := ♯M of mesh points.
In other words, we examine h-convergence. We already did this in the case of piecewise polynomial
interpolation in Section 6.5.1 and composite numerical quadrature in Section 7.4.
When investigating asymptotic convergence of single step methods we often resort to families of equidis-
tant meshes of [0, T ]:
k
M N := {tk := T: k = 0 . . . , N } . (11.3.17)
N
T
We also call this the use of uniform timesteps of size h := N .
✦ We consider the following IVP for the logistic ODE, see Ex. 11.1.5
✦ We apply explicit and implicit Euler methods (11.2.7)/(11.2.13) with uniform timestep h = 1/N ,
N ∈ {5, 10, 20, 40, 80, 160, 320, 640}.
✦ Monitored: Error at final time E(h) := |y(1) − y N |
11. Numerical Integration – Single Step Methods, 11.3. General single step methods 729
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
1 1
10 10
λ = 1.000000 λ = 1.000000
λ = 3.000000 λ = 3.000000
λ = 6.000000 λ = 6.000000
0 λ = 9.000000 0 λ = 9.000000
10 10
O(h) O(h)
error (Euclidean norm)
-2 -2
10 10
-3 -3
10 10
-4 -4
10 10
-5 -5
10 10
-3 -2 -1 0 -3 -2 -1 0
10 10 10 10 10 10 10 10
Fig. 405 timestep h Fig. 406 timestep h
-1
10
-2
10
-3
10
better:
-4
10
Parlance: based on the observed rate of algebraic convergence, the two Euler methods are said to
“converge with first order”, whereas the implicit midpoint method is called “second-order con-
vergent”.
The observations made for polygonal timestepping methods reflect a general pattern:
Then customary single step methods (→ Def. 11.3.5) will enjoy algebraic convergence in the mesh-
width, more precisely, see [?, Thm. 11.25],
there is a p ∈ N such that the sequence (yk )k generated by the single step method
for ẏ = f(t, y) on a mesh M := {t0 < t1 < · · · < t N = T } satisfies
11. Numerical Integration – Single Step Methods, 11.3. General single step methods 730
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The minimal integer p ∈ N for which (11.3.20) holds for a single step method when applied to an
ODE with (sufficiently) smooth right hand side, is called the order of the method.
As in the case of quadrature rules (→ Def. 7.3.1) their order is the principal intrinsic indicator for the
“quality” of a single step method.
(11.3.22) Convergence analysis for the explicit Euler method [?, Ch. 74]
We consider the explicit Euler method (11.2.7) on a mesh M := {0 = t0 < t1 < · · · < t N = T } for a
generic autonomous IVP (11.1.20) with sufficiently smooth and (globally ) Lipschitz continuous f, that is,
and exact solution t 7→ y(t). Throughout we assume that solutions of ẏ = f(y) are defined on [0, T ] for
all initial states y0 ∈ D.
We argue that in this context the abstraction pays off, because it helps elucidate a general technique for
the convergence analysis of single step methods.
11. Numerical Integration – Single Step Methods, 11.3. General single step methods 731
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
y k+1
propagated error
Fundamental error splitting
e k+1
e k+1 = Ψ hk yk − Φ hk y(tk )
Ψhk (y(tk ))
= Ψ hk yk − Ψhk y(tk ) yk
| {z }
propagated error (11.3.25)
hk
+ Ψ y(tk ) − Φ y(tk ) . hk ek y ( t k+1 )
| {z }
one-step error one-step error
y ( tk )
Fig. 408
tk t k+1
A generic one-step error expressed through continu-
D Φh y ous and discrete evolutions:
τ (h, y) := Ψh y − Φh y . (11.3.26)
τ (h, yk )
Geometric considerations: distance of a smooth curve and its tangent shrinks as the square of the distance
to the intersection point (curve locally looks like a parabola in the ξ − η coordinate system, see Fig. 411).
η
D η h
Φ y(tk )
τ (h, yk )
τ (h, yk )
ξ
y(tk ) Ψ h y(tk ) ξ
t
tk tk+1
Fig. 410 Fig. 411
The geometric considerations can be made rigorous by analysis: recall Taylor’s formula for the function
y ∈ CK +1 [?, Satz 5.5.1]:
K t+h
Z
hj ( j) (t + h − τ )K
y(t + h ) − y(t) = ∑ y (t) + y ( K + 1) ( τ ) dτ , (11.3.27)
j =0
j! K!
t
| {z }
y ( K + 1 ) (ξ ) K +1
= h
K!
for some ξ ∈ [t, t + h]. We conclude that, if y ∈ C2 ([0, T ]), which is ensured for smooth f, see
Lemma 11.1.4, then
y(tk+1 ) − y(tk ) = ẏ(tk )hk + 12 z̈ (ξ k )h2k = f(y(tk ))hk + 12 ÿ(ξ k )h2k ,
11. Numerical Integration – Single Step Methods, 11.3. General single step methods 732
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
for some tk ≤ ξ k ≤ tk+1 .This leads to an expression for the one-step error from (11.3.26)
k l −1
ǫk ≤ ∑ ∏(1 + Lh j ) ρl , k = 1, . . . , N . (11.3.31)
l =1 j =1
k l −1 k
l −1
(11.3.31) ⇒ ǫk ≤ ∑ ∏ exp( Lh j ) · ρl = ∑ exp( L ∑ j=1 h j )ρl .
l =1 j =1 l =1
l −1
Note: ∑ h j ≤ T for final time T and conclude
j =1
k k
ρk
ǫk ≤ exp( LT ) ∑ ρl ≤ exp( LT ) max ∑ hl ≤ T exp( LT ) l =max hl · max kÿ(τ )k .
l =1 k hk l =1 1,...,k t0 ≤ τ ≤ t k
We can summarize the insight gleaned through this theoretical analysis as follows:
✦ Error bound grows exponentially with the length T of the integration interval.
11. Numerical Integration – Single Step Methods, 11.3. General single step methods 733
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
In the analysis of the global discretization error of the explicit Euler method in § 11.3.22 a one-step error
of size O(h2k ) led to a total error of O(h) through the effect of error accumulation over N ≈ h−1 steps.
This relationship remains valid for almost all single step methods:
Consider an IVP (11.1.20) with solution t 7→ y(t) and a single step method defined by the
discrete evolution Ψ (→ Def. 11.3.5). If the one-step error along the solution trajectory satisfies (Φ
is the evolution map associated with the ODE, see Def. 11.1.39)
A rigorous statement as a theorem would involve some particular assumptions on Ψ, which we do not
want to give here. These assumptions are satisfied, for instance, for all the methods presented in the
sequel.
Supplementary reading. [?, Sect. 11.6], [?, Ch. 76], [?, Sect. 11.8]
So far we only know first and second order methods from 11.2: the explicit and implicit Euler method
(11.2.7) and (11.2.13), respectively, are of first order, the implicit midpoint rule of second order. We
observed this in Ex. 11.3.18 and it can be proved rigorously for all three methods adapting the arguments
of § 11.3.22.
Thus, barring the impact of roundoff, the low-order polygonal approximation methods are guaranteed to
achieve any prescribed accuracy provided that the mesh is fine enough. Why should we need any other
timestepping schemes?
Remark 11.4.1 (Rationale for high-order single step methods cf. [?, Sect. 11.5.3])
We argue that the use of higher-order timestepping methods is highly advisable for the sake of efficiency.
The reasoning is very similar to that of Rem. 7.3.48, when we considered numerical quadrature. The
reader is advised to study that remark again.
As we saw in § 11.3.16 error bounds for single step methods for the solution of IVPs will inevitably feature
unknown constants “C > 0”. Thus they do not give useful information about the discretization error for
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 734
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
a concrete IVP and mesh. Hence, it is too ambitious to ask how many timesteps are needed so that
ky(T ) − y N k stays below a prescribed bound, cf. the discussion in the context of numerical quadrature.
The usual concept of “computational effort” for single step methods (→ Def. 11.3.5) is as follows
Computational effort ∼ total number of f-evaluations for approximately solving the IVP,
∼ number of timesteps, if evaluation of discete evolution Ψh (→ Def. 11.3.5) re-
quires fixed number of f-evaluations,
∼ h−1 , in the case of uniform timestep size h > 0 (equidistant mesh (11.3.17)).
Now, let us consider a single step method of order p ∈ N, employed with a uniform timestep hold . We
focus on the maximal discretization error in the mesh points, see § 11.3.14. As in (7.3.49) we assume that
the asymptotic error bounds are sharp:
err(hnew ) ! 1
Goal: = for reduction factor ρ>1.
err(hold ) ρ
p
hnew ! 1
(11.3.20) ⇒ p = ⇔ hnew = ρ−1/p hold .
hold ρ
☞ the larger the order p, the less effort for a prescribed reduction of the error!
We remark that another (minor) rationale for using higher-order methods [?, Sect. 11.5.3]: curb impact of
roundoff errors (→ Section 1.5.3) accumulating during timestepping.
Now we will build a class of methods that are explicit and achieve orders p > 2. The starting point is
a simple integral equation satisfied by any solution t 7→ y(t) of an initial value problems for the ODE
ẏ = f(y):
Z t1
ẏ(t) = f(t, y(t)) ,
IVP: ⇒ y ( t 1 ) = y0 + f(τ, y(τ )) dτ
y ( t 0 ) = y0 t0
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 735
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
What error can we afford in the approximation of y(t0 + ci h) (under the assumption that f is Lipschitz
continuous)? We take the cue from the considerations in § 11.3.22.
Note that there is a factor h in front of the quadrature sum in (11.4.3). Thus, our goal can already be
achieved, if only
This is accomplished by a less accurate discrete evolution than the one we are about to build. Thus,
we can construct discrete evolutions of higher and higher order, in turns, starting with the explicit Euler
method. All these methods will be explicit, that is, y1 can be computed directly from point values of f.
Now we apply the boostrapping idea outlined above. We write kℓ ∈ R d for the approximations of y(t0 +
c i h ).
• Quadrature formula = trapezoidal rule (7.2.5):
1
Q( f ) = 12 ( f (0) + f (1)) ↔ s = 2: c1 = 0, c2 = 1 , b1 = b2 = , (11.4.5)
2
and y(t1 ) approximated by explicit Euler step (11.2.7)
• Quadrature formula → simplest Gauss quadrature formula = midpoint rule (→ Ex. 7.2.3) & y( 12 (t1 +
t0 )) approximated by explicit Euler step (11.2.7)
(11.4.7) = explicit midpoint method (for numerical integration of ODEs) [?, Alg. 11.18].
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 736
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
We perform an empiric study of the order of the explicit single step methods constructed in Ex. 11.4.4.
✦ IVP: ẏ = 10y(1 − y) (logistic ODE (11.1.6)), y(0) = 0.01, T = 1,
✦ Explicit single step methods, uniform timestep h.
0
1 10
y(t) s=1, Explicit Euler
0.9 Explicit Euler s=2, Explicit trapezoidal rule
Explicit trapezoidal rule s=2, Explicit midpoint rule
0.8 Explicit midpoint rule -1
O(h2)
10
0.7
error |yh(1)-y(1)|
-2
0.6 10
y
0.5
-3
0.4 10
0.3
-4
0.2 10
0.1
0 -2 -1
0 0.2 0.4 0.6 0.8 1 10 10
Fig. 412 t Fig. 413 stepsize h
i −1 s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 := y0 + h ∑ bi ki .
j =1 i =1
The vectors ki ∈ R d , i = 1, . . . , s, are called increments, h > 0 is the size of the timestep.
Recall Rem. 11.3.12 to understand how the discrete evolution for an explicit Runge-Kutta method is spec-
ified in this definition by giving the formulas for the first step. This is a convention widely adopted in the
literature about numerical methods for ODEs. Of course, the increments ki have to be computed anew in
each timestep.
The implementation of an s-stage explicit Runge-Kutta single step method according to Def. 11.4.9 is
straightforward: The increments ki ∈ R d are computed successively, starting from k1 = f(t0 + c1 h, y0 ).
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 737
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
c1 0 ··· 0
Shorthand notation for (explicit) Runge- .. ..
c2 a21 . .
Kutta methods [?, (11.75)] c A
:= .. .. .. .. .
bT . . . .
Butcher scheme ✄
(Note: A is strictly lower triangular s ×
cs as1 ··· as,s−1 0
s-matrix) b1 ··· bs − 1 bs
(11.4.11)
Note that in Def. 11.4.9 the coefficients bi can be regarded as weights of a quadrature formula on [0, 1]:
apply explicit Runge-Kutta single step method to “ODE” ẏ = f (t). The quadrature rule with these weights
and nodes c j will have order ≥ 1, if the weights add up to 1!
A Runge-Kutta single step method according to Def. 11.4.9 is consistent (→ Def. 11.3.10) with the
ODE ẏ = f(t, y), if and only if
s
∑ bi = 1 .
i =1
Example 11.4.13 (Butcher schemes for some explicit RK-SSM [?, Sect. 11.6.1])
The following explicit Runge-Kutta single step methods are often mentioned in literature.
0 0
• Explicit Euler method (11.2.7): ➣ order = 1
1
0 0 0
• explicit trapezoidal rule (11.4.6): 1 1 0 ➣ order = 2
1 1
2 2
0 0 0
1 1
• explicit midpoint rule (11.4.7): 2 2 0 ➣ order = 2
0 1
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 738
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
0 0 0 0 0
1 1
2 2 0 0 0
1 1
• Classical 4th-order RK-SSM: 2 0 2 0 0 ➣ order = 4
1 0 0 1 0
1 2 2 1
6 6 6 6
0 0 0 0 0
1 1
3 3 0 0 0
2
• Kutta’s 3/8-rule: 3 − 13 1 0 0 ➣ order = 4
1 1 −1 1 0
1 3 3 1
8 8 8 8
Hosts of (explicit) Runge-Kutta methods can be found in the literature, see for example the Wikipedia page.
They are stated in the form of Butcher schemes (11.4.11) most of the time.
Runge-Kutta single step methods of order p > 2 are not found by bootstrapping as in Ex. 11.4.4, because
the resulting methods would have quite a lot of stages compared to their order.
Rather one derives order conditions yielding large non-linear systems of equations for the coefficients aij
and bi in Def. 11.4.9, see [?, Sect .4.2.3] and [?, Ch. III]. This approach is similar to the construction of a
Gauss quadrature rule in Ex. 7.3.13. Unfortunately, the systems of equations are very difficult to solve and
no universal recipe is available. Nevertheless, through massive use of symbolic computation, Runge-Kutta
methods of order up to 19 have been constructed in this way.
The following table gives lower bounds for the number of stages needed to achieve order p for an explicit
Runge-Kutta method.
order p 1 2 3 4 5 6 7 8 ≥9
minimal no. s of stages 1 2 3 4 6 7 9 11 ≥ p+3
No general formula is has been discovered. What is known is that for explicit Runge-Kutta single step
methods according to Def. 11.4.9
order p ≤ number s of stages of RK-SSM
An implementation of an explicit embedded Runge-Kutta single-step method with adaptive stepsize con-
trol for solving an autonomous IVP is provided by the utility class ode45. The terms “embedded” and
“adaptive” will be explained in Section 11.5.
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 739
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
(i) StateType: type for vectors in state space V , e.g. a fixed size vector type of E IGEN:
Eigen::Matrix<double,N,1>, where N is an integer constant § 11.2.1.
(ii) RhsType: a functor type, see Section 0.2.3, for the right hand side function f; must match State-
Type, default type provided.
The functor for the right hand side f : D ⊂ V → V of the ODE ẏ = f(y) is specified as an argument of
the constructor.
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 740
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
2. T: the final time T , initial time t0 = 0 is assumed, because the class can deal with autonomous
ODEs only, recall § 11.1.21.
3. norm: a functor returning a suitable norm for a state vector. Defaults to E IGEN’s maximum vector
norm.
The method returns a vector of 2-tuples (yk , tk ) (note the order!), k = 0, . . . , N , of temporal mesh points
tk , t0 = 0, t N = T , see § 11.2.2, and approximate states yk ≈ y(tk ), where t 7→ y(t) stands for the
exact solution of the initial value problem.
The next self-explanatory code snippet uses the numerical integrator class ode45 for solving a scalar
autonomous ODE.
21 f o r ( auto s t a t e : s t a t e s )
22 std : : cout << " t = " << s t a t e . second << " , y = " << s t a t e . f i r s t
23 << " , | e r r | = " << fabs ( s t a t e . f i r s t − y ( s t a t e . second ) ) <<
std : : endl ;
24 }
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 741
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
M ATLAB provides a built-in numerical integrator based on explicit RK-SSM, see [?] and [?, Sect. 7.2]. Its
calling syntax is
[t,y] = ode45(odefun,tspan,y0);
27 tnew = t + hA(6);
28 i f done, tnew = tfinal; end % Hit end point exactly.
29 h = tnew - t; % Purify h.
30 ynew = y + f*hB(:,6);
.
31 % .. (stepsize control, see Sect. 11.5 dropped
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Methods 742
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Chemical reaction kinetics is a field where ODE based models are very common. This example presents
a famous reaction with extremely abrupt dynamics. Refer to [?, Ch. 62] for more information about the
ODE-based modelling of kinetics of chemical reactions.
y1 := c(BrO3− ): ẏ1 = − k 1 y1 y2 − k 3 y1 y3 ,
y2 := c(Br− ): ẏ2 = − k 1 y1 y2 − k 2 y2 y3 + k 5 y5 ,
y3 := c(HBrO2 ): ẏ3 = k1 y1 y2 − k2 y2 y3 + k3 y1 y3 − 2k4 y23 , (11.5.3)
y4 := c(Org): ẏ4 = k2 y2 y3 + k4 y23 ,
y5 := c(Ce(IV)): ẏ5 = k 3 y1 y3 − k 5 y5 ,
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 743
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
-4
10 -6
10
-5
10
-7
10
-6
10
c(t)
c(t)
-8
10
-7
10
-9
10
-8
10
-10
-9 10
10
-10 -11
10 10
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Fig. 414 t Fig. 415 t
This is very common with evolutions arising from practical models (circuit models, chemical reaction mod-
els, mechanical systems)
100
y =1
0
y = 0.5
90 0
80
consider the scalar autonomous IVP:
70
ẏ = y2 , y(0) = y0 > 0 . 60
y0
y(t)
1 − y0 t 40
0
-1 -0.5 0 0.5 1 1.5 2 2.5
Fig. 416 t
How to choose temporal mesh {t0 < t1 < · · · < t N −1 < t N } for single step method in case J (y0 ) is not
known, even worse, if it is not clear a priori that a blow up will happen?
Just imagine: what will result from equidistant explicit Euler integration (11.2.7) applied to the above IVP?
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 744
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
solution by ode45
100
y0 = 1
y0 = 0.5
90
y =2
0
80
Simulation with M ATLAB’s ode45:
70
M ATLAB-code 11.5.5:
60 1 fun = @(t,y) y.^2;
t1,y1] = ode45 (fun,[0 2],1);
k
50 2
y
40
3 [t2,y2] = ode45 (fun,[0
2],0.5);
30
4 [t3,y3] = ode45 (fun,[0 2],2);
20
10
0
-1 -0.5 0 0.5 1 1.5 2 2.5
Fig. 417 t
We observe: ode45 manages to reduce stepsize more and more as it approaches the singularity of the
solution! How can it accomplish this feat!
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 745
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Be efficient! Be accurate!
Why local-in-time timestep control (based on estimating only the one-step error)?
Consideration: If a small time-local error in a single timestep leads to large error kyk − y(tk )k at later
times, then local-in-time timestep control is powerless about it and will not even notice!!
We “recycle” heuristics already employed for adaptive quadrature, see Section 7.5, § 7.5.10. There we
tried to get an idea of the local quadrature error by comparing two approximations of different order. Now
we pursue a similar idea over a single timestep.
e h y(tk ) − Ψ h y(tk ) .
Φ h y(tk ) − Ψh y(tk ) ≈ ESTk := Ψ (11.5.8)
| {z }
one-step error
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 746
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
ESTk ↔ ATOL
Compare ➣ Reject/accept current step (11.5.10)
ESTk ↔ RTOLkyk k
relative tolerance
For a similar use of absolute and relative tolerances see Section 8.1.2: termination criteria for iterations,
in particular (8.1.25).
☞ Simple algorithm:
ESTk < max{ATOL, kyk kRTOL}: Carry out next timestep (stepsize h)
Use larger stepsize (e.g., αh with some α > 1) for following step (∗)
ESTk > max{ATOL, kyk kRTOL}: Repeat current step with smaller stepsize < h, e.g., 12 h
Rationale for (∗): if the current stepsize guarantees sufficiently small one-step error, then it might be
possible to obtain a still acceptable one-step error with a larger timestep, which would enhance efficiency
(fewer timesteps for total numerical integration). This should be tried, since timestep control will usually
provide a safeguard against undue loss of accuracy.
C++11 code 11.5.11: Simple local stepsize control for single step methods
2 // Auxiliary function: default norm for an E I G E N vector type
3 template <class State >
4 double _norm ( const S t a t e &y ) { r et ur n y . norm ( ) ; }
5
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 747
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 748
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
We face the same conundrum as in the case of adaptive numerical quadrature, see Rem. 7.5.17:
By the heuristic considerations, see (11.5.8) it seems that ESTk measures the one-step error for
! the low-order method Ψ and that we should use yk+1 = Ψ hk yk , if the timestep is accepted.
hk
e y , since it is available for free. This is
However, it would be foolish not to use the better value yk+1 = Ψ k
what is done in every implementation of adaptive methods, also in Code 11.5.11, and this choice can be
justified by control theoretic arguments [?, Sect. 5.2].
We test adaptive timestepping routine from Code 11.5.11 for a scalar IVP and compare the estimated local
error and true local error.
✦ IVP for ODE ẏ = cos(αy)2 , α > 0, solution y(t) = arctan(α(t − c))/α for y(0) ∈] − π/2, π/2[
✦ Simple adaptive timestepping based on explicit Euler (11.2.7) and explicit trapezoidal rule (11.4.6)
Adaptive timestepping, rtol = 0.010000, atol = 0.000100, a = 20.000000 Adaptive timestepping, rtol = 0.010000, atol = 0.000100, a = 20.000000
0.08 0.025
y(t) true error |y(tk)-y k|
yk estimated error EST
k
0.06 rejection
0.02
0.04
0.02
0.015
error
y
0.01
-0.02
-0.04
0.005
-0.06
-0.08 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Fig. 418 t Fig. 419 t
Observations:
☞ Adaptive timestepping well resolves local features of solution y(t) at t = 1
☞ Estimated error (an estimate for the one-step error) and true error are not related! To understand
this recall Rem. 11.5.12.
In this experiment we want to explore whether adaptive timestepping is worth while, as regards reduction
of computational effort without sacrificing accuracy.
We retain the simple adaptive timestepping from previous experiment Ex. 11.5.13 and also study the same
IVP.
New: initial state y(0) = 0!
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 749
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Now we examine the dependence of the maximal discretization error in mesh points on the computational
effort. The latter is proportional to the number of timesteps.
Solving dt y = a cos(y)2 with a = 40.000000 by simple adaptive timestepping Error vs. no. of timesteps for dt y = a cos(y)2 with a = 40.000000
1
0.05 10
uniform timestep
adaptive timestep
0.045
0
10
0.04
0.035
-1
10
k
-2
y
0.025 10
k
0.02
-3
10
0.015
rtol = 0.400000
0.01 rtol = 0.200000
-4
rtol = 0.100000 10
rtol = 0.050000
0.005 rtol = 0.025000
rtol = 0.012500
rtol = 0.006250
-5
0 10
1 2 3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 10 10
Fig. 420 t Fig. 421 no. N of timesteps
Solutions (yk )k for different values of rtol Error vs. computational effort
Observations:
☞ Adaptive timestepping achieves much better accuracy for a fixed computational effort.
Same ODE and simple adaptive timestepping as in previous experiment Ex. 11.5.14.
π π
ẏ = cos2 (αy) ⇒ y(t) = arctan(α(t − c))/α ,y(0) ∈] − ,− [ ,
2α 2α
for α = 40.
π
Now: initial state y(0) = −0.0386 ≈ 2α as in Ex. 11.5.13
Solving dt y = a cos(y)2 with a = 40.000000 by simple adaptive timestepping Error vs. no. of timesteps for d y = a cos(y)2 with a = 40.000000
t
0
0.05 10
uniform timestep
adaptive timestep
0.04
0.03
0.02
-1
10
max |y(t )-y |
k
0.01
k
y
0
k
-0.01
-2
10
-0.02
rtol = 0.400000
-0.03 rtol = 0.200000
rtol = 0.100000
rtol = 0.050000
-0.04 rtol = 0.025000
rtol = 0.012500
rtol = 0.006250
-3
-0.05 10
1 2 3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 10 10
Fig. 422 t Fig. 423 no. N of timesteps
Solutions (yk )k for different values of rtol Error vs. computational effort
Observations:
☞ Adaptive timestepping leads to larger errors at the same computational cost as uniform timestepping
!
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 750
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Explanation: the position of the steep step of the solution has a sensitive dependence on an initial value,
if y(0) ≈ 2alπpha :
1
y(t) = α arctan(α(t + tan(y0/α ))) , step at ≈ − tan(y0/α) .
Hence, small local errors in the initial timesteps will lead to large errors at around time t ≈ 1. The stepsize
control is mistaken in condoning these small one-step errors in the first few steps and, therefore, incurs
huge errors later.
However, the perspective of backward error analysis (→ § 1.5.84) rehabilitates adaptive stepsize control
in this case: it gives us a numerical solution that is very close to the exact solution of the ODE with slightly
perturbed initial state y0 .
The above algorithm (Code 11.5.11) is simple, but the rule for increasing/shrinking of timestep “squanders”
the information contained in ESTk : TOL:
More ambitious goal ! When ESTk > TOL : stepsize adjustment better hk = ?
When ESTk < TOL : stepsize prediction good hk+1 = ?
p+2
Heuristics: the timestep hk is small ➥ “higher order terms” O(hk ) can be ignored.
. p+1 p+2
Ψhk y(tk ) − Φhk y(tk ) = chk + O(hk ), . p+1
⇒ ESTk = chk . (11.5.18)
e hk y(tk ) − Φhk y(tk ) =. O(h p+2 ) .
Ψ k
.
✎ notation: = equality up to higher order terms in hk
. p+1 . ESTk
ESTk = chk ⇒ c= p+1
. (11.5.19)
hk
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 751
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
For the sake of accuracy (stipulates “ESTk < TOL”) & efficiency (favors “>”) we aim for
!
ESTk = TOL := max{ATOL, kyk kRTOL} . (11.5.20)
What timestep h∗ can actually achieve (11.5.20), if we “believe” in (11.5.18) (and, therefore, in (11.5.19))?
ESTk p+1
(11.5.19) & (11.5.20) ⇒ TOL = p+1
h∗ .
hk
q
TOL
"‘Optimal timestep”: h∗ = h p +1
ESTk . (11.5.21)
(stepsize prediction)
C++11 code 11.5.22: Refined local stepsize control for single step methods
2 // Auxiliary function: default norm for an E I G E N vector type
3 template <class State >
4 double _norm ( const S t a t e &y ) { r et ur n y . norm ( ) ; }
5
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 752
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
29 }
30 i f ( h < hmin ) {
31 c e r r << " W a r n i n g : F a i l u r e a t t = "
32 << s t a t e s . back ( ) . f i r s t
33 << " . U n a b l e t o meet i n t e g r a t i o n t o l e r a n c e s w i t h o u t r e d u c i n g
the step "
34 << " s i z e b e l o w t h e s m a l l e s t v a l u e a l l o w e d ( " << hmin << " ) a t
t i m e t . " << endl ;
35 }
36 r et ur n s t a t e s ;
37 }
Comments on Code 11.5.22 (see comments on Code 11.5.11 for more explanations):
• Input arguments as for Code 11.5.11, except for p =
ˆ order of lower order discrete evolution.
• line 26: compute presumably better local stepsize according to (11.5.21),
• line 27: decide whether to repeat the step or advance,
• line 27: extend output arrays if current step has not been rejected.
The name of M ATLAB’s standard integrator ode45 already indicates the orders of the pair of single step
methods used for adaptive stepsize control:
Ψ=
ˆ RK-method of order 4 e=
Ψ ˆ RK-method of order 5
ode45
Specifying tolerances for M ATLAB’s integrators is done as follows:
options = odeset(’abstol’,atol,’reltol’,rtol,’stats’,’on’);
[t,y] = ode45(@(t,x) f(t,x),tspan,y0,options);
(f = function handle, tspan =
ˆ [t0 , T ], y0 =
ˆ y0 , t =
ˆ tk , y =
ˆ yk )
The possibility to pass tolerances to numerical integrators based on adaptive timestepping may tempt
one into believing that they allow to control the accuracy of the solutions. However, as is clear from
Rem. 11.5.16, these tolerances are solely applied to local error estimates and, inherently, have nothing to
do with global discretization errors, see Ex. 11.5.13.
The absolute/relative tolerances imposed for local-in-time adaptive timestepping do not allow to
predict accuracy of solution!
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 753
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
For higher order RK-SSM with a considerable number of stages computing different sets of increments
(→ Def. 11.4.9) for two methods of different order just for the sake of local-in-time stepsize control would
mean incommensurate effort.
Embedding idea: Use two RK-SSMs based on the same increments, that is, built with the same coefficients
aij , but different weights bi , see Def. 11.4.9 for the formulas, and different orders p and p + 1.
The following two embedded RK-SSM, presented in the form of their extended Butcher schemes, provided
single step methods of orders 4 & 5.
0 0
1 1 1 1
3 3 2 2
1 1 1 1 1
3 6 6 2 0 2
1 1 3 1 0 0 1
2 8 0 8
1 3 5 7 13 1
1 2 0 − 23 2 4 32 32 32 − 32
1 2 1 1 1 1 1
y1 6 0 0 3 6
y1 6 3 3 6
yb1 1
10 0 3
10
2
5
1
5
yb1 − 21 7
3
7
3
13
6 − 16
3
We test the effect of adaptive stepsize control in M ATLAB for the equations of motion describing the planar
movement of a point mass in a conservative force field x ∈ R 2 7→ F(x) ∈ R 2 : Let t 7→ y(t) ∈ R 2 be
the trajectory of point mass (in the plane).
2y
From Newton’s law: ÿ = F(y) := − . (11.5.28)
kyk22
acceleration force
As in Rem. 11.1.23 we can convert the second-order ODE (11.5.28) into an equivalent 1st-order ODE by
introducing the velocity v := ẏ as an extra solution component:
" v #
ẏ
(11.5.28) ⇒ = − 2y . (11.5.29)
v̇ k y k2 2
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 754
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Zeitschrittweite
0 y (t) 0 0.1
i
-1
-2
-3
-4 -5 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
t t
Zeitschrittweite
y (t)
0 0 0.1
i
-1
-2
-3
-4 -5 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
t t
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 755
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
2
2
y
y
-0.2 -0.2
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
Exakte Bahn Exakte Bahn
Naeherung Naeherung
-1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
y y
1 1
Observations:
☞ Fast changes in solution components captured by adaptive approach through very small timesteps.
☞ Completely wrong solution, if tolerance reduced slightly.
In this example we face a rather sensitive dependence of the trajectories on initial states or intermediate
states. Small perturbations at one instance in time can be have a massive impact on the solution at later
times. Loca stepsize control is powerless about preventing this.
• know the concept of evolution operator for an ODE and its relationship with solutions of associated
initial value problems.
• be able to convert higher-order and non-autonomous ODEs into the form ẏ = f(y).
• know about discrete evolutions and how they induce single step methods (SSMs).
• remember that single step method converge asymptotically algebraically for stepsize h → 0 and
that the rate of convergence is called the order of the SSM.
• know the general form of explicit Runge-Kutta methods and Butcher schemes.
• understand when adaptive timestep control is essential for meaningful numerical integration of initial
value problems.
• be able to describe the policy to time-local adaptive timestep control for embedded Runge-Kutta
methods.
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 756
Chapter 12
Explicit Runge-Kutta methods with stepsize control (→ Section 11.5) seem to be able to provide approxi-
mate solutions for any IVP with good accuracy provided that tolerances are set appropriately.
Everything settled about numerical integration?
In this example we will witness the near failure of a high-order adaptive explicit Runge-Kutta method for a
simple scalar autonomous ODE.
This is a logistic ODE as introduced in Ex. 11.1.5. We try to solve it by means of an explicit adaptive
embedded Runge-Kutta-Fehlberg method (→ Rem. 11.5.25) using the class ode45 from § 11.4.16 (Pre-
processor switch MATLABCOEFF activated).
757
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The following plots have been generated with M ATLAB using its built-in adaptive explicit RK-SSM:
1
1 0.02
0.8
Stepsize
y(t)
y
0.6
0.5 0.01
0.4
0.2
0 0
0 0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 Fig.
1 425 t
Fig. 424 t
Stepsize control of ode45 running amok!
The solution is virtually constant from t > 0.2 and, nevertheless, the integrator uses tiny timesteps
? until the end of the integration interval.
Contents
12.1 Model problem analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747
12.2 Stiff Initial Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
12.3 Implicit Runge-Kutta Single Step Methods . . . . . . . . . . . . . . . . . . . . . . 766
12.3.1 The implicit Euler method for stiff IVPs . . . . . . . . . . . . . . . . . . . . . 767
12.3.2 Collocation single step methods . . . . . . . . . . . . . . . . . . . . . . . . . 768
12.3.3 General implicit RK-SSMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
12.3.4 Model problem analysis for implicit RK-SSMs . . . . . . . . . . . . . . . . . 774
12.4 Semi-implicit Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 780
12. Single Step Methods for Stiff Initial Value Problems, 12. Single Step Methods for Stiff Initial Value Problems758
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Supplementary reading. See also [?, Ch. 77], [?, Sect. 11.3.3].
To rule out that what we observed in Ex. 12.0.1 might have been a quirk of the IVP (12.0.2) we conduct
the same investigations for the simple linear, scalar, autonomous IVP
1 − number o f s teps : 33
We use the ode45 class to solve (12.1.2) with the 2 − number o f r e j e c t e d s teps : 32
same parameters as in Code 12.0.3. ✄ 3 − function calls : 231
Stepsize
y(t)
0.4
y
0.2 0 0.005
-0.5 0
0 0.2 0.4 0.6 0.8 1
-0.2 Fig. 427
0 0.2 0.4 0.6 0.8 1 t
Fig. 426 t
Observation: Though y(t) ≈ 0 for t > 0.1, the integrator keeps on using “unreasonably small” timesteps
even then.
In this section we will discover a simple explanation for the startling behavior of ode45 in Ex. 12.0.1.
The simplest explicit RK-SSM is the explicit Euler method, see Section 11.2.1. We know that it should
converge like O(h) for meshwidth h → 0. In this example we will see that this may be true only for
sufficiently small h, which may be extremely small.
ẏ = f (y) := λy , y(0) = 1 .
12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 759
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
✦ We apply the explicit Euler method (11.2.7) with uniform timestep h = 1/N , N ∈
{5, 10, 20, 40, 80, 160, 320, 640}.
Explicit Euler method for saalar model problem Explicit Euler, h=174.005981Explicit Euler, h=175.005493
20
10 3.5
λ = -10.000000
λ = -30.000000 3
λ = -60.000000
error at final time T=1 (Euclidean norm)
10 λ = -90.000000
10 2.5
O(h)
0
10 1.5
-10
y
10 0.5
-20
10 -0.5
-1
-30
10 -1.5
-2 exact solution
-40
explicit Euler
10
-3 -2 -1 0
10 10 10 10 0 0.2 0.4 0.6 0.8 1 1.2 1.4
Fig. 428 timestep h Fig. 429 t
✦ Now we look at an IVP for the logistic ODE, see Ex. 11.1.5:
✦ As before, we apply the explicit Euler method (11.2.7) with uniform timestep h = 1/N , N ∈
{5, 10, 20, 40, 80, 160, 320, 640}.
140
10
λ = 10.000000 1.4
λ = 30.000000
10
120 λ = 60.000000
λ = 90.000000 1.2
100
10 1
error (Euclidean norm)
80 0.8
10
0.6
60
y
10
0.4
40
10
0.2
20
10 0
10
0 -0.2
exact solution
-0.4 explicit Euler
-20
10
-3 -2 -1 0
10 10 10 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 430 timestep h Fig. 431 t
For large timesteps h we also observe oscillatory blow-up of the sequence (yk )k .
Deeper analysis:
For y ≈ 1: f (y) ≈ λ(1 − y) ➣ If y(t0 ) ≈ 1, then the solution of the IVP will behave like the solution
of ẏ = λ(1 − y), which is a linear ODE. Similary, z(t) := 1 − y(t) will behave like the solution of the
“decay equation” ż = −λz. Thus, around the stationary point y = 1 the explicit Euler method behaves
like it did for ẏ = λy in the vicinity of the stationary point y = 0; it grossly overshoots.
12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 760
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
The phenomenon observed in the two previous examples is accessible to a remarkably simple rigorous
analysis: Motivated by the considerations in Ex. 12.1.3 we study the explicit Euler method (11.2.7) for the
linear model problem: ẏ = λy , y(0) = y0 , with λ ≪ 0 , (12.1.5)
which has exponentially decaying exact solution
y(t) = y0 exp(λt) → 0 for t → ∞ .
Recall the recursion for the explicit Euler with uniform timestep h > 0 method for (12.1.5):
(11.2.7) for f (y) = λy: yk+1 = yk (1 + λh) . (12.1.6)
We easily get a closed form expression for the approximations yk :
(
0 , if λh > −2 (qualitatively correct) ,
yk = y0 (1 + λh)k ⇒ |yk | →
∞ , if λh < −2 (qualitatively wrong) .
Only if |λ|h < 2 we obtain a decaying solution by the explicit Euler method!
Could it be that the timestep control is desperately trying to enforce the qualitatively correct behavior of the
numerical solution in Ex. 12.1.3? Let us examine how the simple stepsize control of Code 11.5.11 fares
for model problem (12.1.5):
0.7
2
0.6
error
y
0.5 1.5
0.4
1
0.3
0.2
0.5
0.1
0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Fig. 432 t Fig. 433 t
12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 761
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Observation: in fact, stepsize control enforces small timesteps even if y(t) ≈ 0 and persistently triggers
rejections of timesteps. This is necessary to prevent overshooting in the Euler method, which contributes
to the estimate of the one-step error.
We see the purpose of stepsize control thwarted, because after only a very short time the solution is
almost zero and then, in fact, large timesteps should be chosen.
Are these observations a particular “flaw” of the explicit Euler method? Let us study the behavior of another
simple explicit Runge-Kutta method applied to the linear model problem.
Example 12.1.9 (Explicit trapzoidal method for decay equation → [?, Ex. 11.29])
Recall recursion for the explicit trapezoidal method derived in Ex. 11.4.4:
Apply it to the model problem (12.1.5), that is, the scalar autonomous ODE with right hand side function
f(y) = f (y) = λy, λ < 0:
the sequence of approximations generated by the explicit trapezoidal rule can be expressed in
closed form as
yk = S(hλ)k y0 , k = 0, . . . , N . (12.1.11)
2 z 7→ 1 − z + 12 z2
|S(hλ)| < 1 ⇔ − 2 < hλ < 0 .
1.5
Qualitatively correct decay behavior of (yk )k only un-
S(z)
h ≤ |2/λ| . (12.1.12)
0.5
(12.1.13) Model problem analysis for general explicit Runge-Kutta single step methods
A c
Apply the explicit Runge-Kutta method (→ Def. 11.4.9): encoded by the Butcher scheme to the
bT
autonomous scalar linear ODE (12.1.5) (ẏ = λy). We write down the equations for the increments and y1
12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 762
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
from Def. 11.4.9 for f (y) := λy and then convert the resulting system of equations into matrix form:
i −1
ki = λ(y0 + h ∑ aij k j ) ,
j =1 I − zA 0 k 1
⇒ = y0 , (12.1.14)
s −zb⊤ 1 y1 1
y1 = y0 + h ∑ bi ki
i =1
where k ∈ R s =ˆ denotes the vector [k1 , . . . , ks ]⊤ /λ of increments, and z := λh. Next we apply block
Gaussian elimination (→ Rem. 2.3.11) to solve for y1 and obtain
Theorem 12.1.17. Stability function of explicit Runge-Kutta methods → [?, Thm. 77.2], [?,
Sect. 11.8.4]
The discrete evolution Ψhλ of an explicit s-stage Runge-Kutta single step method (→ Def. 11.4.9)
c A
with Butcher scheme (see (11.4.11)) for the ODE ẏ = λy amounts to a multiplication with
bT
the number
From Thm. 12.1.17 and their Butcher schemes we can instantly compute the stability functions of explicit
RK-SSM. We do this for a few methods whose Butcher schemes were listed in Ex. 11.4.13
0 0 S(z) = 1 + z .
• Explicit Euler method (11.2.7): ➣
1
0 0 0
• Explicit trapezoidal method (11.4.6): 1 1 0 ➣ S(z) = 1 + z + 21 z2 .
1 1
2 2
12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 763
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
0 0 0 0 0
1 1
2 2 0 0 0
1 2 1 3 1 4
• Classical RK4 method: 1 1
2 0 2 0 0 ➣ S(z) = 1 + z + 2 z + 6 z + 24 z .
1 0 0 1 0
1 2 2 1
6 6 6 6
These examples confirm an immediate consequence of the determinant formula for the stability function
S ( z ).
For a consistent (→ Def. 11.3.10) s-stage explicit Runge-Kutta single step method according to
Def. 11.4.9 the stability function S defined by (12.1.18) is a non-constant polynomial of degree ≤ s:
S ∈ Ps .
Φh y = eλh y ←→ Ψh y = S(λh)y .
In light of Ψ ≈ Φ, see (11.3.3), we expect that
Let S denote the stability function of an s-stage explicit Runge-Kutta single step method of order
q ∈ N. Then
This means that the lowest q + 1 coefficients of S(z) must be equal to the first coefficients of the expo-
nential series:
q
1
S(z) = ∑ j! z j + zq+1 p(z) with some p ∈ Ps− q −1 .
j =0
12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 764
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
∞
In § 12.1.13 we established that for the sequence (yk )k=0 produced by an explicit Runge-Kutta single step
method applied to the linear scalar model ODE ẏ = λy, λ ∈ R , with uniform timestep h > 0 holds
(yk ) ∞
k=0 non-increasing ⇔ |S(λh)| ≤ 1 ,
(12.1.27)
(yk ) ∞
k=0 exponentially increasing ⇔ |S(λh)| > 1 .
So, for any λ 6= 0 there will be a threshold hmax > 0 so that |yk | → ∞ as |h| > hmax .
Reversing the argument we arrive at a timestep constraint, as already observed for the explicit Euler
methods in § 12.1.4.
Only if one ensures that |λh| is sufficiently small, one can avoid exponentially increasing approxi-
mations yk (qualitatively wrong for λ < 0) when applying an explicit RK-SSM to the model problem
(12.1.5) with uniform timestep h > 0,
For λ ≪ 0 this stability induced timestep constraint may force h to be much smaller than required by
demands on accuracy : in this case timestepping becomes inefficient.
Ex. 12.0.1, Ex. 12.1.8 send the message that local-in-time stepsize control as discussed in Section 11.5
selects timesteps that avoid blow-up, with a hefty price tag however in terms of computational cost and
poor accuracy.
Objection: simple linear scalar IVP (12.1.5) may be an oddity rather than a model problem: the weakness
of explicit Runge-Kutta methods discussed above may be just a peculiar response to an unusual situation.
Let us extend our investigations to systems of linear ODEs, d > 1.
A generic linear ordinary differential equation on state space R d has the form
12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 765
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
As explained in [?, Sect. 8.1], (12.1.31) can be solved by diagonalization: If we can find a regular matrix
V ∈ C d,d such that
λ1 0
.. d,d
MV = VD with diagonal matrix D = . ∈C , (12.1.32)
0 λd
The columns of V are a basis of eigenvectors of M, the λ j ∈ C, j = 1, . . . , d are the associated eigenval-
ues of M, see Def. 9.1.1.
The idea behind diagonalization is the transformation of (12.1.31) into d decoupled scalar linear ODEs:
ż1 = λ1 z1
z ( t ) : = V −1 y ( t ) ..
ẏ = My −−−−−−−−→ ż = Dz ↔ . , since M = VDV−1 .
żd = λd zd
ü + αu̇ + βu = g(t) , Us ( t )
with coefficients α := ( RC)−1 , β = ( LC)−1 , g(t) =
αU̇s (t).
Fig. 435
We integrate IVPs for this ODE by means of M ATLAB’s adaptive integrator ode45.
12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 766
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
0.008
v(t)/100 R = 100Ω, L = 1H, C = 1µF, Us (t) = 1V sin(t),
0.006
u(0) = v(0) = 0 (“switch on”)
0.004 ode45 statistics:
0.002 17897 successful steps
u(t),v(t)
0
1090 failed attempts
-0.002
113923 function evaluations
-0.004
Maybe the time-dependent right hand side due to the time-harmonic excitation severly affects ode45?
Let us try a constant exciting voltage:
-3 RCL-circuit: R=100.000000, L=1.000000, C=0.000001
x 10
2
u(t)
v(t)/100
0
R = 100Ω, L = 1H, C = 1µF, Us (t) = 1V,
-2
u(0) = v(0) = 0 (“switch on”)
-4
ode45 statistics:
u(t),v(t)
-10
Tiny timesteps despite virtually constant solution!
-12
0 1 2 3 4 5 6
Fig. 437 time t
We make the same observation as in Ex. 12.0.1, Ex. 12.1.8: the local-in-time stepsize control of ode45
(→ Section 11.5) enforces extremely small timesteps though the solution almost constant except at t = 0.
12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 767
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
To understand the structure of the solutions for this transient circuit example, let us apply the diagonaliza-
tion technique from § 12.1.30 to the linear ODE
0 1
ẏ = y , y (0 ) = y0 ∈ R 2 . (12.1.37)
− β −α
| {z }
= :M
We can obtain the general solution of ẏ = My, M ∈ R 2,2 , by diagonalization of M (if possible):
λ1
MV = M(v1 , v2 ) = (v1 , v2 ) . (12.1.38)
λ2
where v1 , v2 ∈ R 2 \ {0} are the the eigenvectors of M, λ1 , λ2 are the eigenvalues of M, see Def. 9.1.1.
For the latter we find
(p
α2 − 4β , if α2 ≥ 4β ,
λ1/2 = 12 (α ± D ) , D := p
ı 4β − α2 , if α2 < 4β .
Note that the eigenvalue have non-vanishing imaginary part in the setting of the experiment.
Recall discrete evolution of explicit Euler method (11.2.7) for ODE ẏ = My, M ∈ R d,d :
As in § 12.1.30 we assume that M can be diagonalized, that is (12.1.32) holds: V−1 MV = D with a
diagonal matrix D ∈ C d,d containing the eigenvalues of M on its diagonal. Next, apply the decoupling by
diagonalization idea to the recursion of the explicit Euler method.
z k : = V −1 y k
V−1 yk+1 = V−1 yk + hV−1 MV(V−1 yk ) ⇔ (zk+1 )i = (zk )i + hλi (zk )i . (12.1.42)
| {z }
ˆ explicit Euler step for żi = λi zi
=
Crucial insight:
∞
The explicit Euler method generates uniformly bounded solution sequences (yk )k=0 for ẏ = My
with diagonalizable matrix M ∈ R d,d with eigenvalues λ1 , . . . , λd , if and only if it generates uniformly
bounded sequences for all the scalar ODEs ż = λi z, i = 1, . . . , d.
12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 768
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
So far we conducted the model problem analysis under the premises λ < 0.
p
However, in Ex. 12.1.35 we face λ1/2 = − 21 α ± i 4β − α2 (complex eigenvalues!). Let us now
examine how the explicit Euler method and even general explicit RK-methods respond to them.
The model problem analysis from Ex. 12.1.3, Ex. 12.1.9 can be extended verbatim to the case of λ ∈ C.
It yields the following insight for the for the explicit Euler method and λ ∈ C:
The sequence generated by the explicit Euler method (11.2.7) for the model problem (12.1.5) satisfies
0.5
✁ { z ∈ C : |1 + z | < 1}
Im z
0
The green region of the complex plane marks values
for λh, for which the explicit Euler method will pro-
-0.5
duce exponentially decaying solutions.
-1
-1.5
-2.5 -2 -1.5 -1 -0.5 0 0.5 1
Fig. 438 Re z
q
Now we can conjecture what happens in Ex. 12.1.35: the eigenvalues λ1/2 = ± i β − 41 α2 of − 21 α
M have a very large (in modulus) negative real part. Since ode45 can be expected to behave as if it
integrates ż = λ2 z, it faces a severe timestep constraint, if exponential blow-up is to be avoided, see
Ex. 12.1.3. Thus stepsize control must resort to tiny timesteps.
(12.1.44) Extended model problem analysis for explicit Runge-Kutta single step methods
A c
We apply an explicit s-stage RK-SSM (→ crefdef:rk) described by the Butcher scheme to the
bT
autonomous linear ODE ẏ = My, M ∈ C d,d , and obtain (for the first step with timestep size h > 0)
s−1 s
k ℓ = M ( y0 + h ∑ aℓ j k j ) , ℓ = 1, . . . , s , y1 = y0 + h ∑ bi kℓ . (12.1.45)
j =1 ℓ=1
12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 769
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Now assume that M can be diagonalized, that is (12.1.32) holds: V−1 MV = D with a diagonal matrix
D ∈ C d,d containing the eigenvalues λi ∈ C of M on its diagonal. Then apply the substitutions
b ℓ := V−1 kℓ , ℓ = 1, . . . , s ,
k yk := V−1 yk , k = 0, 1 ,
b
We infer that, if (yk )k is the sequence produced by an explicit RK-SSM applied to ẏ = My, then
[ 1]
y 0
k ..
−1
yk = V
. V ,
[ d]
0 yk
[i ]
where yk is the sequence generated by the same RK-SSM with the same sequence of timesteps for
k
the IVP ẏ = λi y, y(0) = V−1 y0 i .
✗ ✔
The RK-SSM generates uniformly bounded solution sequences (yk ) ∞
for ẏ = My with diagonal-
k=0
izable matrix M ∈ R d,d with eigenvalues λ1 , . . . , λd , if and only if it generates uniformly bounded
✖ ✕
sequences for all the scalar ODEs ż = λi ż, i = 1, . . . , d.
Hence, understanding the behavior of RK-SSM for autonomous scalar linear ODEs ẏ = λy with λ ∈ C
is enough to predict their behavior for general autonomous linear systems of ODEs.
Theorem 12.1.48. (Absolute) stability of explicit RK-SSM for linear systems of ODEs
The sequence (yk )k of approximations generated by an explicit RK-SSM (→ Def. 11.4.9) with
stability function S (defined in (12.1.18)) applied to the linear autonomous ODE ẏ = My, M ∈ C d,d ,
with uniform timestep h > 0 decays exponentially for every initial state y0 ∈ C d , if and only if
|S(λi h)| < 1 for all eigenvalues λi of M.
for any solution of ẏ = My. This is obvious from the representation formula (12.1.33).
12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 770
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
We consider an explicit Runge-Kutta single step method with stability function S for the model linear scalar
IVP ẏ = λy, y(0) = y0 , λ ∈ C. From Thm. 12.1.17 we learn that for uniform stepsize h > 0 we have
yk = S(λh)k y0 and conclude that
Hence, the modulus |S(λh)| tells us for which combinations of λ and stepsize h we achieve exponential
decay yk → ∞ for k → ∞, which is the desirable behavior of the approximations for Re λ < 0.
Let the discrete evolution Ψ for a single step method applied to the scalar linear ODE ẏ = λy,
λ ∈ C, be of the form
and a function S : C → C. Then the region of (absolute) stability of the single step method is given
by
SΨ := {z ∈ C: |S(z)| < 1} ⊂ C .
Of course, by Thm. 12.1.17, in the case of explicit RK-SSM the function S will coincide with their stability function
from (12.1.18).
We can easily combine the statement of Thm. 12.1.48 with the concept of a region of stability and conclude
that an explicit RK-SSM will generate expoentially decaying solutions for the linear ODE ẏ = My, M ∈
C d,d , for every initial state y0 ∈ C d , if and only if λi h ∈ SΨ for all eigenvalues λi of M.
The green domains ⊂ C depict the bounded regions of stability for some RK-SSM from Ex. 11.4.13.
12. Single Step Methods for Stiff Initial Value Problems, 12.1. Model problem analysis 771
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
3 3
2.5
2
2 2
1.5
1 1 1
0.5
Im
Im
0 0
Im
-0.5
-1 -1
-1
-1.5
-2 -2
-2
-2.5 -3 -3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Re Re Re
In general we have for a consistent RK-SSM (→ Def. 11.3.10) that their stability functions staidfy S(z) =
1 + z + O(z2 ) for z → 0. Therefore, SΨ 6= ∅ and the imaginary axis will be tangent to SΨ in z = 0.
This section will reveal that the behavior observed in Ex. 12.0.1 and Ex. 12.1.3 is typical for a large class
of problems and that the model problem (12.1.5) really represents a “generic case”. This justifies the
attention paid to linear model problem analysis in Section 12.1.
In Ex. 11.5.1 we already saw an ODE model for the dynamics of a chemical reaction. Now we study an
abstract reaction.
2 k k4
reaction: A + B ←−
−→ C , A + C ←−
−→ D
k1 k3 (12.2.2)
| {z } | {z }
fast reaction slow reaction
If c A (0) > c B (0) ➢ 2nd reaction determines overall long-term reaction dynamics
Mathematical model: non-linear ODE involving concentrations y(t) = (c A (t), c B (t), cC (t), c D (t))T
cA − k1 c A c B + k2 c C − k3 c A c C + k4 c D
d cB − k1 c A c B + k2 c C
ẏ := = f (y) : =
k1 c A c B − k2 c C − k3 c A c C + k4 c D . (12.2.3)
dt cC
cD k3 c A c C − k4 c D
12. Single Step Methods for Stiff Initial Value Problems, 12.2. Stiff Initial Value Problems 772
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
c (t) 10 7
A
c (t)
C
10
cA,k, ode45 9 6
cC,k, ode45
8 8 5
concentrations
6 7 4
timestep
c (t)
C
4 6 3
2 5 2
4 1
0
3 0
-2 0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1 440
Fig. t
Fig. 439 t
Observations: After a fast initial transient phase, the solution shows only slow dynamics. Nevertheless,
the explicit adaptive integrator ode113 insists on using a tiny timestep. It behaves very much like ode45
in Ex. 12.0.1.
12. Single Step Methods for Stiff Initial Value Problems, 12.2. Stiff Initial Value Problems 773
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
(12.2.7) provides a solution even for λ 6= 0, if k y(0)k2 = 1, because in this case the term
λ(1 − k yk2 ) y will never become non-zero on the solution trajectory.
1.5 2
1.5
1
0.5
0.5
2
0
y
-0.5 -0.5
-1
-1
-1.5
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Fig. 441 y Fig. 442 -2
1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
1
We study the response of ode45 to different choice of λ with initial state y0 = . According to the
0
above considerations this initial state should completely “hide the impact of λ from our view”.
ode45 for attractive limit cycle -4 ode45 for rigid motion
x 10
1.5 8 1 0.2
1 7
0.5 6
timestep
timestep
y (t)
y (t)
0 5 0 0.1
i
-0.5 4
-1 3
y y
1,k 1,k
y y
2,k 2,k
-1.5 2 -1 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Fig. 443 t Fig. 444 t
12. Single Step Methods for Stiff Initial Value Problems, 12.2. Stiff Initial Value Problems 774
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Thus, the term of the right hand side, which is multiplied by λ will always vanish on the exact solution
trajectory, which stays on the unit circle.
Nevertheless, ode45 is forced to use tiny timesteps by the mere presence of this term!
We want to find criteria that allow to predict the massive problems haunting explicit single step methods
in the case of the non-linear IVP of Ex. 12.0.1, Ex. 12.2.1, and Ex. 12.2.5. Recall that for linear IVPs of
the form ẏ = My, y(0) = y0 , the model problem analysis of Section 12.1 tells us that, knowledge of
the region of stability of the timestepping scheme, the eigenvalues of the matrix M ∈ C d,d provide full
information about timestep constraint we are going to face. Refer to Thm. 12.1.48 and § 12.1.49.
We start with a “phenomenological notion”, just a keyword to refer to the kind of difficulties presented by
the IVPs of Ex. 12.0.1, Ex. 12.2.1, Ex. 12.1.8, and Ex. 12.2.5.
z(0) = 0 , ż = f(y∗ + z) = f(y∗ ) + D f(y∗ )z + R(y∗ , z) , with k R(y∗ , z)k = O(k zk2 ) .
This is obtained by Taylor expansion of f at y∗ , see [?, Satz 7.5.2]. Hence, in a neighborhood of a state
y∗ on a solution trajectory t 7→ y(t), the deviation z(t) = y(t) − y∗ satisfies
ż ≈ f(y∗ ) + D f(y∗ )z . (12.2.11)
The short-time evolution of y with y(0) = y∗ is approximately governed by the affine-linear ODE
We consider one step a general s-stage RK-SSM according to Def. 11.4.9 for the autonomous ODE
ẏ = f(y), with smooth right hand side function f : D ⊂ R d → R d :
i −1 s
ki = f(y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 = y0 + h ∑ bi ki .
j =1 i =1
12. Single Step Methods for Stiff Initial Value Problems, 12.2. Stiff Initial Value Problems 775
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
We perform linearization at y∗ := y0 and ignore all terms at least quadratic in the timestep size h:
i −1 s
ki ≈ f(y∗ ) + D f(y∗ )h ∑ aij k j , i = 1, . . . , s , y1 = y0 + h ∑ bi ki .
j =1 i =1
The defining equations for the same RK-SSM applied to
ż = M(z) + b , M := D f(y∗ ) ∈ R d,d , b := f(y∗ ) ,
which agrees with (12.2.12) after substitution z(t) − y(t) − y∗ , are
i −1 s
ki ≈ b + Mh ∑ aij k j , i = 1, . . . , s , y1 = y0 + h ∑ bi ki .
j =1 i =1
We find that for small timesteps
the discrete evolution of the RK-SSM for ẏ = f(y) in the state y∗ is close to the discrete
evolution of the same RK-SSM applied to the linearization (12.2.12) of the ODE in y∗ .
w k = y k − y0 + M − 1 b .
➣ The analysis of the behavior of an RK-SSM for an affine-linear ODE can be reduces to understanding
its behavior for a linear ODE with the same matrix.
for small timestep the behavior of an explicit RK-SSM applied to ẏ = f(y) close to the state y∗
is determined by the eigenvalues of the Jacobian D f(y∗ ).
In particular, if D f(y∗ ) has at least one eigenvalue whose modulus is large, then an exponential drift-off
of the approximate states yk away from y∗ can only be avoided for sufficiently small timestep, again a
timestep constraint.
An initial value problem for an autonomous ODE ẏ = f(y) will probably be stiff, if, for substantial
periods of time,
where t 7→ y(t) is the solution trajectory and σ (M) is the spectrum of the matrix M, see Def. 9.1.1.
The condition (12.2.16) has to be read as “the real parts of all eigenvalues are below a bound with small
modulus”. If this is not the case, then the exact solution will experience blow-up. It will change drastically
over very short periods of time and small timesteps will be required anyway in order to resolve this.
12. Single Step Methods for Stiff Initial Value Problems, 12.2. Stiff Initial Value Problems 776
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Thus, for λ ≫ 1, D f(y(t)) will always have an eigenvalue with large negative real part, whereas
the other eigenvalue is close to zero: the IVP is stiff.
Often one can already tell from the expected behavior of the solution of an IVP, which is often clear from
the modeling context, that one has to brace for stiffness.
Explicit Runge-Kutta single step method cannot escape tight timestep constraints for stiff IVPs that may
render them inefficient, see § 12.1.49. In this section we are going to augment the class of Runge-Kutta
methods by timestepping schemes that can cope well with stiff IVPs.
12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 777
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
We revisit the setting of Ex. 12.1.3 and again consider Euler methods for the decay IVP
ẏ = λy , y(0) = 1 , λ < 0 .
We apply both the explicit Euler method (11.2.7) and the implicit Euler method (11.2.13) with uniform
timesteps h = 1/N , N ∈ {5, 10, 20, 40, 80, 160, 320, 640} and monitor the error at final time T = 1 for
different values of λ.
Explicit Euler method (11.2.7) Implicit Euler method (11.2.13)
Explicit Euler method for saalar model problem Implicit Euler method for saalar model problem
20 0
10 10
λ = -10.000000
λ = -30.000000
λ = -60.000000 -5
10
error at final time T=1 (Euclidean norm)
0
10
-15
10
-10 -20
10 10
-25
10
-20
10
-30
10
-30 λ = -10.000000
10
-35
λ = -30.000000
10 λ = -60.000000
λ = -90.000000
O(h)
-40 -40
10 10
-3 -2 -1 0 -3 -2 -1 0
10 10 10 10 10 10 10 10
Fig. 445 timestep h Fig. 446 timestep h
λ large: blow-up of yk for large timestep h λ large: stable for all timesteps h > 0 !
We observe onset of convergence of the implicit Euler method already for large timesteps h.
We follow the considerations of § 12.1.4 and consider the implicit Euler method (11.2.13) for the
No timestep constraint: qualitatively correct behavior of (yk )k for Re λ < 0 and any h > 0!
12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 778
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
As in § 12.1.41 this analysis can be extended to linear systems of ODEs ẏ = My, M ∈ C d,d , by means
of diagonalization.
As in § 12.1.30 and § 12.1.41 we assume that M can be diagonalized, that is (12.1.32) holds: V−1 MV =
D with a diagonal matrix D ∈ C d,d containing the eigenvalues of M on its diagonal. Next, apply the
decoupling by diagonalization idea to the recursion of the implicit Euler method.
z k : = V −1 y k 1
V−1 yk+1 = V−1 yk + h |V−{z
1
MV}(V−1 yk+1 ) ⇔ ( z k + 1 )i = (z ) . (12.3.6)
1 − λi h k i
=D | {z }
ˆ implicit Euler step for żi = λi zi
=
Crucial insight:
For any timestep, the implicit Euler method generates exponentially decaying solution sequences
(yk )∞
k=0 for ẏ = My with diagonalizable matrix M ∈ R
d,d with eigenvalues λ , . . . , λ , if Re λ < 0
1 d i
for all i = 1, . . . , d.
Thus we expect that the implicit Euler method will not face stability induced timestep constraints for stiff
problems (→ Notion 12.2.9).
Unfortunately the implicit Euler method is of first order only, see Ex. 11.3.18. This section presents an
algorithm for designing higher order single step methods generalizing the implicit Euler method.
Setting: We consider the general ordinary differential equation ẏ = f(t, y), f : I × D → R d locally
Lipschitz continuous, which guarantees the local existence of unique solutions of initial value problems,
see Thm. 11.1.32.
We define the single step method through specifying the first step y0 = y(t0 ) → y1 ≈ y(t1 ), where
y0 ∈ D is the initial step at initial time t0 ∈ I . We assume that the exact solution trajectory t 7→ y(t)
exists on [t0 , t1 ]. Use as a timestepping scheme on a temporal mesh (→ § 11.2.2) in the sense of
Def. 11.3.5 is straightforward.
12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 779
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
✡ ✠
d
Our choice (the “standard option”): (Componentwise) polynomial trial space V = (Ps )
Recalling dim Ps = s + 1 from Thm. 5.2.2 we see that our choice makes the number N := d(s + 1) of
collocation conditions match the dimension of the trial space V .
Now we want to derive a concrete representation for the polynomial yh . We draw on concepts introduced
in Section 5.2.2. We define the collocation points as
In each of its d components, the derivative ẏh is a polynomial of degree s − 1: ẏ ∈ (Ps−1 )d . Hence, it
has the following representation, compare (5.2.13).
s
ẏh (t0 + τh) = ∑ ẏh (t0 + c j h) L j (τ ) . (12.3.10)
j =1
This yields the following formulas for the computation of y1 , which characterize the s-stage collocation
single step method induced by the (normalized) collocation points c j ∈ [0, 1], j = 1, . . . , s.
s Z ci
ki = f (t0 + ci h, y0 + h ∑ aij k j ) , aij := L j (τ ) dτ ,
j =1 0
where Z 1 (12.3.11)
s
y1 := yh (t1 ) = y0 + h ∑ bi ki . bi := Li (τ ) dτ .
0
i =1
12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 780
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Note that, since arbitrary y0 ∈ D, t0 , t1 ∈ I were admitted, this defines a discrete evolution Ψ : I × I ×
D → R d by Ψt0 ,t1 y0 := yh (t1 ).
Note that (12.3.11) represents a generically non-linear system of s · d equations for the s · d components
of the vectors ki , i = 1, . . . , s. Usually, it will not be possible to obtain ki by a fixed number of evaluations
of f. For this reason the single step methods defined by (12.3.11) are called implicit.
With similar arguments as in Rem. 11.2.14 one can prove that for sufficiently small |t1 − t0 | a unique
solution for k1 , . . . , ks can be found.
Clearly, in the case d = 1, f (t, y) = f (t), y0 = 0 the computation of y1 boils down to the evaluation of a
quadrature formula on [t0 , t1 ], because from (12.3.11) we get
s Z 1
y1 = h ∑ bi f (t0 + ci h) , bi := Li (τ ) dτ , (12.3.14)
i =1 0
which is a polynomial quadrature formula (7.2.2) on [0, 1] with nodes c j transformed to [t0 , t1 ] according
to (7.1.5).
We consider the scalar logistic ODE (11.1.6) with parameter λ = 10 (→ only mildly stiff), initial state
y0 = 0.01, T = 1.
Numerical integration by timestepping with uniform timestep h based on collocation single step method
(12.3.11).
0
10
➊ -2
10
j
Equidistant collocation points, c j = s+1 , j =
1, · · · , s.
max |y (t )-y(t) )|
k
-4
10
We observe algebraic convergence with the empiric
h k
rates
k
-6
s = 1 : p = 1.96 10
s = 2 : p = 2.03
s = 3 : p = 4.00 -8
10 s=1
s = 4 : p = 4.04 s=2
s=3
-10
s=4
10 -2 -1 0
10 10 10
Fig. 447 h
In this case we conclude the following (empiric) order (→ Def. 11.3.21) of the collocation single step
12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 781
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
method:
(
s for even s ,
(empiric) order =
s + 1 for odd s .
0
10
➊
Gauss points in [0, 1] as normalized collocation
points c j , j = 1, . . . , s. 10
-5
max |y (t )-y(t) )|
k
We observe algebraic convergence with the empiric
h k
rates
k
s = 1 : p = 1.96 -10
10
s = 2 : p = 4.01
s = 3 : p = 6.00 s=1
s = 4 : p = 8.02 s=2
s=3
-15
s=4
10 -2 -1 0
10 10 10
Fig. 448 h
Obviously, for the (empiric) order (→ Def. 11.3.21) of the Gauss collocation single step method holds
(empiric) order = 2s .
Note that the 1-stage Gauss collocation single step method is the implicit midpoint method from Sec-
tion 11.2.3.
What we have observed in Exp. 12.3.15 reflects a fundamental result on collocation single step methods
as defined in (12.3.11).
Theorem 12.3.17. Order of collocation single step method [?, Satz .6.40]
Provided that f ∈ C p ( I × D ), the order (→ Def. 11.3.21) of an s-stage collocation single step
method according to (12.3.11) agrees with the order (→ Def. 7.3.1) of the quadrature formula on
[0, 1] with nodes c j and weights b j , j = 1, . . . , s.
➣ By Thm. 7.3.22 the s-stage Gauss collocation single step method whose nodes c j are chosen as the s
Gauss points on [0, 1] is of order 2s.
The notations in (12.3.11) have deliberately been chosen to allude to Def. 11.4.9. In that definition it takes
only letting the sum in the formula for the increments run up to s to capture (12.3.11).
12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 782
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Definition 12.3.18. General Runge-Kutta single step method (cf. Def. 11.4.9)
Note: computation of increments ki may now require the solution of (non-linear) systems of equations of
size s · d (→ “implicit” method, cf. Rem. 12.3.12)
Many of the techniques and much of the theory discussed for explicit RK-SSMs carry over to general
(implicit) Runge-Kutta single step methods:
• Sufficient condition for consistence from Cor. 11.4.12
• Algebraic convergence for meshwidth h → 0 and the related concept of order (→ Def. 11.3.21)
• Embedded methods and algorithms for adaptive stepsize control from Section 11.5
This leads to the equivalent defining equations in “stage form” for an implicit RK-SSM
s s
gi = h ∑ aij f(t0 + ci h, y0 + g j ) , y1 = y0 + h ∑ bi f(t0 + c j h, y0 + gi ) . (12.3.23)
j =1 i =1
12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 783
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
We reformulate the increment equations in stage form (12.3.23) as a non-linear system of equations in
standard form F(x) = 0. Unknowns are the total s · d components of the stage vectors gi , i = 1, . . . , s as
defined in (12.3.22).
g = [ g1 , . . . , g s ] ⊤ , f(t0 + c1 h, y0 + g1 )
s .. !
gi := h ∑ aij f(t0 + c j h, y0 + g j ) F (g) = g − h (A ⊗ I ) . =0,
j =1 f(t0 + cs h, y0 + gs )
where I is the d × d identity matrix and ⊗ designates the Kronecker product introduced in Def. 1.4.17.
We compute an approximate solution of F(g) = 0 iteratively by means of the simplified Newton method
presented in Rem. 8.4.39. This is a Newton method with “frozen Jacobian”. As g → 0 for h → 0, we
choose zero as initial guess:
Obviously, D F(0) → I for h → 0. Thus, D F(0) will be regular for sufficiently small h.
In each step of the simplified Newton method we have to solve a linear system of equations with coefficient
matrix D F(0). If s · d is large, an efficient implementation has to reuse the LU-decomposition of D F(0),
see Code 8.4.40 and Rem. 2.5.10.
Model problem analysis for general Runge-Kutta single step methods (→ Def. 12.3.18) runs parallel to
that for explicit RK-methods as elaborated in Section 12.1, § 12.1.13. Familiarity with the techniques and
results of this section is assumed. The reader is asked to recall the concept of stability function from
Thm. 12.1.17, the diagonalization technique from § 12.1.44, and the definition of region of (absolute)
stability from Def. 12.1.51.
12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 784
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
det(I − zA + z1b T )
S(z) := 1 + zb T (I − zA) −1 1 = , z := λh , 1 = [1, . . . , 1] T ∈ R s .
| {z } det(I − zA)
stability function
We determine the Butcher schemes (12.3.20) for simple implicit RK-SSM and apply the formula from
Thm. 12.3.27 to compute their stability functions.
1 1 1
• Implicit Euler method: ➣ S(z) = .
1 1−z
1
2
1
2
1 + 21 z
• Implicit midpoint method: ➣ S(z) = .
1 1 − 21 z
Their regions of stability SΨ as defined in Def. 12.1.51 can easily found from the respective stability func-
tions:
3
3
2 2
1 1
Im
Im
0 0
-1 -1
-2 -2
-3
-3 -3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3Fig. 450 Re
Fig. 449 Re
SΨ : implicit midpoint method (11.2.18)
SΨ : implicit Euler method (11.2.13)
From the determinant formula for the stability function S(z) we can conclude a generalization of Cor. 12.1.20.
12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 785
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
For a consistent (→ Def. 11.3.10) s-stage general Runge-Kutta single step method according to
P(z)
Def. 12.3.18 the stability function S is a non-constant rational function of the form S(z) =
Q (z)
with polynomials P ∈ Ps , Q ∈ Ps .
Of course, a rational function z 7→ S(z) can satisfy lim|z|→∞ |S(z)| < 1 as we habe seen in Ex. 12.3.28.
As a consequence, the region of stability for implicit RK-SSM need not be bounded.
(12.3.30) A-stability
A general RK-SSM with stability function S applied to the scalar linear IVP ẏ = λy, y(0) = y0 ∈ C,
λ ∈ C, with uniform timestep h > 0 will yield the sequence (yk )∞
k=0 defined by
yk = S(z)k y0 , z = λh . (12.3.31)
Hence, the next property of a RK-SSM guarantees that the sequence of approximations decays exponen-
tially whenever the exact solution of the model problem IVP (12.1.5) does so.
C − := {z ∈ C: Re z < 0} ⊂ SΨ . (SΨ =
ˆ region of stability Def. 12.1.51)
From Ex. 12.3.28 we conclude that both the implicit Euler method and the implicit midpoint method are
A-stable.
A-stable Runge-Kutta single step methods will not be affected by stability induced timestep constraints
when applied to stiff IVP (→ Notion 12.2.9).
In order to reproduce the qualitative behavior of the exact solution, a single step method when applied to
the scalar linear IVP ẏ = λy, y(0) = y0 ∈ C, λ ∈ C, with uniform timestep h > 0,
12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 786
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Regions of stability of Gauss collocation single step methods, see Exp. 12.3.15:
5 20 50 0.7 1.5
0.7
1.5
1.1
1.1
0.9
0.9
1 0.7
1.1
4 40
15
0.9
1.5 5
1.
1
3 5 30
10
1.
0.
7
0.7
2 20
0.7
0.4 0.
1.5
0. 4
4 5
1 10
4
0.4
0.
0.4
0.9 1
0.9 1.51.1
1.1
0.91
Im
Im
1.1
0 0
Im
0
0.4
1
-1 -10
-5
1.5
0.
4
0.7
0.4
0.4 0.4
0.7
-2 -20
0.7
1.
-10
5
-3 -30
1.
5
0.9
1.1
1.1
-15 1.5
0.9
0.7 -40
1.1
1
-4
1
1.5
0.9
1
0.7 1.5
0.7 -20 -50
-5 -20 -10 0 10 20 -60 -40 -20 0 20 40 60
-6 -4 -2 0 2 4 Fig.6 452 Fig. 453
Fig. 451 Re Re Re
Theorem 12.3.35. Region of stability of Gauss collocation single step methods [?,
Satz 6.44]
s-stage Gauss collocation single step methods defined by (12.3.11) with the nodes cs given by the
s Gauss points on [0, 1], feature the “ideal” stability domain:
SΨ = C − . (12.3.34)
whose solution essentially is the smooth function t 7→ sin(2πt). Applying the criteria (12.2.15) and
(12.2.16) we immediately see that this IVP is extremely stiff.
1
We solve it with different implicit RK-SSM on [0, 1] with large uniform timestep h = 20 .
12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 787
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
4
1
y(t)
Impliziter Euler 0.8 exp(z)
3
Kollokations RK-ESV s=1 Impliziter Euler
0.6
Kollokations RK-ESV s=2 Gauss-Koll.-RK-ESV s=1
2 Kollokations RK-ESV s=3 0.4 Gauss-Koll.-RK-ESV s=2
Kollokations RK-ESV s=4 Gauss-Koll.-RK-ESV s=3
0.2
1
Re(S(z))
Gauss-Koll.-RK-ESV s=4
0
y
0 -0.2
-0.4
-1
-0.6
-2 -0.8
-1
-3 -1000 -800 -600 -400 -200 0
0 0.2 0.4 0.6 0.8 1 455
Fig. z
Fig. 454 t
We observe that Gauss collocation RK-SSMs incur a huge discretization error, whereas the simple implicit
Euler method provides a perfect approximation!
lim |S(z)| = 1 .
| z|→∞
Hence, when they are applied to ẏ = λy with extremely large (in modulus) λ < 0, they will produce
sequences that decay only very slowly or even oscillate, which misses the very rapid decay of the ex-
act solution. The stability function for the implicity Euler method is S(z) = (1 − z)−1 and satisfies
lim|z|→∞ S(z) = 0, which will mean a fast exponential decay of the yk .
(12.3.37) L-stability
In light of what we learned in the previous experiment we can now state what we expect from the stability
function of a Runge-Kutta method that is suitable for stiff IVP (→ Notion 12.2.9):
c A
Consider a Runge-Kutta single step method (→ Def. 12.3.18) described by the Butcher scheme .
bT
12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 788
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Assume that A ∈ R s,s is regular, which can be fulfilled only for an implicit RK-SSM.
P (z)
For a rational function S(z) = Q(z) the limit for |z| → ∞ exists and can easily be expressed by the leading
coefficients of the polynomials P and Q:
If b T = (A):,j
T
(row of A) ⇒ S(−∞) = 0 . (12.3.43)
A closer look at the coefficient formulas of (12.3.11) reveals that the algebraic condition (12.3.43) will
automatically satisfied for a collocation single step method with cs = 1!
There is a family of s-point quadrature formulas on [0, 1] with a node located in 1 and (maximal) order
2s − 1: Gauss-Radau formulas. They induce the L-stable Gauss-Radau collocation single step methods
of order 2s − 1 according to Thm. 12.3.17.
√ √ √ √
4− 6 88−7 6 296−169 6 − 2+ 3 6
1 5 1 10
√ 360 √ 1800√ 225√
3 12 − 12 4+ 6 296+169 6 88+7 6 − 2− 3 6
1 1 3 1 10 1800 360√ 225
1 √
1 4
3
4
1 1 16− 6 16+ 6 1
36√ 36√ 9
4 4 16− 6 16+ 6 1
36 36 9
P(z) 60
S(z) = , P ∈ Ps−1 , Q ∈ Ps .
Re(S(z))
Q (z) 50
40
12. Single Step Methods for Stiff Initial Value Problems, 12.3. Implicit Runge-Kutta Single Step Methods 789
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
10 20 30
8
15
0.4 20
6 0.4 0.4
0.4
10 0.7
0.7 0.7
0.4
4 0.9
1.1 0. 1
0.9 4
0.9 1 1
1.1 1.5 10 1.1 1.1 0.
0.4
1. 5 1.5 7
5 1.
0.9
0.7
2 5
0 .7
0.4
1 .1
1
1
0.1
1
0.4
Im
Im
Im
0.9
0 0 0
0.7
0.9
1.5
1.5
0.9
0.9
1.5
1.5 .1
0.4
1 1.1
0.7
1.1
0.4
1
0.4
1
-2 1.5
7
-5 1.1
0.
0.7
-4 0.9 1
0.7 0.9
0.4 0.7
-10 0.4
-6 0.4 0.4
0.4 -20
-15
-8
We compare the sequences generated by 1-stage and 2-stage Gauss collocation and Gauss-Radau
collocation SSMs, respectively (uniform timestep).
˜quidistantes Gitter, h=0.016667 ˜quidistantes Gitter, h=0.016667
1.4 1.4
y(t) y(t)
Gauss-Koll., s= 1 RADAU, s= 1
1.2 Gauss-Koll., s= 2 1.2 RADAU, s= 2
1 1
0.8 0.8
y
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
t t
The 2nd-order Gauss collocation SSM (implicit midpoint method) suffers from spurious oscillations we
homing in on the stable stationary state y = 1. The explanation from Exp. 12.3.36 also applies to this
example.
The fourth-order Gauss method is already so accurate that potential overshoots when approaching y = 1
are damped fast enough.
12. Single Step Methods for Stiff Initial Value Problems, 12.4. Semi-implicit Runge-Kutta Methods 790
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Remember that we compute approximate solutions anyway, and the increments are weighted with the
stepsize h ≪ 1, see Def. 12.3.18. So there is no point in determining them with high accuracy!
Idea: Use only a fixed small number of Newton steps to solve for the ki , i = 1, . . . , s.
✦ We consider an Initial value problem for logistic ODE, see Ex. 11.1.5
One Newton step (8.4.1) applied to F(y) = 0 with initial guess yk yields
Note: for linear ODE with f(y) = Ay, A ∈ R d,d , we recover the original implicit Euler method!
Observation: Approximate evaluation of defining equation for yk+1 preserves 1st order convergence.
12. Single Step Methods for Stiff Initial Value Problems, 12.4. Semi-implicit Runge-Kutta Methods 791
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
-2
✦ Now, implicit midpoint method (11.2.18), uni- 10
error
& approximate computation of yk+1 by 1 New-
-6
ton step, initial guess yk 10
?
j =1
!
s
k i = f ( y0 ) + h D f ( y0 ) ∑ aij k j , i = 1, . . . , s . (12.4.2)
j =1
The good news is that all results about stability derived from model problem analysis (→ Section 12.1)
remain valid despite linearization of the increment equations:
✞ ☎
✝ ✆
Linearization does nothing for linear ODEs ➢ stability function (→ Thm. 12.3.27) not affected!
The bad news is that the preservation of the order observed in Ex. 12.4.1 will no longer hold in the general
case.
12. Single Step Methods for Stiff Initial Value Problems, 12.4. Semi-implicit Runge-Kutta Methods 792
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
error
order = 3, see Ex. 12.3.44. -8
10
✦ -10
10
Increments from linearized equations (12.4.2)
RADAU (s=2)
✦ We monitor the error through err = -12
10 semi-implicit RADAU
O(h3)
max |y j − y(t j )| -14
O(h2)
j=1,...,n 10 -4 -3 -2 -1 0
10 10 10 10 10
Fig. 462 h
We have just seen that the simple linearization according to (12.4.2) will degrade the order of implicit
RK-SSMs and leads to a substantial loss of accuracy. This is not an option.
Yet, the idea behind (12.4.2) has been refined. One does not start from a known RK-SSM, but introduces
general coefficients for structurally linear increment equations.
i −1 i −1
(I − haii J)ki = f(y0 + h ∑ (aij + dij )k j ) − hJ ∑ dij k j , J = D f(y0 ) ,
j =1 j =1
(12.4.6)
s
y1 : = y0 + ∑ bj k j .
j =1
Then the coefficients aij , dij , and bi are determined from order conditions by solving large non-linear
systems of equations.
In each step s linear systems with coefficient matrices I − haii J have to be solved. For methods used in
practice one often demands that aii = γ for all i = 1, . . . , s. As a consequence, we have to solve s linear
systems with the same coefficient matrix I − hγJ ∈ R d,d , which permits us to reuse LU-factorizations,
see Rem. 2.5.10.
A ROW method is the basis for the standard integrator that M ATLAB offers for stiff problems:
12. Single Step Methods for Stiff Initial Value Problems, 12.4. Semi-implicit Runge-Kutta Methods 793
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
opts = odeset(’abstol’,atol,’reltol’,rtol,’Jacobian’,J)
[t,y] = ode23s(odefun,tspan,y0,opts);
Ψ=
ˆ RK-method of order 2 e=
Ψ ˆ RK-method of order 3
ode23s
integrator for stiff IVP
Many relevant ordinary differential equations feature a right hand side function that is the sum to two (or
more) terms. Consider an autonomous IVP with a right hand side function that can be split in an additive
fashion:
Let us introduce the evolution operators (→ Def. 11.1.39) for both summands:
Φ tf ↔ ODE ẏ = f(y) ,
(Continuous) evolution maps:
Φtg ↔ ODE ẏ = g(y) .
Temporarily we assume that both Φ tf , Φ tg are available in the form of analytic formulas or highly accurate
approximations.
Idea: Build single step methods (→ Def. 11.3.5) based on the following
discrete evolutions
Ψh Ψh
(12.5.3) ↔ Φhg (12.5.4) ↔ Φhg
y0 y0
Fig. 463
Φ hf Fig. 464
Φ f/2
h
12. Single Step Methods for Stiff Initial Value Problems, 12.5. Splitting methods 794
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Note that over many timesteps the Strang splitting approach is not more expensive than Lie-Trotter split-
ting, because the actual implementation of (12.5.4) should be done as follows:
y1/2 := Φ f/2 ,
h
y1 := Φ hg y1/2 ,
y3/2 := Φhf y1 , y2 := Φ hg y3/2 ,
y5/2 := Φhf y2 , y3 := Φ hg y5/2 ,
.. ..
. .,
because Φ f/2 ◦ Φ f/2 = Φ hf . This means that a Strang splitting SSM differs from a Lie-Trotter splitting
h h
We consider the following IVP whose right hand side function is the sum of two functions for which the
ODEs can be solved analytically:
q
ẏ = λy(1 − y) + 1 − y2 , y (0 ) = 0 .
| {z } | {z }
= : f (y) = :g(y)
1
Φtf y = , t > 0, y ∈]0, 1] (logistic ODE (11.1.6))
1 + (y −1 − 1)e−λt
(
π
sin(t + arcsin(y)) , if t + arcsin(y) < ,
Φtg y = 2 t > 0, y ∈ [0, 1] .
1 , else,
-2
10
Numerical experiment:
-3
10 For T = 1, λ = 1, we compare the two splitting
methods for uniform timesteps with a very accurate
|y(T)-y (T)|
-4
10
f=@(t,x) λ*x*(1-x)+sqrt(1-x^2);
options=odeset(’reltol’,1.0e-10,...
-5
10 ’abstol’,1.0e-12);
Lie-Trotter-Splitting
Strang-Splitting [t,yex]=ode45(f,[0,1],y0,options);
O(h)
2
-6
10
O(h ) ✁ Error at final time T = 1
-2 -1
10 10
Fig. 465 Zeitschrittweite h
We observe algebraic convergence of the two splitting methods, order 1 for (12.5.3), oder 2 for (12.5.4).
Die single step methods defined by (12.5.3) or (12.5.4) are of order (→ Def. 11.3.21) 1 and 2,
respetively.
12. Single Step Methods for Stiff Initial Value Problems, 12.5. Splitting methods 795
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Of course, the assumption that ẏ = f(y) and ẏ = g(y) can be solved exactly will hardly ever be met.
However, it should be clear that a “sufficiently accurate” approximation of the evolution maps Φhg and Φ hf
is all we need
Again we consider the IVP of Ex. 12.5.5 and inexact splitting methods based on different single step
methods for the two ODE corresponding to the summands.
-2
10 LTS-Eul explicit Euler method (11.2.7) → Ψ hh,g ,
Ψhh, f + Lie-Trotter splitting (12.5.3)
-3
10 SS-Eul explicit Euler method (11.2.7) → Ψ hh,g ,
Ψhh, f + Strang splitting (12.5.4)
|y(T)-y (T)|
-4
10
method (11.2.7) ◦ exact evolution Φ hg ◦
implicit Euler method (11.2.13)
-5
10 LTS-Eul LTS-EMP explicit midpoint method (11.2.18) →
SS-Eul
SS-EuEI
Ψhh,g , Ψhh, f + Lie-Trotter splitting (12.5.3)
-6
LTS-EMP
SS-EMP SS-EMP explicit midpoint method (11.4.7) → Ψ hh,g ,
10
Fig. 466
10
-2 -1
10 Ψhh, f + Strang splitting (12.5.4)
Zeitschrittweite h
☞ The order of splitting methods may be (but need not be) limited by the order of the SSMs used for
Φhf , Φ hg .
“Splittable” ODEs
12. Single Step Methods for Stiff Initial Value Problems, 12.5. Splitting methods 796
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
small perturbation
1.4
0.03
ode45
y(t)
1.2
1
1
0.02
Zeitschrittweite
0.8
y(t)
y
0.6
0.01
0.4
y(t)
0.2 LT-Eulex, h=0.04
0 LT-Eulex, h=0.02
ST-MPRexpl, h=0.05
0
0 0.2 0.4 0.6 0.8 1 0
Fig. 467 t
0 0.2 0.4 0.6 0.8 1
Fig. 468 t
Solution from ode45, see Ex. 12.0.1 inexacte splitting method: solution (yk )
ode45: 152
LT-Eulex, h = 0.04: 25
Total number of timesteps
LT-Eulex, h = 0.02: 50
ST-MPRexpl, h = 0.05: 20
Details of the methods:
LT-Eulex: ẏ = λy(1 − y) → exact evolution, ẏ = α sin y → expl. Euler (11.2.7) & Lie-Trotter
splitting (12.5.3)
ST-MPRexpl: ẏ = λy(1 − y) → exacte evolution, ẏ = α sin y → expl. midpoint rule (11.4.7) & Strang
splitting (12.5.4)
We observe that this splitting scheme can cope well with the stiffness of the problem, because the stiff
term on the right hand side is integrated exactly.
In the numerical treatment of partial differential equation one commonly encounters ODEs of the form
g ( y1 )
.. ⊤ d,d
ẏ = f(y) := −Ay + . , A=A ∈R positive definite (→ Def. 1.1.8) , (12.5.13)
g(yd )
with state space D = R d , where λmin (A) ≈ 1, λmax (A) ≈ d2 , and the derivative of g : R → R is
bounded. Then IVPs for (12.5.13) will be stiff, since the Jacobian
g ′ ( y1 )
.. d,d
D f (y) = − A + . ∈R
g′ (yd )
12. Single Step Methods for Stiff Initial Value Problems, 12.5. Splitting methods 797
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
will have eigenvalues “close to zero” and others that are large (in modulus) and negative. Hence, D f(y)
will satisfy the criteria (12.2.15) and (12.2.16) for any state y ∈ R d .
• For the linear ODE ẏ = g(y) we have to use and L-stable (→ Def. 12.3.38) single step method,
for instance a second-order implicit Runge-Kutta method. Its increments can be obtained by solving
a linear system of equations, whose coefficient matrix will be the same for every step, if uniform
timesteps are used.
• The ODE ẏ = q(y) boils down to decoupled scalar ODEs ẏ j = g(y j ), j = 1, . . . , d. For them we
can use an inexpensive explicit RK-SSM like the explicit trapezoidal method (11.4.6). According to
our assumptions on g these ODEs are not haunted by stiffness.
12. Single Step Methods for Stiff Initial Value Problems, 12.5. Splitting methods 798
Bibliography
[1] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Springer, Heidelberg,
2008.
[2] M.H. Gutknecht. Lineare Algebra. Lecture notes, SAM, ETH Zürich, 2009.
http://www.sam.math.ethz.ch/~mhg/unt/LA/HS07/.
[3] W. Hackbusch. Iterative Lösung großer linearer Gleichungssysteme. B.G. Teubner–Verlag, Stuttgart,
1991.
[4] Wolfgang Hackbusch. Iterative solution of large sparse systems of equations, volume 95 of Applied
Mathematical Sciences. Springer-Verlag, New York, 1994. Translated and revised from the 1991
German original.
[5] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen Rech-
nens. Mathematische Leitfäden. B.G. Teubner, Stuttgart, 2002.
[6] A.R. Laliena and F.-J. Sayas. Theoretical aspects of the application of convolution quadrature to
scattering of acoustic waves. Numer. Math., 112(4):637–678, 2009.
[7] K. Nipp and D. Stoffer. Lineare Algebra. vdf Hochschulverlag, Zürich, 5 edition, 2002.
[8] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics, volume 37 of Texts in Applied Math-
ematics. Springer, New York, 2000.
[9] Yousef Saad. Iterative methods for sparse linear systems. Society for Industrial and Applied Mathe-
matics, Philadelphia, PA, second edition, 2003.
[10] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich, 2009. https://moodle-
app1.net.ethz.ch/lms/mod/resource/index.php?id=145.
799
Index
800
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
C++11 code: DFT of real vectors of length n/2 expansion coefficient of Chebychev inter-
➺ GITLAB, 319 polant, 465
C++11 code: DFT-based 2D discrete periodic con- C++11 code: Efficient computation of coefficient
volution ➺ GITLAB, 323 of trigonometric interpolation polynomial
C++11 code: DFT-based approximate computa- (equidistant nodes), 421
tion of Fourier coefficients, 531 C++11 code: Efficient evaluation of Chebychev
C++11 code: DFT-based evaluation of Fourier sum polynomials up to a certain degree, 453
at equidistant points ➺ GITLAB, 331 C++11 code: Efficient implementation of inverse
C++11 code: DFT-based frequency filtering ➺ GITLAB, power method in E IGEN ➺ GITLAB, 168
315 C++11 code: Efficient implementation of simpli-
C++11 code: Data for “bridge truss”, 644 fied Newton method, 587
C++11 code: Definition of a class for “update friendly” C++11 code: Efficient multiplication of Kronecker
polynomial interpolant, 385 product with vector in E IGEN ➺ GITLAB,
C++11 code: Definition of a simple vector class 92
MyVector, 19 C++11 code: Efficient multiplication of Kronecker
C++11 code: Definition of class for Chebychev product with vector in P YTHON, 93
interpolation, 461 C++11 code: Efficient multiplication with the up-
C++11 code: Demo: discrete Fourier transform in per diagonal part of a rank- p-matrix in
E IGEN ➺ GITLAB, 309 E IGEN ➺ GITLAB, 91
C++11 code: Demonstration code for access to C++11 code: Envelope aware forward substitu-
matrix blocks in E IGEN ➺ GITLAB, 59 tion ➺ GITLAB, 205
C++11 code: Demonstration of over-/underflow C++11 code: Envelope aware recursive LU-factorization
➺ GITLAB, 104 ➺ GITLAB, 205
C++11 code: Demonstration of roundoff errors C++11 code: Equidistant composite Simpson rule
➺ GITLAB, 101 (7.4.5), 526
C++11 code: Demonstration of use of lambda C++11 code: Equidistant composite trapezoidal
function, 17 rule (7.4.4), 525
C++11 code: Demonstration on how reshape a C++11 code: Euclidean inner product, 28
matrix in E IGEN ➺ GITLAB, 64 C++11 code: Euclidean norm, 27
C++11 code: Dense Gaussian elimination applied C++11 code: Evaluation of difference quotients
to arrow system ➺ GITLAB, 170 with variable precision ➺ GITLAB, 109
C++11 code: Difference quotient approximation C++11 code: Evaluation of trigonometric interpo-
of the derivative of exp ➺ GITLAB, 109 lation polynomial in many points, 420
C++11 code: Direct solver applied to a upper tri- C++11 code: Example code demonstrating the
angular matrix ➺ GITLAB, 165 use of PARDISO with E IGEN ➺ GITLAB,
C++11 code: Discrete periodic convolution: DFT 195
implementation ➺ GITLAB, 311 C++11 code: Extracting an entry of a sparse ma-
C++11 code: Discrete periodic convolution: straight- trix, 182
forward implementation ➺ GITLAB, 310 C++11 code: Extraction of periodic patterns by
C++11 code: Discriminant formula for the real DFT ➺ GITLAB, 313
roots of p(ξ ) = ξ 2 + αξ + β ➺ GITLAB, C++11 code: FFT-based solution of local transla-
105 tion invariant linear operators ➺ GITLAB,
C++11 code: Divided differences evaluation by 349
modified Horner scheme, 384 C++11 code: Fast evaluation of trigonometric poly-
C++11 code: Divided differences, recursive im- nomial at equidistant points, 424
plementation, in situ computation, 383 C++11 code: Finding out EPS in C++ ➺ GITLAB,
C++11 code: Driver code for Gram-Schmidt or- 103
thonormalization, 30 C++11 code: Fitting and interpolating polynomial,
C++11 code: Driver for recursive LU-factorization 430
➺ GITLAB, 148 C++11 code: Frequency extraction → Fig. 143,
C++11 code: Efficient computation of Chebychev 312
819
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
LU -decomposition of sparse matrices, 196 Lagrange polynomials for uniformly spaced nodes,
L2 -error estimates for polynomial interpolation, 446 367
h-adaptive numerical quadrature, 538 Lanczos process for eigenvalue computation , 662
p-convergence of piecewise polynomial interpo- Page rank algorithm , 619
lation, 494 PCA for data classification , 275
(Nearly) singular LSE in shifted inverse iteration, PCA of stock prices , 284
639 Power iteration with Ritz projection , 656
(Relative) point locations from distances, 217 qr based orthogonalization , 653
L2 ([−1, 1])-orthogonal polynomials → [?, Bsp. 33.2], Rayleigh quotient iteration , 640
473 Resonances of linear electrical circuits , 609
Compressed row-storage (CRS) format, 177 Ritz projections onto Krylov space , 659
BLAS calling conventions, 82 Runtimes of eig , 617
E IGEN in use, 61 Stabilty of Arnoldi process , 667
General non-linear systems of equations, 541 Subspace power iteration with orthogonal projec-
ode45 for stiff problem, 746 tion , 650
‘Partial LU -decompositions” of principal minors, Vibrations of a truss structure , 644
149
“Annihilating” orthogonal transformations in 2D, A data type designed for of interpolation problem,
237 364
“Behind the scenes” of MyVector arithmetic, 28 A function that is not locally Lipschitz continuous,
“Butcher barriers” for explicit RK-SSM, 728 705
“Failure” of adaptive timestepping, 739 A posteriori error bound for linearly convergent it-
“Fast” matrix multiplication, 88 eration, 551
“Low” and “high” frequencies, 314 A posteriori termination criterion for linearly con-
“Squeezed” DFT of a periodically truncated sig- vergent iterations, 551
nal, 327 A posteriori termination criterion for plain CG, 683
B = B H s.p.d. mit Cholesky-Zerlegung, 615 A priori and a posteriori choice of optimal interpo-
L-stable implicit Runge-Kutta methods, 778 lation nodes, 452
fft A special quasi-linear system of equations, 583
Efficiency, 335 Accessing matrix data as a vector, 61
2-norm from eigenvalues, 677 Accessing rows and columns of sparse matrices,
3-Term recursion for Legendre polynomials, 518 178
Adapted Newton method, 567
Analytic solution of homogeneous linear ordinary Adaptive explicity RK-SSM for simple decay ODE,
differential equations , 612 748
Convergence of PINVIT , 642 Adaptive integrator for stiff problems in M ATLAB,
Convergence of subspace variant of direct power 783
method , 657 Adaptive quadrature in M ATLAB, 539
Data points confined to a subspace , 274 Adaptive timestepping for mechanical problem, 743
Direct power method , 627 Adding EPS to 1, 103
Eigenvalue computation with Arnoldi process , 668 Adding EPS to 1, 103
Impact of roundoff on Lanczos process , 663 Affine invariance of Newton method, 578
Algorithm for cluster analysis, 279
821
NumCSE, AT’15, Prof. Ralf Hiptmair c SAM, ETH Zurich, 2015
Row swapping commutes with forward elimina- Stability functions of explicit Runge-Kutta single
tion, 156 step methods, 752
Row-wise & column-wise view of matrix product, Stable discriminant formula, 113
70 Stable implementation of Householder reflections,
Runge’s example, 443, 446 239
Runtime comparison for computation of coefficient Stable orthonormalization by QR-decomposition,
of trigonometric interpolation polynomials, 96
422 Stable solution of LSE by means of QR-decomposition,
Runtime of Gaussian elimination, 137 250
Stage form equations for increments, 773
S.p.d. Hessians, 51 Standard E IGEN lu() operator versus triangularView()
Sacrificing numerical stability for efficiency, 172 , 165
Scaling a matrix, 73 Stepsize control detects instability, 754
Sensitivity of linear mappings, 130 Stepsize control in M ATLAB, 742
Shape preservation of cubic spline interpolation, Storing orthogonal transformations, 242
408 Strongly attractive limit cycle, 762
Shifted inverse iteration, 638 Subspace power methods, 658
Significance of smoothness of interpoland, 446 Summation of exponential series, 118
Silly M ATLAB, 182 SVD and additive rank-1 decomposition, 259
Simple adaptive stepsize control, 738 SVD-based computation of the rank of a matrix,
Simple adaptive timestepping for fast decay, 750 261
Simple composite polynomial quadrature rules, Switching to equivalent formulas to avoid cancel-
525 lation, 114
Simple preconditioners, 691
Simple Runge-Kutta methods by quadrature & boos- Tables of quadrature rules, 505
trapping, 725 Tangent field and solution curves, 710
Simplified Newton method, 586 Taylor approximation, 434
Sine transform via DFT of half length, 345 Termination criterion for contrative fixed point iter-
Small residuals by Gaussian elimination, 161 ation, 560
Smoothing of a triangulation, 185 Termination criterion for direct power iteration, 628
Solving the increment equations for implicit RK- Termination criterion in pcg, 694
SSMs, 773 Termination of PCG, 693
Sound filtering by DFT, 316 Testing equality with zero, 104
Sparse LU -factors, 197 Testing stability of matrix×vector multiplication,
Sparse elimination for arrow matrix, 192 124
Sparse LSE in circuit modelling, 175 The “matrix×vector-multiplication problem”, 121
Sparse matrices from the discretization of linear The inverse matrix and solution of a LSE, 129
partial differential equations, 176 The message of asymptotic estimates, 523
Special cases in IEEE standard, 100 Timing polynomial evaluations, 376
Spectrum of Fourier matrix, 308 Timing sparse elimination for the combinatorial
Speed of convergence of Euler methods, 718 graph Laplacian, 193
spline Transformation of polynomial approximation schemes,
interpolants, approx. complete cubic, 499 438
shape preserving quadratic interpolation, 414 Transformation of quadrature rules, 504
Splines in M ATLAB, 406 Transforming approximation error estimates, 439
Splitting linear and local terms, 787 Transient circuit simulation, 702
Splitting off stiff components, 786 Transient simulation of RLC-circuit, 755
Square root iteration as Newton’s method, 563 Trend analysis, 272
Square root of a s.p.d. matrix, 688 Tridiagonal preconditioning, 692
Stability by small random perturbations, 159 Trigonometric interpolation of analytic functions,
Stability function and exponential function, 753 487
Two-dimensional DFT in M ATLAB, 321