0% found this document useful (0 votes)
2 views

Programming Basics and AI Lecture

This document is a lecture note by Seongjai Kim from Mississippi State University, providing an introduction to programming basics, scientific computing, and algorithmic design using Matlab and Python. It targets undergraduate and talented high school students, focusing on mathematical concepts, programming skills, artificial intelligence, and machine learning. The content includes various programming examples and exercises to enhance computational literacy.

Uploaded by

rodrigo camargos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Programming Basics and AI Lecture

This document is a lecture note by Seongjai Kim from Mississippi State University, providing an introduction to programming basics, scientific computing, and algorithmic design using Matlab and Python. It targets undergraduate and talented high school students, focusing on mathematical concepts, programming skills, artificial intelligence, and machine learning. The content includes various programming examples and exercises to enhance computational literacy.

Uploaded by

rodrigo camargos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 403

Programming Basics and AI

with Matlab and Python

Lectures on YouTube:
https://www.youtube.com/@mathtalent

Seongjai Kim

Department of Mathematics and Statistics


Mississippi State University
Mississippi State, MS 39762 USA
Email: [email protected]

Updated: December 2, 2023


Coding techniques may improve the computer program by tens of percents,
while an effective algorithmic design can improve it by tens or hundreds of times

In computational literacy, the coding ability is to the tip of the iceberg


as the ability of algorithmic design is to its remainder

Seongjai Kim, Professor of Mathematics, Department of Mathematics and Statistics, Mis-


sissippi State University, Mississippi State, MS 39762 USA. Email: [email protected].
Prologue
This lecture note provides an overview of scientific computing, i.e., of modern information
engineering tasks to be tackled by powerful computer simulations. The emphasis through-
out is on the understanding of modern algorithmic designs and their efficient im-
plementation.
As well known as in the society of computational methods, computer programming is
the process of constructing an executable computer program in order to accomplish a spe-
cific computational task. Programming in practice cannot be realized without incorporat-
ing computational languages. However, it is not a simple process of experiencing computa-
tional languages; it involves concerns such as

• mathematical analysis,
• generating computational algorithms,
• profiling algorithms’ accuracy and cost, and
• the implementation of algorithms in selected programming languages
(commonly referred to as coding).

The source code of a program can be written in one or more programming languages.
The manuscript is conceived as an introduction to the thriving field of information engi-
neering, particularly for early-year college students who are interested in mathemat-
ics, engineering, and other sciences, without an already strong background in computa-
tional methods. It will also be suitable for talented high school students. All examples
to be treated in this manuscript are implemented in Matlab and Python, and occasionally
in Maple.

Currently, the lecture note is growing.


December 2, 2023

iii
iv

Level of Lectures
• The target audience is undergraduate students.
• However, talented high school students would be able to follow
the lectures.
• Persons with no programming experience will understand
most of lectures.

Goals of Lectures
Let the students understand
• Mathematical Basics: Calculus & Linear Algebra
• Programming, with Matlab and Python
• Artificial Intelligence (AI)
• Basics of Machine Learning (ML)
• An ML Software: Scikit-Learn

Programming: the Programmers’ Work

• Understand and analyze the problem


• Convert mathematical terms to computer programs
• Verify the code
• Get insights

You will learn programming!


Contents

Title ii

Prologue iv

Table of Contents ix

1 Programming Basics 1
1.1. What is Programming or Coding? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1. Programming: Some Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2. Functions: Generalization and Reusability . . . . . . . . . . . . . . . . . . . . . 5
1.1.3. Becoming a Good Programmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2. Matlab: A Powerful Computer Language . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1. Introduction to Matlab/Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.2. Repetition: Iteration Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.3. Anonymous Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2.4. Open Source Alternatives to Matlab . . . . . . . . . . . . . . . . . . . . . . . . . 25
Exercises for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 Programming Examples 29
2.1. Area Estimation of the Region Defined by a Closed Curve . . . . . . . . . . . . . . . . . 30
2.2. Visualization of Complex-Valued Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3. Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.1. Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.2. Short-Time Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4. Computational Algorithms and Their Convergence . . . . . . . . . . . . . . . . . . . . . 47
2.4.1. Computational Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.2. Big O and little o notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.5. Inverse Functions: Exponentials and Logarithms . . . . . . . . . . . . . . . . . . . . . 53
2.5.1. Inverse functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.5.2. Logarithmic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Exercises for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3 Programming with Calculus 65


3.1. Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.1.1. The Slope of the Tangent Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

v
vi Contents

3.1.2. Derivative and Differentiation Rules . . . . . . . . . . . . . . . . . . . . . . . . . 71


3.2. Basis Functions and Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.2.1. Change of Variables & Basis Functions . . . . . . . . . . . . . . . . . . . . . . . 76
3.2.2. Power Series and the Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2.3. Taylor Series Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3. Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3.1. Lagrange Form of Interpolating Polynomials . . . . . . . . . . . . . . . . . . . . 87
3.3.2. Polynomial Interpolation Error Theorem . . . . . . . . . . . . . . . . . . . . . . 90
3.4. Numerical Differentiation: Finite Difference Formulas . . . . . . . . . . . . . . . . . . 93
3.5. Newton’s Method for the Solution of Nonlinear Equations . . . . . . . . . . . . . . . . 99
3.6. Zeros of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.6.1. Horner’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Exercises for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4 Linear Algebra Basics 113


4.1. Solutions of Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.1.1. Solving a linear system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.1.2. Matrix equation Ax = b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.2. Row Reduction and the General Solution of Linear Systems . . . . . . . . . . . . . . . 119
4.2.1. Echelon Forms and the Row Reduction Algorithm . . . . . . . . . . . . . . . . . 120
4.2.2. The General Solution of Linear Systems . . . . . . . . . . . . . . . . . . . . . . . 123
4.3. Linear Independence and Span of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.4. Invertible Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Exercises for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

5 Programming with Linear Algebra 135


5.1. Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.2. Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.2.1. Characteristic Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2.2. Matrix Similarity and The Diagonalization Theorem . . . . . . . . . . . . . . . 144
5.3. Dot Product, Length, and Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.4. Vector Norms, Matrix Norms, and Condition Numbers . . . . . . . . . . . . . . . . . . 151
5.5. Power Method and Inverse Power Method for Eigenvalues . . . . . . . . . . . . . . . . 155
5.5.1. The Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.5.2. The Inverse Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Exercises for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6 Multivariable Calculus 163


6.1. Multi-Variable Functions and Their Partial Derivatives . . . . . . . . . . . . . . . . . . 164
6.1.1. Functions of Several Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.1.2. First-order Partial Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.2. Directional Derivatives and the Gradient Vector . . . . . . . . . . . . . . . . . . . . . . 168
Contents vii

6.3. Optimization: Method of Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . 173


6.3.1. Optimization Problems with Equality Constraints . . . . . . . . . . . . . . . . . 174
6.3.2. Optimization Problems with Inequality Constraints . . . . . . . . . . . . . . . . 177
6.4. The Gradient Descent Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.4.1. Introduction to the Gradient Descent Method . . . . . . . . . . . . . . . . . . . . 181
6.4.2. The Gradient Descent Method in Multi-Dimensions . . . . . . . . . . . . . . . . 185
6.4.3. The Gradient Descent Method for Positive Definite Linear Systems . . . . . . . 188
Exercises for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

7 Least-Squares and Regression Analysis 193


7.1. The Least-Squares Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.2. Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.2.1. Regression line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.2.2. Least-squares fitting of other curves . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.2.3. Nonlinear regression: Linearization . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.3. Scene Analysis with Noisy Data: Weighted Least-Squares and RANSAC . . . . . . . . 204
7.3.1. Weighted Least-Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
7.3.2. RANdom SAmple Consensus (RANSAC) . . . . . . . . . . . . . . . . . . . . . . . 206
Exercises for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

8 Python Basics 211


8.1. Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
8.2. Python Essentials in 30 Minutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
8.3. Zeros of a Polynomial in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
8.4. Python Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Exercises for Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

9 Vector Spaces and Orthogonality 233


9.1. Subspaces of Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
9.2. Orthogonal Sets and Orthogonal Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
9.3. Orthogonal Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.4. The Gram-Schmidt Process and QR Factorization . . . . . . . . . . . . . . . . . . . . . 248
9.5. QR Iteration for Finding Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Exercises for Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

10 Introduction to Machine Learning 259


10.1.What is Machine Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
10.2.Binary Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
10.2.1. The Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
10.2.2. Adaline: ADAptive LInear NEuron . . . . . . . . . . . . . . . . . . . . . . . . . . 271
10.3.Popular Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
10.3.1. Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
10.3.2. Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
viii Contents

10.3.3. k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281


10.4.Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
10.4.1. A Simple Network to Classify Hand-written Digits: MNIST Dataset . . . . . . 284
10.4.2. Implementation for MNIST Digits Dataset [9] . . . . . . . . . . . . . . . . . . . 288
10.5.Scikit-Learn: A Python Machine Learning Library . . . . . . . . . . . . . . . . . . . . . 292
A Machine Learning Modelcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Exercises for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

11 Principal Component Analysis 303


11.1.Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
11.1.1. The covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
11.1.2. Computation of principal components . . . . . . . . . . . . . . . . . . . . . . . . 309
11.1.3. Dimensionality reduction: Data compression . . . . . . . . . . . . . . . . . . . . 311
11.2.Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
11.2.1. Algebraic interpretation of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . 317
11.2.2. Computation of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
11.3.Applications of the SVD to LS Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Exercises for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

A Appendices 333
A.1. Optimization: Primal and Dual Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 334
A.1.1. The Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
A.1.2. Lagrange Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
A.2. Weak Duality, Strong Duality, and Complementary Slackness . . . . . . . . . . . . . . 338
A.2.1. Weak Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
A.2.2. Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
A.2.3. Complementary Slackness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
A.3. Geometric Interpretation of Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
A.4. Rank-One Matrices and Structure Tensors . . . . . . . . . . . . . . . . . . . . . . . . . 349
A.5. Boundary-Effects in Convolution Functions in Matlab and Python SciPy . . . . . . . . 353
A.6. From Python, Call C, C++, and Fortran . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

P Projects 365
P.1. Project: Canny Edge Detection Algorithm for Color Images . . . . . . . . . . . . . . . . 366
P.1.1. Noise Reduction: Image Blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
P.1.2. Gradient Calculation: Sobel Gradient . . . . . . . . . . . . . . . . . . . . . . . . 372
P.1.3. Edge Thinning: Non-maximum Suppression . . . . . . . . . . . . . . . . . . . . 375
P.1.4. Double Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
P.1.5. Edge Tracking by Hysteresis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
P.2. Project: Text Extraction from Images, PDF Files, and Speech Data . . . . . . . . . . . 380

Bibliography 383
Contents ix

Index 385
x Contents
1
C HAPTER

Programming Basics

In this chapter, you will learn


• what programming is
• what coding is
• what programming languages are
• how to convert mathematical terms to computer programs
• how to control repetitions

Contents of Chapter 1
1.1. What is Programming or Coding? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2. Matlab: A Powerful Computer Language . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Exercises for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1
2 Chapter 1. Programming Basics

1.1. What is Programming or Coding?


Definition 1.1. Computer programming is the process of building
an executable computer program in order to accomplish a specific com-
putational task.
• Programming involves various concerns such as
– mathematical/physical analysis,
– generating computational algorithms,
– profiling algorithms’ accuracy and cost, and
– the implementation of algorithms in a chosen programming lan-
guage (commonly referred to as coding).
• The purpose of programming is to find a sequence of instructions
that will automate the performance of a task for solving a given
problem.
• Thus, the process of programming often requires expertise in
several different subjects, including
– knowledge of the application domain,
– specialized algorithms, and
– formal logic.

1.1.1. Programming: Some Examples


Example 1.2. Assume that we need to find the sum of integers from 2 to
5:
2 + 3 + 4 + 5.

Solution. You may start with 2; add 3, add 4, and finally add 5; the answer
is 14. This simple procedure is the result of programming in your brain.
Programming is thinking.
1.1. What is Programming or Coding? 3

Example 1.3. Let’s try to get 5. Your calculator must have a function
√ √
key . When you input 5 and push Enter , your calculator displays the
answer on the spot. How can the calculator get the answer?
Solution. Calculators or computers cannot keep a table to look the answer
up. They compute the answer on the spot as follows.

Let Q = 5.
1. initialization: p
2. for i = 1, 2, · · · , itmax
p ← (p + Q/p)/2;
3. end for
squareroot_Q.m
1 Q=5;
2

3 p = 1;
4 for i=1:8
5 p = (p+Q/p)/2;
6 fprintf("%3d %.20f\n",i,p)
7 end

Output
1 1 3.00000000000000000000
2 2 2.33333333333333348136
3 3 2.23809523809523813753
4 4 2.23606889564336341891
5 5 2.23606797749997809888
6 6 2.23606797749978980505
7 7 2.23606797749978980505
8 8 2.23606797749978980505

The algorithm has converged to 20 decimal digit accuracy, just in 6 itera-


tions.
Note: The above example shows what really happens in your calculators

and computers (sqrt). In general, Q can be found in a few iterations of
simple mathematical operations.
4 Chapter 1. Programming Basics

Remark 1.4. Note that


p2 − Q
p ← (p + Q/p)/2 = p − , (1.1)
2p
which can be interpreted as follows.
1. Square the current iterate p;
2. Measure the difference from Q;
3. Scale the difference by twice the current iterate (2p)
4. Update p by subtracting the scaled difference (correction term)

Quesiton. How could we know:


a good scaling factor in the correction term is 2p?

• The answer comes from a mathematical analysis.


• In general, programming consists of
(a) mathematical analysis,
(b) algorithmic design,
(c) implementation to the computer (coding), and
(d) verification for accuracy and efficiency.
• Once you have done a mathematical analysis and performed al-
gorithmic design, the next step is to implement the algorithm for
a code (coding).
• In implementation, you can use one (or more) of computer languages
such as Matlab, Python, C, C++, and Java.

• Through the course, you will learn programming techniques,


using simple languages such as Matlab and Python.
• Why simple languages?

To focus on mathematical logic and algorithmic design


1.1. What is Programming or Coding? 5

Remark 1.5. (Coding vs. Programming)


At this moment, you may ask questions like:
• What is coding?
• How is it related to programming?
Though the terms are often used interchangeably, coding and program-
ming are two different things.
Particularly, in Software Development Industries,
• Coding refers to writing codes for applications, but
programming is a much broader term.
– Coding is basically the process of creating codes from one lan-
guage to another,
– while programming is to find solutions of problems and deter-
mine how they should be solved.
• Programmers generally deal with the big picture in applications.
So you will learn programming!

1.1.2. Functions: Generalization and Reusability


A good programmer must implement codes that are effective, easy to
modify, and reusable. In order to understand reusability, let us con-
sider the following simple programming.
Example 1.6. Find the sum of the square of consecutive integers from
1 to 10.

Solution.
• This example asks to evaluate the quantity:
10
X
2 2 2
1 + 2 + · · · + 10 = i2 . (1.2)
i=1

• A Matlab code can be written as


6 Chapter 1. Programming Basics

• When the code is executed, the vari-


1 sqsum = 0; able sqsum saves 385.
2 for i=1:10
• The code is a simple form of
3 sqsum = sqsum + i^2;
repetition, one of most common
4 end
building blocks in programming.

Remark 1.7. Reusability.


• The above Matlab program produces the square sum of integers
from 1 to 10, which may not be useful for other occasions.
• In order to make programs reusable for various situations, opera-
tions or a group of operations must be
– implemented with variable inputs, and
– saved as a form of function.

Example 1.8. (Generalization of Example 1.6). Find the sum of the


square of consecutive integers from m to n.

Solution. As a generalization of the above Matlab code, it can be imple-


mented and saved in squaresum.m as follows.
squaresum.m
1 function sqsum = squaresum(m,n)
2 %function sqsum = squaresum(m,n)
3 % Evaluates the square sum of consecutive integers: m to n.
4 % input: m,n
5 % output: sqsum
6

7 sqsum = 0;
8 for i=m:n
9 sqsum = sqsum + i^2;
10 end

• In Matlab, each of saved function files is called an M-file, of which the


first line specifies
1.1. What is Programming or Coding? 7

– the function name (squaresum),


– input variables (m,n),
– outputs (sqsum).

• Lines 2–5 of squaresum.m, beginning with the percent sign (%), are for
a convenient user interface. A built–in function help can be utilized
whenever we want to see what the programmer has commented for the
function. For example,
help
1 >> help squaresum
2 function sqsum = squaresum(m,n)
3 Evaluates the square sum of consecutive integers: m to n.
4 input: m,n
5 output: sqsum

• The last four lines of squaresum.m include the required operations for
the given task.

• On the command window, the function is called for various m and


n. For example,

1 >> squaresum(1,10)
2 ans = 385
8 Chapter 1. Programming Basics

1.1.3. Becoming a Good Programmer

In-Reality 1.9. Aspects of Programming


As aforementioned, computer programming (or programming) is
the process of building an executable computer program for accomplish-
ing a specific computational task. A task may consist of numerous
sub-tasks each of which can be implemented as a function; some
functions may be used more than once or repeatedly in a pro-
gram. The reader should consider following aspects of programming,
before coding.

• Task modularization: The given computational task can partitioned


into several small sub-tasks (modules), each of which is manage-
able conveniently and effectively in both mathematical analysis and
computer implementation. The major goal of task modularization is to
build a backbone of programming.
• Development of algorithms: For each module, computational algo-
rithms must be developed and saved in functions.
• Choice of computer languages: One can choose one of computer
languages in which all the sub-tasks are implemented. However, it
is occasionally the case that sub-tasks are implemented in more than
one computer languages, in order to maximize the performance
of the resulting program and/or to minimize human efforts.
• Debugging: Once all the modules are implemented and linked for the
given computational task, the code must be verified for correctness
and effectiveness. Such a process of finding and resolving defects or
issues within a computer program is called debugging.

Note: It is occasionally the case that verification and debugging take


much longer time than implementation itself.
1.1. What is Programming or Coding? 9

Remark 1.10. Tips for Programming:


• Add functions one-by-one: Building a program is not a simple prob-
lem but a difficult project, particularly when the program should be
constructed from scratch. A good strategy for an effective program-
ming is:
(a) Add functions one-by-one.
(b) Check if the program is correct, each time adding a function.
That is, the programmer should keep the program in a working condi-
tion for the whole period of time of implementation.
• Use/modification of functions: One can build a new program pretty
effectively by trying to modify and use old functions used for the same
or similar projects. When it is the case,
you may have to start the work by copying old func-
tions to newly-named functions to modify, rather than
adding/replacing lines of the original functions.
Such a strategy will make the programmer debug much more easily
and keep the program in a working condition all the time.

Note: For a successful programming, the programmer may consider the


following, before he/she starts implementation (coding).
• Understanding the problem: inputs, operations, & outputs
• Required algorithms: reusable or new
• Required mathematical methods/derivation
• Program structure: how to place operations/functions
• Verification: How to verify the code to make sure correctness
10 Chapter 1. Programming Basics

Example 1.11. Let us write a program for sorting an array of numbers


from smallest to largest.
Solution. We should consider the following before coding.
• The goal: A sorting algorithm.
• Method: Comparison of component pairs tor the smaller to move up.
• Verification: How can I verify the program work correctly?
Let’s use e.g., a randomly-generated array of numbers.
• Parameters: Overall, what could be input/output parameters?

All being considered, a program is coded as follows.


mysort.m
1 function S = mysort(R)
2 %function S = mysort(R)
3 % which sorts an array from smallest to largest
4

5 %% initial setting
6 S = R;
7

8 %% get the length


9 n = length(R);
10

11 %% begin sorting
12 for j=n:-1:2 %index for the largest among remained
13 for i=1:j-1
14 if S(i) > S(i+1)
15 tmp = S(i);
16 S(i) = S(i+1);
17 S(i+1) = tmp;
18 end
19 end
20 end
1.1. What is Programming or Coding? 11

SortArray.m
1 % User parameter
2 n=10;
3

4 % An array of random numbers


5 % (1,n) vector of integer random values <= 100
6 R = randi(100,1,n)
7

8 % Call "mysort"
9 S = mysort(R)

Output
1 >> SortArray
2 R =
3 33 88 75 17 91 94 79 36 2 72
4 S =
5 2 17 33 36 72 75 79 88 91 94

Note: You may have to run “SortArray.m” a few times, to make sure that
“mysort” works correctly.

Summary 1.12. Programming vs. Coding

• Programming consists of analysis, design, coding, & verification.


It requires creative thinking and reasoning, on top of coding.

• It would better begin with a simple computer language.


12 Chapter 1. Programming Basics

1.2. Matlab: A Powerful Computer Language


Matlab (matrix laboratory) is a multi-paradigm numerical computing
environment and a proprietary programming language developed by
MathWorks.
• Flexibility: Matlab allows matrix manipulations, plotting of func-
tions and data, implementation of algorithms, creation of user in-
terfaces, and interfacing with programs written in other languages,
including C, C++, C#, Java, Fortran, and Python; it is particularly
good at matrix manipulations.
• Computer Algebra: Although Matlab is intended primarily for nu-
merical computing, an optional toolbox uses the MuPAD symbolic
engine, allowing access to symbolic computing abilities.
• Most Convenient Computer Language: Overall, Matlab is about
the easiest computer language to learn and to use as well.

Remark 1.13. For each of programming languages, there are four


essential components to learn.
1. Looping – repetition
2. Conditional statements – dealing with cases
3. Input/Output – using data and saving/visualizing results
4. Functions – reusability and programming efficiency (§1.1.2)

1.2.1. Introduction to Matlab/Octave


Vectors and Matrices
The most basic thing you will need to do is to enter vectors and matrices.
You would enter commands to Matlab at a prompt that looks like >>.
• Rows are separated by semicolons (;) or Enter .
• Entries in a row are separated by commas (,) or Space .
1.2. Matlab: A Powerful Computer Language 13

Vectors and Matrices


1 >> v = [1; 2; 3] % column vector
2 v =
3 1
4 2
5 3
6 >> w = [5, 6, 7, 8] % row vector
7 w =
8 5 6 7 8
9 >> A = [2 1; 1 2] % matrix
10 A =
11 2 1
12 1 2
13 >> B = [2, 1; 1, 2]
14 B =
15 2 1
16 1 2

• The symbols (,) and (;) can be used to combine more than one command
in the same line.
• If we use semicolon (;), Matlab sets the variable but does not print the
output.

1 >> p = [2; -3; 1], q = [2; 0; -3];


2 p =
3 2
4 -3
5 1
6 >> p+q
7 ans =
8 4
9 -3
10 -2
11 >> d = dot(p,q);

where dot computes the dot product of two vectors.


14 Chapter 1. Programming Basics

• Instead of entering a matrix at once, we can build it up from either its


rows or its columns.

1 >> c1=[1; 2]; c2=[3; 4];


2 >> M=[c1,c2]
3 M =
4 1 3
5 2 4
6 >> c3=[5; 6];
7 >> M=[M,c3]
8 M =
9 1 3 5
10 2 4 6
11 >> c4=c1; r3=[2 -1 5 0];
12 >> N=[M, c4; r3]
13 N =
14 1 3 5 1
15 2 4 6 2
16 2 -1 5 0
1.2. Matlab: A Powerful Computer Language 15

Operations with Vectors and Matrices


• Matlab uses the symbol (*) for both scalar multiplication and matrix-
vector multiplication.
• In Matlab, to retrieve the (i, j)-th entry of a matrix M, type M(i,j).
• To retrieve more than one element at a time, give a list of columns and
rows that you want.
– For example, 2:4 is the same as [2 3 4].
– A colon (:) by itself means all. Thus, M(i,:) extracts the i-th row
of M. Similarly, M(:,j) extracts the j-th column of M.

1 >> M=[1 2 3 4; 5 6 7 8; 9 10 11 12], v=[1;-2;2;1];


2 M =
3 1 2 3 4
4 5 6 7 8
5 9 10 11 12
6 >> M(2,3)
7 ans =
8 7
9 >> M(3,[2 4])
10 ans =
11 10 12
12 >> M(:,2)
13 ans =
14 2
15 6
16 10
17 >> 3*v
18 ans =
19 3
20 -6
21 6
22 3
23 >> M*v
24 ans =
25 7
26 15
27 23
16 Chapter 1. Programming Basics

• To multiply two matrices in Matlab, use the symbol (*).


• The n × n identity matrix is formed with the command eye(n).
• You can ask Matlab for its reasoning using the command why. Unfor-
tunately, Matlab usually takes attitude and gives a random response.

1 >> A=[1 2; 3 4], B=[4 5; 6 7],


2 A =
3 1 2
4 3 4
5 B =
6 4 5
7 6 7
8 >> A*B
9 ans =
10 16 19
11 36 43
12 >> I=eye(3)
13 I =
14 1 0 0
15 0 1 0
16 0 0 1
17 >> C=[2 4 6; 1 3 5; 0 1 1];
18 >> C_inv = inv(C)
19 C_inv =
20 1.0000 -1.0000 -1.0000
21 0.5000 -1.0000 2.0000
22 -0.5000 1.0000 -1.0000
23 >> C_inv2=C\I
24 C_inv2 =
25 1.0000 -1.0000 -1.0000
26 0.5000 -1.0000 2.0000
27 -0.5000 1.0000 -1.0000
28 >> C_inv*C
29 ans =
30 1 0 0
31 0 1 0
32 0 0 1
1.2. Matlab: A Powerful Computer Language 17

Graphics with Matlab


In Matlab, the most popular graphic command is plot, which creates
a 2D line plot of the data in Y versus the corresponding values in X. A
general syntax for the command is
plot(X1,Y1,LineSpec1,...,Xn,Yn,LineSpecn)
fig_plot.m
1 close all
2

3 %% a curve
4 X1=linspace(0,2*pi,11); % n=11
5 Y1=cos(X1);
6

7 %% another curve
8 X2=linspace(0,2*pi,51);
9 Y2=sin(X2);
10

11 %% plot together
12 plot(X1,Y1,'-or','linewidth',2, X2,Y2,'-b','linewidth',2)
13 legend({'y=cos(x)','y=sin(x)'})
14 axis tight
15 print -dpng 'fig_cos_sin.png'

Figure 1.1: plot of y = cos x and y = sin x.


18 Chapter 1. Programming Basics

Above fig_plot.m is a typical M-file for figuring with plot.


• Line 1: It closes all figures currently open.
• Lines 3, 7, and 11 (comments): When the percent sign (%) appears, the
rest of the line will be ignored by Matlab.
• Lines 4 and 8: The command linspace(x1,x2,n) returns a row vector
of n evenly spaced points between x1 and x2.
• Line 12: Its result is a figure shown in Figure 1.1.
• Line 15: it saves the figure into a png format, named fig_cos_sin.png.
The first function (y = cos x) is plotted with 11 points so that its curve
shows the local linearity, while the graph of y = sin x looks smooth with
51 points.

• For contour plots, you may use contour.


• For figuring 3D objects, you may try surf and mesh.
• For function plots, you can use fplot, fsurf, and fmesh.

Remark 1.14. (help and doc).


Matlab is powerful and well-documented as well. To see what a built-in
function do or how you can use it, type

help <name> or doc <name>

The command doc opens the Help browser. If the Help browser is already
open, but not visible, then doc brings it to the foreground and opens a
new tab. Try doc surf, followed by doc contour.
1.2. Matlab: A Powerful Computer Language 19

1.2.2. Repetition: Iteration Loops

Recall: (Remark 1.13 on p.12) For each of programming languages,


there are four essential components to learn.
1. Looping – repetition
2. Conditional statements – dealing with cases
3. Input/Output – using data and saving/visualizing results
4. Functions – reusability and programming efficiency (§1.1.2)

Note: Repetition
• In scientific computing, one of most frequently occurring events is
repetition.
• Each repetition of the process is also called an iteration.
• It is the act of repeating a process, to generate a (possibly un-
bounded) sequence of outcomes, with the aim of approaching a de-
sired goal, target or result. Thus,
(a) Iteration must start with an initialization (starting point), and
(b) Perform a step-by-step marching in which the results of one it-
eration are used as the starting point for the next iteration.

In the context of mathematics or computer science, iteration (along with the


related technique of recursion) is a very basic building block in program-
ming.
As in other computer languages, Matlab provides a few types of loops to
handle looping requirements including: while loops, for loops, and nested
loops.
20 Chapter 1. Programming Basics

While loop
The while loop repeatedly executes statements while a specified condi-
tion is true. The syntax of a while loop in Matlab is as follows.
while <expression>
<statements>
end
An expression is true when the result is nonempty and contains all
nonzero elements, logical or real numeric; otherwise the expression is
false.

Example 1.15. Here is an example for the while loop.


%% while loop
a=10; b=15;
fprintf('while loop execution: a=%d, b=%d\n',a,b);

while a<=b
fprintf(' The value of a=%d\n',a);
a = a+1;
end
When the code above is executed, the result will be:
while loop execution: a=10, b=15
The value of a=10
The value of a=11
The value of a=12
The value of a=13
The value of a=14
The value of a=15
1.2. Matlab: A Powerful Computer Language 21

For loop
A for loop is a repetition control structure that allows you to efficiently
write a loop that needs to execute a specific number of times. The syntax
of a for loop in Matlab is as following:
for index = values
<program statements>
end
Here values can be any list of numbers. For example:
• initval:endval – increments the index variable from initval to
endval by 1, and repeats execution of program statements while in-
dex is not greater than endval.
• initval:step:endval – increments index by the value step on each
iteration, or decrements when step is negative.

Example 1.16. The code in Example 1.15 can be rewritten as a for loop.
%% for loop
a=10; b=15;
fprintf('for loop execution: a=%d, b=%d\n',a,b);

for i=a:b
fprintf(' The value of i=%d\n',i);
end
When the code above is executed, the result will be:
for loop execution: a=10, b=15
The value of i=10
The value of i=11
The value of i=12
The value of i=13
The value of i=14
The value of i=15
22 Chapter 1. Programming Basics

Nested loops
Matlab also allows to use one loop inside another loop. The syntax for a
nested loop in Matlab is as follows:
for n = n0:n1
for m = m0:m1
<statements>;
end
end
The syntax for a nested while loop statement in Matlab is as follows:
while <expression1>
while <expression2>
<statements>;
end
end
For a nested loop, you can combine
• for loop and while loop
• more than two
1.2. Matlab: A Powerful Computer Language 23

Loop Control Statements


Break Statement
The break statement terminates execution of for or while loops.
• Statements in the loop that appear after the break statement are
not executed.
• In nested loops, break exits only from the loop in which it occurs.
• Control passes to the statement following the end of that loop.

Example 1.17. Let’s modify the code in Example 1.15 to involve a break
statement.
%% "break" statement with while loop
a=10; b=15; c=12.5;
fprintf('while loop execution: a=%d, b=%d, c=%g\n',a,b,c);

while a<=b
fprintf(' The value of a=%d\n',a);
if a>c, break; end
a = a+1;
end
When the code above is executed, the result is:
while loop execution: a=10, b=15, c=12.5
The value of a=10
The value of a=11
The value of a=12
The value of a=13
When the condition a>c is satisfied, break is invoked; which breaks the while
loop to stop.
24 Chapter 1. Programming Basics

Continue Statement
continue passes control to the next iteration of a for or while loop.
• It skips any remaining statements in the body of the loop for the
current iteration; the program continues execution from the next
iteration.
• continue applies only to the body of the loop where it is called.
• In nested loops, continue skips remaining statements only in the
body of the loop in which it occurs.

Example 1.18. Consider a modification of the code in Example 1.16.


%% for loop with "continue"
a=10; b=15;
fprintf('for loop execution: a=%d, b=%d\n',a,b);

for i=a:b
if mod(i,2), continue; end % even integers, only
disp([' The value of i=' num2str(i)]);
end
When the code above got executed, the result is:
for loop execution: a=10, b=15
The value of i=10
The value of i=12
The value of i=14

Note: In the above, mod(i,2) returns the remainder when i is divided


by 2 (so that the result is either 0 or 1). In general,
• mod(a,m) returns the remainder after division of a by m, where a is
the dividend and m is the divisor.
• This mod function is often called the modulo operation.
1.2. Matlab: A Powerful Computer Language 25

1.2.3. Anonymous Function


Matlab-code 1.19. In Matlab, one can define an anonymous function,
which is a function that is not stored in a program file.
anonymous_function.m
1 %% Define an anonymous function
2 f = @(x) x.^3-x-2;
3

4 %% Evaluate the function


5 f1 = f(1)
6 X = 1:6;
7 fX = feval(f,X)
8

9 %% Calculus
10 q = integral(f,1,3)

Output
1 >> anonymous_function
2 f1 =
3 -2
4 fX =
5 -2 4 22 58 118 208
6 q =
7 12

1.2.4. Open Source Alternatives to Matlab

• Octave is the best-known alternative to Matlab. Octave strives for


exact compatibility, so many of your projects developed for Matlab may
run in Octave with no modification necessary.
• NumPy is the main package for scientific computing with Python. It
can process n-dimensional arrays, complex matrix transforms, linear
algebra, Fourier transforms, and can act as a gateway for C and C++
integration. It is the fundamental data-array structure for the SciPy
Stack, and an ecosystem of Python-based math, science, and engineer-
ing software. Python basics will be considered in Chapter 8, p. 211.
26 Chapter 1. Programming Basics

Exercises for Chapter 1

1.1. On Matlab command window, perform the following

• 1:20
• 1:1:20
• 1:2:20
• 1:3:20;
• isprime(12)
• isprime(13)
• for i=3:3:30, fprintf('[i,i^2]=[%d, %d]\n',i,i^2), end
for i=3:3:30

The above is the same as  fprintf('[i,i^2]=[%d, %d]\n',i,i^2)


end
• for i=1:10,if isprime(i),fprintf('prime=%d\n',i);end,end
Rewrite it with linebreaks, rather than using comma (,).

1.2. Compose a code and write as a function for the sum of prime numbers not larger than
a positive integer n.
1.3. Modify the function you made in Exercise 2 to count the number of prime numbers
and return the result along with the sum. For multiple output, the function may
start with
function [sum, numver] = <function_name>(inputs)

1.4. Let, for k, n positive integers,


k
X
Sk = 1 + 2 + · · · + k = i
i=1

and n
X
Tn = Sk .
k=1

Write a code to find and print out Sn and Tn for n = 1 : 10.



1+ 5
1.5. The golden ratio is the number φ = .
2
(a) Verify that the golden ratio is the positive solution of x2 − x − 1 = 0.
(b) Evaluate the golden ratio in 12-digit decimal accuracy.
1.2. Matlab: A Powerful Computer Language 27

1.6. The Fibonacci sequence is a series of numbers, defined by

f0 = 0, f1 = 1; fn = fn−1 + fn−2 , n = 2, 3, · · · (1.3)

The Fibonacci sequence has interesting properties; two of them are

(i) The ratio rn = fn /fn−1 approaches the golden ratio, as n increases.


(ii) Let x1 and x2 be two solutions of x2 − x − 1 = 0:
√ √
1− 5 1+ 5
x1 = and x2 = .
2 2
Then
(x2 )n − (x1 )n
tn := √ = fn , for all n ≥ 0. (1.4)
5

(a) Compose a code to print out the following in a table format.

n fn rn tn

for n ≤ K = 20.
You may start with
Fibonacci_sequence.m
1 K = 20;
2 F = zeros(K);
3 F(1)=1; F(2)=F(1);
4

5 for n=3:K
6 F(n) = F(n-1)+F(n-2);
7 rn = F(n)/F(n-1);
8 fprintf("n =%3d; F = %7d; rn = %.12f\n",n,F(n),rn);
9 end

(b) Find n such that rn has 12-digit decimal accuracy to the golden ratio φ.
Ans: (b) n = 32
28 Chapter 1. Programming Basics
2
C HAPTER

Programming Examples

Contents of Chapter 2
2.1. Area Estimation of the Region Defined by a Closed Curve . . . . . . . . . . . . . . . . . 30
2.2. Visualization of Complex-Valued Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3. Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4. Computational Algorithms and Their Convergence . . . . . . . . . . . . . . . . . . . . . 47
2.5. Inverse Functions: Exponentials and Logarithms . . . . . . . . . . . . . . . . . . . . . 53
Exercises for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

29
30 Chapter 2. Programming Examples

2.1. Area Estimation of the Region Defined by


a Closed Curve
Problem 2.1. It is common in reality that a region is saved by a se-
quence of points: For some n > 0,
(x0 , y0 ), (x1 , y1 ), · · · , (xn , yn ), (xn , yn ) = (x0 , y0 ). (2.1)

Figure 2.1: A region and its approximation.

Here the question is:


If a sequence of points (2.1) represents a region, how can we com-
pute its area accurately?

Derivation of Computational Formulas


Example 2.2. Let’s begin with a very simple example.

(a) Find the area of a rectangle [a, b] × [c, d].

Solution. We know the area = (b − a) · (d − c).


It can be rewritten as

b · (d − c) − a · (d − c) = b · (d − c) + a · (c − d)

from which we may guess that


X
Area = x∗i · ∆y i , (2.2)
i
2.1. Area Estimation of the Region Defined by a Closed Curve 31

where the sum is carried out over line segments Li and x∗i denotes the
mid value of x on Li .
(b) Find the area of a triangle.
Solution. We know the area = 21 (b − a) · (d − c).
Now, let’s try to find the area using the formula
(2.2): X
Area = x∗i · ∆y i .
i

Let L1 , L2 , L3 be the bottom side, vertical side,


and the hypotenuse, respectively.
Then
a+b b+b b+a
Area = · (c − c) + · (d − c) + · (c − d)
2 2 2
b+a
= 0 + b · (d − c) + · (c − d)
2
 b + a 1
= b− · (d − c) = (b − a) · (d − c).
2 2
Okay. The formula is correct!
Note: Horizontal line segments makes no contribution to the area.

(c) Let’s verify the formula once more.

The area of the M-shaped is 30.


Let’s collect only nonzero values:

2 · 3 − 2.5 · 2 + 3.5 · 2 − 4 · 3
+6 · 6
−3.5 · 2 + 2.5 · 2
= 6 − 5 + 7 − 12
+36
−7 + 5
= 30

Again, the formula is correct!!


32 Chapter 2. Programming Examples

Summary 2.3. The above work can be summarized as follows.


• Let a region R be represented as a sequence of points

(x0 , y0 ), (x1 , y1 ), · · · , (xn , yn ), (xn , yn ) = (x0 , y0 ). (2.3)

• Let Li be the i-th line segment connecting (xi−1 , yi−1 ) and (xi , yi ), n =
1, 2, · · · , n. Then the area of R can be computed using the formula
n
X
Area(R) = x∗i · ∆y i , (2.4)
i=1

where
xi−1 + xi
x∗i = , ∆y i = yi − yi−1 .
2

Note: The formula (2.4) is a result of Green’s Theorem for the line
integral and numerical approximation.

Example 2.4. We will generate a dataset, save it to a file, and read it to


plot and measure the area.

(a) Generate a dataset that represents the circle of radius centered the
origin. For example, for i = 0, 1, 2, · · · , n,

(xi , yi ) = (cos θi , sin θi ), θi = i · . (2.5)
n
Note that (xn , yn ) = (x0 , y0 ).
(b) Analyze accuracy improvement of the area as n grows. The larger n you
choose, the more accurately the data would represent the region.

Solution.
circle.m
1 n = 10;
2 %%---- Data generation -----------------
3 theta = linspace(0,2*pi,n+1)'; % a column vector
4 data = [cos(theta),sin(theta)];
5
2.1. Area Estimation of the Region Defined by a Closed Curve 33

6 %%---- Plot it & Save the image --------


7 figure,
8 plot(data(:,1),data(:,2),'r-','linewidth',2);
9 daspect([1 1 1]); axis tight;
10 xlim([-1 1]), ylim([-1 1]);
11 title(['Circle: n=' int2str(n)])
12 image_name = 'circle.png';
13 saveas(gcf,image_name);
14

15 %%---- Save the data -------------------


16 filename = 'circle-data.txt';
17 csvwrite(filename,data)
18 %writematrix(data,filename,'Delimiter',' ');
19

20 %%======================================
21 %%---- Read the data -------------------
22 %%======================================
23 DATA = load(filename);
24 X = DATA(:,1);
25 Y = DATA(:,2);
26

27 figure,
28 plot(X,Y,'b--','linewidth',2);
29 daspect([1 1 1]); axis tight
30 xlim([-1 1]), ylim([-1 1]);
31 title(['Circle: n=' int2str(n)])
32 yticks(-1:0.5:1)
33 saveas(gcf,'circle-dashed.png');
34

35 %%---- Area computation ----------------


36 area = area_closed_curve(DATA); %See an Exercise problem
37 fprintf('n = %3d; area = %.12f, misfit = %.12f\n', ...
38 size(DATA,1)-1,area, abs(pi-area));
34 Chapter 2. Programming Examples

Figure 2.2: An approximation of the unit circle, with n = 10.

Accuracy Improvement
1 n = 10; area = 2.938926261462, misfit = 0.202666392127
2 n = 20; area = 3.090169943749, misfit = 0.051422709840
3 n = 40; area = 3.128689300805, misfit = 0.012903352785
4 n = 80; area = 3.138363829114, misfit = 0.003228824476
5 n = 160; area = 3.140785260725, misfit = 0.000807392864

Note: The misfit becomes a quarter as the number of points is doubled.

Remark 2.5. From Example 2.4, you learn how to


• Generate datasets
• Save data into a file
• Read a data file
• Set figure environments
• Call functions
2.2. Visualization of Complex-Valued Solutions 35

2.2. Visualization of Complex-Valued Solutions

Problem 2.6. Seeking the solutions of

f (x) = x2 − x + 1 = 0, (2.6)

we can easily find that the equation has no real solutions. However,
by using the quadratic formula, the complex-valued solutions are

1 ± 3i
x= .
2
Here we have questions:
What do the complex-valued solutions mean?
Can we visualize them?

Remark 2.7. Complex Number System:


The most complete number system is the system of complex numbers:

C = {x + yi | x, y ∈ R}

where i = −1, called the imaginary unit.
• Seeking a real-valued solution of f (x) = 0 is the same as finding
a solution of f (z) = 0, z = x + yi, restricting on the x-axis (y = 0).
• If
f (z) = A(x, y) + B(x, y) i, (2.7)
then the complex-valued solutions are the points x + yi such that
A(x, y) = B(x, y) = 0.

Example 2.8. For f (x) = x2 − x + 1, express f (x + yi) in the form of (2.7).


Solution.


Ans: f (z) = (x2 − x − y 2 + 1) + (2x − 1)y i
36 Chapter 2. Programming Examples

Example 2.9. Implement a code to visualize complex-valued solutions of


f (x) = x2 − x + 1 = 0.
Solution. From Example 2.8,
f (z) = A(x, y) + B(x, y) i, A(x, y) = x2 − x − y 2 + 1, B(x, y) = (2x − 1)y.
visualize_complex_solution.m
1 close all
2 if exist('OCTAVE_VERSION','builtin'), pkg load symbolic; end
3

4 syms x y real
5

6 %% z^2 -z +1 = 0
7 A = @(x,y) x.^2-x-y.^2+1;
8 B = @(x,y) (2*x-1).*y;
9 T = 'z^2-z+1=0';
10

11 figure, % Matlab 'fmesh' is not yet in Octave


12 np=41; X=linspace(-5,5,np); Y=linspace(-5,5,np);
13 mesh(X,Y,A(X,Y'),'EdgeColor','r'), hold on
14 mesh(X,Y,B(X,Y'),'EdgeColor','b'),
15 mesh(X,Y,zeros(np,np),'EdgeColor','k'),
16 legend("A","B","0"),
17 xlabel('x'), ylabel('y'), title(['A and B for ' T])
18 hold off
19 print -dpng 'complex-solutions-A-B-fmesh.png'
20

21 %%--- Solve A=0 and B=0 --------------


22 [xs,ys] = solve(A(x,y)==0,B(x,y)==0,x,y);
23

24 figure,
25 np=101; X=linspace(-5,5,np); Y=linspace(-5,5,np);
26 contour(X,Y,A(X,Y'), [0 0],'r','linewidth',2), hold on
27 contour(X,Y,B(X,Y'), [0 0],'b--','linewidth',2)
28 plot(double(xs),double(ys),'r.','MarkerSize',30) % the solutions
29 grid on
30 %ax=gca; ax.GridAlpha=0.5; ax.FontSize=13;
31 legend("A=0","B=0")
32 xlabel('x'), ylabel('yi'), title(['Compex solutions of ' T])
33 hold off
34 print -dpng 'complex-solutions-A-B=0.png'
2.2. Visualization of Complex-Valued Solutions 37

Figure 2.3: Two solutions are 1/2 + -3ˆ (1/2)/2 i and 1/2 + 3ˆ (1/2)/2 i.

Remark 2.10. You can easily find the real part and the imaginary
part of polynomials of z = x + iy as follows.
Real and Imaginary Parts
1 syms x y real
2 z = x + 1i*y;
3

4 g = z^2 -z +1;
5 simplify(real(g))
6 simplify(imag(g))

Here “1i” (number 1 and letter i), appeared in Line 2, means the imaginary

unit i = −1.
Output
1 ans = x^2 - x - y^2 + 1
2 ans = y*(2*x - 1)

Summary 2.11. Visualization of Complex-Valued Solutions


Seeking a real-valued solution of f (x) = 0 is the same as finding a
solution of f (z) = 0, z = x + yi, restricting on the x-axis (y = 0).
38 Chapter 2. Programming Examples

2.3. Discrete Fourier Transform


Note: In spectral analysis of audio data, our goal is to determine the
frequency content of a signal.
• For analog signals: Use Fourier transform
• For digital signals: Use discrete Fourier transform

Definition 2.12. Fourier Transform

• For a function x(t), a continuous signal, the Fourier transform is


defined as ˆ ∞
X(ω) = x(t)e−iωt dt, (2.8)
−∞
where ω = 2πf is the angular frequency and f is the frequency.
• The inverse Fourier transform is defined as
ˆ ∞
1
x(t) = X(ω)eiωt dω. (2.9)
2π −∞

The identity
eiθ = cos θ + i sin θ, θ ∈ R, (2.10)
is called the Euler’s identity; see Exercise 3.3 on p.109.
2.3. Discrete Fourier Transform 39

2.3.1. Discrete Fourier Transform


Definition 2.13. Discrete Fourier Transform (DFT)
For a discrete signal {x(n∆t)}, the discrete Fourier transform is de-
fined as
N
X −1
X(k∆f ) = x(n∆t)e−i(2πk∆f )(n∆t) , k = 0, 1, 2, · · · , N − 1, (2.11)
n=0

where
• N = the total number of discrete data points taken
• T = the total sampling time (second)
• ∆t = time between data points, ∆t = T /N
• ∆f = the frequency increment (frequency resolution)
∆f = 1/T (Hz)
• fs = the sampling frequency (per second), fs = 1/∆t = N/T .
Note: (k∆f )(n∆t) = kn/N

Definition 2.14. Inverse Discrete Fourier Transform (IDFT)


The inverse discrete Fourier transform of X is defined as
N −1
1 X
x(n∆t) = X(k∆f )ei(2πk∆f )(n∆t) , n = 0, 1, 2, · · · , N − 1, (2.12)
N
k=0

Remark 2.15. Fast Fourier Transform (FFT)


The fast Fourier transform is simply a DFT that is fast.
• All the rules and details about DFTs apply to FFTs as well.
• Power-of-2 Restriction
– Many FFTs (e.g., in Microsoft Excel) restrict N to a power of 2,
such as 64, 128, 256, and so on.
– The FFT in Matlab has no such restriction.
40 Chapter 2. Programming Examples

Example 2.16. Generate a synthetic signal for T = 4 seconds, at a sam-


pling rate of fs = 100 Hz. Then, compute its DFT and restore the original
signal by applying the IDFT.
Solution. Let’s begin with the DFT and the IDFT.
discrete_Fourier.m
1 function X = discrete_Fourier(x)
2 % function X = discrete_Fourier(x)
3 % Calculate the full DFT
4

5 N = length(x); % Length of input sequence


6 X = zeros(1,N); % Initialize output sequence
7

8 % (k*Df)*n*Dt = (k/T)*n*(T/N) = k*n/N


9 for k = 0:N-1 % Loop over all frequency components
10 for n = 0:N-1 % Loop over all time-domain samples
11 X(k+1) = X(k+1) + x(n+1)*exp(-1i*2*pi*k*n/N);
12 end
13 end

discrete_Fourier_inverse.m
1 function X = discrete_Fourier_inverse(x)
2 % function X = discrete_Fourier_inverse(x)
3 % Calculate the inverse DFT
4

5 N = length(x); % Length of input sequence


6 X = zeros(1,N); % Initialize output sequence
7

8 % (k*Df)*n*Dt = (k/T)*n*(T/N) = k*n/N


9 for n = 0:N-1 % Loop over all time-domain samples
10 for k = 0:N-1 % Loop over all frequency components
11 X(n+1) = X(n+1) + x(k+1)*exp(1i*2*pi*k*n/N);
12 end
13 end
14

15 X = X/N;
2.3. Discrete Fourier Transform 41

For a synthetic signal, we combine two sinusoidals, of which frequencies


f = 10, 20 and magnitudes are 1, 2.
signal_DFT.m
1 close all
2

3 T = 4; fs = 100;
4 t = 0:1/fs:T-1/fs; % Time vector
5 x = sin(2*pi*10*t) + 2*sin(2*pi*20*t); % Signal
6

7 figure, plot(t,x,'-k')
8 print -dpng 'dft-data-signal.png'
9

10 %-------------------------------------
11 X = discrete_Fourier(x);
12

13 % Compute magnitude and phase


14 N = length(X);
15 mag_X = zeros(1,N); % Initialize magnitude
16 phi_X = zeros(1,N); % Initialize phase
17 for k = 1:N
18 mag_X(k) = sqrt(real(X(k))^2 + imag(X(k))^2);
19 phi_X(k) = atan2(imag(X(k)), real(X(k)));
20 end
21

22 %-------------------------------------
23 x_restored = discrete_Fourier_inverse(X);
24 misfit = max(abs(x-x_restored))
25

26 % Plots for Spectra


27 %-------------------------------------
28 f = (0:N-1)*fs/N; % Frequency vector
29 figure, subplot(2,1,1);
30 plot(f, mag_X,'-r','linewidth',1.5);
31 xlabel('Frequency (Hz)'); ylabel('Magnitude'); title('Magnitude Spectrum');
32 subplot(2,1,2);
33 plot(f, phi_X,'-b');
34 xlabel('Frequency (Hz)'); ylabel('Phase (rad)'); title('Phase Spectrum');
35 print -dpng 'dft-magnitude-phase.png'

Output
1 misfit = 2.6083e-13
42 Chapter 2. Programming Examples

Figure 2.4: A synthetic signal and its spectra.


2.3. Discrete Fourier Transform 43

Remark 2.17. Nyquist Criterion


• The Nyquist criterion is important in DFT analysis.
– When sampling at frequency fs , we can obtain reliable
frequency information only for frequencies less than
fs /2. (Here, reliable means without aliasing problems.)
• We can calculate at what value of k the frequency k∆f = fs /2.
fs /2 (N/T )/2 N
k= = = . (2.13)
∆f 1/T 2

• Therefore, we conclude that the maximum useful frequency fmax


from a DFT output (also called the folding frequency ffolding ) is
fs N
fmax = ffolding = = ∆f. (2.14)
2 2
In other words, only half of the N available DFT output values are
useful – for k = 0:N/2-1.

Example 2.18. Suppose we sample a signal for T = 4 seconds, at a sam-


pling rate of fs = 100 Hz.

(a) How many data points are taken?


N = T fs = 4 · 100 = 400
(b) How many useful DFT output values are obtained?
N/2 = 200
(c) What is ∆f ?
∆f = 1/T = 1/4 = 0.25 Hz
(d) What is the maximum frequency for which the DFT output is useful
and reliable?
fmax = (N/2)∆f = (400/2) · 0.25 = 50 Hz
Note: The other half of the output values (f > ffolding ) are thrown out or
ignored.
44 Chapter 2. Programming Examples

2.3.2. Short-Time Fourier Transform


The short-time Fourier transform (STFT) is used to analyze how
the frequency content of a signal changes over time.
• The procedure for computing the STFT:
(a) Divide the signal into short segments of equal length.
(b) Compute the DFT separately on each short segment.
• The magnitude squared of the STFT is known as the spectrogram,
a time-frequency representation of the signal.

Remark 2.19. Short-Time Fourier Transform.


• It requires to set
– An analysis window g(n) of length M
– R: The window hops over the original signal by R samples.
⇒ The overlap L = M − R
• Most window functions taper off at the edges to avoid spectral ring-
ing.
• The DFT of each windowed segment
 is stored into a complex-valued
matrix of int (Ns − L)/(M − L) columns, where Ns is the length of
the original signal.

Figure 2.5: iscola_stft.png from Matlab.


2.3. Discrete Fourier Transform 45

Example 2.20. Implement a code for the STFT.


Solution.

We first select an analysis window.


win_cos.m
1 function g = win_cos(M)
2 % function g = win_cos(M)
3

4 t = linspace(0,2*pi,M);
5 g = 0.55-0.45*cos(t);

short_time_DFT.m
1 close all; clear all
2

3 fs=10000; %sampling frequency in Hz


4 T0=1;
5 t0=0:1/fs:T0-1/fs; % Time vector
6 x0=sin((100+3900*t0/2).*(2*pi*t0)); % a chirp signal
7

8 %-------------------------------------------------
9 x = [x0,x0,x0]; %t = [t0,t0+T0,t0+2*T0];
10

11 %-------------------------------------------------
12 M = 150; % window length
13 R = 60; % sliding length
14

15 g = win_cos(M);
16 F = stft2(x,g,R);
17 S = abs(F).^2; % spectrogram
18

19 %-- Plots ----------------------------------------


20 figure,plot(t0(1:1000), x0(1:1000),'-k','linewidth',1.5)
21 title('First 1000 Samples of the Chirp Signal')
22 print -dpng 'stft-chirp-signal.png'
23

24 figure,plot(1:M,g,'-b','linewidth',1.5);
25 ylim([0,1]); title('win\_cos, a window function')
26 print -dpng 'stft-window-g.png'
27
46 Chapter 2. Programming Examples

28 Df = fs/M; Dt = R/fs;
29 figure, imagesc((0:size(S,2)-1)*Dt,(0:M-1)*Df,S)
30 xlabel('Time (second)'); ylabel('Frequency (Hz)'); title('Spectrogram');
31 colormap('pink'); colorbar; set(gca,'YDir','normal')
32 print -dpng 'stft-spectrogram.png'

stft2.m
1 function F = stft2(x,g,R)
2 % function F = stft2(x,g,R)
3 % x: the signal
4 % g: the window function of length M
5 % R: sliding length
6 % Output: F = the short-time DFT
7

8 Ns = length(x); M = length(g);
9 L = M-R; % overlap
10

11 Col = floor((Ns-L)/(M-L));
12 F = zeros(M,Col); c = 1;
13

14 while 1
15 if c==1, n0=1; else, n0=n0+R; end
16 n1=n0+M-1;
17 if n1>Ns, break; end
18 signal = x(n0:n1).*g;
19 F(:,c) = discrete_Fourier(signal)'; c=c+1;
20 end

Figure 2.6: The first 1000 samples of the chirp signal and the spectrogram from the STFT.
2.4. Computational Algorithms and Their Convergence 47

2.4. Computational Algorithms and Their Con-


vergence
Definition 2.21. Suppose that p∗ is an approximation to p. Then

• The absolute error is |p − p∗ |, and


|p − p∗ |
• the relative error is , provided that p 6= 0.
|p|

2.4.1. Computational Algorithms


Definition 2.22. An algorithm is a procedure that describes, in an
unambiguous manner, a finite sequence of steps to be carried out in a
specific order.

Algorithms consist of various steps for inputs, outputs, and functional oper-
ations, which can be described effectively by a so-called pseudocode.

Definition 2.23. An algorithm is called stable, if small changes in the


initial data produce correspondingly small changes in the final results.
Otherwise, it is called unstable. Some algorithms are stable only for
certain choices of data/parameters, and are called conditionally stable.

Notation 2.24. (Growth rates of the error): Suppose that E0 > 0


denotes an error introduced at some stage in the computation and En
represents the magnitude of the error after n subsequent operations.
• If En = C × n E0 , where C is a constant independent of n, then the
growth of error is said to be linear, for which the algorithm is stable.
• If En = C n E0 , for some C > 1, then the growth of error is exponential,
which turns out unstable.
48 Chapter 2. Programming Examples

Rates (Orders) of Convergence


Definition 2.25. Let {xn } be a sequence of real numbers tending to a
limit x∗ .
• The rate of convergence is at least linear if there are a constant c1 < 1
and an integer N such that
|xn+1 − x∗ | ≤ c1 |xn − x∗ |, ∀ n ≥ N. (2.15)

• We say that the rate of convergence is at least superlinear if there


exist a sequence εn tending to 0 and an integer N such that
|xn+1 − x∗ | ≤ εn |xn − x∗ |, ∀ n ≥ N. (2.16)

• The rate of convergence is at least quadratic if there exist a constant


C (not necessarily less than 1) and an integer N such that
|xn+1 − x∗ | ≤ C|xn − x∗ |2 , ∀ n ≥ N. (2.17)

• In general, we say that the rate of convergence is of α at least if there


exist a constant C (not necessarily less than 1 for α > 1) and an integer
N such that
|xn+1 − x∗ | ≤ C|xn − x∗ |α , ∀ n ≥ N. (2.18)
2.4. Computational Algorithms and Their Convergence 49

Example 2.26. Consider a sequence defined recursively as


xn 1
x1 = 2, xn+1 = + . (2.19)
2 xn
(a) Find the limit of the sequence; (b) show that the convergence is quadratic.
Hint : You may first check the behavior of the sequence. Then prove its convergence, by
√ 1 1
verifying xn > 2 for all n ≥ 1 ( ∵ x2n+1 −2 > 0) and xn+1 < xn ( ∵ xn −xn+1 = xn ( − 2 ) > 0).
2 xn

Solution.
sequence_sqrt2.m
1 x = 2;
2 for n=1:5
3 x = x/2 + 1/x;
4 fprintf('n=%d: xn = %.10f\n',n,x)
5 end

Output
1 n=1: xn = 1.5000000000
2 n=2: xn = 1.4166666667
3 n=3: xn = 1.4142156863
4 n=4: xn = 1.4142135624
5 n=5: xn = 1.4142135624

It looks monotonically decreasing and bounded below ⇒ Converge!



(It converges to 2 ≈ 1.41421356237310.)
50 Chapter 2. Programming Examples

2.4.2. Big O and little o notation


Definition 2.27.

• A sequence {αn }∞ ∞
n=1 is said to be in O (big Oh) of {βn }n=1 if a positive
number K exists for which
 |αn | 
|αn | ≤ K|βn |, for large n or equivalently, ≤K . (2.20)
|βn |
In this case, we say “αn is in O(βn )" and denote αn ∈ O(βn ) or αn =
O(βn ).
• A sequence {αn } is said to be in o (little oh) of {βn } if there exists a
sequence εn tending to 0 such that
 |αn | 
|αn | ≤ εn |βn |, for large n or equivalently, lim =0 . (2.21)
n→∞ |βn |

In this case, we say “αn is in o(βn )" and denote αn ∈ o(βn ) or αn =


o(βn ).
n+1 1
Example 2.28. Show that αn = =O and
n2 n
n+3
f (n) = 3 2
∈ O(n−2 ) ∩ o(n−1 ).
n + 20n
Solution.
2.4. Computational Algorithms and Their Convergence 51

Definition 2.29. Suppose lim G(h) = 0. A quantity F (h) is said to be in


h→0
O (big Oh) of G(h) if a positive number K exists for which
|F (h)|
≤ K, for h sufficiently small. (2.22)
|G(h)|
In this case, we say F (h) is in O(G(h)), and denote F (h) ∈ O(G(h)). Lit-
tle oh of G(h) can be defined the same way as for sequences.

Example 2.30. Taylor’s series expansion for cos(x) is given as


1 2 1 4 1 6
cos(x) = 1 − x + x − x + ···
2! 4! 6!
1 1 1 6
= 1 − x2 + x4 − x + ··· .
2 24 720
If you use a computer algebra software (e.g. Maple), you will obtain
1 2
taylor(cos(x), x = 0, 4) = 1 − x + O(x4 )
2!
which implies that
1 4 1 6
x − x + · · · = O(x4 ). (2.23)
|24 720
{z }
=: F (x)
Indeed,
|F (x)| 1 1 2 1
= − x + · · · ≤ , for sufficiently small x. (2.24)
|x4 | 24 720 24
Thus F (x) ∈ O(x4 ).
Example 2.31. Choose the correct assertions (in each, n → ∞)

a. (n2 + 1)/n3 ∈ o(1/n)



b. (n + 1)/ n ∈ o(1)
c. 1/ ln n ∈ O(1/n)
d. 1/(n ln n) ∈ o(1/n)
e. en /n5 ∈ O(1/n)
52 Chapter 2. Programming Examples

1
Example 2.32. Let f (h) = (1 + h − eh ). What are the limit and the rate
h
of convergence of f (h) as h → 0?
Solution.

Self-study 2.33. Show that these assertions are not true.

a. ex − 1 = O(x2 ), as x → 0
b. x = O(tan−1 x), as x → 0
c. sin x cos x = o(x), as x → 0

Solution.
2.5. Inverse Functions: Exponentials and Logarithms 53

2.5. Inverse Functions: Exponentials and Log-


arithms
In-Reality 2.34. A function f is a rule that assigns an output y
to each input x: f (x) = y. Thus a function is a set of actions that de-
termines the system. However, in reality, it is often the case that the
equation must be solved for either the input or the function.
1. Given (f, x), getting y is the simplest and most common task.
2. Given (f, y), solving for x is to find the inverse function of f .
3. Given (x, y), solving for f is not a simple task in practice.
• Using many data points {(xi , yi )}, finding an approximation of
f is the core subject of polynomial interpolation (§ 3.3), re-
gression analysis (Ch. 7), and machine learning (Ch. 10).

Key Idea 2.35. What is the Inverse of a Function?


Let f : X → Y be a function. For simplicity, consider
y = f (x) = 2x + 1. (2.25)

• Then, f is a rule that performs two actions: ×2 and followed by +1.


• The reverse of f must be: −1 followed by ÷2.
– Let y ∈ Y . Then the reverse of f can be written as
x = (y − 1)/2 =: g(y) (2.26)

The function g is the inverse function of f .


– However, it is conventional to choose x for the independent vari-
able. Thus it can be formulated as
y = g(x) = (x − 1)/2. (2.27)

• Let’s summarize the above:


(a) Solve y = f (x) for x: x = (y − 1)/2 =: g(y).
(b) Exchange x and y: y = g(x) = (x − 1)/2.
54 Chapter 2. Programming Examples

2.5.1. Inverse functions


Note: The first step for finding the inverse function of f is to solve y =
f (x) for x, to get x = g(y). Here the required is for g to be a function.

Definition 2.36. A function f is called a one-to-one function if it


never takes on the same value twice; that is,

f (x1 ) 6= f (x2 ) whenever x1 6= x2 . (2.28)

Claim 2.37. Horizontal Line Test.


A function is one-to-one if and only if no horizontal line intersects its
graph more than once.

Example 2.38. Check if the function is one-to-one.

a. f (x) = x2 b. g(x) = x2 , x ≥ 0

c. h(x) = x3

Solution.

Definition 2.39. Let f be a one-to-one function with domain X and


range Y . Then its inverse function f −1 has domain Y and range X and
is defined by
f −1 (y) = x ⇔ f (x) = y, (2.29)
for any y ∈ Y .
2.5. Inverse Functions: Exponentials and Logarithms 55

Solution. Write y = x3 + 2.

Step 1: Solve it for x:


p
x3 = y − 2 ⇒ x = 3
y − 2.

Step 2: Exchange x and y:



y = 3 x − 2.

Therefore the inverse function is



f −1 (x) = 3 x − 2.

Exponential Functions
Definition 2.40. A function of the form

f (x) = ax , where a > 0 and a 6= 1, (2.30)

is called an exponential function (with base a).


• All exponential functions have domain (−∞, ∞) and range (0, ∞), so
an exponential function never assumes the value 0.
• All exponential functions are either increasing (a > 1) or decreasing
(0 < a < 1) over the whole domain.

Figure 2.7: Exponential functions.


56 Chapter 2. Programming Examples

Example 2.41. Table 2.1 shows data for the population of the world
in the 20th century. Figure 2.8 shows the corresponding scatter plot.
• The pattern of the data points suggests an exponential growth.
• Use an exponential regression algorithm to find a model of the form
P (t) = a · bt , (2.31)

where t = 0 corresponds to 1900.


Table 2.1
t Population P
(years since 1900) (millions)
0 1650
10 1750
20 1860
30 2070
40 2300
50 2560
60 3040
70 3710
80 4450
90 5280
Figure 2.8: Scatter plot for world population
100 6080
growth.
110 6870

Remark 2.42. The exponential regression (2.31) can be rewritten as


ln P = ln(a · bt ) = ln a + t ln b = α + tβ. (2.32)

One can find the parameters (α, β) which fit best the following:

α + 0β = ln 1650 


α + 10β = ln 1750 
 " #

 α
α + 20β = ln 1860 ⇒ A =r (2.33)
..  β
.





α + 110β = ln 6870 

Then recover (a, b): a = eα , b = eβ .


2.5. Inverse Functions: Exponentials and Logarithms 57

Solution. We will see details of the exponential regression later.


population.m
1 Data =[0 1650; 10 1750; 20 1860; 30 2070;
2 40 2300; 50 2560; 60 3040; 70 3710;
3 80 4450; 90 5280; 100 6080; 110 6870];
4 m = size(Data,1);
5

6 % exponential model, through linearization


7 A = ones(m,2);
8 A(:,2) = Data(:,1);
9 r = log(Data(:,2));
10 p = (A'*A)\(A'*r);
11 a = exp(p(1)), b = exp(p(2)),
12

13 plot(Data(:,1),Data(:,2),'k.','MarkerSize',20)
14 xlabel('Years since 1900');
15 ylabel('Millions'); hold on
16 print -dpng 'population-data.png'
17 t = Data(:,1);
18 plot(t,a*b.^t,'r-','LineWidth',2)
19 print -dpng 'population-regression.png'
20 hold off

The program results in

a = 1.4365 × 103 , b = 1.0140.

Thus the exponential model reads

P (t) = (1.4365×109 )·(1.0140)t . (2.34)

Figure 2.9 shows the graph of this ex-


ponential function together with the
original data points. We see that the
exponential curve fits the data rea- Figure 2.9: Exponential model for world
population growth.
sonably well.
58 Chapter 2. Programming Examples

The Number e
Of all possible bases for an exponential function, there is one that is
most convenient for the purposes of calculus. The choice of a base a is
influenced by the way the graph of y = ax crosses the y-axis.
• Some of the formulas of calculus will be greatly simplified, if we
choose the base a so that the slope of the tangent line to y = ax
at x = 0 is exactly 1.
• In fact, there is such a number and it is denoted by the letter e.
(This notation was chosen by the Swiss mathematician Leonhard
Euler in 1727, probably standing for exponential.)
• It turns out that the number e lies between 2 and 3:

e ≈ 2.718282 (2.35)

Figure 2.10: The number e.

Remark 2.43. The Euler’s number e as a Limit


It can be calculated as the limit

e = lim (1 + x)1/x . (2.36)


x→0

e_limit.m Output
1 % An increasing sequence 1 e_1 = 2.5937424601
2 2 e_2 = 2.7048138294
3 for n=1:8 3 e_3 = 2.7169239322
4 x=1/10^n; 4 e_4 = 2.7181459268
5 en = (1+x)^(1/x); 5 e_5 = 2.7182682372
6 fprintf('e_%d = %.10f\n',n,en) 6 e_6 = 2.7182804691
7 end 7 e_7 = 2.7182816941
8 e_8 = 2.7182817983
2.5. Inverse Functions: Exponentials and Logarithms 59

2.5.2. Logarithmic Functions

Recall: The exponential function f (x) = ax is either increasing (a > 1)


or decreasing (0 < a < 1).
• It is one-to-one by the Horizontal Line Test.
• Therefore it has its inverse.

Definition 2.44. The logarithmic function with base a, written


y = loga x, is the inverse of y = ax (a > 0, a 6= 1). That is,

y = loga x ⇔ ay = x. (2.37)

Example 2.45. Find the inverse of y = 2x .

Solution.

1. Solve y = 2x for x:
x = log2 y

2. Exchange x and y:
y = log2 x

Thus the graph of y = log2 x must be


the reflection of the graph of y = 2x Figure 2.11: Graphs of y = 2x and y = log2 x.
about y = x.

Note:
• Equation (2.37) represents the action of “solving for x”
• The domain of y = loga x must be the range of y = ax , which is (0, ∞).
60 Chapter 2. Programming Examples

The Natural Logarithm and the Common Logarithm


Of all possible bases a for logarithms, we will see later that the most conve-
nient choice of a base is the number e.
Definition 2.46.

• The logarithm with base e is called the natural logarithm and has
a special notation:
loge x = ln x (2.38)

• The logarithm with base 10 is called the common logarithm and


has a special notation:

log10 x = log x (2.39)

Remark 2.47.

• From your calculator, you can see buttons of LN and LOG , which
represent ln = loge and log = log10 , respectively.
• When you implement a code on computers, the functions ln and
log can be called by “log” and “log10”, respectively.

Properties of Logarithms

• Algebraic Properties: for (a > 0, a 6= 1)

Product Rule: loga xy = loga x + loga y


x
Quotient Rule: loga = loga x − loga y
y (2.40)
Power Rule: loga xα = α loga x
1
Reciprocal Rule: loga = − loga x
x
• Inverse Properties

aloga x = x, x > 0; loga ax = x, x ∈ R


(2.41)
eln x = x, x > 0; ln ex = x, x ∈ R
2.5. Inverse Functions: Exponentials and Logarithms 61

Example 2.48. Solve for x.

(a) e5−3x = 3.
(b) log3 x + log3 (x − 2) = 1
(c) ln(ln x) = 0

Solution.

Ans: (a) x = 13 (5 − ln 3). (b) x = 3. (Caution: x = −1 cannot be a solution.)

Claim 2.49.
(a) Every exponential function is a power of the natural exponential
function.
ax = ex ln a . (2.42)

(b) Every logarithmic function is a constant multiple of the natural log-


arithm.
ln x
loga x = , (a > 0, a 6= 1) (2.43)
ln a
which is called the Change-of-Base Formula.
x
Proof. (a). ax = eln(a ) = ex ln a .
(b). ln x = ln(aloga x ) = (loga x)(ln a), from which one can get (2.43).

Remark 2.50. Based on Claim 2.49, all exponential and logarithmic


functions can be evaluated by the natural exponential function and the
natural logarithmic function; which are named “exp()” and “log()”, in
Matlab.

Note: You will work on a project, Canny Edge Detection Algorithm For
Color Images, while you are studying the next chapter.
62 Chapter 2. Programming Examples

Exercises for Chapter 2

2.1. Download a dataset saved in heart-data.txt:


https://skim.math.msstate.edu/LectureNotes/data/heart-data.txt
(a) Draw a figure for it.
(b) Use the formula (2.4) to find the area.

Hint : You may use the following. You should finish the function area_closed_curve.
Note that the index in Matlab arrays begins with 1, not 0.
heart.m
1 DATA = load('heart-data.txt');
2

3 X = DATA(:,1); Y = DATA(:,2);
4 figure, plot(X,Y,'r-','linewidth',2);
5

6 [m,n] = size(DATA);
7 area = area_closed_curve(DATA);
8

9 fprintf('# of points = %d; area = %g\n',m,area);

area_closed_curve.m
1 function area = area_closed_curve(data)
2 % compute the area of a region of closed curve
3

4 [m,n] = size(data);
5 area = 0;
6

7 for i=2:m
8 %FILL HERE APPROPRIATELY
9 end

Ans: (b) 9.41652.


2.2. Function f (x) = x3 − 2x2 + x − 2 has two complex-values zeros and a real zero. Imple-
ment a code to visualize all the solutions in the complex coordinates.
Hint : Find the real and imaginary parts of f (z) as in Remark 2.10.
2.3. Either produce or download a sound file and then compute its spectrogram. For a wav
file, you may try https://www2.cs.uic.edu/∼i101/SoundFiles/ or
https://voiceage.com/Audio-Samples-AMR-WB.html.
Hint : Assume you have got StarWars3.wav. Then you may start with the following.
real_STFT.m
1 filename = 'StarWars3.wav';
2 [x,fs] = audioread(filename);
3
2.5. Inverse Functions: Exponentials and Logarithms 63

4 gong = audioplayer(x,fs);
5 play(gong)

The above works on both Matlab and Ocatve.


2.4. For the fair (xn , αn ), is it true that xn = O(αn ) as n → ∞?
√ √
(a) xn = n2 − 10; αn = n
(b) xn = 3n − n4 + 1; αn = n3
1 √
(c) xn = n − √ + 1; αn = n
n
(d) xn = n2 + n; α n = n3

2.5. The population of Starkville, Mississippi, was 2,689 in the year 1900 and 25,495 in
2020. Assume that the population in Starkville grows exponentially with the model

Pn = P0 · (1 + r)n , (2.44)

where n is the elapsed year and r denotes the growth rate per year.

(a) Find the growth rate r.


(b) Estimate the population in 1950 and 2000.
(c) Approximately when is the population going to reach 50,000?

Hint : Applying the natural log to (2.44) reads log(Pn /P0 ) = n log(1 + r). Dividing it
by n and applying the natural exponential function gives 1 + r = exp(log(Pn /P0 )/n),
where Pn = 25495, P0 = 2689, and n = 120.
Ans: (a) r = 0.018921(= 1.8921%). (c) 2056.
64 Chapter 2. Programming Examples
3
C HAPTER

Programming with Calculus

In modern scientific computing (particularly, for AI), calculus and linear


algebra play crucial roles.
• In this chapter, you will learn certain selected topics in calculus
which are essential for advanced computing tasks.
• We will consider basic concepts and their applications as well.

Contents of Chapter 3
3.1. Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2. Basis Functions and Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3. Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.4. Numerical Differentiation: Finite Difference Formulas . . . . . . . . . . . . . . . . . . 93
3.5. Newton’s Method for the Solution of Nonlinear Equations . . . . . . . . . . . . . . . . 99
3.6. Zeros of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Exercises for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

65
66 Chapter 3. Programming with Calculus

3.1. Differentiation
3.1.1. The Slope of the Tangent Line

Problem 3.1. A function y = f (x) can be graphed as a curve.


• In many applications, the tangent line plays a crucial role for the
computation of approximate solutions.
• Here are questions:
– What is the tangent line?
– How can we find it?

Average Speed and Instantaneous Speed


When f (t) measures the distance traveled at time t,
• Average Speed.
distance traveled f (t0 + h) − f (t0 )
Average speed over [t0 , t0 + h] = =
elapsed time (t0 + h) − t0
(3.1)
• Instantaneous Speed. For h very small,
f (t0 + h) − f (t0 )
Instantaneous speed at t0 ≈ (3.2)
h

Example 3.2. If y denotes the distance fallen in feet after t seconds, then
the Galileo’s law of free-fall is

y = 16t2 ft. (3.3)

Let t0 = 1.

(a) Find average speed, the difference quotient,


f (t0 + h) − f (t0 )
h
for various h, positive and negative.
(b) Estimate the instantaneous speed at t = t0 .
3.1. Differentiation 67

Solution.
free_fall.m
1 syms f(t) Q(h) %also, views t and h as symbols
2

3 f(t) = 16*t.^2; t0=1;


4 Int = [t0-1.5,t0+1.1];
5 fplot(f(t),Int, 'k-','LineWidth',3)
6 hold on
7

8 %%---- Difference Quotient (DQ) ----------


9 Q(h) = (f(t0+h)-f(t0))/h;
10 S(t,h) = Q(h)*(t-t0)+f(t0); % Secant line
11

12 %%---- Secant Lines, with Various h ------


13 for h0 = [-1 -0.5 0.5 1]
14 fplot(S(t,h0),Int, 'b--','LineWidth',2)
15 plot([t0+h0],[f(t0+h0)],'b.','markersize',30)
16 end
17

18 %%---- Limit of the DQ -------------------


19 tan_slope = limit(Q(h),h,0);
20 T(t) = tan_slope*(t-t0)+f(t0);
21 fplot(T(t),Int, 'r-','LineWidth',3)
22 plot([t0],[f(t0)],'r.','markersize',30)
23

24 axis tight, grid on; hold off


25 ax=gca; ax.FontSize=15; ax.GridAlpha=0.5;
26 print -dpng 'free-fall-tangent.png'
27

28 %%---- Measure Q(h) wih h=+-10^-i --------


29 for i = 1:5
30 h=-10^(-i); fprintf(" h= %.5f; Q(h) = %.8f\n",h,Q(h))
31 end
32 for i = 1:5
33 h=10^(-i); fprintf(" h= %.5f; Q(h) = %.8f\n",h,Q(h))
34 end
68 Chapter 3. Programming with Calculus

Difference Quotient at t0 = 1
1 h= -0.10000; Q(h) = 30.40000000
2 h= -0.01000; Q(h) = 31.84000000
3 h= -0.00100; Q(h) = 31.98400000
4 h= -0.00010; Q(h) = 31.99840000
5 h= -0.00001; Q(h) = 31.99984000
6 h= 0.10000; Q(h) = 33.60000000
7 h= 0.01000; Q(h) = 32.16000000
8 h= 0.00100; Q(h) = 32.01600000
9 h= 0.00010; Q(h) = 32.00160000
10 h= 0.00001; Q(h) = 32.00016000

Let’s confirm this algebraically.


• When f (t) = 16t2 and t0 = 1, the difference quotient reads
∆y f (1 + h) − f (1) 16(1 + h)2 − 16(1)2
(t0 = 1) = =
∆t h h
2 2
16(1 + 2h + h ) − 16(1) 32h + 16h2 (3.4)
= =
h h
= 32 + 16h

• As h gets closer and closer to 0, the average speed has the limiting
value 32 ft/sec when t0 = 1 sec.
• Thus, the slope of the tangent line is 32.
3.1. Differentiation 69

Example 3.3. Find an equation of the tangent line to the graph of y = x2


at x0 = 2.

Figure 3.1: Graph of y = x2 and the tangent line at x0 = 2.

Solution. Let’s first try to find the slope, as the limit of the difference
quotient.

Definition 3.4.
The slope of the curve y = f (x) at the point P (x0 , f (x0 )) is the number
f (x0 + h) − f (x0 )
=: f 0 (x0 ), provided the limit exists .

lim (3.5)
h→0 h
The tangent line to the curve at P is the line through P with this slope:
y − f (x0 ) = f 0 (x0 ) (x − x0 ). (3.6)
70 Chapter 3. Programming with Calculus

Example 3.5. Can you find the tan-


gent line to y = |x2 − 1| at x0 = 1?
Solution. As one can see from Fig-
ure 3.2, the left-hand limit and
the right-hand slope of the differ-
ence quotient are not the same. Or,
you may say the left-hand and the
right-hand secant lines converge dif-
ferently. Thus no tangent line can be
defined. Figure 3.2: Graph of y = |x2 − 1| and secant
lines at x0 = 1.

secant_lines_abs_x2_minus_1.m
1 syms f(x) Q(h) %also, views t and h as symbols
2

3 f(x)=abs(x.^2-1); x0=1;
4 figure, fplot(f(x),[x0-3,x0+1.5], 'k-','LineWidth',3)
5 hold on
6

7 Q(h) = (f(x0+h)-f(x0))/h;
8 S(x,h) = Q(h)*(x-x0)+f(x0); % Secant line
9 %%---- Secant Lines, with Various h ------
10 for h0 = [-0.5 -0.25 -0.1 0.1 0.25 0.5]
11 fplot(S(x,h0),[x0-1,x0+1], 'b--','LineWidth',2)
12 plot([x0+h0],[f(x0+h0)],'b.','markersize',25)
13 end
14 plot([x0],[f(x0)],'r.','markersize',35)
15 daspect([1 2 1])
16 axis tight, grid on
17 ax=gca; ax.FontSize=15; ax.GridAlpha=0.5;
18 hold off
19 print -dpng 'secant-y=abs-x2-1.png'
3.1. Differentiation 71

3.1.2. Derivative and Differentiation Rules


You have calculated average slopes, for various interval lengths h, and
estimated the instantaneous slope by letting h approach zero.

Definition 3.6. The derivative of a function f (x), denoted f 0 (x) or


df (x)
, is
dx
df (x) f (x + h) − f (x)
f 0 (x) = = lim , (3.7)
dx h→0 h
provided that the limit exists.

Example 3.7. Use the definition to find derivatives of the functions:


f (x+h)−f (x)
Find the difference quotient h , simplify it, and then apply lim .
h→0

(a) f (x) = x (b) f (x) = x2 (c) f (x) = x3

Solution.

Formula 3.8. From the last example,

f (x) = x ⇒ f 0 (x) = 1
f (x) = x2 ⇒ f 0 (x) = 2x
f (x) = x3 ⇒ f 0 (x) = 3x2
.. ..
. .
f (x) = xn ⇒ f 0 (x) = nxn−1
72 Chapter 3. Programming with Calculus

Example 3.9. Differentiate the following powers of x.

(a) x6 (b) x3/2 (c) x1/2

Solution.

Formula 3.10. Differentiation Rules:


• Let f (x) = au(x) + bv(x), for some constants a and b. Then

f (x + h) − f (x) [au(x + h) + bv(x + h)] − [au(x) + bv(x)]


=
h h
u(x + h) − u(x) v(x + h) − v(x) (3.8)
= a +b
h h
0 0
→ au (x) + bv (x)

• Let f (x) = u(x)v(x). Then


f (x + h) − f (x) u(x + h)v(x + h) − u(x)v(x)
=
h h
u(x + h)v(x + h) − u(x)v(x + h) + u(x)v(x + h) − u(x)v(x)
=
h
u(x + h) − u(x) v(x + h) − v(x)
= v(x + h) + u(x)
h h
→ u0 (x)v(x) + u(x)v 0 (x)
(3.9)
3.1. Differentiation 73

Example 3.11. Use the product rule (3.9) to find the derivative of the
function
f (x) = x6 = x2 · x4

Solution.

Example 3.12. Does the curve y = x4 −2x2 +2 have any horizontal tangent
line? Use the information you just found, to sketch the graph.
Solution.
74 Chapter 3. Programming with Calculus

Example 3.13. Consider a computer program.


derivative_rules.m
1 syms n a b real
2 syms u(x) v(x)
3 syms f(x) Q(h)
4

5 %%-- Define f(x) ----------------------


6 f(x) = x^n;
7

8 %%-- Define Q(h) and take limit -------


9 Q(h) = (f(x+h)-f(x))/h;
10 fp = limit(Q(h),h,0); simplify(fp)

Use the program to verify various rules of differentiation.


Solution.
Table 3.1: Rules of Derivative

f(x) Results Mathematical formula


0
xˆ n n*xˆ (n - 1) xn = nxn−1 (power rule)
0
a*u(x) + b*v(x) a*D(u)(x) + b*D(v)(x) au + bv = au0 + bv 0 (linearity rule)
0
u(x)*v(x) D(u)(x)*v(x) + uv = u0 v + uv 0 (product rule)
u(x)*D(v)(x)
0
u(x)/v(x) (D(u)(x)*v(x) - u/v = (u0 v − uv 0 )/v 2 (quotient rule)
D(v)(x)*u(x))/v(x)ˆ 2
0
u(v(x)) D(v)(x)*D(u)(v(x)) u(v(x)) = u0 (v(x)) · v 0 (x) (chain rule)

Example 3.14. Find the derivative of g(x) = (2x + 1)10 .


Solution. Let u(x) = x10 and v(x) = 2x + 1. Then g(x) = u(v(x)).

• u0 (x) = 10x9 and v 0 (x) = 2.


• Thus
9
g 0 (x) = u0 (v(x)) · v 0 (x) = 10 v(x) · 2 = 20(2x + 1)9 .
3.1. Differentiation 75

Remark 3.15. Differentiation Rules:


• The power rule holds for real number n.
• The quotient rule can be explained using the product rule.
 u 0
= (u · v −1 )0 = u0 · v −1 + u · (v −1 )0
v
u0 u · v 0
= u0 · v −1 + u · (−v −2 v 0 ) = − 2 (3.10)
v v
u0 v − uv 0
=
v2
• The chain rule can be verified algebraically.
d u(v(x)) u(v(x + h)) − u(v(x))
= lim
dx h→0 h
u(v(x + h)) − u(v(x)) v(x + h) − v(x) (3.11)
= lim ·
h→0 v(x + h) − v(x) h
0 0
= u (v(x)) · v (x).

Here u0 (v(x)) means the rate of change of u with respect to v(x).


• We may rewrite (3.11):
d u(v(x)) ∆u ∆u ∆v
= lim = lim · = u0 (v(x)) · v 0 (x). (3.12)
dx ∆x→0 ∆x ∆x→0 ∆v ∆x

Self-study 3.16. Find the derivative of the functions.

(a) f (x) = (3x − 1)7 x5 (3x − 1)7


(b) g(x) =
x5
Solution.


Ans: (b) (3x − 1)6 (6x + 5) /x6
76 Chapter 3. Programming with Calculus

3.2. Basis Functions and Taylor Series


3.2.1. Change of Variables & Basis Functions

In-Reality 3.17. Two Major Mathematical Techniques.


In the history of the engineering research and development (R&D),
there have been two major mathematical techniques: the change of
variables and representation by basis functions.
• In a nut shell, the change of variables is a basic technique used
to simplify problems.
• A function can be either represented or approximated by a lin-
ear combination of basis functions.

Example 3.18. Find solutions of the system of equations


(
xy + 2x + 2y = 20
(3.13)
x2 y + xy 2 = 48

where x and y are positive real numbers with x < y.


Solution. Let s = xy and t = x + y (a change of variables).

Ans: (x, y) = (2, 4)


Note: Most subjects in Calculus, particularly Vector Calculus, are deeply
related to the change of variables, to handle differentiation and integra-
tion over general 2D and 3D domains more effectively.
3.2. Basis Functions and Taylor Series 77

Definition 3.19. A basis for a vector space V is a set of vectors (func-


tions) that
1. is linearly independent, and
2. spans V .
Note: Linear Independence and Span are defined in § 4.3.

Example 3.20. Basis Functions


• The monomial basis for the polynomial space Pn is given by

1, x, x2 , · · · , xn . (3.14)

Each polynomial p ∈ Pn is expressed as a linear combination of the


monomial basis functions:

p = c0 + c1 x + c2 x2 + · · · + cn xn . (3.15)

• The standard unit vectors in Rn are


     
1 0 0
 ..
0 1  .
   
 .
.., e2 = 0, · · · , en =  ...,
   
e1 =  (3.16)
 ..  ..
     
 .  . 0
 
0 0 1

which form the standard basis for Rn ; any x ∈ Rn can be written as


       
x1 1 0 0
 ..
 x2  0 1  .
     
 .  .
.. = x1  .. + x2 0 + · · · + xn  ... = x1 e1 + x2 e2 + · · · + xn en .
   
x=
 ..  ..  ..
       
 .  .  . 0
 
xn 0 0 1

• Various other bases can be formulated.


78 Chapter 3. Programming with Calculus

3.2.2. Power Series and the Ratio Test


The monomial basis for analytic functions is given by

1, x, x2 , · · · . (3.17)

This basis is used in power series and Taylor series.

Definition 3.21. A power series about x = 0 is a series of the form



X
cn xn = c0 + c1 x + c2 x2 + · · · + cn xn + · · · . (3.18)
n=0

A power series about x = a is a series of the form



X
cn (x − a)n = c0 + c1 (x − a) + c2 (x − a)2 + · · · + cn (x − a)n + · · · , (3.19)
n=0

in which the center a and the coefficients c0 , c1 , c2 , · · · , cn , · · · are con-


stants.

Example 3.22. Taking all the coefficients to be 1 in (3.18) gives the geo-
metric series ∞
X
xn = 1 + x + x2 + · · · + xn + · · · ,
n=0
which converges to 1/(1 − x) only if |x| < 1. That is,
1
= 1 + x + x2 + · · · + xn + · · · , |x| < 1. (3.20)
1−x
3.2. Basis Functions and Taylor Series 79

Remark 3.23. It follows from Example 3.20 that


1. A function can be approximated by a power series.
2. A power series may converge only on a certain interval, of radius R.

Theorem
X 3.24. The Ratio Test:
Let an be any series and suppose that
an+1
lim = ρ. (3.21)
n→∞ an
P
(a) If ρ < 1, then the series converges absolutely. ( |an | converges)
(b) If ρ > 1, then the series diverges.
(c) If ρ = 1, then the test is inconclusive.

Example 3.25. For what values of x do the following power series con-
verge?
∞ n ∞
X
n−1 x x2 x3 X xn x2 x3
(a) (−1) = x− + −··· (b) = 1+x+ + + ···
n=1
n 2 3 n=0
n! 2! 3!

Solution.

Ans: (a) x ∈ (−1, 1].


80 Chapter 3. Programming with Calculus

Theorem 3.26. Term-by-Term Differentiation.



X
If cn (x − a)n has radius of convergence R > 0, it defines a function
n=0


X
f (x) = cn (x − a)n on |x − a| < R. (3.22)
n=0

This function f has derivatives of all orders inside the interval, and we
obtain the derivatives by differentiating the original series term by term:

X
0
f (x) = ncn (x − a)n−1 ,
n=1

X (3.23)
f 00 (x) = n(n − 1)cn (x − a)n−2 ,
n=2
and so on. Each of these derived series converges at every point of the
interval a − R < x < a + R.

This theorem similarly holds for Term-by-Term Integration.


Example 3.27. A power series is given as in Example 3.25 (b):

X xn x2 x3
=1+x+ + + ··· .
n=0
n! 2! 3!

Find its derivative.


Solution.
3.2. Basis Functions and Taylor Series 81

3.2.3. Taylor Series Expansion

Remark 3.28. We have seen how a converging power series defines


a function. In order to make infinite series more useful:
• Here we will try to express a given function as an infinite series
called the Taylor series.
• In many cases, the Taylor series provides useful polynomial ap-
proximation of the original function.
• Because approximation by polynomials is extremely useful to both
mathematicians and scientists, Taylor series are at the core of the
theory of infinite series.

Series Representations

Key Idea 3.29. Taylor Series.


• Let’s assume that a given function f is expressed as a power series
about x = a:
X∞
f (x) = cn (x − a)n
n=0
(3.24)
= c0 + c1 (x − a) + c2 (x − a)2 + · · · + cn (x − a)n + · · · ,

with a positive radius of convergence R > 0. ⇒ What are cn ?


• Term-by-term derivatives read
f 0 (x) = c1 + 2 c2 (x − a) + 3 c3 (x − a)2 + · · · + ncn (x − a)n−1 + · · ·
f 00 (x) = 2 c2 + 3 · 2 c3 (x − a) + · · · + n(n − 1)cn (x − a)n−2 + · · · (3.25)
000 n−3
f (x) = 3 · 2 c3 + 4 · 3 · 2 c4 (x − a) + · · · + n(n − 1)(n − 2)cn (x − a) + ···

with the nth derivative being


f (n) (x) = n!cn + (n + 1)!cn+1 (x − a) + · · · (3.26)

• Thus, when x = a,

(n) f (n) (a)


f (a) = n!cn ⇒ cn = . (3.27)
n!
82 Chapter 3. Programming with Calculus

Definition 3.30. Let f be a function with derivatives of all orders


throughout some interval containing a as an interior point. Then the
Taylor series generated by f at x = a is

X f (n) (a) n 0 f 00 (a)
(x − a) = f (a) + f (a)(x − a) + (x − a)2 + · · · (3.28)
n=0
n! 2!

The Maclaurin series of f is the Taylor series generated by f at x = 0:



X f (n) (0) n 0 f 00 (0) 2
x = f (0) + f (0)x + x + ··· (3.29)
n=0
n! 2!

Example 3.31. Find the Taylor series and Taylor polynomials generated
by f (x) = cos x at x = 0.
Solution. The cosine and its derivatives are
f (x) = cos x f 0 (x) = − sin x
f 00 (x) = − cos x f (3) (x) = sin x
.. ..
. .
(2n) n (2n+1)
f (x) = (−1) cos x f (x) = (−1)n+1 sin x.
At x = 0, the cosines are 1 and the sines are 0, so
f (2n) (0) = (−1)n , f (2n+1) (0) = 0. (3.30)
The Taylor series generated by cos x at x = 0 is
1 0 1 x2 x4 x6
1 + 0 · x − x2 + x3 + x4 + · · · = 1 − + − + ··· (3.31)
2! 3! 4! 2! 4! 6!

Figure 3.3: y = cos x and its Taylor polynomials near x = 0.


3.2. Basis Functions and Taylor Series 83

Commonly Used Taylor Series

Function Series Convergence



1 2 3
X
= 1 + x + x + x + ··· = xn x ∈ (−1, 1)
1−x n=0
2 ∞
x X xn
ex = 1+x+ + ··· = x∈R
2! n=0
n!

x2 x4 X
n x
2n
cos x = 1− + − ··· = (−1) x∈R
2! 4! n=0
(2n)! (3.32)

x3 x5 X
n x
2n+1
sin x = x− + − ··· = (−1) x∈R
3! 5! n=0
(2n + 1)!

x2 x3 X
n+1 x
n
ln(1 + x) = x− + − ··· = (−1) x ∈ (−1, 1]
2 3 n=1
n

x3 x5 X x2n+1
tan−1 x = x− + − ··· = (−1)n x ∈ [−1, 1]
3 5 n=0
2n + 1

Note: The interval of convergence can be verified using e.g. the ratio
test, presented in Theorem 3.24, p. 79.
sin x
Self-study 3.32. Plot the sinc function f (x) = and its Taylor poly-
x
nomials of order 4, 6, and 8, about x = 0.
Solution. Hint : Use e.g., syms x; T4 = taylor(sin(x)/x,x,0,’Order’,5). Here “Or-
der” means the leading order of truncated terms.

Taylor Polynomials
Definition 3.33. Let f be a function with derivatives of order k =
1, 2, · · · , N in some interval containing a as an interior point. Then for
any integer n from 0 through N , the Taylor polynomial of order n
generated by f at x = a is the polynomial

0 f 00 (a) 2 f (n) (a)


Pn (x) = f (a) + f (a)(x − a) + (x − a) + · · · + (x − a)n . (3.33)
2! n!
84 Chapter 3. Programming with Calculus

Theorem 3.34. Taylor’s Theorem with Lagrange Remainder


Suppose f ∈ C n [a, b], f (n+1) exists on (a, b), and x0 ∈ [a, b]. Then, for every
x ∈ [a, b],
n
X f (k) (x0 )
f (x) = (x − x0 )k + Rn (x), (3.34)
k!
k=0

where, for some ξ between x and x0 ,


f (n+1) (ξ)
Rn (x) = (x − x0 )n+1 .
(n + 1)!

Note: The above theorem is useful in various engineering applications;


the error of the polynomial approximation can be verified by mea-
suring the Lagrange Remainder.

Example 3.35. Let f (x) = cos(x) and x0 = 0. Determine the second and
third Taylor polynomials for f about x0 .
Maple-code
1 f := x -> cos(x):
2 fp := x -> -sin(x):
3 fpp := x -> -cos(x):
4 fp3 := x -> sin(x):
5 fp4 := x -> cos(x):
6

7 p2 := x -> f(0) + fp(0)*x/1! + fpp(0)*x^2/2!:


8 p2(x);
9 = 1 - 1/2 x^2
10 R2 := fp3(xi)*x^3/3!;
11 = 1/6 sin(xi) x^3
12 p3 := x -> f(0) + fp(0)*x/1! + fpp(0)*x^2/2! + fp3(0)*x^3/3!:
13 p3(x);
14 = 1 - 1/2 x^2
15 R3 := fp4(xi)*x^4/4!;
16 = 1/24 cos(xi) x^4
17

18 # On the other hand, you can find the Taylor polynomials easily
19 # using built-in functions in Maple:
20 s3 := taylor(f(x), x = 0, 4);
21 = 1 - 1/2 x^2 + O(x^4)
22 convert(s3, polynom);
23 = 1 - 1/2 x^2
3.2. Basis Functions and Taylor Series 85

1 plot([f(x), p3(x)], x = -2 .. 2, thickness = [2, 2],


2 linestyle = [solid, dash], color = black,
3 legend = ["f(x)", "p3(x)"],
4 legendstyle = [font = ["HELVETICA", 10], location = right])

Figure 3.4: f (x) = cos x and its third Taylor polynomial P3 (x).

Note: When n = 0, x = b, and x0 = a, the Taylor’s Theorem reads

f (b) = f (a) + R0 (b) = f (a) + f 0 (c) · (b − a), (3.35)

equivalently,
f (b) − f (a)
f 0 (c) = , for some c ∈ (a, b), (3.36)
b−a
which is the Mean Value Theorem.

Figure 3.5: The Mean Value Theorem


86 Chapter 3. Programming with Calculus

3.3. Polynomial Interpolation


Polynomial Approximation and Interpolation

Theorem 3.36. (Weierstrass Approximation Theorem)


Suppose f ∈ C[a, b]. Then, for each ε > 0, there exists a polynomial P (x)
such that
|f (x) − P (x)| < ε, for all x ∈ [a, b]. (3.37)

Every continuous function can be approximated by a polynomial, arbi-


trarily close.

Theorem 3.37. (Polynomial Interpolation Theorem)


If x0 , x1 , x2 , · · · , xn are (n + 1) distinct real numbers, then for arbitrary
values y0 , y1 , y2 , · · · , yn , there is a unique polynomial pn of degree at
most n such that
pn (xi ) = yi (0 ≤ i ≤ n). (3.38)
The graph of y = pn (x) passes all points {(xi , yi )}.

For values at (n + 1) distinct points, the interpolating polynomial pn ∈ Pn


is unique.

Example 3.38. Find the interpolating polynomial p2 passing (−2, 3), (0, −1),
and (1, 0).
Solution.
3.3. Polynomial Interpolation 87

3.3.1. Lagrange Form of Interpolating Polynomials

Let data points (xk , yk ), 0 ≤ k ≤ n be given, where n + 1 abscissas xi are


distinct. The interpolating polynomial will be sought in the form
n
X
pn (x) = y0 Ln,0 (x) + y1 Ln,1 (x) + · · · + yn Ln,n (x) = yk Ln,k (x), (3.39)
k=0

where Ln,k (x) are basis polynomials that depend on the nodes
x0 , x1 , · · · , xn , but not on the ordinates y0 , y1 , · · · , yn .
See Definition 3.19, p.77, for basis.

For example, for {(x0 , y0 ), (x1 , y1 ), (x2 , y2 )}, the Lagrange form of interpolat-
ing polynomial reads

p2 (x) = y0 L2,0 (x) + y1 L2,1 (x) + y2 L2,2 (x). (3.40)

How to Determine the Basis {Ln,k (x)}


Observation 3.39. Let all the ordinates be 0 except for a 1 occupying
i-th position, i.e., yi = 1 and other ordinates are all zero.
• Then,
n
X
pn (x) = yk Ln,k (x) = Ln,i (x) ⇒ pn (xj ) = Ln,i (xj ). (3.41)
k=0

• On the other hand, the polynomial pn interpolating the data must sat-
isfy pn (xj ) = δij , where δij is the Kronecker delta
(
1 if i = j,
δij =
0 if i 6= j.

• Thus all the basis polynomials must satisfy

Ln,i (xj ) = δij , for all 0 ≤ i, j ≤ n. (3.42)

Polynomials satisfying such a property are known as the cardinal


functions.
88 Chapter 3. Programming with Calculus

Example 3.40. Construction of Ln,0 (x): It is to be an nth-degree poly-


nomial that takes the value 0 at x1 , x2 , · · · , xn and the value 1 at x0 .
Clearly, it must be of the form
n
Y
Ln,0 (x) = c(x − x1 )(x − x2 ) · · · (x − xn ) = c (x − xj ), (3.43)
j=1

where c is determined for which Ln,0 (x0 ) = 1. That is,

1 = Ln,0 (x0 ) = c(x0 − x1 )(x0 − x2 ) · · · (x0 − xn ) (3.44)

and therefore
1
c= . (3.45)
(x0 − x1 )(x0 − x2 ) · · · (x0 − xn )
Hence, we have
n
(x − x1 )(x − x2 ) · · · (x − xn ) Y (x − xj )
Ln,0 (x) = = . (3.46)
(x0 − x1 )(x0 − x2 ) · · · (x0 − xn ) j=1 (x0 − xj )

Summary 3.41. Each cardinal function is obtained by similar reason-


ing; the general formula is then
n
Y (x − xj )
Ln,i (x) = , i = 0, 1, · · · , n. (3.47)
(xi − xj )
j=0, j6=i

Example 3.42. Determine the Lagrange interpolating polynomial that


passes through (2, 4) and (5, 1).
Solution.
3.3. Polynomial Interpolation 89

Example 3.43. Let x0 = 2, x1 = 4, x2 = 5. Use the points to find the second


Lagrange interpolating polynomial p2 for f (x) = 1/x.
Solution.

1 1 1
Ans: p2 = (x − 4)(x − 5) − (x − 2)(x − 5) + (x − 2)(x − 4)
12 8 15
Lagrange_interpol.py
1 import sympy
2

3 def Lagrange(Lx,Ly):
4 X=sympy.symbols('X')
5 if len(Lx)!= len(Ly):
6 print("ERROR"); return 1
7 p=0
8 for k in range(len(Lx)):
9 t=1
10 for j in range(len(Lx)):
11 if j != k:
12 t *= ( (X-Lx[j])/(Lx[k]-Lx[j]) )
13 p += t*Ly[k]
14 return p
15

16 if __name__ == "__main__":
17 Lx=[2,4,5]; Ly=[1/2,1/4,1/5]
18 p2 = Lagrange(Lx,Ly)
19 print(p2); print(sympy.simplify(p2))

Output
1 [Tue Aug.29] python Lagrange_interpol.py
2 0.5*(5/3 - X/3)*(2 - X/2) + 0.25*(5 - X)*(X/2 - 1) + 0.2*(X/3 - 2/3)*(X - 4)
3 0.025*X**2 - 0.275*X + 0.95
90 Chapter 3. Programming with Calculus

3.3.2. Polynomial Interpolation Error Theorem

Q: When an interpolating polynomial Pn ≈ f , what is the error |f − Pn |?

Theorem 3.44. (Polynomial Interpolation Error Theorem). Let


f ∈ C n+1 [a, b], and let Pn be the polynomial of degree ≤ n that interpolates
f at n + 1 distinct points x0 , x1 , · · · , xn in the interval [a, b]. Then, for each
x ∈ (a, b), there exists a number ξx between x0 , x1 , · · · , xn , hence in the
interval [a, b], such that
n
f (n+1) (ξx ) Y
f (x) − Pn (x) = (x − xi ) =: Rn (x). (3.48)
(n + 1)! i=0

Example 3.45. If the function f (x) = sin(x) is approximated by a poly-


nomial of degree 5 that interpolates f at six equally distributed points in
[−1, 1] including end points, how large is the error on this interval?
Solution. The nodes xi are −1, −0.6, −0.2, 0.2, 0.6, and 1. It is easy to see
that
|f (6) (ξ)| = | − sin(ξ)| ≤ sin(1).
g := x -> (x+1)*(x+0.6)*(x+0.2)*(x-0.2)*(x-0.6)*(x-1):
gmax := maximize(abs(g(x)), x = -1..1)
0.06922606316
Thus,
5
f (6) (ξ) Y sin(1)
| sin(x) − P5 (x)| = (x − xi ) ≤ gmax
6! i=0 6! (3.49)
= 0.00008090517158
3.3. Polynomial Interpolation 91

Theorem 3.46. (Polynomial Interpolation Error Theorem for


Equally Spaced Nodes): Let f ∈ C n+1 [a, b], and let Pn be the poly-
nomial of degree ≤ n that interpolates f at
b−a
xi = a + ih, h= , i = 0, 1, · · · , n.
n
Then, for each x ∈ (a, b),
hn+1
|f (x) − Pn (x)| ≤ M, (3.50)
4(n + 1)

where
M = max |f (n+1) (ξ)|.
ξ∈[a,b]

Proof. Recall the interpolation error Rn (x) given in (3.48). We consider


bounding n
Y
max |x − xi |.
x∈[a,b]
j=1

Start by picking an x. We can assume that x is not one of the nodes, because
otherwise the product in question is zero. Let x ∈ (xj , xj+1 ), for some j. Then
we have
h2
|x − xj | · |x − xj+1 | ≤ . (3.51)
4
Now note that
(
(j + 1 − i)h for i < j
|x − xi | ≤ (3.52)
(i − j)h for j + 1 < i.

Thus n
Y h2
|x − xi | ≤ [(j + 1)! hj ] [(n − j)! hn−j−1 ]. (3.53)
j=1
4
Since (j + 1)!(n − j)! ≤ n!, we can reach the following bound
n
Y 1
|x − xi | ≤ hn+1 n!. (3.54)
j=1
4

The result of the theorem follows from the above bound.


92 Chapter 3. Programming with Calculus

Example 3.47. (Revisit to Example 3.45) The function f (x) = sin(x) is


approximated by a polynomial of degree 5 that interpolates f at six equally
distributed points in [−1, 1] including end points. Use (3.50) to estimate the
upper bound of the interpolation error.
Solution.
interpol_error.py
1 from sympy import *
2 x = symbols('x')
3

4 a,b=-1,1; n=5
5

6 f = sin(x)
7 print( diff(f,x,n+1) )
8

9 h = (b-a)/n
10 M = abs(-sin(1.));
11

12 err_bound = h**(n+1)/(4*(n+1)) *M
13 print(err_bound)

Output
1 -sin(x)
2 0.000143611048073881

Compare the above error with the one in (3.49), p.90.


3.4. Numerical Differentiation: Finite Difference Formulas 93

3.4. Numerical Differentiation: Finite Differ-


ence Formulas
Note: The derivative of f at x0 is defined as
f (x0 + h) − f (x0 )
f 0 (x0 ) = lim . (3.55)
h→0 h
This formula gives an obvious way to generate an approximation of
f 0 (x0 ):
f (x0 + h) − f (x0 )
f 0 (x0 ) ≈ . (3.56)
h

Formula 3.48. (Two-Point Difference Formulas): Let x1 = x0 + h


and P0,1 be the first Lagrange polynomial interpolating f on [x0 , x1 ].
Then
(x − x0 )(x − x1 ) 00
f (x) = P0,1 (x) + f (ξ)
2! (3.57)
x − x1 x − x0 (x − x0 )(x − x1 ) 00
= f (x0 ) + f (x1 ) + f (ξ).
−h h 2!
Differentiating it, we obtain
f (x1 ) − f (x0 ) 2x − x0 − x1 00 (x − x0 )(x − x1 ) d 00
f 0 (x) = + f (ξ) + f (ξ).
h 2 2! dx
(3.58)
Thus f (x1 ) − f (x0 ) h 00
f 0 (x0 ) = − f (ξ(x0 ))
h 2 (3.59)
f (x1 ) − f (x0 ) h 00
f 0 (x1 ) = + f (ξ(x1 ))
h 2
Definition 3.49. For h > 0,

f (xi + h) − f (xi )
f 0 (xi ) ≈ Dx+ f (xi ) = , (forward-difference)
h
f (xi ) − f (xi − h)
f 0 (xi ) ≈ Dx− f (xi ) = . (backward-difference)
h
(3.60)
94 Chapter 3. Programming with Calculus

Example 3.50. Use the forward-difference formula to approximate


f (x) = x3 at x0 = 1 using h = 0.1, 0.05, 0.025.
Solution. Note that f 0 (1) = 3.
Maple-code
1 f := x -> x^3: x0 := 1:
2

3 h := 0.1:
4 (f(x0 + h) - f(x0))/h
5 3.310000000
6 h := 0.05:
7 (f(x0 + h) - f(x0))/h
8 3.152500000
9 h := 0.025:
10 (f(x0 + h) - f(x0))/h
11 3.075625000

The error becomes half, as h halves?


Formula 3.51. (In general): Let {x0 , x1 , · · · , xn } be (n + 1) distinct
points in some interval I and f ∈ C n+1 (I). Then the Interpolation Error
Theorem reads
n n
X f (n+1) (ξ) Y
f (x) = f (xk )Ln,k (x) + (x − xk ). (3.61)
(n + 1)!
k=0 k=0

Its derivative gives


n n
X d  f (n+1) (ξ)  Y
0
f (x) = f (xk )L0n,k (x)
+ (x − xk )
dx (n + 1)!
k=0 k=0
(n+1) n (3.62)
f (ξ) d  Y 
+ (x − xk ) .
(n + 1)! dx
k=0

Hence,
n n
0
X f (n+1) (ξ) Y
f (xi ) = f (xk )L0n,k (xi ) + (xi − xk ), (3.63)
(n + 1)!
k=0 k=0,k6=i

which is the (n + 1)-point difference formula to approximate f 0 (xi ).


3.4. Numerical Differentiation: Finite Difference Formulas 95

Formula 3.52. (Three-Point Difference Formulas (n = 2)):


For convenience, let

x0, x1 = x0 + h, x2 = x0 + 2h, h > 0.

Recall the second-order cardinal basis functions


(x − x1 )(x − x2 ) (x − x0 )(x − x2 )
L2,0 (x) = , L2,1 (x) = ,
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 )
(x − x0 )(x − x1 )
L2,2 (x) = .
(x2 − x0 )(x2 − x1 )
It follows from the Polynomial Interpolation Error Theorem that

f (x) = f (x0 )L2,0 (x) + f (x1 )L2,1 (x) +f (x2 )L2,2 (x)
2
f (3) (ξ) Y (3.64)
+ (x − xk ),
3!
k=0

and its derivative reads


f 0 (x) = f (x0 )L02,0 (x) + f (x1 )L02,1 (x) +f (x2 )L02,2 (x)
2
d h f (3) (ξ) Y i (3.65)
+ (x − xk ) .
dx 3!
k=0

Thus, the three-point formulas read


2
f (3) (ξ) Y
f 0 (x0 ) = f (x0 )L02,0 (x0 ) + f (x1 )L02,1 (x0 ) +f (x2 )L02,2 (x0 ) + (x0 − xk )
3!
k=1
2
−3f (x0 ) + 4f (x1 ) − f (x2 ) h (3)
= + f (ξ0 ),
2h 3
2
f (x2 ) − f (x0 ) h (3)
f 0 (x1 ) = − f (ξ1 ),
2h 6
0 f (x0 ) − 4f (x1 ) + 3f (x2 ) h2 (3)
f (x2 ) = + f (ξ2 ).
2h 3
(3.66)
96 Chapter 3. Programming with Calculus

Formula 3.53. (Five-Point Difference Formulas):


Let fi = f (x0 + i h), h > 0, −∞ < i < ∞.
f−2 − 8f−1 + 8f1 − f2 h4 (5)
f 0 (x0 ) = + f (ξ),
12h 30
(3.67)
0 −25f0 + 48f1 − 36f2 + 16f3 − 3f4 h4 (5)
f (x0 ) = + f (ξ).
12h 5

Summary 3.54. Numerical Differentiation:


1. f (x) = Pn (x) + Rn (x), Pn (x) ∈ Pn , Rn (x) = O(hn+1 )
2. f 0 (x) = Pn0 (x) + O(hn ),
3. f 00 (x) = Pn00 (x) + O(hn−1 ), and so on.

Note: We can see from the above summary that for f 00


• The three-point formula (n = 2): its accuracy is O(h)
• The five-point formula (n = 4): its accuracy is O(h3 )
These hold for every point in [x0 , xn ] including all nodal points {xi }.

Midpoint Formula for f 00 : A higher-order Accuracy


Recall: (Theorem 3.34, p.84). Taylor’s Theorem with Lagrange
Remainder
Suppose f ∈ C n [a, b], f (n+1) exists on (a, b), and x0 ∈ [a, b]. Then, for every
x ∈ [a, b],
n
X f (k) (x0 )
f (x) = (x − x0 )k + Rn (x), (3.68)
k!
k=0

where, for some ξ between x and x0 ,


f (n+1) (ξ)
Rn (x) = (x − x0 )n+1 .
(n + 1)!

Replace x by x0 + h:
3.4. Numerical Differentiation: Finite Difference Formulas 97

Alternative Form of Taylor’s Theorem


Remark 3.55. Replacing x by x0 + h in the Taylor’s Theorem, we have
n
X f (k) (x0 ) k f (n+1) (ξ) n+1
f (x0 + h) = h + Rn (h), Rn (h) = h , (3.69)
k! (n + 1)!
k=0

for some ξ between x0 and x0 + h. In detail,


0 f 00 (x0 ) 2 f 000 (x0 ) 3
f (x0 + h) = f (x0 ) + f (x0 )h + h + h + ···
2! 3! (3.70)
f (n) (x0 ) n
+ h + Rn (h).
n!

Example 3.56. Use the Taylor series to derive the midpoint formula

f−1 − 2f0 + f1
f 00 (x0 ) =
h2 (3.71)
2
h (4) h4 (6) h6 (8)
− f (x0 ) − f (x0 ) − f (x0 ) − · · ·
12 360 20160

Solution. It follows from the Taylor’s series formula (3.70) that

f 00 (x0 ) 2 f 000 (x0 ) 3 f (4) (x0 ) 4


f (x0 + h) = f (x0 ) + f 0 (x0 )h + h + h + h + ···
2! 3! 4!
0 f 00 (x0 ) 2 f 000 (x0 ) 3 f (4) (x0 ) 4
f (x0 − h) = f (x0 ) − f (x0 )h + h − h + h − ···
2! 3! 4!
(3.72)
Adding these two equations, we have
f 00 (x0 ) 2 f (4) (x0 ) 4
f (x0 + h) + f (x0 − h) = 2f (x0 ) + 2 h +2 h + ··· . (3.73)
2! 4!
Solve it for f 00 (x0 ).
Note: The higher-order accuracy in (3.71) can be achieved at the mid-
point only.
98 Chapter 3. Programming with Calculus

Example 3.57. Use the second-derivative midpoint formula to approxi-


mate f 00 (1) for f (x) = x5 − 3x2 , using h = 0.2, 0.1, 0.05.
Solution.
Maple-code
1 f := x -> x^5 - 3*x^2:
2 x0 := 1:
3

4 eval(diff(f(x), x, x), x = x0)


5 14
6 h := 0.2:
7 (f(x0 - h) - 2*f(x0) + f(x0 + h))/h^2
8 14.40000000
9 h := 0.1:
10 (f(x0 - h) - 2*f(x0) + f(x0 + h))/h^2
11 14.10000000
12 h := 0.05:
13 (f(x0 - h) - 2*f(x0) + f(x0 + h))/h^2
14 14.02500000
3.5. Newton’s Method for the Solution of Nonlinear Equations 99

3.5. Newton’s Method for the Solution of Non-


linear Equations
The Newton’s method is also called the Newton-Raphson method.
The objective is to find a zero p of f :

f (p) = 0. (3.74)

Strategy 3.58. Let p0 be an approximation of p. We will try to find a


correction term h such that (p0 + h) is a better approximation of p than
p0 ; ideally (p0 + h) = p.

• If f 00 exists and is continues, then by Taylor’s Theorem


h2 00
0 = f (p) = f (p0 + h) = f (p0 ) + hf 0 (p0 ) + f (ξ), (3.75)
2
where h = p − p0 and ξ lies between p and p0 .
• If |h| is small, it is reasonable to ignore the last term of (3.75) and solve
for h = p − p0 :
f (p0 )
h = p − p0 ≈ − 0 . (3.76)
f (p0 )
• Define
f (p0 )
p1 = p0 − ; (3.77)
f 0 (p0 )
then p1 may be a better approximation of p than p0 .
• The above can be repeated.

Algorithm 3.59. Newton’s method for solving f (x) = 0


For p0 chosen close to a root p, compute {pn } repeatedly satisfying
f (pn−1 )
pn = pn−1 − , n ≥ 1. (3.78)
f 0 (pn−1 )
100 Chapter 3. Programming with Calculus

Graphical interpretation
• Let p0 be the initial approximation close to p. Then, the tangent line
at (p0 , f (p0 )) reads
L(x) = f 0 (p0 )(x − p0 ) + f (p0 ). (3.79)

• To find the x-intercept of y = L(x), let


0 = f 0 (p0 )(x − p0 ) + f (p0 ).

Solving the above equation for x becomes


f (p0 )
x = p0 − 0 , (3.80)
f (p0 )
of which the right-side is the same as in (3.77).

Figure 3.6: Graphical interpretation of the Newton’s method.


An Example of Divergence
1 f := arctan(x);
2 Newton(f, x = Pi/2, output = plot, maxiterations = 3);
3.5. Newton’s Method for the Solution of Nonlinear Equations 101

Remark 3.60.
• The Newton’s method may diverge, unless the initialization is accu-
rate.
• It cannot be continued if f 0 (pn−1 ) = 0 for some n. As a matter of fact,
the Newton’s method is most effective when f 0 (x) is bounded away
from zero near p.

Convergence analysis for the Newton’s method


Define the error in the n-th iteration: en = pn − p. Then
f (pn−1 ) en−1 f 0 (pn−1 ) − f (pn−1 )
en = pn − p = pn−1 − 0 −p = . (3.81)
f (pn−1 ) f 0 (pn−1 )
On the other hand, it follows from the Taylor’s Theorem that
1
0 = f (p) = f (pn−1 − en−1 ) = f (pn−1 ) − en−1 f 0 (pn−1 ) + e2n−1 f 00 (ξn−1 ), (3.82)
2
for some ξn−1 . Thus, from (3.81) and (3.82), we have
1 f 00 (ξn−1 ) 2
en = e . (3.83)
2 f 0 (pn−1 ) n−1

Theorem 3.61. (Convergence of Newton’s method): Let f ∈ C 2 [a, b]


and p ∈ (a, b) is such that f (p) = 0 and f 0 (p) 6= 0. Then, there is a
neighborhood of p such that if the Newton’s method is started p0 in that
neighborhood, it generates a convergent sequence pn satisfying

|pn − p| ≤ C|pn−1 − p|2 , (3.84)

for a positive constant C.


102 Chapter 3. Programming with Calculus

Example 3.62. Apply the Newton’s method to solve f (x) = arctan(x) = 0,


with p0 = π/5.

1 Newton(arctan(x), x = Pi/5, output = sequence, maxiterations = 5)


2 0.6283185308, -0.1541304479, 0.0024295539, -9.562*10^(-9), 0., 0.

Since p = 0, en = pn and
|en | ≤ 0.67|en−1 |3 , (3.85)
which is an occasional super-convergence.
Theorem 3.63. Newton’s Method for a Convex Function
Let f ∈ C 2 (R) be increasing, convex, and of a zero p. Then, the zero p is
unique and the Newton iteration converges to p from any starting point.

Example 3.64. Use the Newton’s method to find the square root of a
positive number Q.

Solution. Let x = Q. Then x is a root of x2 − Q = 0. Define f (x) = x2 − Q;
set f 0 (x) = 2x. The Newton’s method reads
f (pn−1 ) p2n−1 − Q 1  Q 
pn = pn−1 − 0 = pn−1 − = pn−1 + . (3.86)
f (pn−1 ) 2pn−1 2 pn−1

mysqrt.m
1 function x = mysqrt(q)
2 %function x = mysqrt(q)
3

4 x = (q+1)/2;
5 for n=1:10
6 x = (x+q/x)/2;
7 fprintf('x_%02d = %.16f\n',n,x);
8 end
3.5. Newton’s Method for the Solution of Nonlinear Equations 103

Results
1 >> mysqrt(16); 1 >> mysqrt(0.1);
2 x_01 = 5.1911764705882355 2 x_01 = 0.3659090909090910
3 x_02 = 4.1366647225462421 3 x_02 = 0.3196005081874647
4 x_03 = 4.0022575247985221 4 x_03 = 0.3162455622803890
5 x_04 = 4.0000006366929393 5 x_04 = 0.3162277665175675
6 x_05 = 4.0000000000000506 6 x_05 = 0.3162277660168379
7 x_06 = 4.0000000000000000 7 x_06 = 0.3162277660168379
8 x_07 = 4.0000000000000000 8 x_07 = 0.3162277660168379
9 x_08 = 4.0000000000000000 9 x_08 = 0.3162277660168379
10 x_09 = 4.0000000000000000 10 x_09 = 0.3162277660168379
11 x_10 = 4.0000000000000000 11 x_10 = 0.3162277660168379

Note: The function sqrt is implemented the same way as mysqrt.m.


104 Chapter 3. Programming with Calculus

3.6. Zeros of Polynomials


Definition 3.65. A polynomial of degree n has the form
P (x) = an xn + an−1 xn−1 + · · · + a1 x + a0 , (3.87)

where ai ’s are called the coefficients of P and an 6= 0.

The objective is to find zeros of P .

Theorem 3.66. (Theorem on Polynomials).


• Fundamental Theorem of Algebra:
Every nonconstant polynomial has at least one root (possibly, in the
complex field).
• Complex Roots of Polynomials:
A polynomial of degree n has exactly n roots in the complex plane,
being agreed that each root shall be counted a number of times equal
to its multiplicity. That is, there are unique (complex) constants
x1 , x2 , · · · , xk and unique integers m1 , m2 , · · · , mk such that
k
X
m1 m2 mk
P (x) = an (x − x1 ) (x − x2 ) · · · (x − xk ) , mi = n. (3.88)
i=1

• Localization of Roots:
All roots of the polynomial P lie in the open disk centered at the origin
and of radius of
1
ρ=1+ max |ai |. (3.89)
|an | 0≤i<n
• Uniqueness of Polynomials:
Let P (x) and Q(x) be polynomials of degree n. If x1 , x2 , · · · , xr , with
r > n, are distinct numbers with P (xi ) = Q(xi ), for i = 1, 2, · · · , r, then
P (x) = Q(x) for all x.
– For example, two polynomials of degree n are the same if they
agree at (n + 1) points.
3.6. Zeros of Polynomials 105

3.6.1. Horner’s Method


Note: Known as nested multiplication and also as synthetic divi-
sion, Horner’s method can evaluate polynomials very efficiently. It
requires n multiplications and n additions to evaluate an arbitrary n-th
degree polynomial.

Algorithm 3.67. Let us try to evaluate P (x) at x = x0 .


• Utilizing the Remainder Theorem, we can rewrite the polynomial
as
P (x) = (x − x0 )Q(x) + r = (x − x0 )Q(x) + P (x0 ), (3.90)
where Q(x) is a polynomial of degree n − 1, say

Q(x) = bn xn−1 + · · · + b2 x + b1 . (3.91)

• Substituting the above into (3.90), utilizing (3.87), and setting equal
the coefficients of like powers of x on the two sides of the resulting
equation, we have
bn = an
bn−1 = an−1 + x0 bn
.. (3.92)
.
b1 = a1 + x 0 b 2
P (x0 ) = a0 + x0 b1
• Introducing b0 = P (x0 ), the above can be rewritten as

bn+1 = 0; bk = ak + x0 bk+1 , n ≥ k ≥ 0. (3.93)

• If the calculation of Horner’s algorithm is to be carried out with pencil


and paper, the following arrangement is often used (known as syn-
thetic division):
106 Chapter 3. Programming with Calculus

Example 3.68. Use Horner’s algorithm to evaluate P (3), where

P (x) = x4 − 4x3 + 7x2 − 5x − 2. (3.94)

Solution. For x0 = 3, we arrange the calculation as mentioned above:

Note that the 4-th degree polynomial in (3.94) is written as

P (x) = (x − 3)(x3 − x2 + 4x + 7) + 19.

Remark 3.69. When the Newton’s method is applied for finding an


approximate zero of P (x), the iteration reads
P (xn−1 )
xn = xn−1 − . (3.95)
P 0 (xn−1 )
Thus both P (x) and P 0 (x) must be evaluated in each iteration.

Strategy 3.70. How to evaluate P 0 (x): The derivative P 0 (x) can be


evaluated by using the Horner’s method with the same efficiency.
Indeed, differentiating (3.90)

P (x) = (x − x0 )Q(x) + P (x0 )

reads
P 0 (x) = Q(x) + (x − x0 )Q0 (x). (3.96)
Thus
P 0 (x0 ) = Q(x0 ). (3.97)
That is, the evaluation of Q at x0 becomes the desired quantity P 0 (x0 ).
3.6. Zeros of Polynomials 107

Example 3.71. Evaluate P 0 (3) for P (x) considered in Example 3.68, the
previous example.
Solution. As in the previous example, we arrange the calculation and carry
out the synthetic division one more time:

Example 3.72. Implement the Horner’s algorithm to evaluate P (3) and


P 0 (3), for the polynomial in (3.94): P (x) = x4 − 4x3 + 7x2 − 5x − 2.
Solution.
horner.m
1 function [p,d] = horner(A,x0)
2 % input: A = [a_0,a_1,...,a_n]
3 % output: p=P(x0), d=P'(x0)
4

5 n = size(A(:),1);
6 p = A(n); d=0;
7

8 for i = n-1:-1:1
9 d = p + x0*d;
10 p = A(i) +x0*p;
11 end

Call_horner.m
1 a = [-2 -5 7 -4 1];
2 x0=3;
3 [p,d] = horner(a,x0);
4 fprintf(" P(%g)=%g; P'(%g)=%g\n",x0,p,x0,d)
5 Result: P(3)=19; P'(3)=37
108 Chapter 3. Programming with Calculus

Example 3.73. Let P (x) = x4 − 4x3 + 7x2 − 5x − 2, as in (3.94). Use the


Newton’s method and the Horner’s method to implement a code and find an
approximate zero of P near 3.
Solution.
newton_horner.m
1 function [x,it] = newton_horner(A,x0,tol,itmax)
2 % input: A = [a_0,a_1,...,a_n]; x0: initial for P(x)=0
3 % outpue: x: P(x)=0
4

5 x = x0;
6 for it=1:itmax
7 [p,d] = horner(A,x);
8 h = -p/d;
9 x = x + h;
10 if(abs(h)<tol), break; end
11 end
Call_newton_horner.m
1 a = [-2 -5 7 -4 1];
2 x0=3;
3 tol = 10^-12; itmax=1000;
4 [x,it] = newton_horner(a,x0,tol,itmax);
5 fprintf(" newton_horner: x0=%g; x=%g, in %d iterations\n",x0,x,it)
6 Result: newton_horner: x0=3; x=2, in 7 iterations

Figure 3.7: Polynomial P (x) = x4 − 4x3 + 7x2 − 5x − 2. Its two zeros are −0.275682 and 2.
3.6. Zeros of Polynomials 109

Exercises for Chapter 3

3.1. In Example 3.5, we considered the curve y = |x2 − 1|. Find the left-hand limit and
right-hand slope of the difference quotient at x0 = 1.
Ans: −2 and 2.
3.2. The number e is determined so that the slope of the graph of y = ex at x = 0 is exactly
1. Let h be a point near 0. Then

eh − e0 eh − 1
Q(h) := =
h−0 h
represents the average slope of the graph between the two points (0, 1) and (h, eh ).
Evaluate Q(h), for h = 0.1, 0.01, 0.001, 0.0001. What can you say about the results?
Ans: For example, Q(0.01) = 1.0050.

3.3. Recall the Taylor series for ex , cos x and sin x in (3.32). Let x = iθ, where i = −1.
Then
i2 θ2 i3 θ3 i4 θ4 i5 θ5 i6 θ6
eiθ = 1 + iθ + + + + + + ··· (3.98)
2! 3! 4! 5! 6!
(a) Prove that eiθ = cos θ + i sin θ, which is called the Euler’s identity.
(b) Prove that eiπ + 1 = 0.

3.4. Implement a code to visualize complex-valued solutions of ez = −1.

• Use fimplicit
• Visualize, with ylim([-2*pi 4*pi]), yticks(-pi:pi:3*pi)

Hint : Use the code in § 2.2, starting with


eulers_identity.m
1 syms x y real
2 z = x+1i*y;
3

4 %% ---- Euler's identity


5 g = exp(z)+1;
6 RE = simplify(real(g))
7 IM = simplify(imag(g))
8

9 A = @(x,y) <Copy RE appropriately>


10 B = @(x,y) <Copy IM appropriately>
11

12 %%--- Solve A=0 and B=0 --------------


13
110 Chapter 3. Programming with Calculus

3.5. Derive the following midpoint formula


−f−2 + 16f−1 − 30f0 + 16f1 − f2 h4 (6) h6 (8)
f 00 (x0 ) = + f (x 0 ) + f (x0 ) + · · · (3.99)
12h2 90 1008
Hint : Use the technique in Example 3.56, with f (x0 + ih), i = −2, −1, 0, 1, 2.
3.6. Use your calculator (or pencil-and-paper) to run two iterations of Newton’s method to
find x2 for given f and x0 .
(a) f (x) = x4 − 2, x0 = 1
(b) f (x) = xex − 1, x0 = 0.5
Ans: (b) x2 = 0.56715557
3.7. The graphs of y = x2 (x + 1) and y = 1/x (x > 0) intersect at one point x = r. Use
Newton’s method to estimate the value of r to eight decimal places.

3.8. Let f (x) = cos x + sin x be defined on on the interval [−1, 1].
(a) How many equally spaced nodes are required to interpolate f to within 10−8 on
the interval?
(b) Evaluate the interpolating polynomial at the midpoint of a subinterval and
verify that the error is not larger than 10−8 .
hn+1
Hint : (a). Recall the formula: |f (x) − Pn (x)| ≤ M . Then, for n, solve
4(n + 1)
(2/n)n+1 √
2 ≤ 10−8 .
4(n + 1)

3.9. Use the most accurate three-point formulas to determine the missing entries.
x f (x) f 0 (x) f 00 (x)
1.0 2.0000 6.00
1.2 1.7536
1.4 1.9616
1.6 2.8736
1.8 4.7776
2.0 8.0000
2.2 12.9056 52.08
3.6. Zeros of Polynomials 111

Hint : The central scheme is more accurate than one-sided schemes.


3.10. Consider the polynomial

P (x) = 3x5 − 7x4 − 5x3 + x2 − 8x + 2.

(a) Use the Horner’s algorithm to find P (4).


(b) Use the Newton’s method to find a real-valued root, starting with x0 = 4. and
applying the Horner’s algorithm for the evaluation of P (xk ) and P 0 (xk ).
112 Chapter 3. Programming with Calculus
4
C HAPTER

Linear Algebra Basics

Real-world systems can be approximated/represented as a system of


linear equations
 
a11 a12 · · · a1n
 21 a22 · · · a2n 
a 
Ax = b, A =  .. .
. . . .
.  ∈ Rm×n , (4.1)
 . . . . 
am1 am2 · · · amn

where b is the source and x is the solution.


In this chapter, we will study topics in linear algebra basics including
• Elementary row operations
• Row reduction algorithm
• Linear independence
• Invertible matrices (m = n)

Contents of Chapter 4
4.1. Solutions of Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.2. Row Reduction and the General Solution of Linear Systems . . . . . . . . . . . . . . . 119
4.3. Linear Independence and Span of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.4. Invertible Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Exercises for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

113
114 Chapter 4. Linear Algebra Basics

4.1. Solutions of Linear Systems


Definition 4.1. A linear equation in the variables x1 , x2 , · · · , xn is an
equation that can be written in the form

a1 x1 + a2 x2 + · · · + an xn = b, (4.2)

where b and the coefficients a1 , a2 , · · · , an are real or complex numbers.

A system of linear equations (or a linear system) is a collection of one


or more linear equations involving the same variables – say, x1 , x2 , · · · , xn .
Example 4.2.
( 
4x1 − x2 = 3  2x + 3y − 4z = 2

(a)
2x1 + 3x2 = 5 (b) x − 2y + z = 1

 3x + y − 2z = −1

• Solution: A solution of the system is a list [s1 , s2 , · · · , sn ] of numbers


that makes each equation a true statement, when
[x1 , x2 , · · · , xn ] ← [s1 , s2 , · · · , sn ].

• Solution Set: The set of all possible solutions is called the solution
set of the linear system.
• Equivalent System: Two linear systems are called equivalent if
they have the same solution set.
For example, Example 4.2 (a) is equivalent to
(
2x1 − 4x2 = −2
R1 ← R1 − R2
2x1 + 3x2 = 5
4.1. Solutions of Linear Systems 115

Remark 4.3. Linear systems may have

no solution ) : inconsistent system


exactly one (unique) solution
: consistent system
infinitely many solutions

Example 4.4. Consider the case of two equations in two unknowns.


( ( (
−x + y = 1 x+y = 1 −2x + y = 2
(a) (b) (c)
−x + y = 3 x−y = 2 −4x + 2y = 4

4.1.1. Solving a linear system


Consider a simple system of 2 linear equations:
(
−2x1 + 3x2 = −1
(4.3)
x1 + 2x2 = 4
Such a system of linear equations can be treated much more conveniently
and efficiently with matrix form; (4.3) reads
" # " # " #
−2 3 x1 −1
= . (4.4)
1 2 x2 4
| {z }
coefficient matrix
The essential information of the system can be recorded compactly in a
rectangular array called an augmented matrix:
" # " #
−2 3 −1 −2 3 −1
or (4.5)
1 2 4 1 2 4
116 Chapter 4. Linear Algebra Basics

Solving (4.3) :

System of linear equations Matrix form


( " #
−2x1 + 3x2 = −1 1 −2 3 −1
x1 + 2x2 = 4 2 1 2 4

1 ↔ 2 : (interchange)
( " #
x1 + 2x2 = 4 1 1 2 4
−2x1 + 3x2 = −1 2 −2 3 −1

2 ← 2 +2· 1 : (replacement)
( " #
x1 + 2x2 = 4 1 1 2 4
7x2 = 7 2 0 7 7

2 ← 2 /7: (scaling)
( " #
x1 + 2x2 = 4 1 1 2 4
x2 = 1 2 0 1 1

1 ← 1 −2· 2 : (replacement)
( " #
x1 = 2 1 1 0 2
x2 = 1 2 0 1 1

At the last step:


( " #
x1 = 2 2
LHS: solution : RHS : I
x2 = 1 1
4.1. Solutions of Linear Systems 117

Tools 4.5. Three Elementary Row Operations (ERO):


• Replacement: Replace one row by the sum of itself and a multiple
of another row
Ri ← Ri + k · Rj , j 6= i
• Interchange: Interchange two rows
Ri ↔ Rj , j 6= i
• Scaling: Multiply all entries in a row by a nonzero constant
Ri ← k · Ri , k 6= 0

Definition 4.6. Two matrices are row equivalent if there is a se-


quence of EROs that transforms one matrix to the other.

4.1.2. Matrix equation Ax = b


A fundamental idea in linear algebra is to view a linear combination of
vectors as a product of a matrix and a vector.
Definition 4.7. Let A = [a1 a2 · · · an ] be an m × n matrix and x ∈ Rn ,
then the product of A and x, denoted by Ax, is the linear combination
of columns of A using the corresponding entries of x as weights, i.e.,
 
x1
x 
 2
Ax = [a1 a2 · · · an ]  ..  = x1 a1 + x2 a2 + · · · + xn an . (4.6)
.
xn

A matrix equation is of the form Ax = b, where b is a column vector


of size m × 1.

Example 4.8. x = [x1 , x2 ]T = [−3, 2]T is the solution of the linear system.

Linear system Matrix equation Vector equation


" #" # " # " # " # " #
x1 + 2x2 = 1 1 2 x1 1 1 2 1
= x1 + x2 =
3x1 + 4x2 = −1 3 4 x2 −1 3 4 −1
118 Chapter 4. Linear Algebra Basics

Theorem 4.9. Let A = [a1 a2 · · · an ] be an m × n matrix, x ∈ Rn , and


b ∈ Rm . Then the matrix equation
Ax = b (4.7)

has the same solution set as the vector equation


x1 a1 + x2 a2 + · · · + xn an = b, (4.8)

which, in turn, has the same solution set as the system with augmented
matrix
[a1 a2 · · · an : b]. (4.9)

Two Fundamental Questions about a Linear System:


1. (Existence): Is the system consistent; that is, does at least one
solution exist?
2. (Uniqueness): If a solution exists, is it the only one; that is, is the
solution unique?

Example 4.10. Determine the values of h such that the given system is a
consistent linear system
x + h y = −5
2x − 8y = 6

Solution.

Ans: h 6= −4
4.2. Row Reduction and the General Solution of Linear Systems 119

4.2. Row Reduction and the General Solution


of Linear Systems
Example 4.11. Solve the following system of linear equations, using the
three EROs. Then, determine if the system is consistent.

x2 − 2x3 = 0
x1 − 2x2 + 2x3 = 3
4x1 − 8x2 + 6x3 = 14

Solution.

Ans: x = [1, −2, −1]T


Note: The system of linear equations can be solved by transforming the
augmented matrix to the reduced row echelon form (rref).
linear_equations_rref.m
1 A = [0 1 -2; 1 -2 2; 4 -8 6];
2 b = [0; 3; 14];
3

4 Ab = [A b];
5 rref(Ab)

Result
1 ans =
2 1 0 0 1
3 0 1 0 -2
4 0 0 1 -1
120 Chapter 4. Linear Algebra Basics

4.2.1. Echelon Forms and the Row Reduction Algorithm

Definition 4.12. Echelon form : A rectangular matrix is in an eche-


lon form if it has following properties.
1. All nonzero rows are above any zero rows (rows of all zeros).
2. Each leading entry in a row is in a column to the right of leading
entry of the row above it.
3. All entries below a leading entry in a column are zeros.
Row reduced echelon form : If a matrix in an echelon form sat-
isfies 4 and 5 below, then it is in the row reduced echelon form
(RREF), or the reduced echelon form (REF).
4. The leading entry in each nonzero row is 1.
5. Each leading 1 is the only nonzero entry in its column.

Example 4.13. Verify whether the following matrices are in echelon form,
row reduced echelon form.
   
1 0 2 0 1 2 0 0 5
(a) 0 1 3 0 4 (b) 0 0 0 9
   
0 0 0 0 0 0 1 0 6
   
1 1 0 1 1 2 2 3
(c) 0 0 1 (d) 0 0 1 1 1
   
0 0 0 0 0 0 0 4
   
1 0 0 5 0 1 0 5
(e) 0 1 0 6 (f) 0 0 0 6
   
0 0 0 1 0 0 1 2

Solution.
4.2. Row Reduction and the General Solution of Linear Systems 121

Terminologies
1) A pivot position is a location in A that corresponds to a leading 1
in the reduced echelon form of A.
2) A pivot column is a column of A that contains a pivot position.

Example 4.14. The matrix A is given with its reduced echelon form. Find
the pivot positions and pivot columns of A.
   
1 1 0 2 0 1 1 0 2 0
 R.E.F
A = 1 1 1 3 0 −−−→ 0 0 1 1 0
  
1 1 0 2 4 0 0 0 0 1
Solution.

Terminologies
3) Basic variables: In the system Ax = b, the variables that corre-
spond to pivot columns (in [A : b]) are basic variables.
4) Free variables: In the system Ax = b, the variables that correspond
to non-pivotal columns are free variables.

Example 4.15. For the system of linear equations, identify its basic vari-
ables and free variables.

 −x1 − 2x2
 = −3
2x3 = 4

 3x3 = 6
Solution. Hint : You may start with its augmented matrix, and apply row operations.
122 Chapter 4. Linear Algebra Basics

Row Reduction Algorithm


Example 4.16. Row reduce the matrix into the reduced echelon form.
 
0 −3 −6 4 9
A = −2 −3 0 3 −1
 
1 4 5 −9 −7
Solution.
   
1 4 5 −9 −7 1 4 5 −9 −7
R1 ↔R3   R2 ←R2 +2R1 
f A −−−−→ −2 −3 0 3 −1 −−−−−−−→  0 5 10 −15 −15

0 −3 −6 4 9 0 −3 −6 4 9
   
1 4 5 −9 −7 1 4 5 −9 −7
R2 ←R2 /5   R3 ←R3 +3R2 
f −−−−−→  0 1 2 −3 −3 −−−−−−−→  0 1 2 −3 −3

0 −3 −6 4 9 0 0 0 −5 0
   
1 4 5 −9 −7 R1 ← R1 + 9R3 1 4 5 0 −7
R3 ←R3 /−5   R2 ← R2 + 3R3 
b −−−−−−→  0 1 2 −3 −3 −−−−−−−−−−→  0 1 2 0 −3

0 0 0 1 0 0 0 0 1 0
 
1 0 −3 0 5
R1 ←R1 −4R2 
b −−−−−−−→  0 1 2 0 −3

0 0 0 1 0

The combination of operations in Line f is called the forward phase of


the row reduction, while that of Line b is called the backward phase.

Remark 4.17. Pivot Positions


Once a matrix is in an echelon form, further row operations do not
change the positions of leading entries. Thus, the leading entries be-
come the leading 1’s in the reduced echelon form.

Uniqueness of the Reduced Echelon Form


Theorem 4.18. Each matrix is row equivalent to one and only one
reduced echelon form.
4.2. Row Reduction and the General Solution of Linear Systems 123

4.2.2. The General Solution of Linear Systems

1) For example, for an augmented 4) The system (4.12) can be ex-


matrix, its R.E.F. is given as pressed as
  
1 0 −5 1  x1 = 1 +5 x3

0 1 1 4 (4.10) x2 = 4 −x3 (4.13)
 

0 0 0 0 x =
3 x3

2) Then, the associated system of 5) Thus, the solution of (4.11) can


equations reads be written as
     
x1 − 5 x3 = 1 x1 1 5
x2 + x3 = 4 (4.11) x2  = 4 + x3 −1 , (4.14)
     
0 = 0 x3 0 1

where {x1 , x2 } are basic vari- in which you are free to choose
ables (∵ pivots). any value for x3 . (That is why it
3) Rewrite (4.11) as is called a “free variable”.)

 x1 = 1 +5 x3

x2 = 4 −x3 (4.12)

x
3 is free

• The description in (4.14) is called a parametric description of solu-


tion set; the free variable x3 acts as a parameter.
• The solution in (4.14) represents all the solutions of the system
(4.10), which is called the general solution of the system.
124 Chapter 4. Linear Algebra Basics

Example 4.19. Find the general solution of the system whose augmented
matrix is  
1 0 −5 0 −8 3
0 1 4 −1 0 6
[A|b] = 
 

0 0 0 0 1 0
0 0 0 0 0 0
Solution. Hint : You should first row reduce it for the reduced echelon form.

Example 4.20. Choose h and k such that the system has

a) No solution b) Unique solution c) Many solutions


(
x1 − 3 x2 = 1
2 x1 + h x2 = k
Solution.

Ans: (a) h = −6, k 6= 2


4.2. Row Reduction and the General Solution of Linear Systems 125

Example 4.21. Find the general solution of the system of which


 
1 0 0 1 7
0 1 3 0 −1
the augmented matrix is [A|b] = 
 
2 −1 −3 2 15

1 0 −1 0 4
Solution.
linear_equations_rref.m
1 Ab = [1 0 0 1 7; 0 1 3 0 -1; 2 -1 -3 2 15; 1 0 -1 0 4];
2 rref(Ab)

Result
1 ans =
2 1 0 0 1 7
3 0 1 0 -3 -10
4 0 0 1 1 3
5 0 0 0 0 0

True-or-False 4.22.
a. The row reduction algorithm applies to only to augmented matrices for
a linear system.
b. If one row in an echelon form of an augmented matrix is [0 0 0 0 2 0],
then the associated linear system is inconsistent.
c. The pivot positions in a matrix depend on whether or not row inter-
changes are used in the row reduction process.
d. Reducing a matrix to an echelon form is called the forward phase of
the row reduction process.
Solution.
Ans: F,F,F,T
126 Chapter 4. Linear Algebra Basics

4.3. Linear Independence and Span of Vectors


Definition 4.23. A set of vectors S = {v1 , v2 , · · · , vp } in Rn is said to be
linearly independent, if the vector equation

x1 v1 + x2 v2 + · · · + xp vp = 0 (4.15)

has only the trivial solution (i.e., x1 = x2 = · · · = xp = 0). The set


of vectors S is said to be linearly dependent, if there exist weights
c1 , c2 , · · · , cp , not all zero, such that

c1 v1 + c2 v2 + · · · + cp vp = 0. (4.16)

Note: A vector in a linearly independent set S cannot be expressed as a


linear combination of other vectors in S.

Example 4.24. Determine if the set {v1 , v2 } is linearly independent.


" # " # " # " #
3 0 3 1
1) v1 = , v2 = 2) v1 = , v2 =
0 5 0 0

Remark 4.25. Let A = [v1 v2 · · · vp ]. The matrix equation Ax = 0 is


equivalent to x1 v1 + x2 v2 + · · · + xp vp = 0.
1. Columns of A are linearly independent if and only if Ax = 0 has
only the trivial solution. ( ⇔ Ax = 0 has no free variable ⇔ Every
column in A is a pivot column.)
2. Columns of A are linearly dependent if and only if Ax = 0 has
nontrivial solution. ( ⇔ Ax = 0 has at least one free variable ⇔ A
has at least one non-pivot column.)
4.3. Linear Independence and Span of Vectors 127

Example 4.26. Determine if the vectors are linearly independent.


     
0 0 −1
2,  0,  3
     
3 −8 1
Solution.

Example 4.27. Determine if the vectors are linearly independent.


       
1 −2 3 2
−2,  4, −6, 2
       
0 1 −1 3
Solution.

Note: In the above example, vectors are in Rn , n = 3; the number of vectors p = 4.


As in this example, if p > n then the vectors must be linearly dependent.
128 Chapter 4. Linear Algebra Basics

Definition 4.28. Let v1 , v2 , · · · , vp be p vectors in Rn . Then


Span{v1 , v2 , · · · , vp } is the collection of all linear combination of
v1 , v2 , · · · , vp , that can be written in the form c1 v1 + c2 v2 + · · · + cp vp ,
where c1 , c2 , · · · , cp are weights. That is,

Span{v1 , v2 , · · · , vp } = {y | y = c1 v1 + c2 v2 + · · · + cp vp } (4.17)

Example 4.29. Find the value of h so that c is in Span{a, b}.


     
3 −6 9
a = −6, b =  4, c = h
     
1 −3 3
Solution.

True-or-False 4.30.
a. The columns of any 3 × 4 matrix are linearly dependent.
b. If u and v are linearly independent, and if {u, v, w} is linearly depen-
dent, then w ∈ Span{u, v}.
c. Two vectors are linearly dependent if and only if they lie on a line
through the origin.
d. The columns of a matrix A are linearly independent, if the equation
Ax = 0 has the trivial solution.
Ans: T,T,T,F
4.4. Invertible Matrices 129

4.4. Invertible Matrices


Definition 4.31. An n × n (square) matrix A is said to be invertible
(nonsingular) if there is an n × n matrix B such that AB = In = BA,
where In is the identity matrix.

Note: In this case, B is the unique inverse of A denoted by A−1 .


(Thus AA−1 = In = A−1 A.)
" # " #
2 5 −7 −5
Example 4.32. If A = and B = . Find AB and BA.
−3 −7 3 2
Solution.

Theorem 4.33. (Inverse of an n × n matrix, n ≥ 2) An n × n matrix


A is invertible if and only if A is row equivalent to In ; in this case,
any sequence of elementary row operations that reduces A into In will
also reduce In to A−1 .

Algorithm 4.34. Algorithm to find A−1 :


1) Row reduce the augmented matrix [A : In ]
2) If A is row equivalent to In , then [A : In ] is row equivalent to
[In : A−1 ]. Otherwise A does not have any inverse.

Note: For the system Ax = b, when A is invertible,

[A : b] → · · · → [In : x] ⇒ x = A−1 b. (4.18)


130 Chapter 4. Linear Algebra Basics
" #
3 2
Example 4.35. Find the inverse of A = .
8 5
Solution. You" may begin#with
3 2 1 0
[A : I2 ] =
8 5 0 1

 
0 1 0
Self-study 4.36. Use pencil-and-paper to find the inverse of A = 1 0 3,
 
4 −3 8
if it exists.
Solution.

When it is implemented:
inverse_matrix.m
1 A = [0 1 0
2 1 0 3
3 4 -3 8];
4 I = eye(3);
5

6 AI = [A I];
7 rref(AI)

Result
1 ans =
2 1.0000 0 0 2.2500 -2.0000 0.7500
3 0 1.0000 0 1.0000 0 0
4 0 0 1.0000 -0.7500 1.0000 -0.2500
4.4. Invertible Matrices 131

Definition 4.37. Given an m × n matrix A, the transpose of A is the


matrix, denoted by AT , whose columns are formed from the correspond-
ing rows of A. That is,
A = [aij ] ∈ Rm×n ⇒ AT = [aji ] ∈ Rn×m . (4.19)

 
1 4 8 1
Example 4.38. If A = 0 −2 −1 3, then AT =
 
9 0 0 5

Theorem 4.39. Properties of Invertible Matrices


" #
a b
a. (Inverse of a 2 × 2 matrix) Let A = . If ad − bc 6= 0, then A
c d
is invertible and " #
1 d −b
A−1 = (4.20)
ad − bc −c a

b. If A is an invertible matrix, then A−1 is also invertible and


(A−1 )−1 = A.
c. If A and B are n × n invertible matrices then AB is also invertible
and (AB)−1 = B −1 A−1 .
d. If A is invertible, then AT is also invertible and (AT )−1 = (A−1 )T .
e. If A is an n × n invertible matrix, then for each b ∈ Rn , the equation
Ax = b has a unique solution x = A−1 b.
132 Chapter 4. Linear Algebra Basics

Theorem 4.40. Invertible Matrix Theorem


Let A be an n × n matrix. Then the following are equivalent.
a. A is an invertible matrix. (Def: There is B s.t. AB = BA = I)
b. A is row equivalent to the n × n identity matrix.
c. A has n pivot positions.
d. The equation Ax = 0 has only the trivial solution x = 0.
e. The columns of A are linearly independent.
f. The linear transformation x 7→ Ax is one-to-one.
g. The equation Ax = b has a unique solution for each b ∈ Rn .
h. The columns of A span Rn .
i. The linear transformation x 7→ Ax maps Rn onto Rn .
j. There is a matrix C ∈ Rn×n such that CA = I
k. There is a matrix D ∈ Rn×n such that AD = I
l. AT is invertible and (AT )−1 = (A−1 )T .
More statements will be added in the coming sections;
see Theorem 5.10, p.140, and Theorem 5.17, p.142.

Note: Let A and B be square matrices. If AB = I, then A and B are both


invertible, with B = A−1 and A = B −1 .

Example 4.41. Use the Invertible Matrix Theorem to decide if A is invert-


ible:  
1 0 −2
A =  3 1 −2
 
−5 −1 9
4.4. Invertible Matrices 133

Exercises for Chapter 4

4.1. Find the general solutions of the systems (in parametric vector form) whose aug-
mented matrices are given as
 
1 −7
 
0 6 5 1 2 −5 −6 0 −5
(a)  0 0 1 −2 −3 0 1 −6 −3 0

2

(b) 
−1 7 −4 2 7 0 0 0 0 1 0

0 0 0 0 0 0
Ans: (a) x = [5, 0, −3, 0]T + x2 [7, 1, 0, 0]T + x4 [−6, 0, 2, 1]T ;
Ans: (b) x = [−9, 2, 0, 0, 0]T + x3 [−7, 6, 1, 0, 0]T + x4 [0, 3, 0, 1, 0]T .
4.2. In the following, we use the notation for matrices in echelon form: the leading entries
with , and any values (including zero) with ∗. Suppose each matrix represents the
augmented matrix for a system of linear equations. In each case, determine if the
system is consistent. If the system is consistent, determine if the solution is unique.

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
     
0
(a) 0 ∗ ∗ (b) 0 0 ∗ ∗ (c) 0 0 ∗ ∗
0 0 ∗ 0 0 0 0 0 0 0 ∗

4.3. Suppose the coefficient matrix of a system of linear equations has a pivot position in
every row. Explain why the system is consistent.
4.4. (a) For what values of h is v3 in Span{v1 , v2 }, and (b) for what values of h is {v1 , v2 , v3 }
linearly dependent? Justify each answer.
−3
     
1 5
v1 = −3, v2 =  9, v3 = −7.
2 −6 h
Ans: (a) No h; (b) All h
1 −2
 
  1
3 −4
4.5. Find the inverses of the matrices, if exist: A = and B =  4 −7 3
7 −8
−2 6 −4
Ans: B is not invertible.
4.6. Describe the possible echelon forms of the matrix. Use the notation of Exercise 2
above.
(a) A is a 3 × 3 matrix with linearly independent columns.
(b) A is a 2 × 2 matrix with linearly dependent columns.
(c) A is a 4 × 2 matrix, A = [a1 , a2 ] and a2 is not a multiple of a1 .

4.7. If C is 6 × 6 and the equation Cx = v is consistent for every v ∈ R6 , is it possible that


for some v, the equation Cx = v has more than one solution? Why or why not?
Ans: No
134 Chapter 4. Linear Algebra Basics
5
C HAPTER

Programming with Linear Algebra

In Chapter 4, we studied linear algebra basics. In this chapter, we will


consider popular subjects in linear algebra, which are applicable for
real-world problems through programming.

Contents of Chapter 5
5.1. Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.2. Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.3. Dot Product, Length, and Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.4. Vector Norms, Matrix Norms, and Condition Numbers . . . . . . . . . . . . . . . . . . 151
5.5. Power Method and Inverse Power Method for Eigenvalues . . . . . . . . . . . . . . . . 155
Exercises for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

135
136 Chapter 5. Programming with Linear Algebra

5.1. Determinants
Definition 5.1. Let A be an n × n square matrix. Then the determi-
nant of A is a scalar value, denoted by det A or |A|.
1) Let A = [a] ∈ R1 × 1 . Then det A = a.
" #
a b
2) Let A = ∈ R2 × 2 . Then det A = ad − bc.
c d
" #
2 1
Example 5.2. Let A = . Consider a linear transformation T : R2 → R2
0 3
defined by T (x) = Ax.

(a) Find the determinant of A.


(b) Determine the image of a rectangle R = [0, 2] × [0, 1] under T .
(c) Find the area of the image.
(d) Figure out how det A, the area of the rectangle (= 2), and the area of
the image are related.
Solution.

Ans: (c) 12
Note: The determinant can be viewed as a volume scaling factor.
5.1. Determinants 137

Definition 5.3. Let Aij be the submatrix of A obtained by deleting row


i and column j of A. Then the (i, j)-cofactor of A = [aij ] is the scalar Cij ,
given by
Cij = (−1)i+j det Aij . (5.1)

Definition 5.4. For n ≥ 2, the determinant of an n × n matrix A = [aij ]


is given by the following formulas:
1. The cofactor expansion across the first row:

det A = a11 C11 + a12 C12 + · · · + a1n C1n (5.2)

2. The cofactor expansion across the row i:

det A = ai1 Ci1 + ai2 Ci2 + · · · + ain Cin (5.3)

3. The cofactor expansion down the column j:

det A = a1j C1j + a2j C2j + · · · + anj Cnj (5.4)


 
1 5 0
Example 5.5. Find the determinant of A = 2 4 −1, by expanding
 
0 −2 0
across the first row and down column 3.
Solution.

Ans: −2
138 Chapter 5. Programming with Linear Algebra

Note: If A is a triangular (upper or lower) matrix, then det A is the


product of entries on the main diagonal of A.
 
1 −2 5 2
0 −6 −7 5
Example 5.6. Compute the determinant of A =  .
 
0 0 3 0
0 0 0 4
Solution.

determinant.m
1 A = [1 -2 5 2; 0 -6 -7 5; 0 0 3 0; 0 0 0 4];
2 det(A)

Result
1 ans =
2 -72

Remark 5.7. The matrix A in Example 5.6 has a pivit position in each
column ⇒ It is invertible.
5.1. Determinants 139

Properties of Determinants

Theorem 5.8. Let A be an n × n square matrix.


a) (Replacement): If B is obtained from A by a row replacement, then
det B = det A.
" # " #
1 3 1 3
A= , B=
2 1 0 −5

b) (Interchange): If two rows of A are interchanged to form B, then


det B = −det A.
" # " #
1 3 2 1
A= , B=
2 1 1 3

c) (Scaling): If one row of A is multiplied by k (6= 0), then


det B = k · det A.
" # " #
1 3 1 3
A= , B=
2 1 −4 −2

 
1 −4 2
Example 5.9. Compute det A, where A = −2 8 −9, after applying
 
−1 7 0
some elementary row operations.
Solution.

Ans: 15
140 Chapter 5. Programming with Linear Algebra

Theorem 5.10. Invertible Matrix Theorem (p.132)


A square matrix A is invertible ⇔ (Ax = 0 ⇒ x = 0)
m. det A 6= 0 (Note: det is a volume scaling factor)

Remark 5.11. Let A and B be n × n matrices.


a) det AT = det A.
" # " #
1 3 1 2
A= , AT =
2 1 3 1

b) det (AB) = det A · det B.


" # " # " #
1 3 1 1 13 7
A= , B= ; then AB = .
2 1 4 2 6 4

1
c) If A is invertible, then det A−1 = . (∵ det In = 1.)
det A

Example 5.12. Suppose the sequence 5 × 5 matrices A, A1 , A2 , and A3 are


related by following elementary row operations:
R ←R −3R R3 ←(1/5) R3 R ↔R
A −−2−−−2 1
−−→ A1 −−−−−−−→ A2 −−4−−→
5
A3
 
1 2 3 4 1
0 −2 1 −1 1
 
Find det A, if A3 = 0 0 3 0 1
 
 
0 0 0 −1 1
0 0 0 0 1
Solution.

Ans: −30
5.2. Eigenvalues and Eigenvectors 141

5.2. Eigenvalues and Eigenvectors


Definition 5.13. Let A be an n × n matrix. An eigenvector of A is a
nonzero vector x such that
Ax = λx (5.5)

for some scalar λ. In this case, the scalar λ is an eigenvalue and x is


the corresponding eigenvector.
" # " #
−1 5 2
Example 5.14. Is an eigenvector of ? What is the eigenvalue?
1 3 6
Solution.

" #
1 6
Example 5.15. Let A = . Show that 7 is an eigenvalue of matrix A,
5 2
and find the corresponding eigenvectors.
Solution. Hint : Start with Ax = 7x. Then (A − 7I)x = 0.
142 Chapter 5. Programming with Linear Algebra

Remark 5.16. Let λ be an eigenvalue of A. Then


(a) The homogeneous system (A−λI) x = 0 has at least one free variable
(∵ x 6= 0).
(b) det (A − λI) = 0.

Theorem 5.17. Invertible Matrix Theorem (p.132)


A square matrix A is invertible ⇔ (Ax = 0 ⇒ x = 0)
n. The number 0 is not an eigenvalue of A

5.2.1. Characteristic Equation


Definition 5.18. The scalar equation det (A − λI) = 0 is called the
characteristic equation of A; the polynomial p(λ) = det (A − λI) is
called the characteristic polynomial of A.
• The solutions of det (A − λI) = 0 are the eigenvalues of A.

Example 5.19. Find the characteristic polynomial, eigenvalues, and cor-


" #
8 2
responding eigenvectors of A = .
3 3
Solution.
5.2. Eigenvalues and Eigenvectors 143

Example 5.20. Find the characteristic polynomial and all eigenvalues of


 
1 1 0
A = 6 0 5
 
0 0 2
eigenvalues.m
1 syms x
2 A = [1 1 0; 6 0 5; 0 0 2];
3

4 polyA = charpoly(A,x)
5 eigenA = solve(polyA)
6 [P,D] = eig(A) % A*P = P*D
7 P*D*inv(P)

Results
1 polyA =
2 12 - 4*x - 3*x^2 + x^3
3

4 eigenA =
5 -2
6 2
7 3
8

9 P =
10 0.4472 -0.3162 -0.6155
11 0.8944 0.9487 -0.6155
12 0 0 0.4924
13 D =
14 3 0 0
15 0 -2 0
16 0 0 2
17

18 ans =
19 1.0000 1.0000 -0.0000
20 6.0000 0.0000 5.0000
21 0 0 2.0000
144 Chapter 5. Programming with Linear Algebra

5.2.2. Matrix Similarity and The Diagonalization Theo-


rem
Definition 5.21. Let A and B be n × n matrices. Then, A is similar to
B, if there is an invertible matrix P such that

A = P BP −1 , or equivalently, P −1 AP = B.

Writing Q = P −1 , we have B = QAQ−1 . So B is also similar to A, and we


say simply that A and B are similar. The map A 7→ P −1 AP is called a
similarity transformation.

The next theorem illustrates one use of the characteristic polynomial, and
it provides the foundation for several iterative methods that approximate
eigenvalues.
Theorem 5.22. If n × n matrices A and B are similar, then they have
the same characteristic polynomial and hence the same eigenval-
ues (with the same multiplicities).

Proof. B = P −1 AP . Then,

B − λI = P −1 AP − λI
= P −1 AP − λP −1 P
= P −1 (A − λI)P,

from which we conclude that

det (B − λI) = det (P −1 ) det (A − λI) det (P ) = det (A − λI).


5.2. Eigenvalues and Eigenvectors 145

Diagonalization
Definition 5.23. An n × n matrix A is said to be diagonalizable if
there exists an invertible matrix P and a diagonal matrix D such that

A = P DP −1 (or P −1 AP = D) (5.6)

That is, diagonalizable matrices are those similar to a diagonal matrix.

Remark 5.24. Let A be diagonalizable, i.e., A = P DP −1 . Then

A2 = (P DP −1 )(P DP −1 ) = P D2 P −1
Ak = P Dk P −1
(5.7)
A−1 = P D−1 P −1 (when A is invertible)
det A = det D

Diagonalization enables us to compute Ak and det A quickly.


" #
7 2
Self-study 5.25. Let A = . Find a formula for Ak , given that
−4 1
" # " #
1 1 5 0
A = P DP −1 , where P = and D = .
−1 −2 0 3
Solution.

2 · 5k − 3k 5k − 3k
 
k
Ans: A =
2 · 3k − 2 · 5k 2 · 3k − 5k
146 Chapter 5. Programming with Linear Algebra

Theorem 5.26. (The Diagonalization Theorem)


1. An n × n matrix A is diagonalizable if and only if A has n linearly
independent eigenvectors v1 , v2 , · · · , vn .
2. In fact, A = P DP −1 if and only if columns of P are n linearly inde-
pendent eigenvectors of A. In this case, the diagonal entries of D are
the corresponding eigenvalues of A. That is,

P = [v1 v2 · · · vn ],
 
λ1 0 · · · 0
 0 λ ··· 0  (5.8)
2
D = diag(λ1 , λ2 , · · · , λn ) =  .. .. . . ,
 
. . .
. .. 
0 0 · · · λn

where Avk = λk vk , k = 1, 2, · · · , n.

Proof. Let P = [v1 v2 · · · vn ] and D = diag(λ1 , λ2 , · · · , λn ), arbitrary ma-


trices. Then,

AP = A[v1 v2 · · · vn ] = [Av1 Av2 · · · Avn ], (5.9)

while
 
λ1 0 · · · 0
 0 λ ··· 0 
2
P D = [v1 v2 · · · vn ] .. .. . .  = [λ1 v1 λ2 v2 · · · λn vn ]. (5.10)
 
. . .
. .. 
0 0 · · · λn
(⇒ ) Now suppose A is diagonalizable and A = P DP −1 . Then we have
AP = P D; it follows from (5.9) and (5.10) that

[Av1 Av2 · · · Avn ] = [λ1 v1 λ2 v2 · · · λn vn ],

from which we conclude

Avk = λk vk , k = 1, 2, · · · , n. (5.11)

Furthermore, P is invertible ⇒ {v1 , v2 , · · · , vn } is linearly independent.


(⇐ ) It is almost trivial.
5.2. Eigenvalues and Eigenvectors 147

Example 5.27. Diagonalize the following matrix, if possible.


 
1 3 3
A = −3 −5 −3
 
3 3 1

Solution.
1. Find the eigenvalues of A.
2. Find three linearly independent eigenvectors of A.
3. Construct P from the vectors in step 2.
4. Construct D from the corresponding eigenvalues.
Check: AP = P D?
−1 −1
     
1
Ans: λ = 1, −2, −2. v1 = −1 , v2 =
   1 , v3 =
  0
1 0 1
diagonalization.m
1 A = [1 3 3; -3 -5 -3; 3 3 1];
2 [P,D] = eig(A) % A*P = P*D
3 P*D*inv(P)

Results
1 P =
2 -0.5774 -0.7876 0.4206
3 0.5774 0.2074 -0.8164
4 -0.5774 0.5802 0.3957
5 D =
6 1.0000 0 0
7 0 -2.0000 0
8 0 0 -2.0000
9

10 ans =
11 1.0000 3.0000 3.0000
12 -3.0000 -5.0000 -3.0000
13 3.0000 3.0000 1.0000

Attention: Eigenvectors corresponding to λ = −2


148 Chapter 5. Programming with Linear Algebra

5.3. Dot Product, Length, and Orthogonality


Definition 5.28. Let u = [u1 , u2 , · · · , un ]T and v = [v1 , v2 , · · · , vn ]T are
vectors in Rn . Then, the dot product (or inner product) of u and v is
given by  
v1
v 
 2
u•v = uT v = [u1 u2 · · · un ] .. 
.
(5.12)
vn
Xn
= u1 v1 + u2 v2 + · · · + un vn = uk vk .
k=1

Note: In a matrix-vector multiplication Av, the product of a row of


A and the column vector v is a dot product.
   
1 3
Example 5.29. Let u = −2 and v =  2. Find u•v.
   
2 −4
Solution.

Theorem 5.30. Let u, v, and w be vectors in Rn , and c be a scalar. Then


a. u•v = v•u
b. (u + v)•w = u•w + v•w
c. (cu)•v = c(u•v) = u•(cv)
d. u•u ≥ 0, and u•u = 0 ⇔ u = 0
5.3. Dot Product, Length, and Orthogonality 149

Definition 5.31. The length (norm) of v is nonnegative scalar kvk


defined by
√ q
kvk = v•v = v12 + v22 + · · · + vn2 and kvk2 = v•v. (5.13)

Note: For any scalar c, kcvk = |c| kvk.


" #
3
Example 5.32. Let W be a subspace of R2 spanned by v = . Find a
4
unit vector u that is a basis for W .
Solution.

Distance in Rn
Definition 5.33. For u, v ∈ Rn , the distance between u and v is

dist(u, v) = ku − vk, (5.14)

the length of the vector u − v.

Example 5.34. Compute the distance between the vectors u = (7, 1) and
v = (3, 2).
Solution.

Figure 5.1: The distance between u and v is


the length of u − v.
150 Chapter 5. Programming with Linear Algebra

Orthogonal Vectors
Definition 5.35. Two vectors u and v in Rn are orthogonal if u•v = 0.

Theorem 5.36. The Pythagorean Theorem: Two vectors u and v


are orthogonal if and only if

ku + vk2 = kuk2 + kvk2 . (5.15)

Proof. For all u and v in Rn ,

ku + vk2 = (u + v)•(u + v) = kuk2 + kvk2 + 2u•v. (5.16)

Thus, u and v are orthogonal ⇔ (5.15) holds

Note: The inner product can be defined as

u•v = kuk kvk cos θ, (5.17)

where θ is the angle between u and v.

Example 5.37. Use (5.17) to find the angle between u and v.


" # " # " # " #
3 −4 1 −1/2
(a) u = , v= (b) u = √ , v = √
4 3 3 3/2

Solution.
5.4. Vector Norms, Matrix Norms, and Condition Numbers 151

5.4. Vector Norms, Matrix Norms, and Condi-


tion Numbers
Vector Norms
Definition 5.38. A norm (or, vector norm) on Rn is a function that
assigns to each x ∈ Rn a nonnegative real number kxk such that the
following three properties are satisfied: for all x, y ∈ Rn and λ ∈ R,

kxk > 0 if x 6= 0 (positive definiteness)


kλxk = |λ| kxk (homogeneity) (5.18)
kx + yk ≤ kxk + kyk (triangle inequality)

Example 5.39. The most common norms are


X 1/p
p
kxkp = |xi | , 1 ≤ p < ∞, (5.19)
i

which we call the p-norms, and

kxk∞ = max |xi |, (5.20)


i

which is called the infinity-norm or maximum-norm.


Note: Two of frequently used p-norms are
X X 1/2
2
kxk1 = |xi |, kxk2 = |xi | (5.21)
i i

The 2-norm is also called the Euclidean norm, often denoted by k · k.

Example 5.40. Let x = [4, 2, −2, −4, 3]T . Find ||x||p , for p = 1, 2, ∞.
Solution.

Note: In general, kxk∞ ≤ kxk2 ≤ kxk1 for all x ∈ Rn ; see Exercise 5.5.
152 Chapter 5. Programming with Linear Algebra

Matrix Norms
Definition 5.41. A matrix norm on m × n matrices is a vector norm
on the mn-dimensional space, satisfying
kAk ≥ 0, and kAk = 0 ⇔ A = 0 (positive definiteness)
kλ Ak = |λ| kAk (homogeneity) (5.22)
kA + Bk ≤ kAk + kBk (triangle inequality)
X 1/2 q
2
Example 5.42. kAkF ≡ |aij | = tr(AAT ) is called the Frobenius
i,j
norm. Here “tr(B)” is the trace of a square matrix B, the sum of elements
on the main diagonal.
Definition 5.43. Once a vector norm || · || has been specified, the in-
duced matrix norm is defined by
kAxk
kAk = max = max kAxk. (5.23)
x6=0 kxk ||x||=1

It is also called an operator norm or subordinate norm.

Theorem 5.44.
(a) For all operator norms and the Frobenius norm,

kAxk ≤ kAk kxk, kA Bk ≤ kAk kBk. (5.24)


X
(b) kAk1 ≡ max kAxk1 = max |aij |
||x||1 =1 j
i
X
(c) kAk∞ ≡ max kAxk∞ = max |aij |
||x||∞ =1 i
j
q
(d) kAk2 ≡ max kAxk2 = λmax (AT A),
||x||2 =1
where λmax denotes the largest eigenvalue.
(e) kAk2 = kAT k2 .
(f) kAk2 = max |λi (A)|, when AT A = AAT (normal matrix).
i
5.4. Vector Norms, Matrix Norms, and Condition Numbers 153

Definition 5.45. Let A ∈ Rn × n be invertible. Then

κ(A) ≡ kAk kA−1 k (5.25)

is called the condition number of A, associated to the matrix norm.


 
1 2 −2
Example 5.46. Let A =  0 1 1 . Then, we have
 
1 −2 2
   
4 0 −4 2 0 0
1 
A−1 =  1 4 −1  and AT A =  0 9 −7  .
  
8
−1 4 1 0 −7 9

a. Find kAk1 , kAk∞ , and kAk2 .


b. Compute the `1 -condition number κ1 (A).

Solution.

Example 5.47. One may consider the infinity-norm as the limit of p-


norms, as p → ∞.
Solution.
154 Chapter 5. Programming with Linear Algebra

The Induced Matrix 2-Norm of A


Definition 5.48. The induced matrix 2-norm can be defined by

kAxk2
kAk2 = max = max kAxk2 = max |yT Ax|. (5.26)
x6=0 kxk2 ||x||2 =1 ||x||2 =||y||2 =1

It is also called an operator 2-norm or subordinate 2-norm.

Theorem 5.49. Properties of the Matrix 2-Norm


(a) ||AT ||2 = ||A||2
(b) ||AT A||2 = ||A||22
(c) ||AAT ||2 = ||A||22

Proof.

(a) The claim follows from the fact that yT Ax is a scalar and therefore
(yT Ax)T = xT AT y and |yT Ax| = |xT AT y|.
(b) Using the Cauchy-Schwarz inequality,

||AT A||2 = max |yT AT Ax|


||x||2 =||y||2 =1
(5.27)
≤ max ||Ay||2 ||Ax||2 = ||A||22 .
||x||2 =||y||2 =1

Now, we choose a unit vector z, for which ||Az||2 = ||A||2 . Then

||A||22 = ||Az||22 = (Az)T (Az) = zT AT A z


≤ max |yT AT Ax| ≡ ||AT A||2 . (5.28)
||x||2 =||y||2 =1

(c) Using the results in (a) and (b),

||A||22 = ||AT ||22 = ||(AT )T AT ||2 = ||AAT ||2 , (5.29)

which completes the proof.


5.5. Power Method and Inverse Power Method for Eigenvalues 155

5.5. Power Method and Inverse Power Method


for Eigenvalues
5.5.1. The Power Method
The power method is an iterative algorithm:
Given a square matrix A ∈ Rn×n , the algorithm finds a number λ, which
is the largest eigenvalue of A (in modulus), and its corresponding
eigenvector v.

Assumption. To apply the power method, we assume that A ∈ Rn×n has


• n eigenvalues {λ1 , λ2 , · · · , λn },
• n associated eigenvectors {v1 , v2 , · · · , vn }, which are linearly
independent, and
• exactly one eigenvalue that is largest in magnitude, λ1 :

|λ1 | > |λ2 | ≥ |λ3 | ≥ · · · ≥ |λn |. (5.30)

The power method approximates the largest eigenvalue λ1 and its asso-
ciated eigenvector v1 .

Derivation of Power Iteration

• Since eigenvectors {v1 , v2 , · · · , vn } are linearly independent, any vector


x ∈ Rn can be expressed as
Xn
x = βj vj , (5.31)
j=1

for some constants {β1 , β2 , · · · , βn }.


• Multiplying both sides of (5.31) by A and A2 gives
X n  Xn n
X
Ax = A βj vj = βj Avj = βj λj vj ,
j=1 j=1 j=1
n
X  n
X (5.32)
2
Ax = A βj λj vj = βj λ2j vj .
j=1 j=1
156 Chapter 5. Programming with Linear Algebra

• In general,
n
X
k
A x = βj λkj vj , k = 1, 2, · · · , (5.33)
j=1

which gives
n
X  λ k h  λ k  λ k  λ k i
k j 1 2 n
A x = λk1 · βj vj = λk1 · β1 v1 + β2 v2 + · · · + βn vn . (5.34)
j=1
λ1 λ1 λ1 λ1

• For j = 2, 3, · · · , n, since |λj /λ1 | < 1, we have lim |λj /λ1 |k = 0, and
k→∞

lim Ak x = lim λk1 β1 v1 . (5.35)


k→∞ k→∞

Remark 5.50. The sequence in (5.35) converges to 0 if |λ1 | < 1 and


diverges if |λ1 | > 1, provided that β1 6= 0.
• The entries of Ak x will grow with k if |λ1 | > 1 and will go to 0 if |λ1 | < 1.
• In either case, it is hard to decide the largest eigenvalue λ1 and its
associated eigenvector v1 .
• To take care of that possibility, we scale Ak x in an appropriate
manner to ensure that the limit in (5.35) is finite and nonzero.

Algorithm 5.51. (The Power Iteration) Given x 6= 0:

initialization : x0 = x/||x||∞
for k = 1, 2, · · ·
yk = Axk−1 ; µk = ||yk ||∞ (5.36)
xk = yk /µk
end for

Claim 5.52. Let {xk , µk } be a sequence produced by the power method.


Then,
xk → v1 , µk → |λ1 |, as k → ∞. (5.37)
More precisely, the power method converges as

µk = |λ1 | + O(|λ2 /λ1 |k ). (5.38)


5.5. Power Method and Inverse Power Method for Eigenvalues 157
 
5 −2 2
Example 5.53. The matrix A = −2 3 −4 has eigenvalues and eigen-
 
2 −4 3
vectors as follows    
9 1 2 0
eig(A) =  3, −1 1 1
   
−1 1 −1 1
Verify that the sequence produced by the power method converges to the
largest eigenvalue and its associated eigenvector.
Solution. The algorithm is implemented in both Matlab and Python.
power_iteration.m
1 A = [5 -2 2; -2 3 -4; 2 -4 3];
2 [V,D] = eig(A);
3 evalues = diag(D)';
4 [~,ind] = sort(evalues,'descend');
5 evalues = evalues(ind)
6 V = V(:,ind); V = V./max(abs(V),[],1)
7

8 x = [1 0 0]';
9 fmt = ['k=%2d: x=[',repmat('%.5f, ',1,numel(x)-1),'%.5f], ',...
10 'mu=%.5f (error=%.7f)\n'];
11

12 for k=1:10
13 y = A*x;
14 [~,ind] = max(abs(y)); mu = y(ind);
15 x =y/mu;
16 fprintf(fmt,k,x,mu,abs(evalues(1)-mu))
17 end

power_iteration.py
1 import numpy as np;
2 np.set_printoptions(suppress=True)
3

4 A = np.array([[5,-2,2],[-2,3,-4],[2,-4,3]])
5 evalues, EVectors = np.linalg.eig(A)
6

7 # Sorting eigenvalues: descend


8 idx = evalues.argsort()[::-1]
9 evalues = evalues[idx]; EVectors = EVectors[:,idx]
10 EVectors /= np.max(abs(EVectors),axis=0) #normalize
11
158 Chapter 5. Programming with Linear Algebra

12 print('evalues=',evalues)
13 print('EVectors=\n',EVectors)
14

15 x = np.array([1,0,0]).T
16 for k in range(10):
17 y = A.dot(x)
18 ind = np.argmax(np.abs(y)); mu = y[ind]
19 x = y/mu
20 print('k=%2d; x=[%.5f, %.5f, %.5f]; mu=%.5f (error=%.7f)'
21 %(k,*x,mu,np.abs(evalues[0]-mu)) );

The results are the same; here is the output from the Matlab code.
Output from power_iteration.m
1 evalues =
2 9.0000 3.0000 -1.0000
3

4 V =
5 1.0000e+00 1.0000e+00 3.9252e-17
6 -1.0000e+00 5.0000e-01 1.0000e+00
7 1.0000e+00 -5.0000e-01 1.0000e+00
8

9 k= 1: x=[1.00000, -0.40000, 0.40000], mu=5.00000 (error=4.0000000)


10 k= 2: x=[1.00000, -0.72727, 0.72727], mu=6.60000 (error=2.4000000)
11 k= 3: x=[1.00000, -0.89655, 0.89655], mu=7.90909 (error=1.0909091)
12 k= 4: x=[1.00000, -0.96386, 0.96386], mu=8.58621 (error=0.4137931)
13 k= 5: x=[1.00000, -0.98776, 0.98776], mu=8.85542 (error=0.1445783)
14 k= 6: x=[1.00000, -0.99590, 0.99590], mu=8.95102 (error=0.0489796)
15 k= 7: x=[1.00000, -0.99863, 0.99863], mu=8.98358 (error=0.0164159)
16 k= 8: x=[1.00000, -0.99954, 0.99954], mu=8.99452 (error=0.0054820)
17 k= 9: x=[1.00000, -0.99985, 0.99985], mu=8.99817 (error=0.0018284)
18 k=10: x=[1.00000, -0.99995, 0.99995], mu=8.99939 (error=0.0006096)

1 1
Notice that |9 − µk | ≈ |9 − µk−1 |, for which |λ2 /λ1 | = .
3 3
5.5. Power Method and Inverse Power Method for Eigenvalues 159

5.5.2. The Inverse Power Method


Some applications require to find an eigenvalue of the matrix A, near a
prescribed value q. The inverse power method is a variant of the Power
method to solve such a problem.
• We begin with the eigenvalues and eigenvectors of (A − qI)−1 . Let

Avi = λi vi , i = 1, 2, · · · , n. (5.39)

• Then it is easy to see that

(A − qI)vi = (λi − q)vi . (5.40)

Thus, we obtain
1
(A − qI)−1 vi = vi . (5.41)
λi − q
• That is, when q 6∈ {λ1 , λ2 , · · · , λn }, the eigenvalues of (A − qI)−1 are
1 1 1
, , ··· , , (5.42)
λ1 − q λ2 − q λn − q
with the same eigenvectors {v1 , v2 , · · · , vn } of A.

Algorithm 5.54. (Inverse Power Method) Applying the power


method to (A − qI)−1 gives the inverse power method. Given x 6= 0:
set : x0 = x/||x||∞
for k = 1, 2, · · ·
yk = (A − qI)−1 xk−1 ; µk = ||yk ||∞
(5.43)
xk = yk /µk
λk = 1/µk + q
end for

Note: All eigenvalues of a square matrix can be found simultaneously


by applying the QR iteration; see §9.5 below.
160 Chapter 5. Programming with Linear Algebra
 
5 −2 2
Example 5.55. The matrix A is as in Example 5.53: A = −2 3 −4.
 
2 −4 3
Find the the eigenvalue of A nearest to q = 4, using the inverse power
method.
Solution.
inverse_power.m
1 A = [5 -2 2; -2 3 -4; 2 -4 3];
2 [V,D] = eig(A);
3 evalues = diag(D)';
4 [~,ind] = sort(evalues,'descend');
5 evalues = evalues(ind)
6 V = V(:,ind); V = V./max(abs(V),[],1)
7

8 x = [1 0 0]';
9 fmt = ['k=%2d: x = [',repmat('%.5f, ',1,numel(x)-1),'%.5f], ',...
10 'lambda=%.7f (error = %.7f)\n'];
11

12 q = 4; B = inv(A-q*eye(3));
13 for k=1:10
14 y = B*x;
15 [~,ind] = max(abs(y)); mu = y(ind);
16 x =y/mu;
17 lambda = 1/mu + q;
18 fprintf(fmt,k,x,lambda,abs(evalues(2)-lambda))
19 end

inverse_power.py
1 import numpy as np;
2 np.set_printoptions(suppress=True)
3

4 A = np.array([[5,-2,2],[-2,3,-4],[2,-4,3]])
5 evalues, EVectors = np.linalg.eig(A)
6

7 # Sorting eigenvalues: largest to smallest


8 idx = evalues.argsort()[::-1]
9 evalues = evalues[idx]; EVectors = EVectors[:,idx]
10 EVectors /= np.max(abs(EVectors),axis=0) #normalize
11

12 print('evalues=',evalues)
13 print('EVectors=\n',EVectors)
14

15 q = 4; x = np.array([1,0,0]).T
5.5. Power Method and Inverse Power Method for Eigenvalues 161

16 B = np.linalg.inv(A-q*np.identity(3))
17 for k in range(10):
18 y = B.dot(x)
19 ind = np.argmax(np.abs(y)); mu = y[ind]
20 x = y/mu
21 Lambda = 1/mu + q
22 print('k=%2d; x=[%.5f, %.5f, %.5f]; Lambda=%.7f (error=%.7f)'
23 %(k,*x,Lambda,np.abs(evalues[1]-Lambda)) );

Output from inverse_power.py


1 evalues= [ 9. 3. -1.]
2 EVectors=
3 [[-1. -1. -0. ]
4 [ 1. -0.5 1. ]
5 [-1. 0.5 1. ]]
6 k= 0; x=[1.00000, 0.66667, -0.66667]; Lambda=2.3333333 (error=0.6666667)
7 k= 1; x=[1.00000, 0.47059, -0.47059]; Lambda=3.1176471 (error=0.1176471)
8 k= 2; x=[1.00000, 0.50602, -0.50602]; Lambda=2.9759036 (error=0.0240964)
9 k= 3; x=[1.00000, 0.49880, -0.49880]; Lambda=3.0047962 (error=0.0047962)
10 k= 4; x=[1.00000, 0.50024, -0.50024]; Lambda=2.9990398 (error=0.0009602)
11 k= 5; x=[1.00000, 0.49995, -0.49995]; Lambda=3.0001920 (error=0.0001920)
12 k= 6; x=[1.00000, 0.50001, -0.50001]; Lambda=2.9999616 (error=0.0000384)
13 k= 7; x=[1.00000, 0.50000, -0.50000]; Lambda=3.0000077 (error=0.0000077)
14 k= 8; x=[1.00000, 0.50000, -0.50000]; Lambda=2.9999985 (error=0.0000015)
15 k= 9; x=[1.00000, 0.50000, -0.50000]; Lambda=3.0000003 (error=0.0000003)

Note: When q = 4, eigenvalues of (A − qI)−1 are {1/5, −1, −1/5}.


1
• The initial vector: x0 = [1, 0, 0]T = (v1 + v2 ); see Example 5.53.
3
• Thus, each iteration must reduce the error by a factor of 5.
When q = 3.1
1 k= 0; x=[1.00000, 0.51282, -0.51282]; Lambda=2.9487179 (error=0.0512821)
2 k= 1; x=[1.00000, 0.49978, -0.49978]; Lambda=3.0008617 (error=0.0008617)
3 k= 2; x=[1.00000, 0.50000, -0.50000]; Lambda=2.9999854 (error=0.0000146)
4 k= 3; x=[1.00000, 0.50000, -0.50000]; Lambda=3.0000002 (error=0.0000002)
5 k= 4; x=[1.00000, 0.50000, -0.50000]; Lambda=3.0000000 (error=0.0000000)

See Exercise 5.8.


162 Chapter 5. Programming with Linear Algebra

Exercises for Chapter 5


 
3 1
5.1. Let A = . Write 5A. Is det (5A) = 5det A?
4 2
1 1 −3
 

5.2. Let A = 0 2 8.


2 4 2

(a) Find det A.


(b) Let U = [0, 1]3 , the unit cube. What can you say about A(U ), the image of U under
the matrix multiplication by A.
 
1 0 1
5.3. Use pencil-and-paper to compute det (B 6 ), where B = 1 1 2.
1 2 1
Ans: 64
 
3 1 0
5.4. A matrix is not always diagonalizable. Let A = 0 3 1. Use [P,D] = eig(A) in
0 0 3
Matlab to verify
(a) P does not have its inverse.
(b) AP = P D.
5.5. Show that kxk∞ ≤ kxk2 ≤ kxk1 for all x ∈ Rn .
5.6. The matrix in Example 5.55 has eigenvalues {−6, −3, −1}. We may try to find the
eigenvalue of A nearest to q = −3.1.
(a) Estimate (mathematically) the convergence speed of the inverse power method.
(b) Verify it by implementing the inverse power method, with x0 = [0, 1, 0]T .
 
2 −1 0 0
−1 2 0 −1
 
5.7. Let A =  . Use indicated methods to approximate eigenvalues and
 0 0 4 −2
0 −1 −2 4
their associated eigenvectors of A within to 10−12 accuracy.
(a) The power method, the largest eigenvalue.
(b) The inverse power method, an eigenvalue near q = 3.
(c) The inverse power method, the smallest eigenvalue.
5.8. What is the theoretical error reduction rate for the convergence of the inverse power
iteration, when q = 3.1, shown on p.161.
Ans: 59.
6
C HAPTER

Multivariable Calculus

In this chapter, we will learn subjects in multivariable calculus, such as

• The gradient vector


• Optimization: Method of Lagrange multipliers
• The gradient descent method

Contents of Chapter 6
6.1. Multi-Variable Functions and Their Partial Derivatives . . . . . . . . . . . . . . . . . . 164
6.2. Directional Derivatives and the Gradient Vector . . . . . . . . . . . . . . . . . . . . . . 168
6.3. Optimization: Method of Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . 173
6.4. The Gradient Descent Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Exercises for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

163
164 Chapter 6. Multivariable Calculus

6.1. Multi-Variable Functions and Their Partial


Derivatives
6.1.1. Functions of Several Variables
Definition 6.1. A function of two variables, f , is a rule that assigns
each ordered pair of real numbers (x, y) in a set D ⊂ R2 a unique real
number denoted by f (x, y). The set D is called the domain of f and its
range is the set of values that f takes on, that is, {f (x, y) : (x, y) ∈ D}.

Definition 6.2. Let f be a function of two variables, and z = f (x, y).


Then x and y are called independent variables and z is called a de-
pendent variable.

x+y+1
Example 6.3. Let f (x, y) = . Evaluate f (3, 2) and give its do-
x−1
main.
Solution.


Ans: f (3, 2) = 6/2; D = {(x, y) : x + y + 1 ≥ 0, x 6= 1}
Example 6.4. Find the domain and the range of
p
f (x, y) = 9 − x2 − y 2 .
Solution.
6.1. Multi-Variable Functions and Their Partial Derivatives 165

6.1.2. First-order Partial Derivatives


Recall: A function y = f (x) is differentiable at a if
f (a + h) − f (a)
f 0 (a) = lim exists.
h→0 h

Figure 6.1: Ordinary derivative f 0 (a) and partial derivatives fx (a, b) and fy (a, b).

fx = ∂f /∂x
Let f be a function of two variables (x, y). Suppose we let only x vary while
keeping y fixed, say y = b . Then g(x) := f (x, b) is a function of a single
variable. If g is differentiable at a, then we call it the partial derivative
of f with respect to x at (a, b) and denoted by fx (a, b).

g(a + h) − g(a)
g 0 (a) = lim
h→0 h
(6.1)
f (a + h, b) − f (a, b)
= lim =: fx (a, b).
h→0 h
166 Chapter 6. Multivariable Calculus

fy = ∂f /∂y
Similarly, the partial derivative of f with respect to y at (a, b), denoted
by fy (a, b), is obtained keeping x fixed, say x = a , and finding the ordinary
derivative at b of G(y) := f (a, y) :

G(b + h) − G(b)
G0 (b) = lim
h→0 h
(6.2)
f (a, b + h) − f (a, b)
= lim =: fy (a, b).
h→0 h

p
Example 6.5. Find fx (0, 0), when f (x, y) = 3
x3 + y 3 .
Solution. Using the definition,
f (h, 0) − f (0, 0)
fx (0, 0) = lim
h→0 h

Ans: 1
Definition 6.6. If f is a function of two variables, its partial deriva-
tives are the functions fx = ∂f ∂f
∂x and fy = ∂y defined by:

∂f f (x + h, y) − f (x, y)
fx (x, y) = (x, y) = lim and
∂x h→0 h
(6.3)
∂f f (x, y + h) − f (x, y)
fy (x, y) = (x, y) = lim .
∂y h→0 h
6.1. Multi-Variable Functions and Their Partial Derivatives 167

Observation 6.7. Partial Derivatives


• The partial derivative with respect to x represents the slope of
the tangent lines to the curve that are parallel to the xz-plane (i.e.
in the direction of h1, 0, ·i).
• Similarly, the partial derivative with respect to y represents
the slope of the tangent lines to the curve that are parallel to the
yz-plane (i.e. in the direction of h0, 1, ·i).

Rule for finding Partial Derivatives of z = f (x, y)


• To find fx , regard y as a constant and differentiate f w.r.t. x.
• To find fy , regard x as a constant and differentiate f w.r.t. y.

Example 6.8. If f (x, y) = x3 +x2 y 3 −2y 2 , find fx (2, 1), fy (2, 1), and fxy (2, 1).
Solution.

Ans: fx (2, 1) = 16; fy (2, 1) = 8


 xz 
Example 6.9. (Functions of Three Variables). Let f (x, y, z) = sin .
1+y
Find the first partial derivatives of f (x, y, z).
Solution.
168 Chapter 6. Multivariable Calculus

6.2. Directional Derivatives and the Gradient


Vector
Recall: For z = f (x, y), the par-
tial derivatives (fx , fy ) represent
the rates of change of z in the
(x, y)-directions, i.e., in the direc-
tions of the unit vectors (i, j).

Note: It would be nice to be able to


find the slope of the tangent line to
a surface S in the direction of an
Figure 6.2
arbitrary unit vector u = ha, bi.

Definition 6.10. The directional derivative of f at x0 = (x0 , y0 ) in


the direction of a unit vector u = ha, bi is
f (x0 + ha, y0 + hb) − f (x0 , y0 )
Du f (x0 , y0 ) = lim , (6.4)
h→0 h
if the limit exists.

Note that
f (x0 + ha, y0 + hb) − f (x0 , y0 ) = f (x0 + ha, y0 + hb) − f (x0 , y0 + hb)
+ f (x0 , y0 + hb) − f (x0 , y0 )

Thus
f (x0 + ha, y0 + hb) − f (x0 , y0 ) f (x0 + ha, y0 + hb) − f (x0 , y0 + hb)
= a
h ha
f (x0 , y0 + hb) − f (x0 , y0 )
+ b ,
hb
which converges to “a fx (x0 , y0 ) + b fy (x0 , y0 )" as h → 0.
6.2. Directional Derivatives and the Gradient Vector 169

Theorem 6.11. If f is a differentiable function of x and y, then f has


a directional derivative in the direction of any unit vector u = ha, bi
and
Du f (x, y) = fx (x, y) a + fy (x, y) b
= hfx (x, y), fy (x, y)i · ha, bi (6.5)
= hfx (x, y), fy (x, y)i · u.

Example 6.12. Let f (x, y) = x3 + 2xy + y 4 . Find the directional deriva-


tive Du f (x, y), when u is the unit vector given by the angle θ = π4 . What is
Du f (2, 3)?
√ √
Solution. u = hcos(π/4), sin(π/4)i = h1/ 2, 1/ 2i.

Figure 6.3


Ans: 65 2
170 Chapter 6. Multivariable Calculus

Self-study 6.13. Find the directional derivative of f (x, y) = x + sin(xy) at


the point (1, 0) in the direction given by the angle θ = π/3.
Solution.


Ans: (1 + 3)/2
Example 6.14. (Functions of Three Variables).
If f (x, y, z) = x2 − 2y 2 + z 4 , find the directional derivative of f at (1, 3, 1) in
the direction of v = h2, −2, −1i .
Solution.

Ans: 8
6.2. Directional Derivatives and the Gradient Vector 171

The Gradient Vector


Definition 6.15. Let f be a differentiable function of two variables x
and y. Then the gradient of f is the vector function
∂f ∂f
∇f (x, y) = hfx (x, y), fy (x, y)i = i+ j. (6.6)
∂x ∂y

Example 6.16. If f (x, y) = sin(x) + exy , find ∇f (x, y) and ∇f (0, 1).
Solution.

Ans: h2, 0i

Remark 6.17. With this notation of the gradient vector, we can rewrite
Du f (x, y) = ∇f (x, y) · u = fx (x, y)a + fy (x, y)b, where u = ha, bi. (6.7)

Example 6.18. Find the directional derivative of f (x, y) = x2 y 3 − 4y at the


point (2, −1) and in the direction of the vector v = h3, 4i.
Solution.

Ans: 4
172 Chapter 6. Multivariable Calculus

Maximizing the Directional Derivative


Note: Let θ is the angle between ∇f and u. Then

Du f = ∇f · u = |∇f | |u| cos θ = |∇f | cos θ,

of which the maximum occurs when θ = 0.

Theorem 6.19. Let f be a differentiable function of two or three vari-


ables. Then
max Du f (x) = |∇f (x)| (6.8)
u
and it occurs when u has the same direction as ∇f (x).

Example 6.20. Let f (x, y) = xey .


(a) Find the rate of change of f at P (1, 0) in the direction from P to Q(−1, 2).
(b) In what direction does f have the maximum rate of change? What is
the maximum rate of change?
Solution.


Ans: (a) 0; (b) 2
∇f (x)
Remark 6.21. Let u = , the unit vector in the gradient direc-
|∇f (x)|
tion. Then
∇f (x)
Du f (x) = ∇f (x) · u = ∇f (x) · = |∇f (x)|. (6.9)
|∇f (x)|
This implies that the directional derivative is maximized in the gradient
direction.

Claim 6.22. The gradient direction is the direction where the func-
tion changes fastest, more precisely, increases fastest!
6.3. Optimization: Method of Lagrange Multipliers 173

6.3. Optimization: Method of Lagrange Multi-


pliers
Recall: (Claim 6.22) The gradient direction is the direction where
the function changes fastest, more precisely, increases fastest!

Level Curves
Example 6.23. Consider the unit circle, the circle of radius 1:

F (x, y) = x2 + y 2 = 1. (6.10)

What can you say about the gradient of F , ∇F ?


Solution. The curve can be parametrized as

r(t) = hx(t), y(t)i = hcos t, sin ti, 0 ≤ t ≤ 2π. (6.11)

• Apply the Chain Rule to have


d dx dy
F = Fx + Fy = 0,
dt dt dt
and therefore
∇F · h− sin t, cos ti = 0. (6.12)

• Note that h− sin t, cos ti = r 0 (t) is the tangential direction to the unit
circle. Thus ∇F must be normal to the curve.
• Indeed,
∇F = h2x, 2yi, (6.13)
which is normal to the curve and the fastest increasing direction.

Claim 6.24. Given a level curve F (x) = k, the gradient vector ∇F (x)
is normal to the curve and pointing the fastest increasing direction.
174 Chapter 6. Multivariable Calculus

6.3.1. Optimization Problems with Equality Constraints


We first consider Lagrange’s method to solve the problem of the form
max f (x) subj.to g(x) = c. (6.14)
x

Figure 6.4: The method of Lagrange multipliers in R2 : ∇f // ∇g, at maximum .

Strategy 6.25. (Method of Lagrange multipliers). For the maxi-


mum and minimum values of f (x, y, z) subject to g(x, y, z) = c,
(a) Find all values of (x, y, z) and λ such that

∇f (x, y, z) = λ∇g(x, y, z) and g(x, y, z) = c . (6.15)

(b) Evaluate f at all these points, to find the maximum and mini-
mum.
6.3. Optimization: Method of Lagrange Multipliers 175

Example 6.26. A topless rectangular box is made from 12m2 of cardboard.


Find the dimensions of the box that maximizes the volume of the box.
Solution. Maximize V = xyz subj.to 2xz + 2yz + xy = 12.

Ans: 4 (x = y = 2z = 2)
Example 6.27. Find the extreme values of f (x, y) = x2 + 2y 2 on the circle
x2 + y 2 = 1. 
 2x = 2x λ 1
" # " # 
2x 2x
Solution. ∇f = λ∇g =⇒ =λ . Therefore, 4y = 2y λ 2
4y 2y  x2 + y 2 = 1 3

From 1 , x = 0 or λ = 1.

Ans: min: f (±1, 0) = 1; max: f (0, ±1) = 2


176 Chapter 6. Multivariable Calculus

Example 6.28. (Continuation of Example 6.27)


Find the extreme values of f (x, y) = x2 + 2y 2 on the disk x2 + y 2 ≤ 1.
Solution. Hint : You may use Lagrange multipliers when x2 + y2 = 1.

Ans: min: f (0, 0) = 0; f (0, ±1) = 2

Remark 6.29. The Method of Lagrange Multipliers


• (Geometric Formula) For the optimization problem (6.14):

max f (x) subj.to g(x) = c, (6.16)


x

the method finds values of x and λ such that

∇f (x) = λ∇g(x) and g(x) = c. (6.17)

• (Interpretation by Calculus) It can be interpreted as follows:


Find the critical points of
def 
L(x, λ) == f (x) − λ g(x) − c . (6.18)

Indeed,
∇x L(x, λ) = ∇f (x) − λ∇g(x),
∂ (6.19)
L(x, λ) = g(x) − c.
∂λ
By equating the right-side with zero, we obtain (6.17).

The function L(x, λ) is called the Lagrangian for the problem (6.16).
6.3. Optimization: Method of Lagrange Multipliers 177

6.3.2. Optimization Problems with Inequality Constraints

For simplicity, consider


minx x2 subj.to x ≥ 1. (6.20)

Rewriting the constraint


x − 1 ≥ 0,

we formulate the Lagrangian:


L(x, α) = x2 − α (x − 1). (6.21)

Now, consider
minx maxα L(x, α) subj.to α ≥ 0.
Figure 6.5: minx x2 subj.to x ≥ 1.
(6.22)

Claim 6.30. The minimization problem (6.20) is equivalent to the min-


imax problem (6.22).

Proof. 1 Let x > 1. ⇒ maxα≥0 {−α(x − 1)} = 0 and α∗ = 0. Thus,

L(x, α) = x2 . (original objective)

2 Let x = 1. ⇒ maxα≥0 {−α(x − 1)} = 0 and α is arbitrary. Thus, again,

L(x, α) = x2 . (original objective)

3 Let x < 1. ⇒ maxα≥0 {−α(x − 1)} = ∞. However, minx won’t make this
happen! (minx is fighting maxα ) That is, when x < 1, the objective L(x, α)
becomes huge as α grows; then, minx will push x % 1 or increase it to become
x ≥ 1. In other words, minx forces maxα to behave, so constraints will
be satisfied.
178 Chapter 6. Multivariable Calculus

Remark 6.31. A Formal Argument for the Equivalence


Let f (x) = x2 and f ∗ be the optimal value of the problem (6.20):

minx f (x) subj.to x ≥ 1. (6.23)

Then it is clear to see


x ≥ 1 ⇒ L(x, α) = f (x) − α (x − 1) ≤ f (x) ⇒ max L(x, α) = f (x)
α≥0
x 6≥ 1 ⇒ max L(x, α) = ∞.
α≥0
(6.24)
Thus
min max L(x, α) = min f (x) = f ∗ , (6.25)
x α≥0 x≥1

where min does not require x to satisfy x ≥ 1.


x

The above analysis implies that the (original) minimization problem (6.23)
is equivalent to the minimax problem.
6.3. Optimization: Method of Lagrange Multipliers 179

Recall: The minimax problem (6.22), which is equivalent to the (origi-


nal) primal problem:

min max L(x, α) subj.to α ≥ 0, (Primal) (6.26)


x α

where
L(x, α) = x2 − α (x − 1).

Definition 6.32. The dual problem of (6.26) is formulated by swap-


ping minx and maxα as follows:

max min L(x, α) subj.to α ≥ 0, (Dual) (6.27)


α x

In the maximin problem, the term minx L(x, α) is called the Lagrange
dual function and the Lagrange multiplier α is also called the dual
variable.

How to solve it . For the Lagrange dual function minx L(x, α), the mini-
mum occurs where the gradient is equal to zero.
d α
L(x, α) = 2x − α = 0 ⇒ x = . (6.28)
dx 2
Plugging this to L(x, α), we have
 α 2 α  α2
L(x, α) = −α −1 =α− .
2 2 4

We can rewrite the dual problem (6.27) as


h α2 i
max α − . (Dual) (6.29)
α≥0 4

⇒ the maximum is 1 when α∗ = 2 (for the dual problem).


Plugging α = α∗ into (6.28) to get x∗ = 1. Or, using the Lagrangian objective,
we have
L(x, α) = x2 − 2(x − 1) = (x − 1)2 + 1. (6.30)
⇒ the minimum is 1 when x∗ = 1 (for the primal problem).
180 Chapter 6. Multivariable Calculus

Multiple Constraints
Consider the problem of the form
max f (x) subj.to g(x) = c and h(x) = d. (6.31)
x

Then, at extrema we must have


∇f ∈ Plane(∇g, ∇h) := {c1 ∇g + c2 ∇h}. (6.32)

Thus (6.31) can be solved by finding all values of (x, y, z) and (λ, µ) such
that
∇f (x, y, z) = λ∇g(x, y, z) + µ∇h(x, y, z)
g(x, y, z) = c (6.33)
h(x, y, z) = d

Example 6.33. Find the maximum value of the function f (x, y, z) = z


on the curve of the intersection of the cone 2x2 + 2y 2 = z 2 and the plane
x + y + z = 4.
Solution. Letting g = 2x2 + 2y 2 − z 2 = 0 4and h = x + y + z − 4 = 0 5 , we
have       
0 4x 1  0 = 4λx + µ 1

0 = λ  4y  + µ 1 =⇒ 0 = 4λy + µ 2
     

1 −2z 1  1 = −2λz + µ 3

From 1 and 2 , we conclude x = y; using 4 , we have z = ±2x.

Ans: 2
6.4. The Gradient Descent Method 181

6.4. The Gradient Descent Method


6.4.1. Introduction to the Gradient Descent Method
Problem 6.34. (Minimization Problem) Let Ω ⊂ Rn , n ≥ 1. Given a
real-valued function f : Ω → R, the general problem of finding the value
that minimizes f is formulated as follows.

min f (x). (6.34)


x∈Ω

In this context, f is the objective function (sometimes referred to as


loss function or cost function). Ω ⊂ Rn is the domain of the function
(also known as the constraint set).

In this section, we solve the minimization problem (6.34) iteratively as


follows: Given an initial guess x0 ∈ Rn , find successive approximations
xk ∈ Rn of the form

xk+1 = xk + γk pk , k = 0, 1, · · · , (6.35)

where pk is the search direction and γk > 0 is the step length.

Note: The gradient descent method is also known as the steepest


descent method or the Richardson’s method.
• Recall that we would solve the minimization problem (6.34) using
iterative algorithms of the form (6.35).
182 Chapter 6. Multivariable Calculus

Derivation of the GD method


• Given xk+1 as in (6.35), we have by Taylor’s formula: for some ξ,

f (xk+1 ) = f (xk + γk pk )
γk2 (6.36)
= f (xk ) + γk f 0 (xk ) · pk + pk · f 00 (ξ)pk .
2
• Assume that f 00 is bounded. Then

f (xk+1 ) = f (xk ) + γk f 0 (xk ) · pk + O(γk2 ), as γk → 0.

• The Goal: To find pk and γk such that

f (xk+1 ) < f (xk ), (6.37)

which can be achieved if

f 0 (xk ) · pk < 0 (6.38)

and either γk is sufficiently small or f 00 (ξ) is nonnegative.


• Choice: Let f 0 (xk ) 6= 0. If we choose

pk = −f 0 (xk ), (6.39)

then
f 0 (xk ) · pk = −||f 0 (xk )||2 < 0, (6.40)
which satisfies (6.38) and therefore (6.37).
• Summary: In the GD method, the search direction is the negative
gradient, the steepest descent direction.
6.4. The Gradient Descent Method 183

The Gradient Descent Method in 1D


Algorithm 6.35. Consider the minimization problem in 1D:

min f (x), x ∈ S, (6.41)


x

where S is a closed interval in R. Then its gradient descent method reads

xk+1 = xk − γ f 0 (xk ). (6.42)

Picking the step length γ : Assume that the step length was chosen to
be independent of n, although one can play with other choices as well. The
question is how to select γ in order to make the best gain of the method. To
turn the right-hand side of (6.42) into a more manageable form, we invoke
Taylor’s Theorem:1
ˆ x+t
0
f (x + t) = f (x) + t f (x) + (x + t − s) f 00 (s) ds. (6.43)
x
00
Assuming that |f (s)| ≤ L, we have
t2 0
f (x + t) ≤ f (x) + t f (x) + L.
2
Now, letting x = xk and t = −γ f 0 (xk ) reads
f (xk+1 ) = f (xk − γ f 0 (xk ))
1
≤ f (xk ) − γ f 0 (xk ) f 0 (xk ) + L [γ f 0 (xk )]2 (6.44)
2
 L
= f (xk ) − [f 0 (xk )]2 γ − γ 2 .
2
The gain (learning) from the method occurs when
L 2
γ − γ2 > 0 ⇒ 0 < γ < , (6.45)
2 L
and it will be best when γ − L2 γ 2 is maximal. This happens at the point
1
γ= . (6.46)
L
1
Taylor’s Theorem with integral remainder: Suppose f ∈ C n+1 [a, b] and x0 ∈ [a, b]. Then, for every
n ˆ
X f (k) (x0 ) 1 x
x ∈ [a, b], f (x) = (x − x0 )k + Rn (x), where Rn (x) = (x − s)n f (n+1) (s) ds.
k! n! x0
k=0
184 Chapter 6. Multivariable Calculus

Thus an effective gradient descent method (6.42) can be written as


1 0 1
xk+1 = xk − γ f 0 (xk ) = xk − f (xk ) = xk − 00
f 0 (xk ). (6.47)
L max |f (x)|
Furthermore, it follows from (6.44) that for γ = 1/L,
γ
f (xk+1 ) ≤ f (xk ) − [f 0 (xk )]2 . (6.48)
2

Remark 6.36. (Convergence of gradient descent method).


Thus it is obvious that the method defines a sequence of points {xk } along
which {f (xk )} decreases.
• If f is bounded from below and the level sets of f are bounded,
{f (xk )} converges; so does {xk }. That is, there is a point x
b such
that
lim xk = x
b. (6.49)
n→∞

• Now, we can rewrite (6.48) as

[f 0 (xk )]2 ≤ 2L [f (xk ) − f (xk+1 )]. (6.50)

Since f (xk ) − f (xk+1 ) → 0, also f 0 (xk ) → 0.


• When f 0 is continuous, using (6.49) reads

f 0 (b
x) = lim f 0 (xk ) = 0, (6.51)
n→∞

which implies that the limit x


b is a critical point.
• The method thus generally finds a critical point but that could still
be a local minimum or a saddle point. Which it is cannot be decided
at this level of analysis.
6.4. The Gradient Descent Method 185

6.4.2. The Gradient Descent Method in Multi-Dimensions


Example 6.37. (Rosenbrock function). For example, the Rosenbrock
function in the two-dimensional (2D) space is defined as2

f (x, y) = (1 − x)2 + 100 (y − x2 )2 . (6.52)

Use the GD method to find the minimizer, starting with x0 = (−1, 2).

Figure 6.6: Plots of the Rosenbrock function f (x, y) = (1 − x)2 + 100 (y − x2 )2 .

rosenbrock_2D_GD.py
1 import numpy as np; import time
2

3 itmax = 10000; tol = 1.e-7; gamma = 1/500


4 x0 = np.array([-1., 2.])
5

6 def rosen(x):
7 return (1.-x[0])**2+100*(x[1]-x[0]**2)**2
8

9 def rosen_grad(x):
10 h = 1.e-5;
11 g1 = ( rosen([x[0]+h,x[1]]) - rosen([x[0]-h,x[1]]) )/(2*h)
12 g2 = ( rosen([x[0],x[1]+h]) - rosen([x[0],x[1]-h]) )/(2*h)
2
The Rosenbrock function in 3D is given as f (x, y, z) = [(1 − x)2 + 100 (y − x2 )2 ] + [(1 − y)2 + 100 (z − y 2 )2 ],
which has exactly one minimum at (1, 1, 1). Similarly, one can define the Rosenbrock function in gen-
eral N -dimensional spaces, for N ≥ 4, by adding one more component for each enlarged dimension.
N
X −1
(1 − xi )2 + 100(xi+1 − x2i )2 , where x = [x1 , x2 , · · · , xN ] ∈ RN . See Wikipedia
 
That is, f (x) =
i=1
(https://en.wikipedia.org/wiki/Rosenbrock_function) for details.
186 Chapter 6. Multivariable Calculus

13 return np.array([g1,g2])
14

15 # Now, GD iteration begins


16 if __name__ == '__main__':
17 t0 = time.time()
18 x=x0
19 for it in range(itmax):
20 corr = gamma*rosen_grad(x)
21 x = x - corr
22 if np.linalg.norm(corr)<tol: break
23 print('GD Method: it = %d; E-time = %.4f' %(it+1,time.time()-t0))
24 print(x)

Output
1 GD Method: it = 7687; E-time = 0.0521
2 [0.99994416 0.99988809]

The Choice of Step Size and Line Search


Note: The convergence of the gradient descent method can be extremely
sensitive to the choice of step size. It often requires to choose the
step size adaptively: the step size would better be chosen small in re-
gions of large variability of the gradient, while in regions with small
variability we would like to take it large.

Strategy 6.38. Backtracking Line Search procedures allow to select


a step size depending on the current iterate and the gradient. In this
procedure, we select an initial (optimistic) step size γk and evaluate the
following inequality (known as sufficient decrease condition):
γk
f (xk+1 ) = f (xk − γk ∇f (xk )) ≤ f (xk ) − k∇f (xk )k2 . (6.53)
2
If this inequality is verified, the current step size is kept. If not, the step
size is divided by 2 (or any number larger than 1) repeatedly until (6.53)
is verified. To get a better understanding, refer to (6.48) on p. 184.
6.4. The Gradient Descent Method 187

The gradient descent algorithm with backtracking line search then becomes
Algorithm 6.39. (The Gradient Descent Algorithm, with Back-
tracking Line Search).

input: initial guess x0 , step size γ > 0;


for k = 0, 1, 2, · · · do

initial step size estimate γk ;
 while (TRUE) do
γk

k∇f (xk )k2

 if f (x k − γ k ∇f (x k )) ≤ f (x k ) −
 2

 break; (6.54)

 else γk = γk /2;
end while
xk+1 = xk − γk ∇f (xk );
end for
return xk+1 ;

Remark 6.40. Incorporated with


• either a line search
• or partial updates,
the gradient descent method is the major computational algorithm
for various machine learning tasks.

Note: The gradient descent method with partial updates is called the
stochastic gradient descent (SGD) method.
188 Chapter 6. Multivariable Calculus

6.4.3. The Gradient Descent Method for Positive Defi-


nite Linear Systems
Definition 6.41. A matrix A = (aij ) ∈ Rn×n is said to be positive
definite if
n
X
T
x Ax = xi aij xj > 0, ∀ x ∈ Rn , x 6= 0. (6.55)
i,j=1

Theorem 6.42. Let A ∈ Rn×n be symmetric. Then A is positive definite


if and only if all eigenvalues of A are positive.

Remark 6.43. Let A ∈ Rn×n be symmetric positive definite and


consider
Ax = b. (6.56)
Then the algebraic system admits a unique solution x ∈ Rn , which is
equivalently characterized by
1
x = arg minn f (η), f (η) = η · Aη − b · η. (6.57)
η∈R 2

For the algebraic system (6.56), Krylov subspace methods update the
iterates as follows.
Given an initial guess x0 ∈ Rn , find successive approximations xk ∈ Rn of
the form
xk+1 = xk + αk pk , k = 0, 1, · · · , (6.58)
where pk is the search direction and αk > 0 is the step length.

• Different methods differ in the choice of the search direction and the
step length.
• In this subsection, we focus on the gradient descent method.
• For other Krylov subspace methods, see e.g. [1, 7].
6.4. The Gradient Descent Method 189

Derivation of the GD Method for (6.57)


• We denote the gradient and Hessian of f by f 0 and f 00 , respectively:

f 0 (η) = Aη − b, f 00 (η) = A. (6.59)

• Given xk+1 as in (6.58), we have by Taylor’s formula

f (xk+1 ) = f (xk + αk pk )
αk2
= f (xk ) + αk f 0 (xk ) · pk + pk · f 00 (ξ)pk (6.60)
22
α
= f (xk ) + αk f 0 (xk ) · pk + k pk · Apk .
2
• Since A is bounded,

f (xk+1 ) = f (xk ) + αk f 0 (xk ) · pk + O(αk2 ), as αk → 0.

• The goal is to find pk and αk such that

f (xk+1 ) < f (xk ), (6.61)

which can be achieved if

f 0 (xk ) · pk < 0 (6.62)

and either γk is sufficiently small or A = f 00 (ξ) is nonnegative.


• Choice: When f 0 (xk ) 6= 0, (6.62) holds true, if we choose:

pk = −f 0 (xk ) = b − Axk =: r k (6.63)

That is, the search direction is the negative gradient, the steepest
descent direction.
190 Chapter 6. Multivariable Calculus

Optimal step length


We may determine αk such that

f (xk + αk pk ) = min f (xk + αpk ), (6.64)


α

in which case αk is said to be optimal.

If αk is optimal, then
d
0 = f (xk + αpk ) = f 0 (xk + αk pk ) · pk
dα α=αk
= (A(xk + αk pk ) − b) · pk (6.65)

= (Axk − b) · pk + αk pk · Apk .

So,
r k · pk
αk = . (6.66)
pk · Apk

Algorithm 6.44. (GD Algorithm)


Select x0 , ε;
r 0 = b − Ax0 ;
Do k = 0, 1, · · ·
αk = ||r k ||2 /r k · Ar k ; (GD1)
(6.67)
xk+1 = xk + αk r k ; (GD2)
r k+1 = r k − αk Ar k ; (GD3)
if kr k+1 k < ε kr 0 k, stop;
End Do

Note: The equation in (GD3) is equivalent to

r k+1 = b − Axk+1 ; (6.68)

see Exercise 6.4, p.192.


6.4. The Gradient Descent Method 191

Recall: (Definition 5.45) Let A ∈ Rn × n be invertible. Then

κ(A) ≡ kAk kA−1 k (6.69)

is called the condition number of A, associated to the matrix norm.

Remark 6.45. When A is symmetric positive definite (SPD), the condi-


tion number becomes
λmax
κ(A) = , (A : SPD) (6.70)
λmin
where λmin and λmax are the minimum and the maximum eigenvalues of
A, respectively.

Theorem 6.46. (Convergence of the GD method): The GD method


converges, satisfying
 k
1
kx − xk k ≤ 1 − kx − x0 k. (6.71)
κ(A)
Thus, the number of iterations required to reduce the error by a factor of
ε is in the order of the condition number of A:
1
k ≥ κ(A) log . (6.72)
ε
192 Chapter 6. Multivariable Calculus

Exercises for Chapter 6

6.1. Find the partial derivatives of the functions.


(a) z = y cos(xy) (c) w = ln(x + 2y + 3z)
(b) f (u, v) = (uv − v 3 )2 (d) u = sin(x21 + x22 + · · · + x2n )
Ans: (d) ∂u/∂xi = 2xi · cos(x21 + x22 + · · · + x2n )
6.2. Use Lagrange multipliers to find extreme values of the function subject to the given
constraint.
(a) f (x, y) = xy; x2 + 4y 2 = 2
(b) f (x, y) = x + y + 2z; x2 + y 2 + z 2 = 6
Ans: max: f (1, 1, 2) = 6; min: f (−1, −1, −2) = −6

6.3. Use the method of Lagrange multipliers to solve the problem.



2 2 x1 ≥ 1
min x1 + x2 , subj.to (6.73)
x,y x2 ≥ 2
Hint : You may start with the Lagrangian
L(x1 , x2 , α1 , α2 ) = x21 + x22 − α1 (x1 − 1) − α2 (x2 − 2), α1 , α2 ≥ 0, (6.74)
and consider the dual problem max min L(x, α), where α = (α1 , α2 ) and x = (x1 , x2 ).
α≥0 x
Then      
2x1 − α1 x1 α1 /2
∇x L(x, α) = =0 ⇒ = . (6.75)
2x2 − α2 x2 α2 /2
Ans: 5
6.4. When the boundary-value problem

−uxx = −2, 0 < x < 4
(6.76)
ux (0) = 0, u(4) = 16
is discretized by the second-order finite difference method with h = 1, the algebraic
system reads Ax = b, where
   
2 −2 0 0 −2
−1 2 −1 0 −2
   
A= , b =   (6.77)
 0 −1 2 −1 −2
0 0 −1 2 14
and the exact solution is x = [0, 1, 4, 9]T .
(a) Find the condition number of A.
(b) Prove that (GD3) in the GD algorithm, Algorithm 6.44, is equivalent to 6.68.
Why do we consider such a manipulation?
(c) Implement the GD algorithm to find a numerical solution in 6-digit accuracy.
7
C HAPTER

Least-Squares and Regression


Analysis

Contents of Chapter 7
7.1. The Least-Squares Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.2. Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.3. Scene Analysis with Noisy Data: Weighted Least-Squares and RANSAC . . . . . . . . 204
Exercises for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

193
194 Chapter 7. Least-Squares and Regression Analysis

7.1. The Least-Squares Problem


Definition 7.1. For a given dataset {(xi , yi )}, let a continuous function
p(x) be constructed.
(a) p is an interpolation if it passes (interpolates) all the data points.
(b) p is an approximation if it approximates (represents) the data
points.
Dataset, in Maple
1 with(LinearAlgebra): with(CurveFitting):
2 n := 100: roll := rand(-n..n):
3 m := 10: xy := Matrix(m, 2):
4 for i to m do
5 xy[i, 1] := i;
6 xy[i, 2] := i + roll()/n;
7 end do:
8 plot(xy,color=red, style=point, symbol=solidbox, symbolsize=20);
7.1. The Least-Squares Problem 195

Note: Interpolation may be too oscillatory to be useful.


Furthermore, it may not be defined.

The Least-Squares (LS) Problem


Note: Let A is an m × n matrix. Then Ax = b may have no solution,
particularly when m > n. In real-world,
• m  n, where m represents the number of data points and n denotes
the dimension of the points
• Need to find a best solution for Ax ≈ b

Definition 7.2. Let A ∈ Rm×n , m ≥ n, and b ∈ Rm . The least-squares


b ∈ Rn which minimizes kAx − bk2 :
problem is to find x

b = arg min kAx − bk2 ,


x
x
or, equivalently, (7.1)
b = arg min kAx − bk22 ,
x
x

where x
b is called a least-squares solution of Ax = b.
196 Chapter 7. Least-Squares and Regression Analysis

The Method of Normal Equations

Theorem 7.3. The set of LS solutions of Ax = b coincides with the


nonempty set of solutions of the normal equations

AT Ax = AT b. (7.2)

Method of Calculus
Let J (x) = kAx − bk2 = (Ax − b)T (Ax − b) and x
b a minimizer of J (x).
• Then we must have
∂J (x)
∇x J (b
x) = = 0. (7.3)
∂x x=b
x

• Let’s compute the gradient of J .



∂J (x) ∂ (Ax − b)T (Ax − b)
=
∂x ∂x
∂(x A Ax − 2xT AT b + bT b)
T T
(7.4)
=
∂x
T T
= 2A Ax − 2A b.

• By setting the last term to zero, we obtain normal equations.

Remark 7.4. Theorem 7.3 implies that LS solutions of Ax = b are


solutions of the normal equations AT Ab
x = AT b.
• When AT A is not invertible, the normal equations have either no
solution or infinitely many solutions.
• So, data acquisition is important, to make it invertible.
7.1. The Least-Squares Problem 197

Theorem 7.5. (Method of Normal Equations) Let A ∈ Rm×n , m ≥ n.


The following statements are logically equivalent:
a. The equation Ax = b has a unique LS solution for each b ∈ Rm .
b. The matrix AT A is invertible.
c. Columns of A are linearly independent.
When these statements hold true, the unique LS solution x
b is given by

b = (AT A)−1 AT b.
x (7.5)

Definition 7.6. A+ := (AT A)−1 AT is called the pseudoinverse of A.

Example 7.7. Describe all least squares solutions of the equation Ax = b,


given
   
1 1 0 1
0 1 0 3
A=  and b =  .
   
0 0 1 8
1 0 1 2
Solution. Let’s try to solve the problem with pencil-and-paper.

least_squares.m
1 A = [1 1 0; 0 1 0; 0 0 1; 1 0 1];
2 b = [1; 3; 8; 2];
3 x = (A'*A)\(A'*b)

Ans: x = [−4, 4, 7]T


198 Chapter 7. Least-Squares and Regression Analysis

7.2. Regression Analysis


Definition 7.8. Regression analysis is a set of statistical methods
used to estimate relationships between one or more independent
variables and a dependent variable.

7.2.1. Regression line

Figure 7.1: A least-squares regression line.

Definition 7.9. Suppose a set of experimental data points are given as

(x1 , y1 ), (x2 , y2 ), · · · , (xm , ym )

such that the graph is close to a line. We (may and must) determine a
line
y = β0 + β1 x (7.6)
that is as close as possible to the data points. Then this line is called
the least-squares line; it is also called the regression line of y on x
and β0 , β1 are called regression coefficients.
7.2. Regression Analysis 199

Calculation of Least-Squares Lines


Consider a least-squares (LS) model of the form y = β0 + β1 x, for a given
data set {(xi , yi ) | i = 1, 2, · · · , m}.
• Then
Predicted y-value Observed y-value
β0 + β1 x1 = y1
β0 + β1 x2 = y2 (7.7)
.. ..
. .
β0 + β1 xm = ym

• It can be equivalently written as


Xβ = y, (7.8)

where    
1 x1 " # y1
1 x  β0 y 
2  2
X =  .. .. , β= , y =  .. .

. .  β1  . 
1 xm ym
Here we call X the design matrix, β the parameter vector, and y
the observation vector.
• Thus the LS solution can be determined by solving the normal equa-
tions:
X T Xβ = X T y, (7.9)
provided that X T X is invertible.
• The normal equations for the regression line read
" # " #
m Σxi Σyi
β= . (7.10)
Σxi Σx2i Σxi yi
200 Chapter 7. Least-Squares and Regression Analysis

Remark 7.10. (Pointwise construction of the normal equations)


The normal equations for the regression line in (7.10) can be rewritten
as " # " #
m m
X 1 xi X yi
2 β = . (7.11)
i=1
xi x i i=1
x i y i

• The pointwise construction of the normal equation is convenient when


either points are first to be searched or weights are applied depending
on the point location.
• The idea is applicable for other regression models as well.

Example 7.11. Find the equation y = β0 + β1 x of least-squares line that


best fits the given points:
(−1, 0), (0, 1), (1, 2), (2, 4)
Solution.
7.2. Regression Analysis 201

7.2.2. Least-squares fitting of other curves

Remark 7.12. Consider a regression model of the form

y = β0 + β1 x + β2 x2 ,

for a given data set {(xi , yi ) | i = 1, 2, · · · , m}. Then


Predicted y-value Observed y-value
β0 + β1 x1 + β2 x21 = y1
β0 + β1 x2 + β2 x22 = y2 (7.12)
.. ..
. .
β0 + β1 xm + β2 x2m = ym

It can be equivalently written as


Xβ = y, (7.13)

where    
1 x1 x21   y1
1 x x2  β0 y 
2 2  2
X =  .. .. , β = β1 , y =  .. .
  
. . .
.. 
  . 
β2
1 xm x2m ym
Now, it can be solved through normal equations:
   
2
Σ1 Σxi Σxi Σyi
X Xβ =  Σxi Σxi Σxi  β =  Σxi yi  = X T y
T 2 3 (7.14)
   
Σx2i Σx3i Σx4i Σx2i yi

Self-study 7.13. Find an LS curve of the form y = β0 + β1 x + β2 x2 that best


fits the given points:
(0, 1), (1, 1), (1, 2), (2, 3).
Solution.

Ans: y = 1 + 0.5x2
202 Chapter 7. Least-Squares and Regression Analysis

7.2.3. Nonlinear regression: Linearization

Strategy 7.14. For nonlinear models, a change of variables can be


applied to get a linear model.

Model Change of Variables Linearization


B 1
y =A+ x
e = , ye = y ⇒ ye = A + Be
x
x x
1 1
y = x
e = x, ye = ⇒ ye = A + Be
x (7.15)
A + Bx y
y = CeDx x
e = x, ye = ln y ⇒ ye = ln C + De
x
1 1
y = x
e = ln x, ye = ⇒ ye = A + Be
x
A + B ln x y
The above table contains just a few examples of linearization; for other
nonlinear models, use your imagination and creativity.

Example 7.15. Find the best fitting curve of the form y = cedx for the data
 
0.1 1.9940
0.2 2.0087
 
0.3 1.8770
 
 
0.4 3.5783
 
0.5 3.9203
 
0.6 4.7617
 
0.7 6.7246
 
0.8 7.1491
 
 
0.9 9.5777
1.0 11.5625

Solution. Applying the natural log function (ln) to y = cedx gives

ln y = ln c + dx. (7.16)

Using the change of variables

Y = ln y, a0 = ln c, a1 = d, X = x,
7.2. Regression Analysis 203

the equation (7.16) reads


Y = a0 + a1 X, (7.17)
for which one can apply the linear LS procedure.
Linearized regression, in Maple
1 # The transformed data
2 xlny := Matrix(m, 2):
3 for i to m do
4 xlny[i, 1] := xy[i, 1];
5 xlny[i, 2] := ln(xy[i, 2]);
6 end do:
7

8 # The linear LS
9 L := CurveFitting[LeastSquares](xlny, x, curve = b*x + a);
10 0.295704647799999 + 2.1530740654363654 x
11

12 # Back to the original parameters


13 c := exp(0.295704647799999) = 1.344073123
14 d := 2.15307406543637:
15

16 # The desired nonlinear model


17 c*exp(d*x);
18 1.344073123 exp(2.15307406543637 x)
204 Chapter 7. Least-Squares and Regression Analysis

7.3. Scene Analysis with Noisy Data: Weighted


Least-Squares and RANSAC
Note: Scene analysis is concerned with the interpretation of acquired
data in terms of predefined models. It consists of 2 subproblems:
1. Finding the best model (classification problem)
2. Computing the best parameter values
(parameter estimation problem)

• Traditional parameter estimation techniques, such as least-


squares (LS), optimize the model to all data points.
– Those techniques are simple averaging methods, based on the
smoothing assumption: There will always be good data points
enough to smooth out any gross deviation.
• However, in many interesting parameter estimation problems, the
smoothing assumption does not hold; that is, the dataset may in-
volve gross errors such as noise.
– Thus, in order to obtain more reliable model parameters, there
must be internal mechanisms to determine which points are
matching to the model (inliers) and which points are false
matches (outliers).

7.3.1. Weighted Least-Squares


Definition 7.16. When certain data points are more important or more
reliable than the others, one may try to compute the coefficient vector
with larger weights on more reliable data points. The weighted least-
squares method is an LS method which involves a weight matrix W ,
often given as a diagonal matrix
W = diag(w1 , w2 , · · · , wm ), (7.18)

which can be decided either manually or automatically.


7.3. Scene Analysis with Noisy Data: Weighted Least-Squares and RANSAC 205

Algorithm 7.17. (Weighted Least-Squares)


• Given data {(xi , yi )}, 1 ≤ i ≤ m, the best-fitting curve can be found
by solving an over-determined algebraic system (7.8):

Xβ = y. (7.19)

• When a weight matrix is applied, the above system can be written


as
W Xβ = W y. (7.20)

• Thus its weighted normal equations read

X T W Xβ = X T W y. (7.21)

Example 7.18. Given data, find the LS line with and without a weight.
When a weight is applied, weigh the first and the last data point by 1/4.
" #T
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
xy :=
5.89 1.92 2.59 4.41 4.49 6.22 7.74 7.07 9.05 5.7

Solution.
Weighted-LS
1 LS := CurveFitting[LeastSquares](xy, x);
2 2.7639999999999967 + 0.49890909090909125 x
3 WLS := CurveFitting[LeastSquares](xy, x,
4 weight = [1/4,1,1,1,1,1,1,1,1,1/4]);
5 1.0466694879390623 + 0.8019424460431653 x
206 Chapter 7. Least-Squares and Regression Analysis

7.3.2. RANdom SAmple Consensus (RANSAC)


The random sample consensus (RANSAC) is one of the most powerful
tools for the reconstruction of ground structures from point cloud obser-
vations in many applications. The algorithm utilizes iterative search
techniques for a set of inliers, to find a proper model for given data.

Algorithm 7.19. (RANSAC) (Fischler-Bolles, 1981) [5]


Input: Measurement set X = {xi }, the error tolerance τe , the stopping
threshold η, and the maximum number of iterations N .
1. Select randomly a minimum point set S, required to determine a
hypothesis.
2. Generate a hypothesis p = g(S).
3. Compute the hypothesis consensus set, fitting within the error tol-
erance τe :
C = inlier(X, p, τe )
4. If |C| ≥ γ = η|X|, then re-estimate a hypothesis p = g(C) and stop.
5. Otherwise, repeat steps 1–4 (maximum of N times).

Example 7.20. Let’s set a hypothesis for a regression line.

1. Minimum point set S: a set of two points, (x1 , y1 ) and (x2 , y2 ).


2. Hypothesis p: y = a + bx (⇒ a + bx − y = 0)
y2 − y1
y = b(x − x1 ) + y1 = a + bx ⇐ b = , a = y1 − bx1 .
x2 − x1

3. Consensus set C:
n |a + bxi − yi | o
C = (xi , yi ) ∈ X | d = √ ≤ τe (7.22)
b2 + 1
7.3. Scene Analysis with Noisy Data: Weighted Least-Squares and RANSAC 207

Note: Implementation of RANSAC


• Step 2: A hypothesis p is the set of model parameters, rather than
the model itself.
• Step 3: The consensus set can be represented more conveniently by
considering C as an index array. That is,
(
1 if xi ∈ C
C(i) = (7.23)
0 if xi 6∈ C

See inlier.m implemented for Exercise 7.2, p. 209.

Remark 7.21. The “inlier” function in Step 3 collects points whose


distance from the model, f (p), is not larger than τe .
Thus, the distance can be interpreted as an automatic weighting
mechanism. Indeed, for each point xi ,
(
≤ τe , then wi = 1
dist(f (p), xi ) (7.24)
> τe , then wi = 0

Then the re-estimation in Step 4, p = g(C), can be seen as an pa-


rameter estimation p = g(X) with the corresponding weight matrix
W = {w1 , w2 , · · · , wm }.

Note: The Basic RANSAC Algorithm


• It is an iterative search method for a set of inliers which may pro-
duce presumably accurate model parameters.
• It is simple to implement and efficient. However, it is problematic
and often erroneous.
• The main disadvantage of RANSAC is that RANSAC is unre-
peatable; it may yield different results in each run so that none of
the results can be optimal.
208 Chapter 7. Least-Squares and Regression Analysis

(Nin = 200, Nout = 50) (Nin = 1200, Nout = 300)

Figure 7.2: The RANSAC for linear-type synthetic datasets.

Table 7.1: The RANSAC: model fitting y = a0 + a1 x. The algorithm runs 1000 times for
each dataset to find the standard deviation of the error: σ(a0 − b
a0 ) and σ(a1 − b
a1 ).

Data σ(a0 − b
a0 ) σ(a1 − b
a1 ) E-time (sec)
1 0.1156 0.0421 0.0156
2 0.1101 0.0391 0.0348

RANSAC is neither repeatable nor optimal.


In order to overcome the drawback, various variants have been studied
in the literature. For variants, see e.g.,
• Maximum Likelihood Estimation Sample Consensus (MLESAC) [12]
• Progressive Sample Consensus (PROSAC) [3]
• Recursive RANSAC (R-RANSAC) [8]
Nonetheless, RANSAC remains a prevailing algorithm for finding in-
liers.
7.3. Scene Analysis with Noisy Data: Weighted Least-Squares and RANSAC 209

Exercises for Chapter 7

7.1. Given data

xi 0.2 0.4 0.6 0.8 1. 1.2 1.4 1.6 1.8 2.


yi 1.88 2.13 1.76 2.78 3.23 3.82 6.13 7.22 6.66 9.07

(a) Plot the data (scattered point plot)


(b) Decide what curve fits the data best.
(c) Use the method of normal equations to find the LS solution.
(d) Plot the curves superposed over the point plot.

7.2. This problem uses the data in Example 7.18, p.205.

(a) Implement the method of normal equations for the least-squares regression to
find the best-fitting line.
(b) The RANSAC, Algorithm 7.19 is implemented for you below. Use the code to
analyze the performance of the RANSAC.
• Set τe = 1, γ = η|X| = 8, and N = 100.
• Run ransac2 100 times to get the minimum, maximum, and average number
of iterations for the RANSAC to find an acceptable hypothesis consensus set.
(c) Plot the best-fitting lines found from (a) and (b), superposed over the data.
get_hypothesis_WLS.m
1 function p = get_hypothesis_WLS(X,C)
2 % Get hypothesis p, with C being used as weights
3 % Output: p = [a,b], where y= a+b*x
4

5 m = size(X,1);
6

7 A = [ones(m,1) X(:,1)];
8 A = A.*C; %A = bsxfun(@times,A,C);
9 r = X(:,2).*C;
10

11 p = ((A'*A)\(A'*r))';
210 Chapter 7. Least-Squares and Regression Analysis

inlier.m
1 function C = inlier(X,p,tau_e)
2 % Input: p=[a,b] s.t. a+b*x-y=0
3

4 m = size(X,1);
5 C = zeros(m,1);
6

7 a = p(1); b=p(2);
8 factor = 1./sqrt(b^2+1);
9 for i=1:m
10 xi = X(i,1); yi = X(i,2);
11 dist = abs(a+b*xi-yi)*factor; %distance from point to line
12 if dist<=tau_e, C(i)=1; end
13 end

ransac2.m
1 function [p,C,iter] = ransac2(X,tau_e,gamma,N)
2 % Input: X = {(x_i,y_i)}
3 % tau_e: the error tolerance
4 % gamma = eta*|X|
5 % N: the maximum number of iterations
6 % Output: p = [a,b], where y= a+b*x
7

8 %%-----------
9 [m,n] = size(X);
10 if n>m, X=X'; [m,n] = size(X); end
11

12 for iter = 1:N


13 % step 1
14 s1 = randi([1 m]); s2 = randi([1 m]);
15 while s1==s2, s2 = randi([1 m]); end
16 S = [X(s1,:);X(s2,:)];
17 % step 2
18 p = get_hypothesis_WLS(S,[1;1]);
19 % step 3
20 C = inlier(X,p,tau_e);
21 % step 4
22 if sum(C)>=gamma
23 p = get_hypothesis_WLS(X,C);
24 break;
25 end
26 end
8
C HAPTER

Python Basics

Contents of Chapter 8
8.1. Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
8.2. Python Essentials in 30 Minutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
8.3. Zeros of a Polynomial in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
8.4. Python Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Exercises for Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

211
212 Chapter 8. Python Basics

8.1. Why Python?


Note: A good programming language must be easy to learn and use
and flexible and reliable.

Advantages of Python
Python has the following characteristics.
• Easy to learn and use
• Flexible and reliable
• Extensively used in Data Science
• Handy for Web Development purposes
• Having Vast Libraries support
• Among the fastest-growing programming languages in the tech
industry
Disadvantage of Python
Python is an interpreted and dynamically-typed language. The line-by-
line execution of code, built with a high flexibility, most likely leads to
slow execution. Python scripts are way slow!

Remark 8.1. Speed up Python Programs


• Use numpy and scipy for all mathematical operations.
• Always use built-in functions wherever possible.

• Cython: It is designed as a C-extension for Python, which is


developed for users not familiar with C. A good choice!
• You may create and import your own C/C++/Fortran-modules into
Python. If you extend Python with pieces of compiled modules,
then the resulting code is easily 100× faster than Python scripts.
The Best Choice!
8.1. Why Python? 213

How to call C/C++/Fortran from Python


Functions in C/C++/Fortran can be compiled using the shell script.
Compile-f90-c-cpp
1 #!/usr/bin/bash
2

3 LIB_F90='lib_f90'
4 LIB_GCC='lib_gcc'
5 LIB_GPP='lib_gpp'
6

7 ### Compiling: f90


8 f2py3 -c --f90flags='-O3' -m $LIB_F90 *.f90
9

10 ### Compiling: C (PIC: position-independent code)


11 gcc -fPIC -O3 -shared -o $LIB_GCC.so *.c
12

13 ### Compiling: C++


14 g++ -fPIC -O3 -shared -o $LIB_GPP.so *.cpp

The shared objects (*.so) can be imported to the Python wrap-up.


Python Wrap-up
1 #!/usr/bin/python3
2

3 import numpy as np
4 import ctypes, time
5 from lib_py3 import *
6 from lib_f90 import *
7 lib_gcc = ctypes.CDLL("./lib_gcc.so")
8 lib_gpp = ctypes.CDLL("./lib_gpp.so")
9

10 ### For C/C++ ----------------------------------------------


11 # e.g., lib_gcc.CFUNCTION(double array,double array,int,int)
12 # returns a double value.
13 #-----------------------------------------------------------
14 IN_ddii = [np.ctypeslib.ndpointer(dtype=np.double),
15 np.ctypeslib.ndpointer(dtype=np.double),
16 ctypes.c_int, ctypes.c_int] #input type
17 OUT_d = ctypes.c_double #output type
18

19 lib_gcc.CFUNCTION.argtypes = IN_ddii
20 lib_gcc.CFUNCTION.restype = OUT_d
21

22 result = lib_gcc.CFUNCTION(x,y,n,m)
214 Chapter 8. Python Basics

• The library numpy is designed for a Matlab-like implementation.


• Python can be used as a convenient desktop calculator.
– First, set a startup environment
– Use Python as a desktop calculator

∼/.python_startup.py
1 #.bashrc: export PYTHONSTARTUP=~/.python_startup.py
2 #.cshrc: setenv PYTHONSTARTUP ~/.python_startup.py
3 #---------------------------------------------------
4 print("\t^[[1;33m~/.python_startup.py")
5

6 import numpy as np; import sympy as sym


7 import numpy.linalg as la; import matplotlib.pyplot as plt
8 print("\tnp=numpy; la=numpy.linalg; plt=matplotlib.pyplot; sym=sympy")
9

10 from numpy import zeros,ones


11 print("\tzeros,ones, from numpy")
12

13 import random
14 from sympy import *
15 x,y,z,t = symbols('x,y,z,t');
16 print("\tfrom sympy import *; x,y,z,t = symbols('x,y,z,t')")
17

18 print("\t^[[1;37mTo see details: dir() or dir(np)^[[m")

Figure 8.1: Python startup.


8.2. Python Essentials in 30 Minutes 215

8.2. Python Essentials in 30 Minutes


Key Features of Python
• Python is a simple, readable, open source programming language
which is easy to learn.
• It is an interpreted language, not a compiled language.
• In Python, variables are untyped; i.e., there is no need to define the
data type of a variable while declaring it.
• Python supports object-oriented programming models.
• It is platform-independent and easily extensible and embeddable.
• It has a huge standard library with lots of modules and packages.
• Python is a high level language as it is easy to use because of simple
syntax, powerful because of its rich libraries and extremely versatile.

Programming Features
• Python has no support pointers.
• Python codes are stored with .py extension.
• Indentation: Python uses indentation to define a block of code.
– A code block (body of a function, loop, etc.) starts with indenta-
tion and ends with the first unindented line.
– The amount of indentation is up to the user, but it must be consis-
tent throughout that block.
• Comments:
– The hash (#) symbol is used to start writing a comment.
– Multi-line comments: Python uses triple quotes, either ”’ or """.
216 Chapter 8. Python Basics

Python Essentials
• Sequence datatypes: list, tuple, string
– [list]: defined using square brackets (and commas)
>>> li = ["abc", 14, 4.34, 23]
– (tuple): defined using parentheses (and commas)
>>> tu = (23, (4,5), ’a’, 4.1, -7)
– "string": defined using quotes (", ’, or """)
>>> st = ’Hello World’
>>> st = "Hello World"
>>> st = """This is a multi-line string
. . . that uses triple quotes."""
• Retrieving elements
>>> li[0]
’abc’
>>> tu[1],tu[2],tu[-2]
((4, 5), ’a’, 4.1)
>>> st[25:36]
’ng\nthat use’
• Slicing
>>> tu[1:4] # be aware
((4, 5), ’a’, 4.1)
• The + and ∗ operators
>>> [1, 2, 3]+[4, 5, 6,7]
[1, 2, 3, 4, 5, 6, 7]
>>> "Hello" + " " + ’World’
Hello World
>>> (1,2,3)*3
(1, 2, 3, 1, 2, 3, 1, 2, 3)
8.2. Python Essentials in 30 Minutes 217

• Reference semantics
>>> a = [1, 2, 3]
>>> b = a
>>> a.append(4)
>>> b
[1, 2, 3, 4]
Be aware with copying lists and numpy arrays!
• numpy, range, and iteration
>>> range(8)
[0, 1, 2, 3, 4, 5, 6, 7]
>>> import numpy as np
>>> for k in range(np.size(li)):
... li[k]
. . . <Enter>
’abc’
14
4.34
23
• numpy array and deepcopy
>>> from copy import deepcopy
>>> A = np.array([1,2,3])
>>> B = A
>>> C = deepcopy(A)
>>> A *= 4
>>> B
array([ 4, 8, 12])
>>> C
array([1, 2, 3])
218 Chapter 8. Python Basics

Frequently used Python Rules


frequently_used_rules.py
1 ## Multi-line statement
2 a = 1 + 2 + 3 + 4 + 5 +\
3 6 + 7 + 8 + 9 + 10
4 b = (1 + 2 + 3 + 4 + 5 +
5 6 + 7 + 8 + 9 + 10) #inside (), [], or {}
6 print(a,b)
7 # Output: 55 55
8

9 ## Multiple statements in a single line using ";"


10 a = 1; b = 2; c = 3
11

12 ## Docstrings in Python
13 def double(num):
14 """Function to double the value"""
15 return 2*num
16 print(double.__doc__)
17 # Output: Function to double the value
18

19 ## Assigning multiple values to multiple variables


20 a, b, c = 1, 2, "Hello"
21 ## Swap
22 b, c = c, b
23 print(a,b,c)
24 # Output: 1 Hello 2
25

26 ## Data types in Python


27 a = 5; b = 2.1
28 print("type of (a,b)", type(a), type(b))
29 # Output: type of (a,b) <class 'int'> <class 'float'>
30

31 ## Python Set: 'set' object is not subscriptable


32 a = {5,2,3,1,4}; b = {1,2,2,3,3,3}
33 print("a=",a,"b=",b)
34 # Output: a= {1, 2, 3, 4, 5} b= {1, 2, 3}
8.2. Python Essentials in 30 Minutes 219

35

36 ## Python Dictionary
37 d = {'key1':'value1', 'Seth':22, 'Alex':21}
38 print(d['key1'],d['Alex'],d['Seth'])
39 # Output: value1 21 22
40

41 ## Output Formatting
42 x = 5.1; y = 10
43 print('x = %d and y = %d' %(x,y))
44 print('x = %f and y = %d' %(x,y))
45 print('x = {} and y = {}'.format(x,y))
46 print('x = {1} and y = {0}'.format(x,y))
47 # Output: x = 5 and y = 10
48 # x = 5.100000 and y = 10
49 # x = 5.1 and y = 10
50 # x = 10 and y = 5.1
51

52 print("x=",x,"y=",y, sep="#",end="&\n")
53 # Output: x=#5.1#y=#10&
54

55 ## Python Interactive Input


56 C = input('Enter any: ')
57 print(C)
58 # Output: Enter any: Starkville
59 # Starkville
220 Chapter 8. Python Basics

Looping and Functions


Example 8.2. Compose a Python function which returns cubes of natural
numbers.
Solution.
get_cubes.py
1 def get_cubes(num):
2 cubes = []
3 for i in range(1,num+1):
4 value = i**3
5 cubes.append(value)
6 return cubes
7

8 if __name__ == '__main__':
9 num = input('Enter a natural number: ')
10 cubes = get_cubes(int(num))
11 print(cubes)

Remark 8.3. get_cubes.py


• Lines 8-11 are added for the function to be called directly. That is,
[Sun Nov.05] python get_cubes.py
Enter a natural number: 6
[1, 8, 27, 64, 125, 216]
• When get_cubes is called from another function, the last four lines
will not be executed.
call_get_cubes.py
1 from get_cubes import *
2

3 cubes = get_cubes(8)
4 print(cubes)

Execusion
1 [Sun Nov.05] python call_get_cubes.py
2 [1, 8, 27, 64, 125, 216, 343, 512]
8.3. Zeros of a Polynomial in Python 221

8.3. Zeros of a Polynomial in Python


In this section, we will implement a Python code for zeros of a polynomial
and compare it with a Matlab code.

Recall: Let’s begin with recalling how to find zeros of a polynomial,


presented in §3.6.
• Remark 3.69: When the Newton’s method is applied for finding an
approximate zero of P (x), the iteration reads
P (xn−1 )
xn = xn−1 − . (8.1)
P 0 (xn−1 )
Thus both P (x) and P 0 (x) must be evaluated in each iteration.
• Strategy 3.70: The derivative P 0 (x) can be evaluated by using
the Horner’s method with the same efficiency. Indeed, differen-
tiating (3.90)
P (x) = (x − x0 )Q(x) + P (x0 )
reads
P 0 (x) = Q(x) + (x − x0 )Q0 (x). (8.2)
Thus
P 0 (x0 ) = Q(x0 ). (8.3)
That is, the evaluation of Q at x0 becomes the desired quantity P 0 (x0 ).
222 Chapter 8. Python Basics

Example 8.4. (Revisit of Example 3.73, p. 108)


Let P (x) = x4 −4x3 +7x2 −5x−2. Use the Newton’s method and the Horner’s
method to implement a code and find an approximate zero of P near 3.
Solution. First, let’s try to use built-in functions.
zeros_of_poly_built_in.py
1 import numpy as np
2

3 coeff = [1, -4, 7, -5, -2]


4 P = np.poly1d(coeff)
5 Pder = np.polyder(P)
6

7 print(P)
8 print(Pder)
9 print(np.roots(P))
10 print(P(3), Pder(3))

Output
1 4 3 2
2 1 x - 4 x + 7 x - 5 x - 2
3 3 2
4 4 x - 12 x + 14 x - 5
5 [ 2. +0.j 1.1378411+1.52731225j 1.1378411-1.52731225j -0.2756822+0.j ]
6 19 37

Observation 8.5. We will see:


Python programming is as easy and simple as Matlab programming.
• In particular, numpy is developed for Matlab-like implementa-
tion, with enhanced convenience.
• Numpy is used extensively in most of scientific Python packages:
SciPy, Pandas, Matplotlib, scikit-learn, · · ·
8.3. Zeros of a Polynomial in Python 223

Now, we implement a code in Python for Newton-Horner method to find


an approximate zero of P near 3.
Zeros-Polynomials-Newton-Horner.py
1 def horner(A,x0):
2 """ input: A = [a_n,...,a_1,a_0]
3 output: p,d = P(x0),DP(x0) = horner(A,x0) """
4 n = len(A)
5 p = A[0]; d = 0
6

7 for i in range(1,n):
8 d = p + x0*d
9 p = A[i] +x0*p
10 return p,d
11

12 def newton_horner(A,x0,tol,itmax):
13 """ input: A = [a_n,...,a_1,a_0]
14 output: x: P(x)=0 """
15 x=x0
16 for it in range(1,itmax+1):
17 p,d = horner(A,x)
18 h = -p/d;
19 x = x + h;
20 if(abs(h)<tol): break
21 return x,it
22

23 if __name__ == '__main__':
24 coeff = [1, -4, 7, -5, -2]; x0 = 3
25 tol = 10**(-12); itmax = 1000
26 x,it =newton_horner(coeff,x0,tol,itmax)
27 print("newton_horner: x0=%g; x=%g, in %d iterations" %(x0,x,it))

Execution
1 [Sat Jul.23] python Zeros-Polynomials-Newton-Horner.py
2 newton_horner: x0=3; x=2, in 7 iterations
224 Chapter 8. Python Basics

Note: The above Python code must be compared with the Matlab code
in §3.6:

horner.m
1 function [p,d] = horner(A,x0)
2 % input: A = [a_0,a_1,...,a_n]
3 % output: p=P(x0), d=P'(x0)
4

5 n = size(A(:),1);
6 p = A(n); d=0;
7

8 for i = n-1:-1:1
9 d = p + x0*d;
10 p = A(i) +x0*p;
11 end

newton_horner.m
1 function [x,it] = newton_horner(A,x0,tol,itmax)
2 % input: A = [a_0,a_1,...,a_n]; x0: initial for P(x)=0
3 % outpue: x: P(x)=0
4

5 x = x0;
6 for it=1:itmax
7 [p,d] = horner(A,x);
8 h = -p/d;
9 x = x + h;
10 if(abs(h)<tol), break; end
11 end

Call_newton_horner.m
1 a = [-2 -5 7 -4 1];
2 x0=3;
3 tol = 10^-12; itmax=1000;
4 [x,it] = newton_horner(a,x0,tol,itmax);
5 fprintf(" newton_horner: x0=%g; x=%g, in %d iterations\n",x0,x,it)
6 Result: newton_horner: x0=3; x=2, in 7 iterations
8.4. Python Classes 225

8.4. Python Classes

Remark 8.6. Object-Oriented Programming (OOP)


Classes are a key concept in the object-oriented programming.
Classes provide a means of bundling data and functionality to-
gether.
• A class is a user-defined template or prototype from which real-
world objects are created.
• The major merit of using classes is on the sharing mechanism
between functions/methods and objects.
– Initialization and the sharing boundaries must be declared
clearly and conveniently.
• A class tells us
– what data an object should have,
– what are the initial/default values of the data, and
– what methods are associated with the object to take actions on
the objects using their data.
• An object is an instance of a class, and creating an object from a
class is called instantiation.

In the following, we would build a simple class, as Dr. Xu did in [14, Ap-
pendix B.5]; you will learn how to initiate, refine, and use classes.
226 Chapter 8. Python Basics

Initiation of a Class
Polynomial_01.py
1 class Polynomial():
2 """A class of polynomials"""
3

4 def __init__(self,coefficient):
5 """Initialize coefficient attribute of a polynomial."""
6 self.coeff = coefficient
7

8 def degree(self):
9 """Find the degree of a polynomial"""
10 return len(self.coeff)-1
11

12 if __name__ == '__main__':
13 p2 = Polynomial([1,2,3])
14 print(p2.coeff) # a variable; output: [1, 2, 3]
15 print(p2.degree()) # a method; output: 2

• Lines 1-2: define a class called Polynomial with a docstring.


– The parentheses in the class definition are empty because we cre-
ate this class from scratch.
• Lines 4-10: define two functions, __init__() and degree(). A function
in a class is called a method.
– The __init__() method is a special method for initialization;
it is called the __init__() constructor.
– The self Parameter and Its Sharing
* The self parameter is required and must come first before the
other parameters in each method.
* The variable self.coeff (prefixed with self) is available to
every method and is accessible by any objects created from
the class. (Variables prefixed with self are called attributes.)
* We do not need to provide arguments for self.
• Line 13: The line p2 = Polynomial([1,2,3]) creates an object p2 (a
polynomial x2 + 2x + 3), by passing the coefficient list [1,2,3].
– When Python reads this line, it calls the method __init__() in the
class Polynomial and creates the object named p2 that represents
this particular polynomial x2 + 2x + 3.
8.4. Python Classes 227

Refinement of the Polynomial class


Polynomial_02.py
1 class Polynomial():
2 """A class of polynomials"""
3

4 count = 0 #Polynomial.count
5

6 def __init__(self):
7 """Initialize coefficient attribute of a polynomial."""
8 self.coeff = [1]
9 Polynomial.count += 1
10

11 def __del__(self):
12 """Delete a polynomial object"""
13 Polynomial.count -= 1
14

15 def degree(self):
16 """Find the degree of a polynomial"""
17 return len(self.coeff)-1
18

19 def evaluate(self,x):
20 """Evaluate a polynomial."""
21 n = self.degree(); eval = []
22 for xi in x:
23 p = self.coeff[0] #Horner's method
24 for k in range(1,n+1): p = self.coeff[k]+ xi*p
25 eval.append(p)
26 return eval
27

28 if __name__ == '__main__':
29 poly1 = Polynomial()
30 print('poly1, default coefficients:', poly1.coeff)
31 poly1.coeff = [1,2,-3]
32 print('poly1, coefficients after reset:', poly1.coeff)
33 print('poly1, degree:', poly1.degree())
34

35 poly2 = Polynomial(); poly2.coeff = [1,2,3,4,-5]


36 print('poly2, coefficients after reset:', poly2.coeff)
37 print('poly2, degree:', poly2.degree())
38

39 print('number of created polynomials:', Polynomial.count)


40 del poly1
41 print('number of polynomials after a deletion:', Polynomial.count)
42 print('poly2.evaluate([-1,0,1,2]):',poly2.evaluate([-1,0,1,2]))
228 Chapter 8. Python Basics

• Line 4: (Global Variable) The variable count is a class attribute of


Polynomial.
– It belongs to the class but not a particular object.
– All objects of the class share this same variable
(Polynomial.count).
• Line 8: (Initialization) Initializes the class attribute self.coeff.
– Every object or class attribute in a class needs an initial value.
– One can set a default value for an object attribute in the
__init__() constructor; and we do not have to include a param-
eter for that attribute. See Lines 29 and 35.
• Lines 11-13: (Deletion of Objects) Define the __del__() method in
the class for the deletion of objects. See Line 40.
– del is a built-in function which deletes variables and objects.
• Lines 19-28: (Add Methods) Define another method called evaluate,
which uses the Horner’s method. See Example 8.4, p.222.

Output
1 poly1, default coefficients: [1]
2 poly1, coefficients after reset: [1, 2, -3]
3 poly1, degree: 2
4 poly2, coefficients after reset: [1, 2, 3, 4, -5]
5 poly2, degree: 4
6 number of created polynomials: 2
7 number of polynomials after a deletion: 1
8 poly2.evaluate([-1,0,1,2]): [-7, -5, 5, 47]
8.4. Python Classes 229

Inheritance
Note: If we want to write a class that is just a specialized version of
another class, we do not need to write the class from scratch.
• We call the specialized class a child class and the other general
class a parent class.
• The child class can inherit all the attributes and methods form the
parent class.
– It can also define its own special attributes and methods or even
overrides methods of the parent class.
Classes can import functions implemented earlier, to define methods.
Classes.py
1 from util_Poly import *
2

3 class Polynomial():
4 """A class of polynomials"""
5

6 def __init__(self,coefficient):
7 """Initialize coefficient attribute of a polynomial."""
8 self.coeff = coefficient
9

10 def degree(self):
11 """Find the degree of a polynomial"""
12 return len(self.coeff)-1
13

14 class Quadratic(Polynomial):
15 """A class of quadratic polynomial"""
16

17 def __init__(self,coefficient):
18 """Initialize the coefficient attributes ."""
19 super().__init__(coefficient)
20 self.power_decrease = 1
21

22 def roots(self):
23 return roots_Quad(self.coeff,self.power_decrease)
24

25 def degree(self):
26 return 2
230 Chapter 8. Python Basics

• Line 1: Imports functions implemented earlier.


• Line 14: We must include the name of the parent class in the paren-
theses of the definition of the child class (to indicate the parent-child
relation for inheritance).
• Line 19: The super() function is to give an child object all the at-
tributes defined in the parent class.
• Line 20: An additional child class attribute self.power_decrease is
initialized.
• Lines 22-23: define a new method called roots, reusing a function
implemented earlier.
• Lines 25-26: The method degree() overrides the parent’s method.

util_Poly.py
1 def roots_Quad(coeff,power_decrease):
2 a,b,c = coeff
3 if power_decrease != 1:
4 a,c = c,a
5 discriminant = b**2-4*a*c
6 r1 = (-b+discriminant**0.5)/(2*a)
7 r2 = (-b-discriminant**0.5)/(2*a)
8 return [r1,r2]

call_Quadratic.py
1 from Classes import *
2

3 quad1 = Quadratic([2,-3,1])
4 print('quad1, roots:',quad1.roots())
5 quad1.power_decrease = 0
6 print('roots when power_decrease = 0:',quad1.roots())

Output
1 quad1, roots: [1.0, 0.5]
2 roots when power_decrease = 0: [2.0, 1.0]
8.4. Python Classes 231

Final Remarks on Python Implementation


• A proper modularization must precede implementation, as for
other programming languages.
• Classes are used quite frequently.
– You do not have to use classes for small projects.
• Try to use classes smartly.
Quite often, they add unnecessary complications and
their methods are hardly applicable directly for other projects.
– You may implement stand-alone functions to import.
– This strategy enhances reusability of functions.
For example, the function roots_Quad defined in util_Poly.py
(page 230) can be used directly for other projects.
– Afterwards, you will get your own utility functions; using
them, you can complete various programming tasks effectively.
232 Chapter 8. Python Basics

Exercises for Chapter 8

You should use Python for the following problems.


8.1. Use nested for loops to assign entries of a 5 × 5 matrix A such that A[i, j] = ij.
8.2. The variable d is initially equal to 1. Use a while loop to keep dividing d by 2 until
d < 10−6 .

(a) Determine how many divisions are made.


(b) Verify your result by algebraic derivation.

Note: A while loop has not been considered in the lecture. However, you can figure it out
easily by yourself.
8.3. Write a function that takes as input a list of values and returns the largest value. Do
this without using the Python max() function; you should combine a for loop and an
if statement.

(a) Produce a random list of size 10-20 to verify your function.

8.4. Let P4 (x) = 2x4 − 5x3 − 11x2 + 20x + 10. Solve the following.

(a) Plot P4 over the interval [−3, 4].


(b) Find all zeros of P4 , modifying Zeros-Polynomials-Newton-Horner.py, p.222.
(c) Add markers for the zeros to the plot.
(d) Find all roots of P40 (x) = 0.
(e) Add markers for the zeros of P40 to the plot.

Hint : For plotting, you may import: “import matplotlib.pyplot as plt” then use
plt.plot(). You will see the Python plotting is quite similar to Matlab plotting.
9
C HAPTER

Vector Spaces and Orthogonality

Contents of Chapter 9
9.1. Subspaces of Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
9.2. Orthogonal Sets and Orthogonal Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
9.3. Orthogonal Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.4. The Gram-Schmidt Process and QR Factorization . . . . . . . . . . . . . . . . . . . . . 248
9.5. QR Iteration for Finding Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Exercises for Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

233
234 Chapter 9. Vector Spaces and Orthogonality

9.1. Subspaces of Rn
Definition 9.1. A subspace of Rn is any set H in Rn that has three
properties:
a) The zero vector is in H.
b) For each u and v in H, the sum u + v is in H.
c) For each u in H and each scalar c, the vector cu is in H.
That is, H is closed under linear combinations.

Remark 9.2. Rn , with the standard addition and scalar multiplication,


is a vector space.

Example 9.3.

1. A line through the origin in R2 is a subspace of R2 .


2. Any plane through the origin in R3 is a subspace of R3 .

Figure 9.1: Span{v1 , v2 } as a plane through the origin.

3. Let v1 , v2 , · · · , vp ∈ Rn . Then Span{v1 , v2 , · · · , vp } is a subspace of Rn .


4. Every spanned set in Rn is a subspace.
9.1. Subspaces of Rn 235

Given a matrix A ∈ Rm×n , we may consider the subspace spanned by the


column vectors of A.
Definition 9.4. Let A be an m × n matrix.
The column space of A is the set (Col A) of all linear combinations of
columns of A. That is, if A = [a1 a2 · · · an ], then

Col A = {u | u = c1 a1 + c2 a2 + · · · + cn an }, (9.1)

where c1 , c2 , · · · , cn are scalars. Col A is a subspace of Rm .


   
1 −3 −4 3
Example 9.5. Let A = −4 6 −2 and b =  3. Determine whether
   
−3 7 6 −4
b is in the column space of A, Col A.
Solution. Clue: 1 b ∈ Col A
⇔ 2 b is a linear combinations of columns of A
⇔ 3 Ax = b is consistent
⇔ 4 [A b] has a solution

Definition 9.6. Let A be an m × n matrix. The null space of A, Nul A,


is the set of all solutions of the homogeneous system Ax = 0.

Theorem 9.7. Nul A is a subspace of Rn .

Proof.
236 Chapter 9. Vector Spaces and Orthogonality

Basis for a Subspace


Definition 9.8. A basis for a subspace H in Rn is a set of vectors that
1. is linearly independent, and
2. spans H.

Remark 9.9.
(" # " #)
1 1
1. , is a basis for R2 .
0 2
     
1 0 0
0 1 0
     
, e2 = 0, · · · , en =  .... Then {e1 , e2 , · · · , en } is called
     
2. Let e1 =  0
 ..  ..
     
 .  . 0
 
0 0 1
n
the standard basis for R .

Example 9.10. Find a basis for the column space of the matrix
 
1 0 −3 5 0
0 1 2 −1 0
B= .
 
0 0 0 0 1
0 0 0 0 0
Solution. Observation: b3 = −3b1 + 2b2 and b4 = 5b1 − b2 .

Theorem 9.11. In general, non-pivot columns are linear combinations


of preceding pivot columns. Thus the pivot columns of a matrix A forms
a basis for Col A.
9.1. Subspaces of Rn 237

Example 9.12. Find bases for the column space and the null space of the
matrix
 
−3 6 −1 1
A =  1 −2 2 3.
 
2 −4 5 8
 
1 2 0 1
Solution. A ∼ 0 0 1 2
 
0 0 0 0

Ans: Col A = {a1 , a3 }

Theorem 9.13. A basis for Nul A can be obtained from the parametric
vector form of solutions of Ax = 0. That is, suppose that the solutions of
Ax = 0 reads
x = x1 u1 + x2 u2 + · · · + xk uk ,
where x1 , x2 , · · · , xk correspond to free variables. Then, a basis for Nul A
is {u1 , u2 , · · · , uk }.

Theorem 9.14. (Rank Theorem) Let A ∈ Rm×n . Then

dim Col A + dim Nul A = rank A + nullity A = n


= (the number of columns in A)

Here, “dim Nul A” is called the nullity of A: nullity A


238 Chapter 9. Vector Spaces and Orthogonality

9.2. Orthogonal Sets and Orthogonal Matrix


Definition 9.15. A set of vectors S = {u1 , u2 , · · · , up } in Rn is said
to be an orthogonal set if each pair of distinct vectors from the set is
orthogonal. That is,
ui •uj = 0, for i 6= j.

If the vectors in S are nonzero, then S is linearly independent and therefore


forms a basis for the subspace spanned by S.

Definition 9.16. An orthogonal basis for a subspace W of Rn is a


basis for W that is also an orthogonal set.

The following theorem shows one of reasons why orthogonality is a useful


property in vector spaces and matrix algebra.
Theorem 9.17. Let {u1 , u2 , · · · , up } be an orthogonal basis for a sub-
space W of Rn . For each y in W , the weights in the linear combination

y = c1 u1 + c2 u2 + · · · + cp up (9.2)
are given by
y•uj
cj = (j = 1, 2, · · · , p). (9.3)
uj •uj

Proof. y•uj = (c1 u1 + c2 u2 + · · · + cp up )•uj = cj uj •uj , from which we can


conclude (9.3).
Example 9.18. Consider a set of vectors S = {u1 , u2 , u3 }, where
     
1 0 −5
u1 = −2, u2 = 1, and u3 = −2. (a) Is S orthogonal? (b) Express the
     
1 2 1
vector y = [11, 0, −5]T as a linear combination of the vectors in S.
Solution.

Ans: y = u1 − 2u2 − 2u3 .


9.2. Orthogonal Sets and Orthogonal Matrix 239

An Orthogonal Projection
Note: Given a nonzero vector u in Rn , consider the problem of decompos-
ing a vector y ∈ Rn into sum of two vectors, one a multiple of u and the
other orthogonal to u. Let
y=y
b + z, b // u and z ⊥ u.
y

Let y
b = αu. Then
0 = z•u = (y − αu)•u = y•u − αu•u.

Thus α = y•u/u•u.

Figure 9.2: Orthogonal projection: y = y


b + z.

Definition 9.19. Given a nonzero vector u in Rn , for y ∈ Rn , let

y=y
b + z, b // u and z ⊥ u.
y (9.4)

Then
y•u
y
b = αu = u, z = y − yb. (9.5)
u•u
The vector yb is called the orthogonal projection of y onto u, and z is
called the component of y orthogonal to u. Let L = Span{u}. Then
we denote
y•u
y
b= u = projL y, (9.6)
u•u
which is called the orthogonal projection of y onto L.
240 Chapter 9. Vector Spaces and Orthogonality
" # " #
7 4
Example 9.20. Let y = and u = .
6 2

(a) Find the orthogonal projection of y onto u.


(b) Write y as the sum of two orthogonal vectors, one in L = Span{u} and
one orthogonal to u.
(c) Find the distance from y to L.
Solution.

Figure 9.3: The orthogonal projection of y


onto L = Span{u}.
9.2. Orthogonal Sets and Orthogonal Matrix 241

Orthonormal Sets
Definition 9.21. A set {u1 , u2 , · · · , up } is an orthonormal set, if it is
an orthogonal set of unit vectors. If W is the subspace spanned by such a
set, then {u1 , u2 , · · · , up } is an orthonormal basis for W , since the set
is automatically linearly independent.
   
1 0
Example 9.22. In Example 9.18, p. 238, we know v1 = −2, v2 = 1,
   
1 2
 
−5
and v3 = −2 form an orthogonal basis for R3 . Find the corresponding
 
1
orthonormal basis.
Solution.

Theorem 9.23. An m × n matrix U has orthonormal columns if and


only if U T U = I.

Proof. To simplify notation, we suppose that U has only three columns:


U = [u1 u2 u3 ], ui ∈ Rm . Then
   
uT1 uT1 u1 uT1 u2 uT1 u3
U T U = uT2 [u1 u2 u3 ] = uT2 u1 uT2 u2 uT2 u3 .
   
uT3 uT3 u1 uT3 u2 uT3 u3
 
1 0 0
Thus, U has orthonormal columns ⇔ U T U = 0 1 0.
 
The proof of the gen-
0 0 1
eral case is essentially the same.
242 Chapter 9. Vector Spaces and Orthogonality

Theorem 9.24. Let U be an m × n matrix with orthonormal columns,


and let x, y ∈ Rn . Then
(a) kU xk = kxk (length preservation)
(b) (U x)•(U y) = x•y (dot product preservation)
(c) (U x)•(U y) = 0 ⇔ x•y = 0 (orthogonality preservation)

Proof.

Theorems 9.23 and 9.24 are particularly useful when applied to square ma-
trices.
Definition 9.25. An orthogonal matrix is a square matrix U such
that U T = U −1 , i.e.,

U ∈ Rn×n and U T U = I. (9.7)

Let’s generate a random orthogonal matrix and test it.

orthogonal_matrix.m Output
1 n = 4; 1 U =
2 2 -0.5332 0.4892 0.6519 0.2267
3 [Q,~] = qr(rand(n)); 3 -0.5928 -0.7162 0.1668 -0.3284
4 U = Q; 4 -0.0831 0.4507 -0.0991 -0.8833
5 5 -0.5978 0.2112 -0.7331 0.2462
6 disp("U ="); disp(U) 6 U'*U =
7 disp("U'*U ="); disp(U'*U) 7 1.0000 -0.0000 0 -0.0000
8 8 -0.0000 1.0000 0.0000 0.0000
9 x = rand([n,1]); 9 0 0.0000 1.0000 -0.0000
10 fprintf("\nx' ="); disp(x') 10 -0.0000 0.0000 -0.0000 1.0000
11 fprintf("||x||_2 =");disp(norm(x,2)) 11 x' = 0.4218 0.9157 0.7922 0.9595
12 fprintf("||U*x||_2=");disp(norm(U*x,2)) 12 ||x||_2 = 1.6015
13 ||U*x||_2= 1.6015
9.3. Orthogonal Projections 243

9.3. Orthogonal Projections


Definition 9.26. Let W be a subspace of Rn .

• A vector z ∈ Rn is said to be orthogonal to the subspace W


if z•w = 0 for all w ∈ W .
• The set of all vectors z that are orthogonal to W is called the or-
thogonal complement of W and is denoted by W ⊥ (and read as
“W perpendicular" or simply “W perp"). That is,

W ⊥ = {z | z•w = 0, ∀ w ∈ W }. (9.8)

Example 9.27. Let W be a plane


through the origin in R3 , and let L be
the line through the origin and per-
pendicular to W .
If z ∈ L and and w ∈ W , then
z•w = 0.
See Figure 9.4.
In fact, L consists of all vectors that
are orthogonal to the w’s in W , and
W consists of all vectors orthogonal
to the z’s in L. That is,
L = W ⊥ and W = L⊥ . Figure 9.4: A plane and line through the ori-
gin as orthogonal complements.

Remark 9.28. Let W be a subspace of Rn .


1. A vector x is in W ⊥ ⇔ x is orthogonal to every vector in a set that
spans W .
2. W ⊥ is a subspace of Rn .
244 Chapter 9. Vector Spaces and Orthogonality

Recall: (Definition 9.19, § 9.2) Given a nonzero vector u in Rn , for y ∈


Rn , let
y=y b + z, y b // u and z ⊥ u. (9.9)
Then
y•u
y
b = αu = u, z = y − y b. (9.10)
u•u
b is called the orthogonal projection of y onto u, and z is called
The vector y
the component of y orthogonal to u. Let L = Span{u}. Then we denote
y•u
y
b= u = projL y, (9.11)
u•u
which is called the orthogonal projection of y onto L.

We generalize this orthogonal projection to subspaces.


Theorem 9.29. (The Orthogonal Decomposition Theorem) Let W
be a subspace of Rn . Then each y ∈ Rn can be written uniquely in the
form
y=y b + z, (9.12)
where yb ∈ W and z ∈ W ⊥ . In fact, if {u1 , u2 , · · · , up } is an orthogonal
basis for W , then
y•u1 y•u2 y•up
b = projW y =
y u1 + u2 + · · · + up ,
u1 •u1 u2 •u2 up •up (9.13)
z = y−y
b.

Figure 9.5: Orthogonal projection of y onto W .


9.3. Orthogonal Projections 245
     
2 −2 1
Example 9.30. Let u1 =  5, u2 =  1, and y = 2. Observe that
     
−1 1 3
{u1 , u2 } is an orthogonal basis for W = Span{u1 , u2 }.

(a) Write y as the sum of a vector in W and a vector orthogonal to W .


(b) Find the distance from y to W .
y•u1 y•u2
b+z ⇒ y
Solution. y = y b = u1 + u2 and z = y − y
b.
u1 •u1 u2 •u2

Figure 9.6: A geometric interpretation of the


orthogonal projection.
246 Chapter 9. Vector Spaces and Orthogonality

Remark 9.31. (Properties of Orthogonal Decomposition)


Let y = y b ∈ W and z ∈ W ⊥ . Then
b + z, where y
1. y
b is called the orthogonal projection of y onto W (= projW y)
2. y
b is the closest point to y in W .
(in the sense ky − y
bk ≤ ky − vk, for all v ∈ W )
3. y
b is called the best approximation to y by elements of W .
4. If y ∈ W , then projW y = y.

Proof. 2. For an arbitrary v ∈ W , y−v = (y−b y −v), where (b


y)+(b y −v) ∈ W .
Thus, by the Pythagorean theorem,

ky − vk2 = ky − y
bk2 + kb
y − vk2 ,

which implies that ky − vk ≥ ky − y


bk.
Self-study 9.32. Find the closest point to y in the subspace Span{u1 , u2 }
and hence
 find the 
distance
 from
 y to
 W.
3 1 −4
−1 −2  1
y =  , u1 =  , u2 =  
     
 1 −1  0
13 2 3
Solution.
9.3. Orthogonal Projections 247

Theorem 9.33. If {u1 , u2 , · · · , up } is an orthonormal basis for a sub-


space W of Rn , then

projW y = (y•u1 ) u1 + (y•u2 ) u2 + · · · + (y•up ) up . (9.14)

If U = [u1 u2 · · · up ], then

projW y = U U T y, for all y ∈ Rn . (9.15)

Thus the orthogonal projection can be viewed as a matrix transforma-


tion.

Proof. Notice that


(y•u1 ) u1 + (y•u2 ) u2 + · · · + (y•up ) up
= (uT1 y) u1 + (uT2 y) u2 + · · · + (uTp y) up
= U (U T y).
" # " √ #
7 1/ 10
Example 9.34. Let y = , u1 = √ , and W = Span{u1 }.
9 −3/ 10

(a) Let U be the 2 × 1 matrix whose only column is u1 . Compute U T U and


UUT.
(b) Compute projW y = (y•u1 ) u1 and U U T y.
Solution.

   
T 1 1 −3 −2
Ans: (a) U U = (b)
10 −3 9 6
248 Chapter 9. Vector Spaces and Orthogonality

9.4. The Gram-Schmidt Process and QR Fac-


torization
Note: The Gram-Schmidt process is an algorithm to produce an
orthogonal or orthonormal basis for any nonzero subspace of Rn .
   
3 1
Example 9.35. Let W = Span{x1 , x2 }, where x1 = 6 and x2 = 2. Find
   
0 2
an orthogonal basis for W .
Main idea: Orthogonal projection
( ) ( (
x1 x1 v1 = x1
⇒ ⇒
x2 x2 = αx1 + v2 v2 = x2 − αx1
where x1 •v2 = 0. Then W = Span{x1 , x2 } = Span{v1 , v2 }.
Solution.

Figure 9.7: Construction of an orthogonal


basis {v1 , v2 }.
9.4. The Gram-Schmidt Process and QR Factorization 249

Theorem 9.36. (The Gram-Schmidt Process) Given a basis


{x1 , x2 , · · · , xp } for a nonzero subspace W of Rn , define

v1 = x1
x2 •v1
v2 = x2 − v1
v1 •v1
x3 •v1 x3 •v2
v3 = x3 − v1 − v2 (9.16)
v1 •v1 v2 •v2
..
.
xp •v1 xp •v2 xp •vp−1
vp = xp − v1 − v2 − · · · − vp−1
v1 •v1 v2 •v2 vp−1 •vp−1
Then {v1 , v2 , · · · , vp } is an orthogonal basis for W . In addition,

Span{x1 , x2 , · · · , xk } = Span{v1 , v2 , · · · , vk }, for 1 ≤ k ≤ p. (9.17)

Remark 9.37. For the result of the Gram-Schmidt process, define


vk
uk = , for 1 ≤ k ≤ p. (9.18)
kvk k
Then {u1 , u2 , · · · , up } is an orthonormal basis for W . In practice, it is
often implemented with the normalized Gram-Schmidt process.

Example 9.38. Find an orthonormal basis for W = Span{x1 , x2 , x3 }, where


     
1 −2 0
 0  2  1
x1 =  , x2 =  , and x3 =  .
     
−1  1 −1
1 0 1
Solution.
250 Chapter 9. Vector Spaces and Orthogonality

QR Factorization of Matrices

Theorem 9.39. (The QR Factorization) If A is an m × n matrix with


linearly independent columns, then A can be factored as

A = QR, (9.19)

where
• Q is an m × n matrix whose columns are orthonormal.
• R is an n × n upper triangular invertible matrix
with positive entries on its diagonal.

Proof. The columns of A form a basis {x1 , x2 , · · · , xn } for W = Col A.

1. Construct an orthonormal basis {u1 , u2 , · · · , un } for W (the Gram-


Schmidt process). Set
def
Q == [u1 u2 · · · un ]. (9.20)

2. (Expression) Since Span{x1 , x2 , · · · , xk } = Span{u1 , u2 , · · · , uk }, 1 ≤


k ≤ n, there are constants r1k , r2k , · · · , rkk such that

xk = r1k u1 + r2k u2 + · · · + rkk uk + 0 · uk+1 + · · · + 0 · un . (9.21)

We may assume hat rkk > 0. (If rkk < 0, multiply both rkk and uk by −1.)
3. Let r k = [r1k , r2k , · · · , rkk , 0, · · · , 0]T . Then

xk = Qr k (9.22)

4. Define
def
R == [r 1 r 2 · · · r n ]. (9.23)

Then we see A = [x1 x2 · · · xn ] = [Qr 1 Qr 2 · · · Qr n ] = QR.


9.4. The Gram-Schmidt Process and QR Factorization 251

The QR Factorization is summarized as follows.


Algorithm 9.40. (QR Factorization) Let A = [x1 x2 · · · xn ].
• Apply the Gram-Schmidt process to obtain an orthonormal basis
{u1 , u2 , · · · , un }.
• Then, as in (9.21),
x1 = (u1 •x1 )u1
x2 = (u1 •x2 )u1 + (u2 •x2 )u2
x3 = (u1 •x3 )u1 + (u2 •x3 )u2 + (u3 •x3 )u3 (9.24)
..
.
Pn
xn = j=1 (uj •xn )uj .

• Thus
A = [x1 x2 · · · xn ] = QR (9.25)
implies that

 1 u2 · · · un ],
Q = [u 
u1 •x1 u1 •x2 u1 •x3 ··· u1 •xn
 0 u2 •x2 u2 •x3 ··· u2 •xn
 

 
 = QT A. (9.26)
 0
R =  0 u3 •x3 ··· u3 •xn
.. .. .. ... ..

. . . .
 
 
0 0 0 ··· un •xn

• In practice, the coefficients rij = ui •xj , i < j, can be saved during


the (normalized) Gram-Schmidt process.
" #
4 −1
Example 9.41. Find the QR factorization for A = .
3 2
Solution.

   
0.8 −0.6 5 0.4
Ans: Q = R=
0.6 0.8 0 2.2
252 Chapter 9. Vector Spaces and Orthogonality

Alternative Calculations of Least-Squares Solutions


Recall: (Theorem 7.5, p.197) Let A ∈ Rm×n , m ≥ n and rank (A) = n.
Then the equation Ax = b has a unique LS solution for each b ∈ Rm :

AT Ax = AT b ⇒ b = (AT A)−1 AT b,
x

which is the Method of Normal Equations. The matrix

A+ := (AT A)−1 AT (9.27)

is called the pseudoinverse of A.

Theorem 9.42. Given an m × n matrix A with linearly independent


columns, let A = QR be a QR factorization of A, as in Algorithm 9.40.
Then, for each b ∈ Rm , the equation Ax = b has a unique LS solution,
given by
b = R−1 QT b.
x (9.28)

Proof. Let A = QR. Then the pseudoinverse of A:

(AT A)−1 AT = ((QR)T QR)−1 (QR)T = (RT QT QR)−1 RT QT


(9.29)
= R−1 (RT )−1 RT QT = R−1 QT ,

which completes the proof.


Self-study 9.43. Find the LS solution of Ax = b for
   
1 3 5 3 
1/2 1/2

1/2  
1 1 0  5  2 4 5
1/2 −1/2 −1/2

A= and b = , where
   
A = QR =  0 2 3
1/2 −1/2 1/2
   
1 1 2  7 0 0 2
1 3 3 −3 1/2 1/2 −1/2

Solution.

Ans: QT b = (6, −6, 4) and x


b = (10, −6, 2)
9.5. QR Iteration for Finding Eigenvalues 253

9.5. QR Iteration for Finding Eigenvalues

Algorithm 9.44. (QR Iteration) Let A ∈ Rn×n .


set A0 = A and U0 = I
for k = 1, 2, · · · do
(a) Ak−1 = Qk Rk ; % QR factorization
(b) Ak = Rk Qk ; (9.30)
(c) Uk = Uk−1 Qk ; % Update transformation matrix
end for
set T := A∞ and U := U∞

Remark 9.45. It follows from (a) and (b) of Algorithm 9.44 that

Ak = Rk Qk = QTk Ak−1 Qk , (9.31)

and therefore
Ak = Rk Qk = QTk Ak−1 Qk = QTk QTk−1 Ak−2 Qk−1 Qk = · · ·
= QTk QTk−1 · · · QT1 A0 Q1 Q2 · · · Qk (9.32)
| {z }
Uk

The above converges to


T = U T AU (9.33)

Claim 9.46.
• Algorithm 9.44 produces an upper triangular matrix T , with its
diagonals being eigenvalues of A, and an orthogonal matrix U such
that
A = UT UT, (9.34)
which is called the Schur decomposition of A.
• If A is symmetric, then T becomes a diagonal matrix of eigenvalues
of A and U is the collection of corresponding eigenvectors.
254 Chapter 9. Vector Spaces and Orthogonality
   
3 1 3 4 −1 1
Example 9.47. Let A = 1 6 4 and B = −1 3 −2. Apply the QR
   
6 7 8 1 −2 3
algorithm, Algorithm 9.44, to find their Schur decompositions.
Solution. You will solve this example once more implementing the QR
iteration algorithm in Python; see Exercise 9.7.
qr_iteration.m
1 function [T,U,iter] = qr_iteration(A)
2 % It produces the Schur decomposition: A = U*T*U^T
3 % T: upper triangular, with diagonals being eigenvalues of A
4 % U: orthogonal
5 % Once A is symmetric,
6 % T becomes diagonal && U contains eigenvectors of A
7

8 T = A; U = eye(size(A));
9

10 % for stopping
11 D0 = diag(T); change = 1;
12 tol = 10^-15; iter=0;
13

14 %%-----------------
15 while change>tol
16 [Q,R] = qr(T);
17 T = R*Q;
18 U = U*Q;
19

20 % for stopping
21 iter= iter+1;
22 D=diag(T); change=norm(D-D0); D0=D;
23 %if iter<=8, fprintf('A_%d =\n',iter); disp(T); end
24 end
9.5. QR Iteration for Finding Eigenvalues 255

We may call it as
call_qr_iteration.m
1 A =[3 1 3; 1 6 4; 6 7 8];
2 [T1,U1,iter1] = qr_iteration(A)
3 U1*T1*U1'
4 [V1,D1] = eig(A)
5

6 B =[4 -1 1; -1 3 -2; 1 -2 3];


7 [T2,U2,iter2] = qr_iteration(B)
8 U2*T2*U2'
9 [V2,D2] = eig(B)

A B
1 T1 = 1 T2 =
2 13.8343 1.0429 -4.0732 2 6.0000 -0.0000 0.0000
3 0.0000 3.3996 0.5668 3 -0.0000 3.0000 -0.0000
4 0.0000 -0.0000 -0.2339 4 0.0000 -0.0000 1.0000
5 U1 = 5 U2 =
6 0.2759 -0.5783 -0.7677 6 0.5774 0.8165 -0.0000
7 0.4648 0.7794 -0.4201 7 -0.5774 0.4082 0.7071
8 0.8414 -0.2409 0.4838 8 0.5774 -0.4082 0.7071
9 iter1 = 9 iter2 =
10 26 10 28
11 ans = 11 ans =
12 3.0000 1.0000 3.0000 12 4.0000 -1.0000 1.0000
13 1.0000 6.0000 4.0000 13 -1.0000 3.0000 -2.0000
14 6.0000 7.0000 8.0000 14 1.0000 -2.0000 3.0000
15 15

16 %---- [V1,D1] = eig(A) 16 %---- [V2,D2] = eig(B)


17 V1 = 17 V2 =
18 -0.2759 -0.5630 0.6029 18 -0.0000 0.8165 0.5774
19 -0.4648 -0.3805 -0.7293 19 0.7071 0.4082 -0.5774
20 -0.8414 0.7337 0.3234 20 0.7071 -0.4082 0.5774
21 D1 = 21 D2 =
22 13.8343 0 0 22 1.0000 0 0
23 0 -0.2339 0 23 0 3.0000 0
24 0 0 3.3996 24 0 0 6.0000
256 Chapter 9. Vector Spaces and Orthogonality

Cpnvergence Check

1 % A =[3 1 3; 1 6 4; 6 7 8]; 1 % B =[4 -1 1; -1 3 -2; 1 -2 3];


2 A_1 = 2 A_1 =
3 12.0652 4.4464 4.2538 3 5.0000 -1.3765 -0.3244
4 3.3054 5.0728 0.9038 4 -1.3765 3.8421 0.6699
5 0.2644 0.0192 -0.1380 5 -0.3244 0.6699 1.1579
6 A_2 = 6 A_2 =
7 13.6436 2.0123 -4.1057 7 5.6667 -0.9406 0.0640
8 0.9772 3.5927 0.1704 8 -0.9406 3.3226 -0.1580
9 0.0050 -0.0063 -0.2362 9 0.0640 -0.1580 1.0108
10 A_3 = 10 A_3 =
11 13.8038 1.2893 4.0854 11 5.9091 -0.5141 -0.0112
12 0.2459 3.4300 -0.4708 12 -0.5141 3.0899 0.0454
13 0.0001 -0.0005 -0.2338 13 -0.0112 0.0454 1.0010
14 A_4 = 14 A_4 =
15 13.8279 1.1035 -4.0765 15 5.9767 -0.2631 0.0019
16 0.0607 3.4060 0.5430 16 -0.2631 3.0232 -0.0145
17 0.0000 -0.0000 -0.2339 17 0.0019 -0.0145 1.0001
18 A_5 = 18 A_5 =
19 13.8328 1.0578 4.0740 19 5.9942 -0.1323 -0.0003
20 0.0149 3.4011 -0.5609 20 -0.1323 3.0058 0.0048
21 0.0000 -0.0000 -0.2339 21 -0.0003 0.0048 1.0000
22 A_6 = 22 A_6 =
23 13.8339 1.0465 -4.0734 23 5.9985 -0.0663 0.0001
24 0.0037 3.3999 0.5653 24 -0.0663 3.0015 -0.0016
25 0.0000 -0.0000 -0.2339 25 0.0001 -0.0016 1.0000
26 A_7 = 26 A_7 =
27 13.8342 1.0438 4.0733 27 5.9996 -0.0331 -0.0000
28 0.0009 3.3997 -0.5664 28 -0.0331 3.0004 0.0005
29 0.0000 -0.0000 -0.2339 29 -0.0000 0.0005 1.0000
30 A_8 = 30 A_8 =
31 13.8343 1.0431 -4.0732 31 5.9999 -0.0166 0.0000
32 0.0002 3.3996 0.5667 32 -0.0166 3.0001 -0.0002
33 0.0000 -0.0000 -0.2339 33 0.0000 -0.0002 1.0000
9.5. QR Iteration for Finding Eigenvalues 257

Exercises for Chapter 9

9.1. Suppose y is orthogonal to u and v. Prove that y is orthogonal to every w in Span{u, v}.
       
3 2 1 5
9.2. Let u1 = −3, u2 =  2, u3 = 1, and x = −3.
0 −1 4 1

(a) Check if {u1 , u2 , u3 } is an orthogonal basis for R3 .


(b) Express x as a linear combination of {u1 , u2 , u3 }.
Ans: x = 34 u1 + 13 u2 + 13 u3
9.3. Let U and V be n × n orthogonal matrices. Prove that U V is an orthogonal matrix.
Hint : See Definition 9.25, where U −1 = U T ⇔ U T U = I.
9.4. Find the best approximation to z by vectors of the form c1 v1 + c2 v2 .
           
3 1 −4 3 2 1
−1 −2  1 −7 −1  1
           
(a) z =  , v1 =  , v2 =   (b) z =  , v1 =  , v2 =  
 1 −1  0  2 −3  0
13 2 3 3 1 −1
Ans: (a) b
z = 3v1 + v2
 
−1 6 6
 3 −8 3
 
9.5. Find an orthogonal basis for the column space of the matrix
 1 −2 6
 
1 −4 −3
Ans: v3 = (1, 1, −3, 1)
 
−10 13 7 −11
 2
 1 −5 3

9.6. Implement a code for the problem. Let A =  −6 3 13 −3
 
 16 −16 −2
 
5
2 1 −5 −7

(a) Use the Gram-Schmidt process to produce an orthogonal basis for the column
space of A.
(b) Use Algorithm 9.40 to produce a QR factorization of A.
(c) Apply the QR iteration to find eigenvalues of A(1:4,1:4).
Ans: (a) v4 = (0, 5, 0, 0, −5)
9.7. Solve Example 9.47 by implementing the QR iteration algorithm in Python; you may
use qr_iteration.m, p.254.
258 Chapter 9. Vector Spaces and Orthogonality
10
C HAPTER

Introduction to Machine Learning

In this chapter, you will learn:


• What machine learning (ML) is
• Popular ML classifiers
• Scikit-Learn: A Python ML library
• A machine learning modelcode

The chapter is a brief introduction to ML. I hope it would be useful.

Contents of Chapter 10
10.1.What is Machine Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
10.2.Binary Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
10.3.Popular Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
10.4.Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
10.5.Scikit-Learn: A Python Machine Learning Library . . . . . . . . . . . . . . . . . . . . . 292
A Machine Learning Modelcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Exercises for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

259
260 Chapter 10. Introduction to Machine Learning

10.1. What is Machine Learning?


The Three Tasks (T3)
Most real-world problems are expressed as
f (x) = y, (10.1)

where f is an operation, x denotes the input, and y is the output.

1. Known: (f, x) ⇒ y: simple to get


2. Known: (f, y) ⇒ x: solve the equation (10.1) (10.2)
3. Known: (x, y) ⇒ f : approximated; Machine Learning

Definition 10.1. Machine Learning (ML)

• ML algorithms are algorithms that can learn from data (input)


and produce functions/models (output).
• Machine learning is the science of getting machines to act, without
functions/models being explicitly programmed to do so.

Example 10.2. There are different types of ML:


• Supervised learning: e.g., classification, regression ⇐ Labeled data
• Unsupervised learning: e.g., clustering ⇐ No labels
• Reinforcement learning: e.g., chess engine ⇐ Reward system
The most popular type is supervised learning.

A Belief in Machine Learning

The average is correct (at least, acceptable)


• “Guess The Weight of the Ox” Competition
– Questioner: Francis Galton – a cousin of Charles Darwin
– A county fair, Plymouth, MA, 1907
– The annual West of England Fat Stock and Poultry Exhibition
• 800 people gathered: mean=1197 lbs; real=1198 lbs
10.1. What is Machine Learning? 261

Supervised Learning
Assumption. Given a data set {(xi , yi )}, where yi are labels,
there exists a relation f : X → Y .
Supervised learning:
(
Given : A training data {(xi , yi ) | i = 1, · · · , N }
(10.3)
Find : fb : X → Y , a good approximation to f

Figure 10.1: Supervised learning and prediction.

Figure 10.2: Classification and regression.


262 Chapter 10. Introduction to Machine Learning

Unsupervised Learning
Note:
• In supervised learning, we know the right answer beforehand
when we train our model, and in reinforcement learning, we de-
fine a measure of reward for particular actions by the agent.
• In unsupervised learning, however, we are dealing with unla-
beled data or data of unknown structure. Using unsupervised learn-
ing techniques, we are able to explore the structure of our data
to extract meaningful information, without the guidance of a known
outcome variable or reward function.
• Clustering is an exploratory data analysis technique that allows
us to organize a pile of information into meaningful subgroups
(clusters) without having any prior knowledge of their group mem-
berships.

Figure 10.3: Clustering.


10.1. What is Machine Learning? 263

Why is ML not Always Simple?


Major Issues in ML
1. Overfitting: Fitting training data too tightly
• Difficulties: Accuracy drops significantly for test data
• Remedies:
– More training data (often, impossible)
– Early stopping; feature selection
– Regularization; ensembling (multiple classifiers)

2. Curse of Dimensionality: The feature space becomes increas-


ingly sparse for an increasing number of dimensions (of a fixed-
size training dataset)
• Difficulties: Larger error, more computation time;
Data points appear equidistant from all the others
• Remedies
– More training data (often, impossible)
– Dimensionality reduction (e.g., Feature selection, PCA)
264 Chapter 10. Introduction to Machine Learning

3. Multiple Local Minima Problem


Training often involves minimizing an objective function.

• Difficulties: Larger error, unrepeatable


• Remedies
– Gaussian sailing; regularization
– Careful access to the data (e.g., mini-batch)

4. Interpretability:
Although ML has come very far, researchers still don’t know exactly
how some algorithms (e.g., deep nets) work.
• If we don’t know how training nets actually work, how do we make
any real progress?
5. One-Shot Learning:
We still haven’t been able to achieve one-shot learning. Traditional
gradient-based networks need a huge amount of data, and are
often in the form of extensive iterative training.
• Instead, we should find a way to enable neural networks to
learn, using just a few examples.
10.2. Binary Classifiers 265

10.2. Binary Classifiers


General Remarks
A binary classifier is a function which can decide whether or not an
input vector belongs to a specific class (e.g., spam/ham).
• Binary classification often refers to those classification tasks that
have two class labels. (two-class classification)
• It is a type of linear classifier, i.e. a classification algorithm that
makes its predictions based on a linear predictor function.
Examples: Perceptron [10], Adaline, Logistic Regression, Support Vec-
tor Machine [4]

Binary classifiers are artificial neurons.

Note: Neurons are interconnected nerve cells, involved in the pro-


cessing and transmitting of chemical and electrical signals. Such a
nerve cell can be described as a simple logic gate with binary outputs;
• multiple signals arrive at the dendrites,
• they are integrated into the cell body,
• and if the accumulated signal exceeds a certain threshold, an output
signal is generated that will be passed on by the axon.

Figure 10.4: A schematic description of a neuron.


266 Chapter 10. Introduction to Machine Learning

Definition 10.3. Let {(x(i) , y (i) )} be labeled data, with x(i) ∈ Rd and
y (i) ∈ {0, 1}. A binary classifier finds a hyperplane in Rd that sepa-
rates data points X = {x(i) } to two classes:

Figure 10.5: A binary classifier, finding a hyperplane in Rd .

Observation 10.4. Binary Classifiers


• The labels {0, 1} are chosen for simplicity.
• A hyperplane can be formulated by a normal vector w ∈ Rd and a
shift (bias) w0 :
z = w T x + w0 . (10.4)
• To learn w and w0 , you may formulate a cost function to mini-
mize, as the Sum of Squared Errors (SSE):
1 X  (i) T (i)
2
J (w) = y − φ(w x + w0 ) , (10.5)
2 i

where φ(z) is an activation function.


10.2. Binary Classifiers 267

10.2.1. The Perceptron Algorithm


The perceptron is a binary classifier of supervised learning.
• 1957: Perceptron algorithm is invented by Frank Rosenblatt, Cornell
Aeronautical Laboratory
– Built on work of Hebbs (1949)
– Improved by Widrow-Hoff (1960): Adaline

• 1960: Perceptron Mark 1 Computer – hardware implementation


• 1970’s: Learning methods for two-layer neural networks

Definition 10.5. We can pose the perceptron as a binary classifier,


in which we refer to our two classes as 1 (positive class) and −1 (negative
class) for simplicity.
• Input values: x = (x1 , x2 , · · · , xd )T
• Weight vector: w = (w1 , w2 , · · · , wd )T
• Net input: z = w1 x1 + w2 x2 + · · · + wd xd
• Activation function: φ(z), defined by
(
1 if z ≥ θ,
φ(z) = (10.6)
−1 otherwise,

where θ is a threshold.

For simplicity, we can bring the threshold θ in (10.6) to the left side of
the equation; define a weight-zero as w0 = −θ and reformulate as
(
1 if z ≥ 0,
φ(z) = z = wT x = w0 + w1 x1 + · · · + wd xd . (10.7)
−1 otherwise,

In the ML literature, the variable w0 is called the bias.


The equation w0 + w1 x1 + · · · + wd xd = 0 represents a hyperplane
in Rd , while w0 decides the intercept.
268 Chapter 10. Introduction to Machine Learning

The Perceptron Learning Rule


The whole idea behind the Rosenblatt’s thresholded perceptron model
is to use a reductionist approach to mimic how a single neuron in the brain
works: it either fires or it doesn’t.
Algorithm 10.6. Rosenblatt’s Initial Perceptron Rule
1. Initialize the weights to 0 or small random numbers.
2. For each training sample x(i) ,
(a) Compute the output value yb(i) (:= φ(wT x(i) )).
(b) Update the weights.

The update of the weight vector w can be more formally written as:

w = w + ∆w, ∆w = η (y (i) − yb(i) ) x(i) ,


(10.8)
w0 = w0 + ∆w0 , ∆w0 = η (y (i) − yb(i) ),

where η is the learning rate, 0 < η < 1, y (i) is the true class label of the
i-th training sample, and yb(i) denotes the predicted class label.

Remark 10.7. A simple thought experiment for the perceptron


learning rule:
• Let the perceptron predict the class label correctly. Then y (i) −b
y (i) = 0
so that the weights remain unchanged.
• Let the perceptron make a wrong prediction. Then
(i) (i)
∆wj = η (y (i) − yb(i) ) xj = ±2 η xj

so that the weight wj is pushed towards the direction of the positive


or negative target class, respectively.
10.2. Binary Classifiers 269

Perceptron for Iris Dataset


perceptron.py
1 import numpy as np
2

3 class Perceptron():
4 def __init__(self, xdim, epoch=10, learning_rate=0.01):
5 self.epoch = epoch
6 self.learning_rate = learning_rate
7 self.weights = np.zeros(xdim + 1)
8

9 def activate(self, x):


10 net_input = np.dot(x,self.weights[1:])+self.weights[0]
11 return 1 if (net_input > 0) else 0
12

13 def fit(self, Xtrain, ytrain):


14 for k in range(self.epoch):
15 for x, y in zip(Xtrain, ytrain):
16 yhat = self.activate(x)
17 self.weights[1:] += self.learning_rate*(y-yhat)*x
18 self.weights[0] += self.learning_rate*(y-yhat)
19

20 def predict(self, Xtest):


21 yhat=[]
22 #for x in Xtest: yhat.append(self.activate(x))
23 [yhat.append(self.activate(x)) for x in Xtest]
24 return yhat
25

26 def score(self, Xtest, ytest):


27 count=0;
28 for x, y in zip(Xtest, ytest):
29 if self.activate(x)==y: count+=1
30 return count/len(ytest)
31

32 #-----------------------------------------------------
33 def fit_and_fig(self, Xtrain, ytrain):
34 wgts_all = []
35 for k in range(self.epoch):
36 for x, y in zip(Xtrain, ytrain):
37 yhat = self.activate(x)
38 self.weights[1:] += self.learning_rate*(y-yhat)*x
39 self.weights[0] += self.learning_rate*(y-yhat)
40 if k==0: wgts_all.append(list(self.weights))
41 return np.array(wgts_all)
270 Chapter 10. Introduction to Machine Learning

Iris_perceptron.py
1 import numpy as np; import matplotlib.pyplot as plt
2 from sklearn.model_selection import train_test_split
3 from sklearn import datasets; #print(dir(datasets))
4 np.set_printoptions(suppress=True)
5 from perceptron import Perceptron
6

7 #-----------------------------------------------------------
8 data_read = datasets.load_iris(); #print(data_read.keys())
9 X = data_read.data;
10 y = data_read.target
11 targets = data_read.target_names; features = data_read.feature_names
12

13 N,d = X.shape; nclass=len(set(y));


14 print('N,d,nclass=',N,d,nclass)
15

16 #---- Take 2 classes in 2D ---------------------------------


17 X2 = X[y<=1]; y2 = y[y<=1];
18 X2 = X2[:,[0,2]]
19

20 #---- Train and Test ---------------------------------------


21 Xtrain, Xtest, ytrain, ytest = train_test_split(X2, y2,
22 random_state=None, train_size=0.7e0)
23 clf = Perceptron(X2.shape[1],epoch=2)
24 #clf.fit(Xtrain, ytrain);
25 wgts_all = clf.fit_and_fig(Xtrain, ytrain);
26 accuracy = clf.score(Xtest, ytest); print('accuracy =', accuracy)
27 #yhat = clf.predict(Xtest);

Figure 10.6: A part of Iris data (left) and the convergence of Perceptron iteration (right).
10.2. Binary Classifiers 271

10.2.2. Adaline: ADAptive LInear NEuron


• (Widrow & Hoff, 1960)
• Weights are updated based on linear activation: e.g.,

φ(wT x) = wT x.

That is, φ is the identity function.


• Adaline algorithm is particularly interesting because it illustrates the
key concept of defining and minimizing continuous cost functions,
which will lay the groundwork for understanding more ad-
vanced machine learning algorithms for classification, such as lo-
gistic regression and support vector machines, as well as regression
models.
• Continuous cost functions allow the ML optimization to incorporate
advanced mathematical techniques such as calculus.

Figure 10.7: Perceptron vs. Adaline


272 Chapter 10. Introduction to Machine Learning

Algorithm 10.8. Adaline Learning:


Given a dataset {(x(i) , y (i) ) | i = 1, 2, · · · , N }, learn
the weights w and bias b = w0 :
• Activation function: φ(z) = z (i.e., identity activation)
• Cost function: the SSE
N
1 X  (i) (i)
2
J (w, b) = y − φ(z ) , (10.9)
2 i=1

where z (i) = wT x(i) + b and φ = I, the identity.

The dominant algorithm for the minimization of the cost function is the the
Gradient Descent Method.

Algorithm 10.9. The Gradient Descent Method uses −∇J for the
search direction (update direction):

w = w + ∆w = w − η∇w J (w, b),


(10.10)
b = b + ∆b = b − η∇b J (w, b),

where η > 0 is the step length (learning rate).

Computation of ∇J for Adaline :


The partial derivatives of the cost function J w.r.to wj and b read
∂J (w, b) X
(i) (i)

(i)
= − y − φ(z ) xj ,
∂wj i
∂J (w, b) X  (10.11)
(i) (i)
= − y − φ(z ) .
∂b i

Thus, with φ = I,
X 
∆w = −η∇w J (w, b) = η y − φ(z ) x(i) ,
(i) (i)

i 
X
(i) (i)
 (10.12)
∆b = −η∇b J (w, b) = η y − φ(z ) .
i

You will modify perceptron.py for Adaline; an implementation issue is


considered in Exercise 10.2, p.301.
10.2. Binary Classifiers 273

Hyperparameters
Definition 10.10. In ML, a hyperparameter is a parameter whose
value is set before the learning process begins. Thus it is an algorithmic
parameter. Examples are
• The learning rate (η)
• The number of maximum epochs/iterations (n_iter)

Figure 10.8: Well-chosen learning rate vs. a large learning rate

Note: There are effective searching schemes to set the learning rate η
automatically.
274 Chapter 10. Introduction to Machine Learning

Multi-class Classification

Figure 10.9: Classification for three classes.

One-versus-all (one-versus-rest) classification


Learning: learn 3 classifiers

• − vs {◦, +} ⇒ weights w−
• + vs {◦, −} ⇒ weights w+
• ◦ vs {+, −} ⇒ weights w◦

Prediction: for a new data sample


x,
Figure 10.10: Three weights: w− , w+ , and
w◦ . yb = arg max φ(wiT x).
i∈{−,+,◦}

OvA (OvR) is readily applicable for classification of general n classes, n ≥ 2.


10.3. Popular Machine Learning Classifiers 275

10.3. Popular Machine Learning Classifiers

Remark 10.11. (The standard logistic sigmoid function)


1 ex
σ(x) = = (10.13)
1 + e−x 1 + ex
• The standard logistic function is the solution of the simple first-
order non-linear ordinary differential equation
d 1
y = y(1 − y), y(0) = . (10.14)
dx 2
• It can be verified easily as
0 ex (1 + ex ) − ex · ex ex
σ (x) = = = σ(x)(1 − σ(x)). (10.15)
(1 + ex )2 (1 + ex )2

• σ 0 is even: σ 0 (−x) = σ 0 (x).


• Rotational symmetry about (0, 1/2):
1 1 2 + ex + e−x
σ(x) + σ(−x) = + = ≡ 1. (10.16)
1 + e−x 1 + ex 2 + ex + e−x
´ ´ ex x
• σ(x) dx = 1+e x dx = ln(1 + e ), which is known as the softplus

function in artificial neural networks. It is a smooth approximation


of the the rectifier (an activation function) defined as
f (x) = x+ = max(x, 0). (10.17)

Figure 10.11: Popular activation functions: (left) The standard logistic sigmoid func-
tion and (right) the rectifier and softplus function.
276 Chapter 10. Introduction to Machine Learning

10.3.1. Logistic Regression


Logistic regression is a probabilistic model.
• Logistic regression maximizes the likelihood of the parameter
w; in realization, it is similar to Adaline.
• Only the difference is the activation function (the sigmoid function),
as illustrated in the figure:

Figure 10.12: Adaline vs. Logistic regression.

• The prediction (the output of the sigmoid function) is interpreted as


the probability of a particular sample belonging to class 1,

φ(z) = p(y = 1|x; w), (10.18)

given its features x parameterized by the weights w, z = wT x + b.


10.3. Popular Machine Learning Classifiers 277

Derivation of the Logistic Cost Function


• Assume that the individual samples in our dataset are independent
of one another. Then we can define the likelihood L as
L(w) = P (y|x; w) = ΠN (i) (i)
i=1 P (y |x ; w)
 y(i)  1−y(i) (10.19)
N (i) (i)
= Πi=1 φ(z ) 1 − φ(z ) ,

where z (i) = wT x(i) + b.


• In practice, it is easier to maximize the (natural) log of this equation,
which is called the log-likelihood function:
N h
X    i
(i) (i) (i) (i)
`(w) = ln(L(w)) = y ln φ(z ) + (1 − y ) ln 1 − φ(z ) .
i=1
(10.20)

Algorithm 10.12. Logistic Regression Learning:


From data {(x(i) , y (i) )}, learn the weights w and bias b, with
• Activation function: φ(z) = σ(z), the logistic sigmoid function
• Cost function: The likelihood is maximized.
Based on the log-likelihood, we define the logistic cost function to
be minimized: h    i
X
(i) (i) (i) (i)
J (w, b) = −y ln φ(z ) − (1 − y ) ln 1 − φ(z ) , (10.21)
i

where z (i) = wT x(i) + b.


278 Chapter 10. Introduction to Machine Learning

Computation of ∇J for Logistic Regression :


Let’s start by calculating the partial derivative of the logistic cost function
with respect to the j–th weight, wj :
∂φ(z (i) )
 
∂J (w, b) X (i) 1 (i) 1
= −y (i) )
+ (1 − y ) (i) )
, (10.22)
∂wj i
φ(z 1 − φ(z ∂w j

where, using z (i) = wT x(i) and (10.15),


∂φ(z (i) ) 0 (i) ∂z
(i)
(i)

(i)

(i)
= φ (z ) = φ(z ) 1 − φ(z ) xj .
∂wj ∂wj
Thus, if follows from the above and (10.22) that
∂J (w, b) Xh
(i)

(i)

(i) (i)
i
(i)
= −y 1 − φ(z ) + (1 − y )φ(z ) xj
∂wj iX h i
(i) (i) (i)
= − y − φ(z ) xj
i
and therefore Xh i
∇w J (w) = − y − φ(z ) x(i) .
(i) (i)
(10.23)
i
Similarly, one can get
Xh i
(i) (i)
∇b J (w) = − y − φ(z ) . (10.24)
i

Algorithm 10.13. Gradient descent learning for Logistic Regression is


formulated as
w := w + ∆w, b := b + ∆b, (10.25)
where η > 0 is the step length (learning rate) and
Xh i
∆w = −η∇w J (w, b) = η y − φ(z ) x(i) ,
(i) (i)

i h
X
(i) (i)
i (10.26)
∆b = −η∇b J (w, b) = η y − φ(z ) .
i

Note: The above gradient descent rule for Logistic Regression is of the
same form as that of Adaline; see (10.12) on p. 272. Only the difference
is the activation function φ.
10.3. Popular Machine Learning Classifiers 279

10.3.2. Support Vector Machine


• Support vector machine (SVM), developed in 1995 by Cortes-
Vapnik [4], can be considered as an extension of the Percep-
tron/Adaline, which maximizes the margin.
• The rationale behind having decision boundaries with large margins
is that they tend to have a lower generalization error, whereas
models with small margins are more prone to overfitting.

Figure 10.13: Linear support vector machine.

To find an optimal hyperplane that maximizes the margin, let’s begin with
considering the positive and negative hyperplanes that are parallel to the
decision boundary:
w0 + wT x+ = 1,
(10.27)
w0 + wT x− = −1.
where w = [w1 , w2 , · · · , wd ]T . If we subtract those two linear equations from
each other, then we have
w · (x+ − x− ) = 2
and therefore
w 2
· (x+ − x− ) = . (10.28)
kwk kwk
280 Chapter 10. Introduction to Machine Learning

Note: w = [w1 , w2 , · · · , wd ]T is a normal vector to the decision boundary


(a hyperplane) so that the left side of (10.28) is the distance between the
positive and negative hyperplanes.

Maximizing the distance (margin) is equivalent to minimizing its reciprocal


1 1 2
2 kwk, or minimizing 2 kwk .

Problem 10.14. The linear SVM is formulated as


1
min kwk2 , subject to
w,w0 2
(10.29)
"
w0 + wT x(i) ≥ 1 if y (i) = 1,
w0 + wT x(i) ≤ −1 if y (i) = −1.

The minimization problem in (10.29) can be solved by the method of La-


grange multipliers; See Appendices A.1, A.2, and A.3.

Remark 10.15. The constraints in Problem 10.14 can be written as

y (i) (w0 + wT x(i) ) − 1 ≥ 0, ∀ i. (10.30)

• The beauty of linear SVM is that if the data is linearly separable,


there is a unique global minimum value.
• An ideal SVM analysis should produce a hyperplane that completely
separates the vectors (cases) into two non-overlapping classes.
• However, perfect separation may not be possible, or it may result
in a model with so many cases that the model does not classify cor-
rectly.
• There are variations of the SVM:
– soft-margin classification, for noisy data
– nonlinear SVMs, kernel methods
10.3. Popular Machine Learning Classifiers 281

10.3.3. k-Nearest Neighbors


The k-nearest neighbor (k-NN) classifier is a typical example of a
lazy learner.
• It is called lazy not because of its apparent simplicity, but because it
doesn’t learn a discriminative function from the training data,
but memorizes the training dataset instead.
• Analysis of the training data is delayed until a query is made to
the system.

Algorithm 10.16. (k-NN algorithm). The algorithm itself is fairly


straightforward and can be summarized by the following steps:
1. Choose the number k and a distance metric.
2. For the new sample, find the k-nearest neighbors.
3. Assign the class label by majority vote.

Figure 10.14: Illustration for how a new data point (?) is assigned the triangle class label,
based on majority voting, when k = 5.
282 Chapter 10. Introduction to Machine Learning

k-NN: pros and cons


• Since it is memory-based, the classifier immediately adapts as we
collect new training data.
• (Prediction Cost) The computational complexity for classify-
ing new samples grows linearly with the number of samples in the
training dataset in the worst-case scenario.a
• Furthermore, we can’t discard training samples since no training
step is involved. Thus, storage space can become a challenge if we
are working with large datasets.
a
J. H. Friedman, J. L. Bentley, and R.A. Finkel (1977). An Algorithm for Finding Best Matches in
Logarithmic Expected Time, ACM transactions on Mathematical Software (TOMS), 3, no. 3, pp. 209–
226. The algorithm in the article is called the KD-tree.

k-NN: what to choose k and a distance metric?


• The right choice of k is crucial to find a good balance between
overfitting and underfitting.
(For sklearn.neighbors.KNeighborsClassifier, default n_neighbors = 5.)
• We also choose a distance metric that is appropriate for the features
in the dataset. (e.g., the simple Euclidean distance, along with data
standardization)
• Alternatively, we can choose the Minkowski distance:
m
X 1/p
def
d(x, z) = kx − zkp == |xi − zi |p . (10.31)
i=1

(For sklearn.neighbors.KNeighborsClassifier, default p = 2.)


10.4. Neural Networks 283

10.4. Neural Networks


Recall: The Perceptron (or, Adaline, Logistic Regression) is the simplest
artificial neuron that makes decisions by weighting up evidence.

Figure 10.15: A simplest artificial neuron.

Complex Neural Networks


• Obviously, a simple artificial neuron is not a complete model of human
decision-making!
• However, they can be used as building blocks for more complex neu-
ral networks.

Figure 10.16: A complex neural network.


284 Chapter 10. Introduction to Machine Learning

10.4.1. A Simple Network to Classify Hand-written Dig-


its: MNIST Dataset
• The problem of recognizing hand-written digits has two components:
segmentation and classification.

=⇒
Figure 10.17: Segmentation.

• We’ll focus on algorithmic components for the classification of individ-


ual digits.

MNIST dataset :
A modified subset of two datasets collected by NIST (US National Insti-
tute of Standards and Technology):
• Its first part contains 60,000 images (for training)
• The second part is 10,000 images (for test), each of which is in 28 × 28
grayscale pixels

A Simple Neural Network

Figure 10.18: A sigmoid network having a single hidden layer.


10.4. Neural Networks 285

What the Neural Network Will Do: An Interpretation


• Let’s concentrate on the first output neuron, the one that is trying
to decide whether or not the input digit is a 0.
• It does this by weighing up evidence from the hidden layer of neurons.

• What are those hidden neurons doing?


– Let’s suppose for the sake of argument that the first neuron
in the hidden layer may detect whether or not the input image
contains

It can do this by heavily weighting the corresponding pixels,


and lightly weighting the other pixels.
– Similarly, let’s suppose that the second, third, and fourth
neurons in the hidden layer detect whether or not the input
image contains

– These four parts together make up a 0 image:

– Thus, if all four of these hidden neurons are firing, then we can
conclude that the digit is a 0.

Remark 10.17. Explainable AI (XAI)


• The above is a common interpretation for neural networks.
• It hardly helps end users understand XAI
• Neural networks may have been designed ineffectively.
286 Chapter 10. Introduction to Machine Learning

Learning with Gradient Descent

• Data set {(x(i) , y(i) )}, i = 1, 2, · · · , N


(e.g., if an image x(k) depicts a 2, then y(k) = (0, 0, 1, 0, 0, 0, 0, 0, 0, 0)T .)
• Cost function
1 X (i)
C(W , B) = ||y − a(x(i) )||2 , (10.32)
2N i

where W denotes the collection of all weights in the network, B all the
biases, and a(x(i) ) is the vector of outputs from the network when x(i)
is input.
• Gradient descent method
" # " # " #
W W ∆W
← + , (10.33)
B B ∆B

where " # " #


∆W ∇W C
= −η .
∆B ∇B C

Note: To compute the gradient ∇C, we need to compute the gradients


∇Cx(i) separately for each training input, x(i) , and then average them:
1 X
∇C = ∇Cx(i) . (10.34)
N i

Unfortunately, when the number of training inputs is very large, it


can take a long time, and learning thus occurs slowly. An idea called
stochastic gradient descent can be used to speed up learning.
10.4. Neural Networks 287

Stochastic Gradient Descent


The idea is to estimate the gradient ∇C by computing ∇Cx(i) for a small
sample of randomly chosen training inputs. By averaging over this
small sample, it turns out that we can quickly get a good estimate of
the true gradient ∇C; this helps speed up gradient descent, and thus
learning.

• Pick out a small number of randomly chosen training inputs (m  N ):

e(1) , x
x e(2) , · · · , x
e(m) ,

which we refer to as a mini-batch.


• Average ∇Cxe(k) to approximate the gradient ∇C. That is,
m
1 X def 1
X
∇Cxe(k) ≈ ∇C == ∇Cx(i) . (10.35)
m N i
k=1

• For classification of hand-written digits for the MNIST dataset, you


may choose: batch_size = 10.

Note: In practice, you can implement the stochastic gradient descent as


follows. For an epoch,
• Shuffle the dataset
• For each m samples (selected from the beginning), update (W , B)
using the approximate gradient (10.35).
288 Chapter 10. Introduction to Machine Learning

10.4.2. Implementation for MNIST Digits Dataset [9]


network.py
1 """
2 network.py (by Michael Nielsen)
3 ~~~~~~~~~~
4 A module to implement the stochastic gradient descent learning
5 algorithm for a feedforward neural network. Gradients are calculated
6 using backpropagation. """
7 #### Libraries
8 # Standard library
9 import random
10 # Third-party libraries
11 import numpy as np
12

13 class Network(object):
14 def __init__(self, sizes):
15 """The list ``sizes`` contains the number of neurons in the
16 respective layers of the network. For example, if the list
17 was [2, 3, 1] then it would be a three-layer network, with the
18 first layer containing 2 neurons, the second layer 3 neurons,
19 and the third layer 1 neuron. """
20

21 self.num_layers = len(sizes)
22 self.sizes = sizes
23 self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
24 self.weights = [np.random.randn(y, x)
25 for x, y in zip(sizes[:-1], sizes[1:])]
26

27 def feedforward(self, a):


28 """Return the output of the network if ``a`` is input."""
29 for b, w in zip(self.biases, self.weights):
30 a = sigmoid(np.dot(w, a)+b)
31 return a
32

33 def SGD(self, training_data, epochs, mini_batch_size, eta,


34 test_data=None):
35 """Train the neural network using mini-batch stochastic
36 gradient descent. The ``training_data`` is a list of tuples
37 ``(x, y)`` representing the training inputs and the desired
38 outputs. """
39

40 if test_data: n_test = len(test_data)


41 n = len(training_data)
42 for j in xrange(epochs):
43 random.shuffle(training_data)
44 mini_batches = [
45 training_data[k:k+mini_batch_size]
46 for k in xrange(0, n, mini_batch_size)]
10.4. Neural Networks 289

47 for mini_batch in mini_batches:


48 self.update_mini_batch(mini_batch, eta)
49 if test_data:
50 print "Epoch {0}: {1} / {2}".format(
51 j, self.evaluate(test_data), n_test)
52 else:
53 print "Epoch {0} complete".format(j)
54

55 def update_mini_batch(self, mini_batch, eta):


56 """Update the network's weights and biases by applying
57 gradient descent using backpropagation to a single mini batch.
58 The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
59 is the learning rate."""
60 nabla_b = [np.zeros(b.shape) for b in self.biases]
61 nabla_w = [np.zeros(w.shape) for w in self.weights]
62 for x, y in mini_batch:
63 delta_nabla_b, delta_nabla_w = self.backprop(x, y)
64 nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
65 nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
66 self.weights = [w-(eta/len(mini_batch))*nw
67 for w, nw in zip(self.weights, nabla_w)]
68 self.biases = [b-(eta/len(mini_batch))*nb
69 for b, nb in zip(self.biases, nabla_b)]
70

71 def backprop(self, x, y):


72 """Return a tuple ``(nabla_b, nabla_w)`` representing the
73 gradient for the cost function C_x. ``nabla_b`` and
74 ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
75 to ``self.biases`` and ``self.weights``."""
76 nabla_b = [np.zeros(b.shape) for b in self.biases]
77 nabla_w = [np.zeros(w.shape) for w in self.weights]
78 # feedforward
79 activation = x
80 activations = [x] #list to store all the activations, layer by layer
81 zs = [] # list to store all the z vectors, layer by layer
82 for b, w in zip(self.biases, self.weights):
83 z = np.dot(w, activation)+b
84 zs.append(z)
85 activation = sigmoid(z)
86 activations.append(activation)
87 # backward pass
88 delta = self.cost_derivative(activations[-1], y) * \
89 sigmoid_prime(zs[-1])
90 nabla_b[-1] = delta
91 nabla_w[-1] = np.dot(delta, activations[-2].transpose())
92

93 for l in xrange(2, self.num_layers):


290 Chapter 10. Introduction to Machine Learning

94 z = zs[-l]
95 sp = sigmoid_prime(z)
96 delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
97 nabla_b[-l] = delta
98 nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
99 return (nabla_b, nabla_w)
100

101 def evaluate(self, test_data):


102 test_results = [(np.argmax(self.feedforward(x)), y)
103 for (x, y) in test_data]
104 return sum(int(x == y) for (x, y) in test_results)
105

106 def cost_derivative(self, output_activations, y):


107 """Return the vector of partial derivatives \partial C_x /
108 \partial a for the output activations."""
109 return (output_activations-y)
110

111 #### Miscellaneous functions


112 def sigmoid(z):
113 return 1.0/(1.0+np.exp(-z))
114

115 def sigmoid_prime(z):


116 return sigmoid(z)*(1-sigmoid(z))

The code is executed using


Run_network.py
1 import mnist_loader
2 training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
3

4 import network
5 n_neurons = 20
6 net = network.Network([784 , n_neurons, 10])
7

8 n_epochs, batch_size, eta = 30, 10, 3.0


9 net.SGD(training_data , n_epochs, batch_size, eta, test_data = test_data)

len(training_data)=50000, len(validation_data)=10000, len(test_data)=10000


10.4. Neural Networks 291

Validation Accuracy
Validation Accuracy
1 Epoch 0: 9006 / 10000
2 Epoch 1: 9128 / 10000
3 Epoch 2: 9202 / 10000
4 Epoch 3: 9188 / 10000
5 Epoch 4: 9249 / 10000
6 ...
7 Epoch 25: 9356 / 10000
8 Epoch 26: 9388 / 10000
9 Epoch 27: 9407 / 10000
10 Epoch 28: 9410 / 10000
11 Epoch 29: 9428 / 10000

Accuracy Comparisons
• scikit-learn’s SVM classifier using the default settings: 9435/10000
• A well-tuned SVM: ≈98.5%
• Well-designed Convolutional NN (CNN):
9979/10000 (only 21 missed!)

Note: For well-designed neural networks, the performance is close


to human-equivalent, and is arguably better, since quite a few of
the MNIST images are difficult even for humans to recognize with confi-
dence, e.g.,

Figure 10.19: MNIST images difficult even for humans to recognize.

XAI ≈ (Neural Network Design)


• The above neural network converges slowly. Why?
• Can we design a new (explainable) form of neural network
– which converges in 2-3 epochs
– to a model showing a better accuracy ?
292 Chapter 10. Introduction to Machine Learning

10.5. Scikit-Learn: A Python Machine Learn-


ing Library
Scikit-learn is one of the most useful and robust libraries for ma-
chine learning.
• It provides various tools for machine learning and statistical
modeling, including
– preprocessing,
– classification, regression, clustering, and
– ensemble methods.
• This library is built upon Numpy, SciPy, Matplotlib, and Pandas.

• Prerequisites: The following are required/recommended, before we


start using scikit-learn.
– Python 3
– Numpy, SciPy, Matplotlib
– Pandas (data analysis)
– Seaborn (visualization)
• Scikit-Learn Installation: For example (on Ubuntu),
pip install -U scikit-learn
sudo apt-get install python3-sklearn python3-sklearn-lib
10.5. Scikit-Learn: A Python Machine Learning Library 293

Five Main Steps, in Machine Learning

1. Selection of features
2. Choosing a performance metric
3. Choosing a classifier and optimization algorithm
4. Evaluating the performance of the model
5. Tuning the algorithm

In practice :
• Each algorithm has its own characteristics and is based on certain
assumptions.
• No Free Lunch Theorem: No single classifier works best across all
possible scenarios.
• Best Model: It is always recommended that you compare the perfor-
mance of at least a handful of different learning algorithms to
select the best model for the particular problem.

Why Scikit-Learn?
• Nice documentation and usability
• The library covers most machine-learning tasks:
– Preprocessing modules
– Algorithms
– Analysis tools
• Robust Model: Given a dataset, you may
(a) Compare algorithms
(b) Build an ensemble model
• Scikit-learn scales to most data problems

⇒ Easy-to-use, convenient, and powerful enough


294 Chapter 10. Introduction to Machine Learning

A Simple Example Code


iris_sklearn.py
1 #------------------------------------------------------
2 # Load Data
3 #------------------------------------------------------
4 from sklearn import datasets
5 # dir(datasets); load_iris, load_digits, load_breast_cancer, load_wine, ...
6

7 iris = datasets.load_iris()
8

9 feature_names = iris.feature_names
10 target_names = iris.target_names
11 print("## feature names:", feature_names)
12 print("## target names :", target_names)
13 print("## set(iris.target):", set(iris.target))
14

15 #------------------------------------------------------
16 # Create "model instances"
17 #------------------------------------------------------
18 from sklearn.linear_model import LogisticRegression
19 from sklearn.neighbors import KNeighborsClassifier
20

21 LR = LogisticRegression(max_iter = 1000)
22 KNN = KNeighborsClassifier(n_neighbors = 5)
23

24 #------------------------------------------------------
25 # Split, Train, and Test
26 #------------------------------------------------------
27 import numpy as np
28 from sklearn.model_selection import train_test_split
29

30 X = iris.data; y = iris.target
31 iter = 100; Acc = np.zeros([iter,2])
32

33 for i in range(iter):
34 X_train, X_test, y_train, y_test = train_test_split(
35 X, y, test_size=0.3, random_state=i, stratify=y)
36 LR.fit(X_train, y_train); Acc[i,0] = LR.score(X_test, y_test)
37 KNN.fit(X_train, y_train); Acc[i,1] = KNN.score(X_test, y_test)
38

39 acc_mean = np.mean(Acc,axis=0)
40 acc_std = np.std(Acc,axis=0)
41 print('## iris.Accuracy.LR : %.4f +- %.4f' %(acc_mean[0],acc_std[0]))
42 print('## iris.Accuracy.KNN: %.4f +- %.4f' %(acc_mean[1],acc_std[1]))
43
10.5. Scikit-Learn: A Python Machine Learning Library 295

44 #------------------------------------------------------
45 # New Sample ---> Predict
46 #------------------------------------------------------
47 sample = [[5, 3, 2, 4],[4, 3, 3, 6]];
48 print('## New sample =',sample)
49

50 predL = LR.predict(sample); predK = KNN.predict(sample)


51 print(" ## sample.LR.predict :",target_names[predL])
52 print(" ## sample.KNN.predict:",target_names[predK])

Output
1 ## feature names: ['sepal length (cm)', 'sepal width (cm)',
2 'petal length (cm)', 'petal width (cm)']
3 ## target names : ['setosa' 'versicolor' 'virginica']
4 ## set(iris.target): {0, 1, 2}
5 ## iris.Accuracy.LR : 0.9631 +- 0.0240
6 ## iris.Accuracy.KNN: 0.9658 +- 0.0202
7 ## New sample = [[5, 3, 2, 4], [4, 3, 3, 6]]
8 ## sample.LR.predict : ['setosa' 'virginica']
9 ## sample.KNN.predict: ['versicolor' 'virginica']

In Scikit-Learn, particularly with ensembling, you can finish most


machine learning tasks conveniently and easily.
296 Chapter 10. Introduction to Machine Learning

A Machine Learning Modelcode: Scikit-Learn


Comparisons and Ensembling
In machine learning, you can write a code easily and effectively using the
following modelcode. It is also useful for algorithm comparisons and
ensembling. You may download
https://skim.math.msstate.edu/LectureNotes/data/Machine-Learning-Modelcode.PY.tar.
Machine_Learning_Model.py
1 import numpy as np; import pandas as pd; import time
2 import seaborn as sbn; import matplotlib.pyplot as plt
3 from sklearn.model_selection import train_test_split
4 from sklearn import datasets
5 np.set_printoptions(suppress=True)
6

7 #=====================================================================
8 # Upload a Dataset: print(dir(datasets))
9 # load_iris, load_wine, load_breast_cancer, ...
10 #=====================================================================
11 data_read = datasets.load_iris(); #print(data_read.keys())
12

13 X = data_read.data
14 y = data_read.target
15 dataname = data_read.filename
16 targets = data_read.target_names
17 features = data_read.feature_names
18

19 #---------------------------------------------------------------------
20 # SETTING
21 #---------------------------------------------------------------------
22 N,d = X.shape; nclass=len(set(y));
23 print('DATA: N, d, nclass =',N,d,nclass)
24 rtrain = 0.7e0; run = 50; CompEnsm = 2;
25

26 def multi_run(clf,X,y,rtrain,run):
27 t0 = time.time(); acc = np.zeros([run,1])
28 for it in range(run):
29 Xtrain, Xtest, ytrain, ytest = train_test_split(
30 X, y, train_size=rtrain, random_state=it, stratify = y)
31 clf.fit(Xtrain, ytrain);
32 acc[it] = clf.score(Xtest, ytest)
33 etime = time.time()-t0
34 return np.mean(acc)*100, np.std(acc)*100, etime # accmean,acc_std,etime
10.5. Scikit-Learn: A Python Machine Learning Library 297

35

36 #=====================================================================
37 # My Classifier
38 #=====================================================================
39 from myclf import * # My Classifier = MyCLF()
40 if 'MyCLF' in locals():
41 accmean, acc_std, etime = multi_run(MyCLF(mode=1),X,y,rtrain,run)
42

43 print('%s: MyCLF() : Acc.(mean,std) = (%.2f,%.2f)%%; E-time= %.5f'


44 %(dataname,accmean,acc_std,etime/run))
45

46 #=====================================================================
47 # Scikit-learn Classifiers, for Comparisions && Ensembling
48 #=====================================================================
49 if CompEnsm >= 1:
50 exec(open("sklearn_classifiers.py").read())

myclf.py
1 import numpy as np
2 from sklearn.base import BaseEstimator, ClassifierMixin
3 from sklearn.tree import DecisionTreeClassifier
4

5 class MyCLF(BaseEstimator, ClassifierMixin): #a child class


6 def __init__(self, mode=0, learning_rate=0.01):
7 self.mode = mode
8 self.learning_rate = learning_rate
9 self.clf = DecisionTreeClassifier(max_depth=5)
10 if self.mode==1: print('MyCLF() = %s' %(self.clf))
11

12 def fit(self, X, y):


13 self.clf.fit(X, y)
14

15 def predict(self, X):


16 return self.clf.predict(X)
17

18 def score(self, X, y):


19 return self.clf.score(X, y)

Note: Replace DecisionTreeClassifier() with your own classier.


• The classifier must be implemented as a child class if if it is used
in ensembling.
298 Chapter 10. Introduction to Machine Learning

sklearn_classifiers.py
1 #=====================================================================
2 # Required: X, y, multi_run [dataname, rtrain, run, CompEnsm]
3 #=====================================================================
4 from sklearn.preprocessing import StandardScaler
5 from sklearn.datasets import make_moons, make_circles, make_classification
6 from sklearn.neural_network import MLPClassifier
7 from sklearn.neighbors import KNeighborsClassifier
8 from sklearn.linear_model import LogisticRegression
9 from sklearn.svm import SVC
10 from sklearn.gaussian_process import GaussianProcessClassifier
11 from sklearn.gaussian_process.kernels import RBF
12 from sklearn.tree import DecisionTreeClassifier
13 from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
14 from sklearn.naive_bayes import GaussianNB
15 from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
16 from sklearn.ensemble import VotingClassifier
17

18 #-----------------------------------------------
19 classifiers = [
20 LogisticRegression(max_iter = 1000),
21 KNeighborsClassifier(5),
22 SVC(kernel="linear", C=0.5),
23 SVC(gamma=2, C=1),
24 RandomForestClassifier(max_depth=5, n_estimators=50, max_features=1),
25 MLPClassifier(hidden_layer_sizes=[100], activation='logistic',
26 alpha=0.5, max_iter=1000),
27 AdaBoostClassifier(),
28 GaussianNB(),
29 QuadraticDiscriminantAnalysis(),
30 GaussianProcessClassifier(),
31 ]
32 names = [
33 "Logistic-Regr",
34 "KNeighbors-5 ",
35 "SVC-Linear ",
36 "SVC-RBF ",
37 "Random-Forest",
38 "MLPClassifier",
39 "AdaBoost ",
40 "Naive-Bayes ",
41 "QDA ",
42 "Gaussian-Proc",
43 ]
10.5. Scikit-Learn: A Python Machine Learning Library 299

44 #-----------------------------------------------
45 if dataname is None: dataname = 'No-dataname';
46 if run is None: run = 50;
47 if rtrain is None: rtrain = 0.7e0;
48 if CompEnsm is None: CompEnsm = 2;
49

50 #=====================================================================
51 print('====== Comparision: Scikit-learn Classifiers =================')
52 #=====================================================================
53 import os;
54 acc_max=0; Acc_CLF = np.zeros([len(classifiers),1]);
55

56 for k, (name, clf) in enumerate(zip(names, classifiers)):


57 accmean, acc_std, etime = multi_run(clf,X,y,rtrain,run)
58

59 Acc_CLF[k] = accmean
60 if accmean>acc_max: acc_max,algname = accmean,name
61 print('%s: %s: Acc.(mean,std) = (%.2f,%.2f)%%; E-time= %.5f'
62 %(os.path.basename(dataname),name,accmean,acc_std,etime/run))
63 print('--------------------------------------------------------------')
64 print('sklearn classifiers Acc: (mean,max) = (%.2f,%.2f)%%; Best = %s'
65 %(np.mean(Acc_CLF),acc_max,algname))
66

67 if CompEnsm <2: quit()


68 #=====================================================================
69 print('====== Ensembling: SKlearn Classifiers =======================')
70 #=====================================================================
71 names = [x.rstrip() for x in names]
72 popped_clf = []
73 popped_clf.append(names.pop(9)); classifiers.pop(9); #Gaussian Proc
74 popped_clf.append(names.pop(7)); classifiers.pop(7); #Naive Bayes
75 popped_clf.append(names.pop(6)); classifiers.pop(6); #AdaBoost
76 popped_clf.append(names.pop(4)); classifiers.pop(4); #Random Forest
77 popped_clf.append(names.pop(0)); classifiers.pop(0); #Logistic Regr
78 #print('popped_clf=',popped_clf[::-1])
79

80 CLFs = [(name, clf) for name, clf in zip(names, classifiers)]


81 #if 'MyCLF' in locals(): CLFs += [('MyCLF',MyCLF())]
82 EnCLF = VotingClassifier(estimators=CLFs, voting='hard')
83 accmean, acc_std, etime = multi_run(EnCLF,X,y,rtrain,run)
84

85 print('EnCLF =',[lis[0] for lis in CLFs])


86 print('%s: Ensemble CLFs: Acc.(mean,std) = (%.2f,%.2f)%%; E-time= %.5f'
87 %(os.path.basename(dataname),accmean,acc_std,etime/run))
300 Chapter 10. Introduction to Machine Learning

Output
1 DATA: N, d, nclass = 150 4 3
2 MyCLF() = DecisionTreeClassifier(max_depth=5)
3 iris.csv: MyCLF() : Acc.(mean,std) = (94.53,3.12)%; E-time= 0.00074
4 ====== Comparision: Scikit-learn Classifiers =================
5 iris.csv: Logistic-Regr: Acc.(mean,std) = (96.13,2.62)%; E-time= 0.01035
6 iris.csv: KNeighbors-5 : Acc.(mean,std) = (96.49,1.99)%; E-time= 0.00176
7 iris.csv: SVC-Linear : Acc.(mean,std) = (97.60,2.26)%; E-time= 0.00085
8 iris.csv: SVC-RBF : Acc.(mean,std) = (96.62,2.10)%; E-time= 0.00101
9 iris.csv: Random-Forest: Acc.(mean,std) = (94.84,3.16)%; E-time= 0.03647
10 iris.csv: MLPClassifier: Acc.(mean,std) = (98.58,1.32)%; E-time= 0.20549
11 iris.csv: AdaBoost : Acc.(mean,std) = (94.40,2.64)%; E-time= 0.04119
12 iris.csv: Naive-Bayes : Acc.(mean,std) = (95.11,3.20)%; E-time= 0.00090
13 iris.csv: QDA : Acc.(mean,std) = (97.64,2.06)%; E-time= 0.00085
14 iris.csv: Gaussian-Proc: Acc.(mean,std) = (95.64,2.63)%; E-time= 0.16151
15 --------------------------------------------------------------
16 sklearn classifiers Acc: (mean,max) = (96.31,98.58)%; Best = MLPClassifier
17 ====== Ensembling: SKlearn Classifiers =======================
18 EnCLF = ['KNeighbors-5', 'SVC-Linear', 'SVC-RBF', 'MLPClassifier', 'QDA']
19 iris.csv: Ensemble CLFs: Acc.(mean,std) = (97.60,1.98)%; E-time= 0.22272

Ensembling:
You may stack the best and its siblings of other options.
10.5. Scikit-Learn: A Python Machine Learning Library 301

Exercises for Chapter 10

10.1. Machine Learning Modelcode

(a) Search the database to get at least five datasets.


(You may try “print(dir(datasets))”.)
(b) Run the Machine Learning Modelcode, p. 296, to compare the performances of 10
selected classifiers.

10.2. Modify perceptron.py, p. 269, to get a code for Adaline.

• For a given training dataset, Adaline converges to a unique weights, while Percep-
tron does not.
• Note that the correction terms are accumulated from all data points in each itera-
tion. As a consequence, the learning rate η may be chosen smaller as the number
of points increases.
Implementation: In order to overcome the problem, you may scale the correction
terms by the number of data points.
– Redefine the cost function (10.9):
N
1 X (i) 2
J (w, b) = y − φ(z (i) ) . (10.36)
2N i=1

where z (i) = wT x(i) + b and φ = I, the identity.


– Then the correction terms in (10.12) become correspondingly
1 X (i)
y − φ(z (i) ) x(i) ,

∆w = η
N i
1 X (i) (10.37)
y − φ(z (i) ) .

∆b = η
N i
302 Chapter 10. Introduction to Machine Learning
11
C HAPTER

Principal Component Analysis

Contents of Chapter 11
11.1.Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
11.2.Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
11.3.Applications of the SVD to LS Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Exercises for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

303
304 Chapter 11. Principal Component Analysis

11.1. Principal Component Analysis


Definition 11.1. Principal component analysis (PCA) is the pro-
cess of computing and using the principal components to perform a
change of basis on the data, sometimes with only the first few princi-
pal components and ignoring the rest.

The PCA, in a Nutshell


• The PCA is a statistical procedure that uses an orthogonal transfor-
mation to convert a set of observations of possibly correlated variables
into a set of linearly uncorrelated variables called the principal com-
ponents.
• The orthogonal axes of the new subspace can be interpreted as the
directions of maximum variance given the constraint that the new
feature axes are orthogonal to each other:

Figure 11.1: Principal components.

• It can be shown that the principal directions are eigenvectors of the


data’s covariance matrix.
• The PCA directions are highly sensitive to data scaling, and we
need to standardize the features prior to PCA, particularly when the
features were measured on different scales and we want to assign
equal importance to all features.
11.1. Principal Component Analysis 305

11.1.1. The covariance matrix


Definition 11.2. Variance measures the variation of a single random
variable, whereas covariance is a measure of the joint variability of two
random variables. Let the random variable pair (x, y) take on the values
{(xi , yi ) | i = 1, 2, · · · , n}, with equal probabilities pi = 1/n. Then
• The formula for variance of x is given by
n
1X
σx2 = (xi − x̄)2 , (11.1)
n i=1

where x̄ is the mean of x values.


• The covariance σ(x, y) of two random variables x and y is given by
n
1X
cov(x, y) = (xi − x̄)(yi − ȳ). (11.2)
n i=1

Remark 11.3. In reality, data are saved in a matrix X ∈ Rn×d :


• each of the n rows represents a different data point, and
• each of the d columns gives a particular kind of feature.
Thus, d describes the dimension of data points and also can be considered
as the number of random variables.

Definition 11.4. The covariance matrix of a data matrix X ∈ Rn×d is


a square matrix C ∈ Rd×d , whose (i, j)-entry is the covariance of the i-th
column and the j-th column of X. That is,

C = [Cij ] ∈ Rd×d , Cij = cov(Xi , Xj ). (11.3)

Example 11.5. Let X b be the data X subtracted by the mean column-


b = X − E[X]. Then the covariance matrix of X reads
wisely: X
1 bT b 1
C = X X = (X − E[X])T (X − E[X]), (11.4)
n n
for which the scaling factor 1/n is often ignored in reality.
306 Chapter 11. Principal Component Analysis

Example 11.6. Generate a synthetic data X in 2D to find its covariance


matrix and principal directions.
Solution.
util_Covariance.py
1 import numpy as np
2

3 # Generate data
4 def generate_data(n):
5 # Normally distributed around the origin
6 x = np.random.normal(0,1, n)
7 y = np.random.normal(0,1, n)
8 S = np.vstack((x, y)).T
9 # Transform
10 sx, sy = 1, 3;
11 Scale = np.array([[sx, 0], [0, sy]])
12 theta = 0.25*np.pi; c,s = np.cos(theta), np.sin(theta)
13 Rot = np.array([[c, -s], [s, c]]).T #T, due to right multiplication
14

15 return S.dot(Scale).dot(Rot) +[5,2]


16

17 # Covariance
18 def cov(x, y):
19 xbar, ybar = x.mean(), y.mean()
20 return np.sum((x - xbar)*(y - ybar))/len(x)
21

22 # Covariance matrix
23 def cov_matrix(X):
24 return np.array([[cov(X[:,0], X[:,0]), cov(X[:,0], X[:,1])], \
25 [cov(X[:,1], X[:,0]), cov(X[:,1], X[:,1])]])
Covariance.py
1 import numpy as np
2 import matplotlib.pyplot as plt
3 from util_Covariance import *
4

5 # Generate data
6 n = 200
7 X = generate_data(n)
8 print('Generated data: X.shape =', X.shape)
9

10 # Covariance matrix
11 C = cov_matrix(X)
11.1. Principal Component Analysis 307

12 print('C:\n',C)
13

14 # Principal directions
15 eVal, eVec = np.linalg.eig(C)
16 xbar,ybar = np.mean(X,0)
17 print('eVal:\n',eVal); print('eVec:\n',eVec)
18 print('np.mean(X, 0) =',xbar,ybar)
19

20 # Plotting
21 plt.style.use('ggplot')
22 plt.scatter(X[:, 0],X[:, 1],c='#00a0c0',s=10)
23 plt.axis('equal');
24 plt.title('Generated Data')
25 plt.savefig('py-data-generated.png')
26

27 for e, v in zip(eVal, eVec.T):


28 plt.plot([0,2*np.sqrt(e)*v[0]]+xbar,\
29 [0,2*np.sqrt(e)*v[1]]+ybar, 'r-', lw=2)
30 plt.title('Principal Directions')
31 plt.savefig('py-data-principal-directions.png')
32 plt.show()

Figure 11.2: Synthetic data and its principal directions (right).


308 Chapter 11. Principal Component Analysis

Output
1 Generated data: X.shape = (200, 2)
2 C:
3 [[ 5.10038723 -4.15289232]
4 [-4.15289232 4.986776 ]]
5 eVal:
6 [9.19686242 0.89030081]
7 eVec:
8 [[ 0.71192601 0.70225448]
9 [-0.70225448 0.71192601]]
10 np.mean(X, 0) = 4.986291809096116 2.1696690114181947

Observation 11.7. Covariance Matrix.


• Symmetry: The covariance matrix C is symmetric so that it is di-
agonalizable. (See §5.2.2, p.144.) That is,

C = U DU −1 , (11.5)

where D is a diagonal matrix of eigenvalues of C and U is the corre-


sponding eigenvectors of C such that U T U = I. (Such a square matrix
U is called an orthogonal matrix.)
• Principal directions: The principal directions are eigenvectors of
the data’s covariance matrix.
• Minimum volume enclosing ellipsoid (MVEE): The PCA can be
viewed as fitting a d-dimensional ellipsoid to the data, where each
axis of the ellipsoid represents one of principal directions.
– If some axis of the ellipsoid is small, then the variance along that
axis is also small.
11.1. Principal Component Analysis 309

11.1.2. Computation of principal components

• Consider a data matrix X ∈ Rn×d :


– each of the n rows represents a different data point,
– each of the d columns gives a particular kind of feature, and
– each column has zero empirical mean (e.g., after standardization).
• Our goal is to find an orthogonal weight matrix W ∈ Rd×d such that

Z = X W, (11.6)

where Z ∈ Rn×d is call the score matrix. Columns of Z represent


principal components of X.

First weight vector w1 : the first column of W :


In order to maximize variance of z1 , the first weight vector w1 should satisfy
w1 = arg max kz1 k2 = arg max kXwk2
kwk=1 kwk=1
wT X T Xw (11.7)
T T
= arg max w X Xw = arg max ,
kwk=1 w6=0 wT w
where the quantity to be maximized can be recognized as a Rayleigh quo-
tient.

Theorem 11.8. For a positive semidefinite matrix (such as X T X),


the maximum of the Rayleigh quotient is the same as the largest eigen-
value of the matrix, which occurs when w is the corresponding eigenvec-
tor, i.e.,
wT X T Xw v1
w1 = arg max = , (X T X)v1 = λ1 v1 , (11.8)
w6=0 wT w kv1 k
where λ1 is the largest eigenvalue of X T X ∈ Rd×d .

Example 11.9. With w1 found, the first principal component of a data


(i)
vector x(i) , the i-th row of X, is then given as a score z1 = x(i) · w1 .
310 Chapter 11. Principal Component Analysis

Further weight vectors wk :


The k-th weight vector can be found by 1 subtracting the first (k − 1) prin-
cipal components from X:
k−1
X
Xbk := X − Xwi wiT , (11.9)
i=1

and then 2 finding the weight vector which extracts the maximum variance
from this new data matrix
wk = arg max kX bk wk2 , (11.10)
kwk=1

which turns out to give the remaining eigenvectors of X T X.

Remark 11.10. The principal components transformation can also be


associated with the singular value decomposition (SVD) of X:
X = U ΣV T , (11.11)

where
U : n × d orthogonal (the left singular vectors of X.)
Σ : d × d diagonal (the singular values of X.)
V : d × d orthogonal (the right singular vectors of X.)

• The matrix Σ explicitly reads


Σ = diag(σ1 , σ2 , · · · , σd ), (11.12)

where σ1 ≥ σ2 ≥ · · · ≥ σd ≥ 0.
• In terms of this factorization, the matrix X T X reads

X T X = (U ΣV T )T U ΣV T = V ΣU T U ΣV T = V Σ2 V T . (11.13)

• Comparing with the eigenvector factorization of X T X, we conclude


= the eigenvectors of X T X ⇒ V ∼
– the right singular vectors V ∼ =W
– (the square of singular values of X) = (the eigenvalues of X T X)
⇒ σj2 = λj , j = 1, 2, · · · , d.
11.1. Principal Component Analysis 311

Summary 11.11. (Computation of Principal Components)


1. Computer the singular value decomposition (SVD) of X:

X = U ΣV T . (11.14)

2. Set
W = V. (11.15)
Then the score matrix, the set of principal components, is

Z = XW = XV = U ΣV T V = U Σ
(11.16)
= [σ1 u1 |σ2 u2 | · · · |σd ud ].

* The SVD will be discussed in §11.2.

11.1.3. Dimensionality reduction: Data compression

• The transformation Z = XW maps a data vector x(i) ∈ Rd to a new


space of d variables which are now uncorrelated.
• However, not all the principal components need to be kept.
• Keeping only the first k principal components, produced by using only
the first k eigenvectors of X T X (k  d), gives the truncated score
matrix:
Zk := X Wk = U ΣV T Wk = U Σk , (11.17)
where Zk ∈ Rn×k , Wk ∈ Rd×k , and

Σk := diag(σ1 , · · · , σk , 0, · · · , 0). (11.18)

• It follows from (11.17) that the corresponding truncated data matrix


reads

Xk = Zk WkT = U Σk WkT = U Σk W T = U Σk V T . (11.19)

Quesitons. How can we choose k ? &


Is the difference kX − Xk k (that we truncated) small ?
312 Chapter 11. Principal Component Analysis

Claim 11.12. It follows from (11.11) and (11.19) that

kX − Xk k2 = kU ΣV T − U Σk V T k2
= kU (Σ − Σk )V T k2 (11.20)
= kΣ − Σk k2 = σk+1 ,

where k · k2 is the induced matrix L2 -norm.

Remark 11.13. Efficient algorithms exist to compute the SVD of X


without having to form the matrix X T X, so computing the SVD is now
the standard way to carry out the PCA. See [6, 13].

Image Compression
• Dyadic Decomposition: The data matrix X ∈ Rm×n is expressed as
a sum of rank-1 matrices:
n
X
T
X = U ΣV = σi ui viT , (11.21)
i=1

where
V = [v1 , · · · , vn ], U = [u1 , · · · , un ].

• Approximation: X can be approximated as


k
X
T
X ≈ Xk := U Σk V = σi ui viT (11.22)
i=1

is closest to X among matrices of rank≤ k, and

||X − Xk ||2 = σk+1 .

• It only takes n · k + m · k = (m + n) · k words to store [v1 , v2 , · · · , vk ] and


[σ1 u1 , σ2 u2 , · · · , σk uk ], from which we can reconstruct Xk .
• We use Xk as our compressed images, stored using (m + n) · k words.
11.1. Principal Component Analysis 313

A Matlab code to demonstrate the SVD compression of images:


peppers_compress.m
1 img = imread('Peppers.png'); [m,n,d]=size(img);
2 [U,S,V] = svd(reshape(im2double(img),m,[]));
3 %%---- select k <= p=min(m,n)
4 k = 20;
5 img_k = U(:,1:k)*S(1:k,1:k)*V(:,1:k)';
6 img_k = reshape(img_k,m,n,d);
7 figure, imshow(img_k)

The “Peppers" image is in [270, 270, 3] ∈ R270×810 .


Image compression using k singular values
Original (k = 270) k=1 k = 10

k = 20 k = 50 k = 100
314 Chapter 11. Principal Component Analysis

Peppers: Singular values

Peppers: Compression quality





 13.7 when k = 1,
20.4 when k = 10,





 23.7 when k = 20,
PSNR (dB) =

 29.0 when k = 50,

32.6 when k = 100,





 37.5 when k = 150,

where PSNR is the “Peak Signal-to-Noise Ratio.”

Peppers Storage: It requires (m + n) · k words.


For example, when k = 50,

(m + n) · k = (270 + 810) · 50 = 54,000 , (11.23)

which is approximately a quarter the full storage space

270 × 270 × 3 = 218,700 .


11.2. Singular Value Decomposition 315

11.2. Singular Value Decomposition


Here we will deal with the SVD in detail.

Theorem 11.14. (SVD Theorem). Let A ∈ Rm×n with m ≥ n. Then we


can write
A = U ΣV T, (11.24)
where U ∈ Rm×n and satisfies U T U = I, V ∈ Rn×n and satisfies V T V = I,
and Σ = diag(σ1 , σ2 , · · · , σn ), where

σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.

Remark 11.15. The matrices are illustrated pictorially as


   
  
   
   
 A  =  U  Σ  V T , (11.25)
     
   
   

where
U : m × n orthogonal (the left singular vectors of A.)
Σ : n × n diagonal (the singular values of A.)
V : n × n orthogonal (the right singular vectors of A.)

• For some r ≤ n, the singular values may satisfy

σ ≥ σ2 ≥ · · · ≥ σr > σr+1 = · · · = σn = 0. (11.26)


|1 {z }
nonzero singular values

In this case, rank (A) = r.


• If m < n, the SVD is defined by considering AT .
316 Chapter 11. Principal Component Analysis

Proof. (of Theorem 11.14) Use induction on m and n: we assume that the
SV D exists for (m − 1) × (n − 1) matrices, and prove it for m × n. We assume
A 6= 0; otherwise we can take Σ = 0 and let U and V be arbitrary orthogonal
matrices.

• The basic step occurs when n = 1 (m ≥ n). We let A = U ΣV T with


U = A/||A||2 , Σ = ||A||2 , V = 1.
• For the induction step, choose v so that

||v||2 = 1 and ||A||2 = ||Av||2 > 0.

Av
• Let u = ||Av||2 , which is a unit vector. Choose Ũ , Ṽ such that

U = [u Ũ ] ∈ Rm×n and V = [v Ṽ ] ∈ Rn×n

are orthogonal.
• Now, we write
" # " #
T T T
u u Av u AṼ
U T AV = · A · [v Ṽ ] =
Ũ T Ũ T Av Ũ T AṼ

Since
T (Av)T (Av) ||Av||22
u Av = = = ||Av||2 = ||A||2 ≡ σ,
||Av||2 ||Av||2
Ũ T Av = Ũ T u||Av||2 = 0,

we have
" # " #" #" #T
σ 0 1 0 σ 0 1 0
U T AV = = ,
0 U1 Σ1 V1T 0 U1 0 Σ1 0 V1

or equivalently
" #! " # " #!T
1 0 σ 0 1 0
A= U V . (11.27)
0 U1 0 Σ1 0 V1

Equation (11.27) is our desired decomposition.


11.2. Singular Value Decomposition 317

11.2.1. Algebraic interpretation of the SVD

Let rank (A) = r. let the SVD of A be A = U Σ V T , with

U = [u1 u2 · · · un ],
Σ = diag(σ1 , σ2 , · · · , σn ),
V = [v1 v2 · · · vn ],

and σr be the smallest positive singular value. Since

A = U Σ V T ⇐⇒ AV = U ΣV T V = U Σ,

we have
AV = A[v1 v2 ··· vn ] = [Av1 Av2 · · · Avn ]
 
σ1

 ... 

 
= [u1 · · · ur · · · un ]  σ r
 (11.28)
...
 
 
 
0
= [σ1 u1 · · · σr ur 0 · · · 0].

Therefore,
(
Avj = σj uj , j = 1, 2, · · · , r
A = U ΣV T ⇔ (11.29)
Avj = 0, j = r + 1, · · · , n

Similarly, starting from AT = V Σ U T ,


(
AT uj = σj vj , j = 1, 2, · · · , r
AT = V Σ U T ⇔ (11.30)
AT uj = 0, j = r + 1, · · · , n
318 Chapter 11. Principal Component Analysis

Summary 11.16. It follows from (11.29) and (11.30) that

• (vj , σj2 ), j = 1, 2, · · · , r, are eigenvector-eigenvalue pairs of AT A.


AT A vj = AT (σj uj ) = σj2 vj , j = 1, 2, · · · , r. (11.31)

So, the singular values play the role of eigenvalues.


• Similarly, we have

AAT uj = A(σj vj ) = σj2 uj , j = 1, 2, · · · , r. (11.32)

• Equation (11.31) gives how to find the singular values {σj } and the
right singular vectors V , while (11.29) shows a way to compute the
left singular vectors U .
• (Dyadic decomposition) The matrix A ∈ Rm×n can be expressed as
n
X
A= σj uj vjT . (11.33)
j=1

When rank (A) = r ≤ n,


r
X
A= σj uj vjT . (11.34)
j=1

This property has been utilized for various approximations and ap-
plications, e.g., by dropping singular vectors corresponding to small
singular values.
11.2. Singular Value Decomposition 319

11.2.2. Computation of the SVD

For A ∈ Rm×n , the procedure is as follows.


1. Form AT A (AT A – covariance matrix of A).
2. Find the eigen-decomposition of AT A by orthogonalization process,
i.e., Λ = diag(λ1 , · · · , λn ),

AT A = V ΛV T ,

where V = [v1 ··· vn ] is orthogonal, i.e., V T V = I.


3. Sort the eigenvalues according to their magnitude and let
p
σj = λj , j = 1, 2, · · · , n.

4. Form the U matrix as follows,


1
uj = Avj , j = 1, 2, · · · , r.
σj
If necessary, pick up the remaining columns of U so it is orthogonal.
(These additional columns must be in Null(AAT ).)
 
v1T
 .
5. A = U ΣV T = [u1 · · · ur · · · un ] diag(σ1 , · · · , σr , 0, · · · , 0)  .
 .
vnT

Lemma 11.17. Let A ∈ Rn×n be symmetric. Then (a) all the eigenvalues
of A are real and (b) eigenvectors corresponding to distinct eigenvalues
are orthogonal.
320 Chapter 11. Principal Component Analysis
 
1 2
Example 11.18. Find the SV D for A =  −2 1 .
 
3 2
Solution.
" #
14 6
1. AT A = .
6 9

2. Solving det (AT A − λI) = 0 gives the eigenvalues of AT A

λ1 = 18 and λ2 = 5,

of which corresponding eigenvectors are


√3 − √213
" # " # " #
3 −2 13
ṽ1 = , ṽ2 = . =⇒ V =
2 3 √2 √3
13 13

√ √ √ √ √
3. σ1 = λ1 = 18 = 3 2, σ2 = λ2 = 5. So
"√ #
18 0
Σ= √
0 5
 
√7
 
" # 7
√3 234
  4
1

4. u1 = √1 A 13
= √118 √113  −4  =  − √234
σ1 Av1 =
 
18 √2  
13
13 √13
234
   4 
" −2
# 4 √
65

1 √1 A 13 1 √1    √7 
u2 = σ2 Av2 = 5 3
= 5 13  7  =  65  .


13
0 0

√7 √4
 
234 65 "√
√3 √2
#" #
 4 7
 18 0 13 13
5. A = U ΣV T = − √234 √  √
65 0 5 2 3
  − √13 √
13
√13 0
234
11.3. Applications of the SVD to LS Problems 321

11.3. Applications of the SVD to LS Problems


Recall: (Definition 7.2, p. 195): Let A ∈ Rm×n , m ≥ n, and b ∈ Rm . The
least-squares problem is to find x b ∈ Rn which minimizes kAx − bk2 :

b = arg min kAx − bk2 ,


x
x
or, equivalently, (11.35)
b = arg min kAx −
x bk22 ,
x

where x
b called a least-squares solution of Ax = b.

Note: When AT A is invertible, the equation Ax = b has a unique LS


solution for each b ∈ Rm (Theorem 7.5). It can be solved by the method
of normal equations; the unique LS solution x b is given by

b = (AT A)−1 AT b.
x (11.36)

Recall: (Definition 7.6, p. 197): (AT A)−1 AT is called the pseudoin-


verse of A. Let A = U ΣV T be the SVD of A. Then
def
(AT A)−1 AT = V Σ−1 U T == A+ . (11.37)
 
1 2
Example 11.19. Find the pseudoinverse of A = −2 1.
 
3 2
Solution. From Example 11.18, p.320, we have
√7 √4
 
"√ #" #
234 65 √3 √2
4 √7 
18 0
A = U ΣV T =  − √234 √ 13 13
 
65 2 3
0 5 − √13 √13
√13 0
234
Thus,
" #" #" #
√3 − √213 √1 0 √7 4
− √234 √13
A+ = V Σ−1 U T = 13 18 234 234
√2 √3 0 √1 √4 √7 0
13 13 5 65 65
" #
1 4 1
− 30 − 15 6
= 11 13 1 .
45 45 9
322 Chapter 11. Principal Component Analysis

Quesiton. What if AT A is not invertible? Although it is invertible, what


if the hypothesis space is either too big or too small?

Solving LS Problems by the SVD


Let A ∈ Rm×n , m > n, with rank (A) = k ≤ n.
• Suppose that the SVD of A is given, that is,

A = U ΣV T .

• Since U and V are `2 -norm preserving, we have

||Ax − b|| = ||U ΣV T x − b|| = ||ΣV T x − U T b||. (11.38)

• Define z = V T x and c = U T b. Then


k
X n
X 1/2
2
||Ax − b|| = (σi zi − ci ) + c2i . (11.39)
i=1 i=k+1

• Thus the norm is minimized when z is chosen with


(
ci /σi , when i ≤ k,
zi = (11.40)
arbitrary, otherwise.

• After determining z, one can find the solution as

x
b = V z. (11.41)

Then the least-squares error reads


n
 X 1/2
min ||Ax − b|| = c2i (11.42)
x
i=k+1
11.3. Applications of the SVD to LS Problems 323

Strategy 11.20. When z is obtained as in (11.40), it is better to choose


zero for the “arbitrary” part:

z = [c1 /σ1 , c2 /σ2 , · · · , ck /σk , 0, · · · , 0]T . (11.43)

In this case, z can be written as

z = Σ+ + T
k c = Σk U b, (11.44)

where
Σ+ T
k = [1/σ1 , 1/σ2 , · · · , 1/σk , 0, · · · , 0] . (11.45)
Thus the corresponding LS solution reads

b = V z = V Σ+
x T
k U b. (11.46)

Note that x
b involves no components of the null space of A;
b is unique in this sense.
x

Remark 11.21.
• When rank (A) = k = n: It is easy to see that
−1 T
V Σ+ T
kU = VΣ U , (11.47)

which is the pseudoinverse of A.


• When rank (A) = k < n: AT A not invertible. However,

A+ + T
k := V Σk U (11.48)

plays the role of the pseudoinverse of A. Thus we will call it the k-th
pseudoinverse of A.

Note: For some LS applications, although rank (A) = n, the k-th pseu-
doinverse A+
k , with a small k < n, may give more reliable solutions.
324 Chapter 11. Principal Component Analysis

Example 11.22. Generate a synthetic dataset in 2D to find least-squares


solutions, using
(a) the method of normal equations and
(b) the SVD with various numbers of principal components.

Solution. Here we implement a Matlab code. You will redo it in Python;


see Exercise 11.2.
util.m
1 classdef util,
2 methods(Static)
3 %---------------------------------------
4 function data = get_data(npt,bx,sigma)
5 data = zeros(npt,2);
6 data(:,1) = rand(npt,1)*bx;
7 data(:,2) = max(bx/3,2*data(:,1)-bx);
8

9 r = randn(npt,1)*sigma; theta = randn(npt,1)*pi;


10 noise = r.*[cos(theta),sin(theta)];
11 data = data+noise;
12 end % indentation is not required, but an extra 'end' is.
13 %---------------------------------------
14 function mysave(gcf,filename)
15 exportgraphics(gcf,filename,'Resolution',100)
16 fprintf('saved: %s\n',filename)
17 end
18 %---------------------------------------
19 function A = get_A(data,n)
20 npt = size(data,1);
21 A = ones(npt,n);
22 for j=2:n
23 A(:,j) = A(:,j-1).*data;
24 end
25 end
26 %---------------------------------------
27 function Y = predict_Y(X,coeff,S_mean,S_std)
28 n = numel(coeff);
29 if nargin==2, S_mean=zeros(1,n); S_std=ones(1,n); end
30 A = util.get_A(X(:),n);
31 Y = ((A-S_mean)./S_std)*coeff;
32 end
33 end,end
11.3. Applications of the SVD to LS Problems 325

Note: In Matlab, you can save multiple functions in a file, using


classdef and methods(Static).
• The functions will be called as class_name.function_name().
• Lines 12, 17, 25, 32: The extra ‘end’ is required for Matlab to distin-
guish functions without ambiguity.
– You may put the extra ‘end’ also for stand-alone functions.
• Line 29: A Matlab function can be implemented so that you may call
the function without some arguments using default arguments.
• Line 30: See how to call a class function from another function.
pca_regression.m
1 function [sol_PCA,S_mean,S_std] = pca_regression(A,b,npc)
2 % input: npc = the number of principal components
3

4 %% Standardization
5 %%---------------------------------------------
6 S_mean = mean(A); S_std = std(A);
7 if S_std(1)==0, S_std(1)=1/S_mean(1); S_mean(1)=0; end
8 AS = (A-S_mean)./S_std;
9

10 %% SVD regression, using the pseudoinverse


11 %%---------------------------------------------
12 [U,S,V] = svd(AS,'econ');
13 S1 = diag(S); % a column vector
14 C1 = zeros(size(S1));
15 C1(1:npc) = 1./S1(1:npc);
16 C = diag(C1); % a matrix
17

18 sol_PCA = V*C*U'*b;
19 end

Note: The standardization variables are included in output to be used


for the prediction.
• Line 7: Note that A(:,1)=1 so that its std must be 0.
• Lines 13 and 16: The function diag() toggles between a column vector
and a diagonal matrix.
• Line 19: The function puts an extra ‘end’ at the end.
326 Chapter 11. Principal Component Analysis

Regression_Analysis.m
1 clear all; close all;
2

3 %%-----------------------------------------------------
4 %% Setting
5 %%-----------------------------------------------------
6 regen_data = 0; %==1, regenerate the synthetic data
7 poly_n = 9;
8 npt=300; bx=5.0; sigma=0.50; %for synthetic data
9 datafile = 'synthetic-data.txt';
10

11 %%-----------------------------------------------------
12 %% Data: Generation and Read
13 %%-----------------------------------------------------
14 if regen_data || ~isfile(datafile)
15 DATA = util.get_data(npt,bx,sigma);
16 writematrix(DATA, datafile);
17 fprintf('%s: re-generated.\n',datafile)
18 end
19 DATA = readmatrix(datafile,"Delimiter",",");
20

21 %%-----------------------------------------------------
22 %% The system: A x = b
23 %%-----------------------------------------------------
24 A = util.get_A(DATA(:,1),poly_n+1);
25 b = DATA(:,2);
26

27 %%-----------------------------------------------------
28 %% Method of Noral Equations
29 %%-----------------------------------------------------
30 sol_NE = (A'*A)\(A'*b);
31 figure,
32 plot(DATA(:,1),DATA(:,2),'k.','MarkerSize',8);
33 axis tight; hold on
34 yticks(1:5); ax = gca; ax.FontSize=13; %ax.GridAlpha=0.25
35 title(sprintf('Synthetic Data: npt = %d',npt),'fontsize',13)
36 util.mysave(gcf,'data-synthetic.png');
37 x=linspace(min(DATA(:,1)),max(DATA(:,1)),51);
38 plot(x,util.predict_Y(x,sol_NE),'r-','linewidth',2);
39 Pn = ['P_',int2str(poly_n)];
40 legend('data',Pn, 'location','best','fontsize',13)
41 TITLE0=sprintf('Method of NE: npt = %d',npt);
42 title(TITLE0,'fontsize',13)
43 hold off
11.3. Applications of the SVD to LS Problems 327

44 util.mysave(gcf,'data-synthetic-sol-NE.png');
45

46 %%-----------------------------------------------------
47 %% PCA Regression
48 %%-----------------------------------------------------
49 for npc=1:size(A,2);
50 [sol_PCA,S_mean,S_std] = pca_regression(A,b,npc);
51 figure,
52 plot(DATA(:,1),DATA(:,2),'k.','MarkerSize',8);
53 axis tight; hold on
54 yticks(1:5); ax = gca; ax.FontSize=13; %ax.GridAlpha=0.25
55 x=linspace(min(DATA(:,1)),max(DATA(:,1)),51);
56 plot(x,util.predict_Y(x,sol_PCA,S_mean,S_std),'r-','linewidth',2);
57 Pn = ['P_',int2str(poly_n)];
58 legend('data',Pn, 'location','best','fontsize',13)
59 TITLE0=sprintf('Method of PC: npc = %d',npc);
60 title(TITLE0,'fontsize',13)
61 hold off
62 savefile = sprintf('data-sol-PCA-npc-%02d.png',npc);
63 util.mysave(gcf,savefile);
64 end

Note: Regression_Analysis is the main function. The code is simple; the


complication is due to plotting.
• Lines 6, 14-19: Data is read from a datafile.
– Setting regen_data = 1 will regenerate the datafile.

Figure 11.3: The synthetic data and the LS solution P9 (x), overfitted.
328 Chapter 11. Principal Component Analysis

Figure 11.4: PCA regression of the data, with various numbers of principal components.
The best regression is achieved when npc = 3.
11.3. Applications of the SVD to LS Problems 329

Exercises for Chapter 11

11.1. Download wine.data from the UCI database:


https://archive.ics.uci.edu/ml/machine-learning-databases/wine/
The data is extensively used in the Machine Learning community. The first column
of the data is the label and the others are features of three different kinds of wines.

(a) Add lines to the code given, to verify (11.20), p.312. For example, set k = 5.
Wine_data.py
1 import numpy as np
2 from numpy import diag,dot
3 from scipy.linalg import svd,norm
4 import matplotlib.pyplot as plt
5

6 data = np.loadtxt('wine.data', delimiter=',')


7 X = data[:,1:]; y = data[:,0]
8

9 #-----------------------------------------------
10 # Standardization
11 #-----------------------------------------------
12 X_mean, X_std = np.mean(X,axis=0), np.std(X,axis=0)
13 XS = (X - X_mean)/X_std
14

15 #-----------------------------------------------
16 # SVD
17 #-----------------------------------------------
18 U, s, VT = svd(XS)
19 if U.shape[0]==U.shape[1]:
20 U = U[:,:len(s)] # cut the nonnecessary
21 Sigma = diag(s) # transform to a matrix
22 print('U:',U.shape, 'Sigma:',Sigma.shape, 'VT:',VT.shape)

Note:

• Line 12: np.mean and np.std are applied, with the option axis=0, to get
the quantities column-by-column vertically. Thus X_mean and X_std are row
vectors.
• Line 18: In Python, svd produces [U, s, VT], where VT = V T . If you would
like to get V , then V = VT.T.

11.2. Implement the code in Example 11.22, in Python.

(a) Report your complete code.


(b) Attached figures as in Figures 11.3 and 11.4.
330 Chapter 11. Principal Component Analysis

Clue: The major reason that a class is used in the Matlab code in Example 11.22 is
to combine multiple functions to be saved in a file. In Python, you do not have to use
a class to save multiple functions in a file. You may start with the following.
util.py
1 mport numpy as np
2 import matplotlib.pyplot as plt
3

4 def get_data(npt,bx,sigma):
5 data = np.zeros([npt,2]);
6 data[:,0] = np.random.uniform(0,1,npt)*bx;
7 data[:,1] = np.maximum(bx/3,2*data[:,0]-bx);
8 r = np.random.normal(0,1,npt)*sigma;
9 theta = np.random.normal(0,1,npt)*np.pi;
10 noise = np.column_stack((r*np.cos(theta),r*np.sin(theta)));
11 data += noise;
12 return data
13

14 def mysave(filename):
15 plt.savefig(filename,bbox_inches='tight')
16 print('saved:',filename)
17

18 # Add other functions

Regression_Analysis.py
1 import numpy as np
2 import numpy.linalg as la
3 import matplotlib.pyplot as plt
4 from os.path import exists
5 import util
6

7 ##-----------------------------------------------------
8 ## Setting
9 ##-----------------------------------------------------
10 regen_data = 1; #==1, regenerate the synthetic data
11 poly_n = 9;
12 npt=300; bx=5.0; sigma=0.50; #for synthetic data
13 datafile = 'synthetic-data.txt';
14 plt.style.use('ggplot')
15

16 ##-----------------------------------------------------
17 ## Data: Generation and Read
18 ##-----------------------------------------------------
19 if regen_data or not exists(datafile):
20 DATA = util.get_data(npt,bx,sigma);
21 np.savetxt(datafile,DATA,delimiter=',');
11.3. Applications of the SVD to LS Problems 331

22 print('%s: re-generated.' %(datafile))


23

24 DATA = np.loadtxt(datafile, delimiter=',')


25

26 plt.figure() # initiate a new plot


27 plt.scatter(DATA[:,0],DATA[:,1],s=8,c='k')
28 plt.title('Synthetic Data: npc = '+ str(npt))
29 util.mysave('data-synthetic-py.png')
30 #plt.show()
31

32 ##-----------------------------------------------------
33 ## The system: A x = b
34 ##-----------------------------------------------------

Note: The semi-colons (;) are not necessary in Python nor harmful; they are in-
cluded from copy-and-paste of Matlab lines. The ggplot style emulates “ggplot",
a popular plotting package for R. When Regression_Analysis.py is executed, you
will have a saved image:

Figure 11.5: data-synthetic-py.png


332 Chapter 11. Principal Component Analysis
A
A PPENDIX

Appendices

Contents of Chapter A
A.1. Optimization: Primal and Dual Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 334
A.2. Weak Duality, Strong Duality, and Complementary Slackness . . . . . . . . . . . . . . 338
A.3. Geometric Interpretation of Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
A.4. Rank-One Matrices and Structure Tensors . . . . . . . . . . . . . . . . . . . . . . . . . 349
A.5. Boundary-Effects in Convolution Functions in Matlab and Python SciPy . . . . . . . . 353
A.6. From Python, Call C, C++, and Fortran . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

333
334 Appendix A. Appendices

A.1. Optimization: Primal and Dual Problems


A.1.1. The Lagrangian

Problem A.1. Consider a general optimization problem of the form


min f (x)
x
subj.to hi (x) ≤ 0, i = 1, · · · , m (Primal) (A.1.1)
qj (x) = 0, j = 1, · · · , p

We define its Lagrangian L : Rn × Rm × Rp → R as


m
X p
X
L(x, α, β) = f (x) + αi hi (x) + βj qj (x)
i=1 j=1
(A.1.2)
= f (x) + α · h(x) + β · q(x),

where α = (α1 , α2 , · · · , αm ) ≥ 0 and β = (β1 , β2 , · · · , βp ) are Lagrange


multipliers.

Definition A.2. The set of points that satisfy the constraints,


def
C == {x ∈ Rn | h(x) ≤ 0 and q(x) = 0}, (A.1.3)

is called the feasible set.

Lemma A.3. For each x in the feasible set C,

f (x) = max L(x, α, β), x ∈ C. (A.1.4)


α≥0,β

The maximum is taken iff α satisfies


αi hi (x) = 0, i = 1, · · · , m. (A.1.5)

Proof. When x ∈ C, we have h(x) ≤ 0 and q(x) = 0 and therefore

L(x, α, β) = f (x) + α · h(x) + β · q(x) ≤ f (x)

Clearly, the last inequality becomes equality iff (A.1.5) holds.


A.1. Optimization: Primal and Dual Problems 335

Remark A.4. Recall L(x, α, β) = f (x) + α · h(x) + β · q(x) and


C = {x | h(x) ≤ 0 and q(x) = 0}. It is not difficult to see

max L(x, α, β) = ∞, x 6∈ C. (A.1.6)


α≥0,β

Theorem A.5. Let f ∗ be the optimal value of the primal problem


(A.1.1):
f ∗ = min f (x).
x∈C
Then f ∗ satisfies
f ∗ = min max L(x, α, β). (A.1.7)
x α≥0,β

Note: The minimum in (A.1.7) does not require x in C.

Proof. For x ∈ C, it follows from (A.1.4) that

f ∗ = min f (x) = min max L(x, α, β). (A.1.8)


x∈C x∈C α≥0,β

When x 6∈ C, since (A.1.6) holds, we have

min max L(x, α, β) = ∞. (A.1.9)


x6∈C α≥0,β

The assertion (A.1.7) follows from (A.1.8) and (A.1.9).

Summary A.6. Primal Problem


The primal problem (A.1.1) is equivalent to the minimax problem

min max L(x, α, β), (Primal) (A.1.10)


x α≥0,β

where
L(x, α, β) = f (x) + α · h(x) + β · q(x).
Here the minimum does not require x in the feasible set C.
336 Appendix A. Appendices

A.1.2. Lagrange Dual Problem


Given a Lagrangian L(x, α, β), we define its Lagrange dual function as
def
g(α, β) == min L(x, α, β)
x  (A.1.11)
= min f (x) + α · h(x) + β · q(x) .
x

Claim A.7. Lower Bound Property

g(α, β) ≤ f ∗ , for α ≥ 0. (A.1.12)

Proof. Let α ≥ 0. Then for x ∈ C,

f (x) ≥ L(x, α, β) ≥ min L(x, α, β) = g(α, β).


x

Minimizing over all feasible points x gives f ∗ ≥ g(α, β).

Definition A.8. Given primal problem (A.1.1), we define its Lagrange


dual problem as

max min L(x, α, β)


α,β x (Dual) (A.1.13)
subj.to α ≥ 0

Thus the dual problem is a maximin problem.

Remark A.9. It is clear to see from the definition, the optimal value of
the dual problem, named as g ∗ , satisfies

g ∗ = max min L(x, α, β). (A.1.14)


α≥0,β x
A.1. Optimization: Primal and Dual Problems 337

Although the primal problem is not convex, the dual problem is


always convex (actually, concave).
Theorem A.10. The dual problem (A.1.13) is a convex optimization
problem. Thus it is easy to optimize.

Proof. From the definition,



g(α, β) = min L(x, α, β) = min f (x) + α · h(x) + β · q(x) ,
x x

which can be viewed as pointwise infimum of affine functions of α and β.


Thus it is concave. Hence the dual problem is a concave maximization
problem, which is a convex optimization problem.

Summary A.11. Given the optimization problem (A.1.1):


• It is equivalent to the minimax problem
min max L(x, α, β), (Primal) (A.1.15)
x α≥0,β

where the Lagrangian is defined as


L(x, α, β) = f (x) + α · h(x) + β · q(x). (A.1.16)

• Its dual problem is a maximin problem


max min L(x, α, β), (Dual) (A.1.17)
α≥0,β x

and the dual function is defined as


g(α, β) = min L(x, α, β). (A.1.18)
x

• The Lagrangian and Duality


– The Lagrangian is a lower bound of the objective function.
f (x) ≥ L(x, α, β), for x ∈ C, α ≥ 0. (A.1.19)

– The dual function is a lower bound of the the primal optimal.


g(α, β) ≤ f ∗ . (A.1.20)

– The dual problem is a convex optimization problem.


338 Appendix A. Appendices

A.2. Weak Duality, Strong Duality, and Com-


plementary Slackness
Recall: For an optimization problem of the form

min f (x)
x
subj.to hi (x) ≤ 0, i = 1, · · · , m (Primal) (A.2.1)
qj (x) = 0, j = 1, · · · , p

the Lagrangian L : Rn × Rm × Rp → R is defined as


m
X p
X
L(x, α, β) = f (x) + αi hi (x) + βj qj (x)
i=1 j=1
(A.2.2)
= f (x) + α · h(x) + β · q(x),

where α = (α1 , α2 , · · · , αm ) ≥ 0 and β = (β1 , β2 , · · · , βp ) are Lagrange


multipliers.

• The problem (A.2.1) is equivalent to the minimax problem

min max L(x, α, β). (Primal) (A.2.3)


x α≥0,β

• Its dual problem is a maximin problem

max min L(x, α, β), (Dual) (A.2.4)


α≥0,β x

and the dual function is defined as

g(α, β) = min L(x, α, β). (A.2.5)


x
A.2. Weak Duality, Strong Duality, and Complementary Slackness 339

A.2.1. Weak Duality

Theorem A.12. The dual problem yields a lower bound for the primal
problem. That is, the minimax f ∗ is greater or equal to the maximin g ∗ :

f ∗ = min max L(x, α, β) ≥ max min L(x, α, β) = g ∗ (A.2.6)


x α≥0,β α≥0,β x

Proof. Let x∗ be the minimizer, the primal optimal. Then

L(x, α, β) ≥ min L(x, α, β) = L(x∗ , α, β), ∀ x, α ≥ 0, β.


x

Let (α∗ , β ∗ ) be the maximizer, the dual optimal. Then

L(x, α∗ , β ∗ ) = max L(x, α, β) ≥ L(x, α, β), ∀ x, α ≥ 0, β.


α≥0,β

It follows from the two inequalities that for all x, α ≥ 0, β,

L(x, α∗ , β ∗ ) = max L(x, α, β) ≥ min L(x, α, β) = L(x∗ , α, β). (A.2.7)


α≥0,β x

Notice that the left side depends on x, while the right side is a function of
(α, β). The inequality holds true for all x, α ≥ 0, β.
⇒ We may take min and max respectively to the left side and the right side,
x α≥0,β
to conclude (A.2.6).

Definition A.13. Weak and Strong Duality

(a) It always holds true that f ∗ ≥ g ∗ , called as weak duality.


(b) In some problems, we actually have f ∗ = g ∗ , which is called strong
duality.
340 Appendix A. Appendices

A.2.2. Strong Duality

Theorem A.14. Slater’s Theorem


If the primal is a convex problem, and there exists at least one strictly
feasible x
e, satisfying the Slater’s condition:
x) < 0 and q(e
h(e x) = 0, (A.2.8)

then strong duality holds.

A conception having close relationship with strong duality is the duality


gap.

Definition A.15. Given primal feasible x and dual feasible (α, β), the
quantity
f (x) − g(α, β) = f (x) − min L(x, α, β) (A.2.9)
x
is called the duality gap.

From the weak duality, we have

f (x) − g(α, β) ≥ f ∗ − g ∗ ≥ 0

Furthermore, we declare a sufficient and necessary condition for duality


gap equal to 0.

Proposition A.16. With x, (α, β), the duality gap equals to 0 iff
(a) x is the primal optimal solution,
(b) (α, β) is the dual optimal solution, and
(c) the strong duality holds.

Proof. From definitions and the weak duality, we have

f (x) ≥ f ∗ ≥ g ∗ ≥ g(α, β).

The duality gap equals to 0, iff the three inequalities become equalities.
A.2. Weak Duality, Strong Duality, and Complementary Slackness 341

A.2.3. Complementary Slackness

Assume that strong duality holds, x∗ is the primal optimal, and (α∗ , β ∗ )
is the dual optimal. Then
def
f (x∗ ) = g(α∗ , β ∗ ) == min L(x, α∗ , β ∗ )
x
n Xm p
X o
∗ ∗
= min f (x) + αi hi (x) + βj qj (x)
x
i=1 j=1 (A.2.10)
m
X Xp
≤ f (x∗ ) + α∗i hi (x∗ ) + βj∗ qj (x∗ )
i=1 j=1

≤ f (x ),

hence two inequalities hold with equality.


• The primal optima x∗ minimizes L(x, α∗ , β ∗ ).
• The complementary slackness holds:
αi∗ hi (x∗ ) = 0, for all i = 1, · · · , m, (A.2.11)

which implies that


αi∗ > 0 ⇒ hi (x∗ ) = 0, hi (x∗ ) < 0 ⇒ αi∗ = 0. (A.2.12)

Note: Complementary slackness says that


• If a dual variable is greater than zero (slack/loose), then the corre-
sponding primal constraint must be an equality (tight.)
• If the primal constraint is slack, then the corresponding dual vari-
able is tight (or zero).

Remark A.17. Complementary slackness is key to designing


primal-dual algorithms. The basic idea is
1. Start with a feasible dual solution α.
2. Attempt to find primal feasible x such that (x, α) satisfy complemen-
tary slackness.
3. If Step 2 succeeded, we are done; otherwise the misfit on x gives a
way to modify α. Repeat.
342 Appendix A. Appendices

A.3. Geometric Interpretation of Duality


Recall: For an optimization problem of the form
min f (x)
x
subj.to hi (x) ≤ 0, i = 1, · · · , m (Primal) (A.3.1)
qj (x) = 0, j = 1, · · · , p

the Lagrangian L : Rn × Rm × Rp → R is defined as


m
X p
X
L(x, α, β) = f (x) + αi hi (x) + βj qj (x)
i=1 j=1
(A.3.2)
= f (x) + α · h(x) + β · q(x),

where α = (α1 , α2 , · · · , αm ) ≥ 0 and β = (β1 , β2 , · · · , βp ) are Lagrange


multipliers.

• The problem (A.3.1) is equivalent to the minimax problem

min max L(x, α, β). (Primal) (A.3.3)


x α≥0,β

• Its dual problem is a maximin problem

max min L(x, α, β), (Dual) (A.3.4)


α≥0,β x

and the dual function is defined as

g(α, β) = min L(x, α, β). (A.3.5)


x

Definition A.18. Given a primal problem (A.2.1), we define its epi-


graph (supergraph) as
A = {(r, s, t) | h(x) ≤ r, q(x) = s, f (x) ≤ t, for some x ∈ Rn }. (A.3.6)
A.3. Geometric Interpretation of Duality 343

Geometric-interpretation A.19. Here are the geometric interpreta-


tion of several key values.
(a) f ∗ is the lowest projection of A to the the t-axis:

f ∗ = min{t | h(x) ≤ 0, q(x) = 0, f (x) ≤ t}


(A.3.7)
= min{t | (0, 0, t) ∈ A}.

(b) g(α, β) is the intersection of the t-axis and a hyperplane of normal


vector (α, β, 1):
def 
g(α, β) == min f (x) + α · h(x) + β · q(x)
x  (A.3.8)
= min (α, β, 1)T (r, s, t) | (r, s, t) ∈ A .

This is referred to as a nonvertical supporting hyperplane, be-


cause the last component of the normal vector is nonzero (it is 1).
(c) g ∗ is the highest intersection of the t-axis and all nonvertical
supporting hyperplanes of A. Notice that α ≥ 0 holds true for
each nonvertical supporting hyperplane of A.

From the geometric interpretation of f ∗ and g ∗ , we actually have an equiv-


alent geometric statement of strong duality:

Theorem A.20. The strong duality holds, iff there exists a nonvertical
supporting hyperplane of A passing through (0, 0, f ∗ ).

Proof. From weak duality f ∗ ≥ g ∗ , the intersection of the t-axis and a


nonvertical supporting hyperplane cannot exceed (0, 0, f ∗ ). The strong du-
ality holds, i.e, f ∗ = g ∗ , iff (0, 0, f ∗ ) is just the highest intersection, meaning
that there exists a nonvertical supporting hyperplane of A passing through
(0, 0, f ∗ ).
344 Appendix A. Appendices

Example A.21. Solve a simple inequality-constrained convex problem

min x2 + 1
x (A.3.9)
subj.to x ≥ 1.

Solution. A code is implemented to draw a figure, shown at the end of


the solution.

• Lagrangian: The inequality constraint can be written as −x + 1 ≤ 0.


Thus the Lagrangian reads

L(x, α) = x2 + 1 + α(−x + 1) = x2 − αx + α + 1
 α 2 α 2 (A.3.10)
= x− − + α + 1,
2 4
and therefore the dual function reads (when x = α/2)

α2
g(α) = min L(x, α) = − + α + 1. (A.3.11)
x 4

Remark A.22. The Solution of min L(x, α)


x

◦ We may obtain it by applying a calculus technique:



L(x, α) = 2x − α = 0, (A.3.12)
∂x
and therefore x = α/2 and (A.3.11) follows.
Equation (A.3.12) is one of the Karush-Kuhn-Tucker (KKT) con-
ditions, the first-order necessary conditions, which defines the
relationship between the primal variable (x) and the dual
variable (α).
◦ Using the KKT condition, (A.3.11) defines the dual function g(α) as
a function of the dual variable (α).
◦ The dual function g(α) is concave, while the Lagrangian is an affine
function of α.
A.3. Geometric Interpretation of Duality 345

• Epigraph: For the convex problem (A.3.9), its epigraph is defined as

A = {(r, t) | −x + 1 ≤ r, x2 + 1 ≤ t, for x ∈ R}. (A.3.13)

To find the edge of the epigraph, we replace inequalities with equalities:

−x + 1 = r, x2 + 1 = t (A.3.14)

and define t as a function of r:

t = x2 + 1 = (−r + 1)2 + 1. (A.3.15)

See Figure A.1, where the shaded region is the epigraph of the problem.

Figure A.1: The epigraph of the convex problem (A.3.9), the shaded region, and strong
duality.

The Primal Optimal: For a feasible point point x,

−x + 1 ≤ 0 ⇒ r = −x + 1 ≤ 0.

Thus the left side of the t-axis in A corresponds to the feasible set; it
follows from (A.3.15) that

f ∗ = min {t | (0, t) ∈ A} = 2. (A.3.16)


346 Appendix A. Appendices

• Nonvertical Supporting Hyperplanes: For the convex problem, it fol-


lows from a Geometric-interpretation (A.3.8) that

g(α) = min {αr + t}. (A.3.17)


(r,t)∈A

For each (r, t), the above reads


α2
αr + t = g(α) = − + α + 1,
4
where (A.3.11) is used. Thus we can define a family of nonvertical sup-
porting hyperplanes as

α2
t = −αr − + α + 1, (A.3.18)
4
which is a line in the (r, t)-coordinates for a fixed α. Figure A.1 depicts
two of the lines: α = 0 and α = 2.

• Strong Duality: Note that on the t-axis (r = 0), (A.3.18) reads


α2 1
t=− + α + 1 = − (α − 2)2 + 2, (A.3.19)
4 4
of which the maximum g ∗ = 2 when α = 2. Thus we can conclude

f ∗ = g ∗ = 2; (A.3.20)

strong duality holds for the convex problem.


A.3. Geometric Interpretation of Duality 347

duality_convex.py
1 import numpy as np
2 from matplotlib import pyplot as plt
3

4 # Convex: min f(x), s.t. x >= 1 (i.e., -x+1 <= 0)


5 def f(x): return x**2+1
6 def g(r,alpha): return -alpha*r+(-alpha**2/4+alpha+1)
7

8 #--- Epigraph: t= f(r)


9 #------------------------------------------------
10 r = np.linspace(-3,4,100); x = -r+1
11 t = f(x); mint = t.min(); maxt = t.max()
12

13 plt.fill_between(r,t,maxt,color='cyan',alpha=0.25)
14 plt.plot(r,t,color='cyan')
15 plt.xlabel(r'$r$',fontsize=15); plt.ylabel(r'$t$',fontsize=15)
16 plt.text(-1,12,r'$\cal A$',fontsize=16)
17 plt.plot([0,0],[mint-3,maxt],color='black',ls='-') # t-axis
18 plt.yticks(np.arange(-2,maxt,2)); plt.tight_layout()
19

20 #--- Two Supporting hyperplanes


21 #------------------------------------------------
22 r = np.linspace(-2.5,2,2)
23 plt.plot(r,g(r,2),color='blue',ls='-')
24 plt.plot(r,g(r,0),color='blue',ls='--')
25

26 #--- Add Texts


27 #------------------------------------------------
28 p=2.1
29 plt.text(p,g(p,0),r'$\alpha=0$',fontsize=14)
30 plt.text(p,g(p,2),r'$\alpha=2$',fontsize=14) # the optimal
31 plt.plot(0,2,'r*',markersize=8)
32 plt.text(0.1,1.9,r'$f^*=g^*=2$',fontsize=14)
33

34 plt.savefig('png-duality-example.png',bbox_inches='tight')
35 plt.show()
348 Appendix A. Appendices

Example A.23. Solve the following nonconvex problem


min x4 − 50x2 + 25x
x (A.3.21)
subj.to x ≥ −2.

Solution. For the nonconvex problem, a code is implemented similar to


duality_convex.py on p.347.

Figure A.2: The nonconvex problem: (left) The graph of y = f (x) and (right) the epigraph
and weak duality.

• The Lagrangian of the problem (A.3.21) reads

L(x, α) = x4 − 50x2 + 25x + α(−x − 2); (A.3.22)

its epigraph is defined as



A = (r, t) | −x − 2 ≤ r, x4 − 50x2 + 25x ≤ t, for some x , (A.3.23)

which is shown as the cyan-colored region in Figure A.2.


• The primal optimal f ∗ is obtained by projecting the negative side of the
epigraph (r ≤ 0) to the t-axis and taking the minimum, f ∗ ≈ −501.6.
• The dual optimal g ∗ is computed as the highest intersection of the t-axis
and all nonvertical supporting hyperplanes of A, g ∗ ≈ −673.4.
• For the nonconvex problem, there does not exist a supporting hyperplane
of A passing through (0, f ∗ ), thus strong duality does not hold.
A.4. Rank-One Matrices and Structure Tensors 349

A.4. Rank-One Matrices and Structure Tensors


Rank-one Matrices
Definition A.24. A rank-one matrix is a matrix with rank equal to
one.

Theorem A.25. Every rank-1 matrix A ∈ Rm×n can be written as an


outer product
A = uvT , u ∈ Rm , v ∈ Rn . (A.4.1)

Theorem A.26. Let A = uvT be a rank-1 matrix. Then

||A||2 = ||A||F = ||u||2 ||v||2 . (A.4.2)

Proof. Using the definitions of the norms, we have


kAxk2
kAk2 ≡ max = max kAxk2 = max kuvT xk2
x6=0 kxk2 ||x||2 =1 ||x||2 =1
T
= max kuk2 · |v x| = ||u||2 ||v||2 ,
||x||2 =1
q q (A.4.3)
kAkF = tr(AA ) = tr(uvT vuT )
T
q
= ||u||22 ||v||22 = ||u||2 ||v||2 ,

which completes the proof.


Example A.27. Let u = [6, 3, −6]T and v = [1, −2, 2]T .

(a) Form A = uvT . (b) Find ||A||2 .


350 Appendix A. Appendices

Structure Tensor
Definition A.28. The structure tensor is a matrix derived from the
gradient of a function f (x), x ∈ Rn : it is defined as

S(x) = (∇f )(∇f )T (x), (A.4.4)

which describes the distribution of the gradient of f in a neighborhood


of the point x. The structure tensor is often used in image processing
and computer vision.

Claim A.29. Structure Tensor in Two Variables x = (x, y) ∈ R2


When x = (x, y) ∈ R2 , the structure tensor in (A.4.4) reads
" #
2
fx fx fy
S(x) = (∇f )(∇f )T (x) = (x). (A.4.5)
fx fy fy2

(a) The matrix S is symmetric and positive semidefinite.


(b) ||S||2 = ||∇f ||22 , which is the maximum eigenvalue of S.
(c) det S = 0, which implies that the number 0 is an eigenvalue of S.
The two eigenvalue-eigenvector pairs (λi , vi ), i = 1, 2, are

λ1 = ||∇f ||22 , v1 = ∇f
(A.4.6)
λ2 = 0, v2 = [−fy , fx ]T

Proof.
(a) The matrix S is clearly symmetric. Let v ∈ R2 . Then
vT Sv = vT (∇f )(∇f )T v = |(∇f )T v|2 ≥ 0, (A.4.7)
which proves that S is positive semidefinite.
(b) It follows from (A.4.2). Since S is symmetric, ||S||2 must be the maxi-
mum eigenvalue of S. (See Theorem 5.44 (f).)
(c) det S = fx2 fy2 − (fx fy )2 = 0 ⇒ S is not invertible ⇒ An eigenvalue of S
must be 0.
(d) It is not difficult to check that Svi = λi vi , i = 1, 2.
A.4. Rank-One Matrices and Structure Tensors 351

Structure Tensor, Applied to Color Image Processing


• Let I(x) = (r, g, b)(x) be a color image defined on a rectangular region
Ω, x = (x, y) ∈ Ω ⊂ R2 .
• Then " #
(r, g, b)x
∇I(x) = (x), (A.4.8)
(r, g, b)y
and therefore the structure tensor of I reads
" #
(r, g, b)x 
SI (x) = (∇I)(∇I)T (x) = (r, g, b)Tx (r, g, b)Ty (x)

(r, g, b)y
" # (A.4.9)
rx2 + gx2 + b2x rx ry + gx gy + bx by
= (x).
rx ry + gx gy + bx by ry2 + gy2 + b2y

Claim A.30. For the structure tensor SI in (A.4.9):


(a) We have
||SI ||2 = ||∇I||22 = λmax (SI ). (A.4.10)

(b) Rewrite SI as " #


Jxx Jxy
SI = . (A.4.11)
Jxy Jyy
Then
q
Jxx + Jyy + (Jxx − Jyy )2 + 4Jxy
2
λmax (SI ) = λ1 = , (A.4.12)
2
and its corresponding eigenvector reads
" #
Jyy − λ1
v1 = , (A.4.13)
−Jxy

which is the edge normal direction.

Proof. Here are hints.


(a) See Theorem 5.44 (f) and Theorem 5.49.
(b) Use the quadratic formula to solve det (SI − λI) = 0 for λ.
352 Appendix A. Appendices

Matlab-code A.31. Structure Tensor


structure_tensor.m
1 function [grad,theta] = structure_tensor(u)
2 % [grad,theta] = structure_tensor(u)
3 % This program uses the Structure Tensor (ST) method to compute
4 % the Sobel gradient magnitude, |grad(u)|, and
5 % the edge normal angle, theta.
6

7 %%--- Sobel Derivatives ---------------------------------


8 [ux,uy] = sobel_derivatives(u);
9

10 %%--- Jacobian: J = (grad u)*(grad u)' ------------------


11 Jxx = dot(ux,ux,3); Jxy = dot(ux,uy,3); Jyy = dot(uy,uy,3);
12

13 %%--- The first Eigenvalue of J and direction (e1,v1) ---


14 D = sqrt((Jxx-Jyy).^2 + 4*Jxy.^2);
15 e1 = (Jxx+Jyy+D)/2;
16 grad = sqrt(e1); % sqrt(e1) = magnitude
17 theta = atan2(-Jxy,Jyy-e1); % v1 = (Jyy-e1,-Jxy)

sobel_derivatives.m
1 function [ux,uy] = sobel_derivatives(u)
2 % Usage: [ux,uy] = sobel_derivatives(u);
3 % It produces Sobel derivatives, using conv2(u,C,'valid').
4

5 %%--- initialization -------------------------


6 if isa(u,'uint8'), u = im2double(u); end
7 [m,n,d] = size(u);
8 ux = zeros(m,n,d); uy = zeros(m,n,d);
9 C = [1 2 1; 0 0 0; -1 -2 -1];
10

11 %%--- conv2, with 'valid' --------------------


12 for k = 1:d
13 ux(2:end-1,2:end-1,k) = conv2(u(:,:,k),C, 'valid');
14 uy(2:end-1,2:end-1,k) = conv2(u(:,:,k),C','valid');
15 end
16

17 %%--- expand, up to the boundary -------------


18 ux(1,:,:) = ux(2,:,:); ux(end,:,:) = ux(end-1,:,:);
19 ux(:,1,:) = ux(:,2,:); ux(:,end,:) = ux(:,end-1,:);
20 uy(1,:,:) = uy(2,:,:); uy(end,:,:) = uy(end-1,:,:);
21 uy(:,1,:) = uy(:,2,:); uy(:,end,:) = uy(:,end-1,:);
A.5. Boundary-Effects in Convolution Functions in Matlab and Python SciPy 353

A.5. Boundary-Effects in Convolution Functions


in Matlab and Python SciPy
Note: You should first understand that there are 3 different modes for
the computation of convolution in Matlab and Python.
• full: At points where the given two arrays overlap partially, the
arrays are padded with zeros to get the convolution there.
• valid: The output is calculated only at positions where the arrays
overlap completely. This mode does not use zero padding at all.
• same: It crops the middle part out of the ‘full’ mode so that its size
is the same as the size of the first array (Matlab) or the larger array
(SciPy).

Observation A.32. Boundary-Effects in Convolutions


A post-processing is required, if the user wants to suppress bound-
ary effects, get the convolution result having the same size as the
input, or both.

Example A.33. Let f (x, y) = sin(πx) cos(πy).


Find the Sobel derivative ∂f /∂x by convolving the filter
 
1 0 −1
C = 2 0 −2.
 
1 0 −1
Observe boundary effects and try to suppress them.
Solution.
• Codes are implemented in
– Matlab: conv2
– SciPy: scipy.signal.convolve2d
for the 3 different modes.
• Around the boundary, values are expanded from the “valid”.
• The convolution results are plotted to compare.
354 Appendix A. Appendices

matlab_conv2_boundary.m
1 %%--- Initial setting --------------------
2 n = 21; h = 1/(n-1);
3 x = linspace(0,1,n); [X,Y] = meshgrid(x);
4

5 F = sin(pi*X).*cos(pi*Y); % the function


6 Fx = pi*cos(pi*X).*cos(pi*Y); % the true derivative
7

8 %%--- conv2: Sobel derivative ------------


9 C = [1 0 -1; 2 0 -2; 1 0 -1] /(8*h);
10 Fx_full = conv2(F,C,'full');
11 Fx_same = conv2(F,C,'same');
12 Fx_valid = conv2(F,C,'valid');
13

14 %%--- Expansion of conv2(valid) ----------


15 Fx_expand = zeros(size(F));
16 Fx_expand(2:end-1,2:end-1) = Fx_valid;
17 Fx_expand(1,:) = Fx_expand(2,:); Fx_expand(end,:) = Fx_expand(end-1,:);
18 Fx_expand(:,1) = Fx_expand(:,2); Fx_expand(:,end) = Fx_expand(:,end-1);

scipy_convolve_boundary.py
1 import numpy as np; import scipy
2 import matplotlib.pyplot as plt
3 from matplotlib import cm; from math import pi
4

5 ##--- Initial setting --------------------


6 n = 21; h = 1/(n-1);
7 x = np.linspace(0,1,n); X,Y = np.meshgrid(x,x);
8

9 F = np.sin(pi*X)*np.cos(pi*Y); # the function


10 Fx = pi*np.cos(pi*X)*np.cos(pi*Y); # the true derivative
11

12 ##--- conv2: Sobel derivative ------------


13 C = np.array([[1,0,-1], [2,0,-2], [1,0,-1]]) /(8*h);
14 Fx_full = scipy.signal.convolve2d(F,C,mode='full');
15 Fx_same = scipy.signal.convolve2d(F,C,mode='same');
16 Fx_valid = scipy.signal.convolve2d(F,C,mode='valid');
17

18 ##--- Expansion of convolve2d(valid) -----


19 Fx_expand = np.zeros(F.shape);
20 Fx_expand[1:-1,1:-1] = Fx_valid;
21 Fx_expand[0,:] = Fx_expand[1,:]; Fx_expand[-1,:] = Fx_expand[-2,:];
22 Fx_expand[:,0] = Fx_expand[:,1]; Fx_expand[:,-1] = Fx_expand[:,-2];
A.5. Boundary-Effects in Convolution Functions in Matlab and Python SciPy 355

Figure A.3: Matlab convolutions.


356 Appendix A. Appendices

Figure A.4: SciPy convolutions.


A.6. From Python, Call C, C++, and Fortran 357

A.6. From Python, Call C, C++, and Fortran


Note: A good programming language must be easy to learn and use
and flexible and reliable.

Python
• Advantages
– Easy to learn and use
– Flexible and reliable
– Extensively used in Data Science
– Handy for Web Development purposes
– Having Vast Libraries support
– Among the fastest-growing programming languages
in the tech industry, machine learning, and AI
• Disadvantage
– Slow!!

Strategy A.34. Speed up Python Programs


• Use numpy and scipy for all mathematical operations.
• Always use built-in functions wherever possible.

• Cython: It is designed as a C-extension for Python, which is


developed for users not familiar with C. A Good Choice!

• Create and import modules/functions in C, C++, or Fortran


– Easily more than 100× faster than Python scripts
– The Best Choice!!
358 Appendix A. Appendices

Python Extension

Example A.35. Functions are implemented in (Python, F90, C, C++)


and called from Python. Let x, y ∈ Rn .
for j=1:m, dotp = 0;
for j=1:m for i=1:n
dotp = dot(x,y); (V) dotp = dotp+x(i)*y(i); (S)
end end
end

f90
test_f90.f90
1 subroutine test_f90_v(x,y,m,dotp)
2 implicit none
3 real(kind=8), intent(in) :: x(:), y(:)
4 real(kind=8), intent(out) :: dotp
5 integer :: m,j
6

7 do j=1,m
8 dotp = dot_product(x,y)
9 enddo
10 end
11

12 subroutine test_f90_s(x,y,m,dotp)
13 implicit none
14 real(kind=8), intent(in) :: x(:), y(:)
15 real(kind=8), intent(out) :: dotp
16 integer :: n,m,i,j
17

18 n =size(x)
19 do j=1,m
20 dotp=0
21 do i=1,n
22 dotp = dotp+x(i)*y(i)
23 enddo
24 enddo
25 end
A.6. From Python, Call C, C++, and Fortran 359

C++
The numeric library is included for vector operations.
test_gpp.cpp
1 #include <iostream>
2 #include <vector>
3 #include <numeric>
4 using namespace std;
5 typedef double VTYPE;
6

7 extern "C" // required when using C++ compiler


8 VTYPE test_gpp_v(VTYPE*x,VTYPE*y, int n, int m)
9 {
10 int i,j;
11 VTYPE dotp;
12

13 for(j=0;j<m;j++){
14 dotp = inner_product(x, x+n, y, 0.0);
15 }
16 return dotp;
17 }
18

19 extern "C" // required when using C++ compiler


20 VTYPE test_gpp_s(VTYPE*x,VTYPE*y, int n, int m)
21 {
22 int i,j;
23 VTYPE dotp;
24

25 for(j=0;j<m;j++){dotp=0.;
26 for(i=0;i<n;i++){
27 dotp += x[i]*y[i];
28 }
29 }
30 return dotp;
31 }
360 Appendix A. Appendices

Python
test_py3.py
1 import numpy as np
2

3 def test_py3_v(x,y,m):
4 for j in range(m):
5 dotp = np.dot(x,y)
6 return dotp
7

8 def test_py3_s(x,y,m):
9 n = len(x)
10 for j in range(m):
11 dotp = 0;
12 for i in range(n):
13 dotp +=x[i]*y[i]
14 return dotp

Compiling
Modules in f90, C, and C++ are compiled by executing the shell script.
Compile-f90-c-cpp
1 #!/usr/bin/bash
2

3 LIB_F90='lib_f90'
4 LIB_GCC='lib_gcc'
5 LIB_GPP='lib_gpp'
6

7 ### Compiling: f90


8 f2py3 -c --f90flags='-O3' -m $LIB_F90 *.f90
9

10 ### Compiling: C (PIC: position-independent code)


11 gcc -fPIC -O3 -shared -o $LIB_GCC.so *.c
12

13 ### Compiling: C++


14 g++ -fPIC -O3 -shared -o $LIB_GPP.so *.cpp
A.6. From Python, Call C, C++, and Fortran 361

Python Wrap-up
An executable Python wrap-up is implemented as follows.
Python_calls_F90_GCC.py
1 #!/usr/bin/python3
2

3 import numpy as np
4 import ctypes, time
5 from test_py3 import *
6 from lib_f90 import *
7 lib_gcc = ctypes.CDLL("./lib_gcc.so")
8 lib_gpp = ctypes.CDLL("./lib_gpp.so")
9

10 n=100; m=1000000
11 #n=1000; m=1000000
12

13 x = np.arange(0.,n,1); y = x+0.1;
14

15 print('--------------------------------------------------')
16 print('Speed test: (dot-product: n=%d), m=%d times' %(n,m))
17 print('--------------------------------------------------')
18

19 ### Python ################################


20 t0 = time.time(); result = test_py3_v(x,y,m)
21 print('test_py3_v: e-time = %.4f; result = %.2f' %(time.time()-t0,result))
22

23 t0 = time.time(); result = test_py3_s(x,y,m)


24 print('test_py3_s: e-time = %.4f; result = %.2f\n' %(time.time()-t0,result))
25

26 ### Fortran ###############################


27 t0 = time.time(); result = test_f90_v(x,y,m)
28 print('test_f90_v: e-time = %.4f; result = %.2f' %(time.time()-t0,result))
29

30 t0 = time.time(); result = test_f90_s(x,y,m)


31 print('test_f90_s: e-time = %.4f; result = %.2f\n' %(time.time()-t0,result))
32

33 ### C #####################################
34 lib_gcc.test_gcc_s.argtypes = [np.ctypeslib.ndpointer(dtype=np.double),
35 np.ctypeslib.ndpointer(dtype=np.double),
36 ctypes.c_int,ctypes.c_int] #input type
37 lib_gcc.test_gcc_s.restype = ctypes.c_double #output type
38

39 t0 = time.time(); result = lib_gcc.test_gcc_s(x,y,n,m)


40 print('test_gcc_s: e-time = %.4f; result = %.2f\n' %(time.time()-t0,result))
41
362 Appendix A. Appendices

42 ### C++ ###################################


43 lib_gpp.test_gpp_v.argtypes = [np.ctypeslib.ndpointer(dtype=np.double),
44 np.ctypeslib.ndpointer(dtype=np.double),
45 ctypes.c_int,ctypes.c_int] #input type
46 lib_gpp.test_gpp_v.restype = ctypes.c_double #output type
47

48 t0 = time.time(); result = lib_gpp.test_gpp_v(x,y,n,m)


49 print('test_gpp_v: e-time = %.4f; result = %.2f' %(time.time()-t0,result))
50

51 lib_gpp.test_gpp_s.argtypes = [np.ctypeslib.ndpointer(dtype=np.double),
52 np.ctypeslib.ndpointer(dtype=np.double),
53 ctypes.c_int,ctypes.c_int] #input type
54 lib_gpp.test_gpp_s.restype = ctypes.c_double #output type
55

56 t0 = time.time(); result = lib_gpp.test_gpp_s(x,y,n,m)


57 print('test_gpp_s: e-time = %.4f; result = %.2f\n' %(time.time()-t0,result))

Performance Comparison
A Linux OS is used with an Intel Core i7-10750H CPU @ 2.60GHz.
n=100, m=1000000 ⇒ 200M flops
1 --------------------------------------------------
2 Speed test: (dot-product: n=100), m=1000000 times
3 --------------------------------------------------
4 test_py3_v: e-time = 0.7672; result = 328845.00
5 test_py3_s: e-time = 18.2175; result = 328845.00
6

7 test_f90_v: e-time = 0.0543; result = 328845.00


8 test_f90_s: e-time = 0.0530; result = 328845.00
9

10 test_gcc_s: e-time = 0.0603; result = 328845.00


11

12 test_gpp_v: e-time = 0.0600; result = 328845.00


13 test_gpp_s: e-time = 0.0612; result = 328845.00
A.6. From Python, Call C, C++, and Fortran 363

n=1000, m=1000000 ⇒ 2B flops


1 --------------------------------------------------
2 Speed test: (dot-product: n=1000), m=1000000 times
3 --------------------------------------------------
4 test_py3_v: e-time = 0.8984; result = 332883450.00
5 test_py3_s: e-time = 201.4086; result = 332883450.00
6

7 test_f90_v: e-time = 0.8331; result = 332883450.00


8 test_f90_s: e-time = 0.8318; result = 332883450.00
9

10 test_gcc_s: e-time = 0.8575; result = 332883450.00


11

12 test_gpp_v: e-time = 0.8638; result = 332883450.00


13 test_gpp_s: e-time = 0.8623; result = 332883450.00

n=100, m=10000000 ⇒ 2B flops


1 --------------------------------------------------
2 Speed test: (dot-product: n=100), m=10000000 times
3 --------------------------------------------------
4 test_py3_v: e-time = 7.8289; result = 328845.00
5 test_py3_s: e-time = 195.0932; result = 328845.00
6

7 test_f90_v: e-time = 0.5656; result = 328845.00


8 test_f90_s: e-time = 0.5456; result = 328845.00
9

10 test_gcc_s: e-time = 0.6090; result = 328845.00


11

12 test_gpp_v: e-time = 0.6055; result = 328845.00


13 test_gpp_s: e-time = 0.6089; result = 328845.00

Summary A.36. Python Calls C, C++, and Fortran


• Compiled modules are 200+ times faster than Python Scripts.
• Compiled modules are yet faster than Python Built-in’s.
• Fortran is about 5 ∼ 10 % faster than C/C++.

Innovative projects often require a completely new code:


• You may search-&-download public-domain functions,
for which you do not have to re-implement in Python.
• If a function needs a long script, you should try C, C++, or Fortran.
364 Appendix A. Appendices
P
A PPENDIX

Projects

Contents of Chapter P
P.1. Project: Canny Edge Detection Algorithm for Color Images . . . . . . . . . . . . . . . . 366
P.2. Project: Text Extraction from Images, PDF Files, and Speech Data . . . . . . . . . . . 380

365
366 Appendix P. Projects

P.1. Project: Canny Edge Detection Algorithm


for Color Images
Through the project, we will build an edge detection (ED) algorithm
which can sketch the edges of objects on color images .

What are Images?


• A rectangular array of pixels
• In the RGB representation:
An image is a function
u : Ω → R3+ := {(r, g, b) : r, g, b ≥ 0}

• In practice,
u : Ω ⊂ N2 → N3[0,255]
due to sampling & quantiza-
tion
• u = u(m, n, d), for d = 1 or 3
• rgb2gray formula:
0.299 ∗ R + 0.587 ∗ G + 0.114 ∗ B

• Most of ED algorithms are developed for grayscale images.


(Color images must be transformed to a grayscale image.)
• Edge detection algorithms consist of a few steps. For example, the
Canny edge detection algorithm has five steps [2] (Canny, 1986):
1. Noise reduction (image blur)
2. Gradient calculation
3. Edge thinning (non-maximum suppression)
4. Double threshold
5. Edge tracking by hysteresis
We will learn Canny’s algorithm, implementing every step in detail.
P.1. Project: Canny Edge Detection Algorithm for Color Images 367

“edge”: A Built-in Function


Edge_Detection.m
1 close all; clear all
2

3 global beta delta Dt nmax level


4 beta = 0.05; delta = 0.2; Dt = 0.25; nmax = 10; %TV denoising
5 level = 1;
6

7 if exist('OCTAVE_VERSION','builtin'), pkg load image; end


8 %----------------------------------------------
9 TheImage = 'DATA/Lena256.png';
10 %TheImage = 'DATA/synthetic-Checkerboard.png';
11 [Filepath,Name,Ext] = fileparts(TheImage);
12 v0 = im2double(imread(TheImage));
13 [m,n,d] = size(v0);
14 if level>=1, figure, imshow(v0); end
15 fprintf('%s: size=(%d,%d,%d)\n',TheImage,m,n,d)
16

17 %--- Use built-in: "edge" ---------------------


18 vg = rgb2gray(v0);
19 E = edge(vg,'Canny');
20 if level>=1, figure,imshow(E); end
21 if level>=2, imwrite(vg,strcat(Name,'_Gray.png'));
22 imwrite(E,strcat(Name,'_Gray-Builtin-Edge.png')); end
23

24 %----------------------------------------------
25 %--- New Trial: Color Edge Detection ----------
26 %----------------------------------------------
27 ES = color_edge(v0,Name);

Figure P.1: Edge detection, using the built-in function “edge”, which is not perfect but
somewhat acceptable.
368 Appendix P. Projects

A Critical Issue on Edge Detection Algorithms

Observation P.1. When color images are transformed to grayscale,


it is occasionally the case that edges
• either lose their strength
• or even disappear.

Figure P.2: Canny edge detection for color images: A synthetic image produced by
transform_to_gray.m, its grayscale, and the result of the built-in function “edge”.

Project Objectives: The project will develop an edge detection algo-


rithm, which is less problem (so, more effective) for color images.
• We will design an effective algorithm as good as (or better than) the
well-tuned built-in function (edge).
• For your convenience and success, you will be provided with a
modelcode, saved in Edge-Detection.MM.tar.
P.1. Project: Canny Edge Detection Algorithm for Color Images 369

A Modelcode, for Color Edge Detection


color_edge.m
1 function ES = color_edge(v0,Name)
2 % function ES = color_edge(v0,Name)
3 % Edge detection for color images,
4 % by dealing with gradients separately for RGB
5

6 global beta delta Dt nmax level


7

8 [m,n,d] = size(v0);
9 vs = zeros(m,n,d); grad = zeros(m,n,d); theta = zeros(m,n,d);
10 TH = zeros(m,n); ES = zeros(m,n);
11

12 %--- Steps 1 & 2: Channel-by-Channel ---------------


13 for k=1:d
14 %u = imgaussfilt(v0(:,:,k),2); % Step 1
15 u = tv_denoising(v0(:,:,k)); % Step 1: Image Denoising/Blur
16 vs(:,:,k) = u;
17 [grad(:,:,k),theta(:,:,k)] = sobel_grad(u); % Step 2: Grad Calculation
18 end
19

20 %--- Combine for (1D) Gradient Intensity -----------


21 [E0,I] = max(grad,[],3); E0 = E0/max(E0(:));
22 for i=1:m, for j=1:n, TH(i,j) = theta(i,j,I(i,j)); end,end
23 if level>=2,
24 imwrite(vs,strcat(Name,'_TV_denoised.png'));
25 imwrite(E0,strcat(Name,'_New-Sobel-Grad.png'));
26 end
27

28 %--- Step 3: Edge Thinning -------------------------


29 E1 = non_max_suppression(E0,TH);
30 if level>=2,
31 imwrite(E1,strcat(Name,'_Sobel-Grad-Supressed.png'));
32 end
33

34 %--- Step 4: Double Threshold ----------------------


35 highRatio = 0.09; lowRatio = 0.3; % Set them appropriately!!
36 [strong,weak] = double_threshold(E1,highRatio,lowRatio);
37

38 %--- Step 5: Edge Tracking by Hysteresis -----------


39 % YOU WILL DO THIS
40

41 %--- Step 6: (Extra Step) Edge-Trim ----------------


42 % YOU WILL DO THIS
370 Appendix P. Projects

P.1.1. Noise Reduction: Image Blur


This step can be done by applying the Gaussian filter imgaussfilt.
• It is a simple averaging algorithm and blurs the whole image.
⇒ It can make edge strength weaker; we will try another method.

Let v0 be an observed (noisy) image defined on Ω ⊂ R2 . Consider the


evolutionary total variation (TV) model [11]:
 ∇u 
ut − ∇ · = β(v0 − u), (TV) (P.1.1)
|∇u|
where the left-side is the negation of the mean curvature and β denotes
a constraint parameter, a Lagrange multiplier.
• The TV model tends to converge to a piecewise constant image. Such
a phenomenon is called the staircasing effect. ⇒ The TV model
can be used for both noise reduction and edge sharpening.

Numerical Discretization
For the time-stepping procedure, we simply employ the explicit method,
the forward Euler method:
un+1 − un  ∇un 
−∇· n
= β(v0 − un ), u0 = v0 , (P.1.2)
∆t |∇u |
which equivalently reads
un+1 = un + ∆t β(v0 − un ) − Aun , u0 = v0 ,

(P.1.3)
where  ∇un   un   un 
n x y
Au ≈ −∇ · n
=− n
− .
|∇u | |∇u | x |∇un | y

Remark P.2. The TV Denoising


• The TV model, (P.1.3), requires to set ∆t small enough for stability.
• It can reduce noise effectively in a few iterations.
• Implementation details can be found from tv_denoising.m and
curvaturePG.m.
P.1. Project: Canny Edge Detection Algorithm for Color Images 371

test_denoising.m
1 close all; clear all
2

3 global beta delta Dt nmax level


4 beta = 0.05; delta = 0.2; Dt = 0.25; nmax = 10; %TV denoising
5 level = 2;
6

7 if exist('OCTAVE_VERSION','builtin'), pkg load image; end


8 %----------------------------------------------
9 TheImage = 'DATA/Lena256.png';
10 v0 = im2double(imread(TheImage));
11 [m,n,d] = size(v0);
12

13 ug = zeros(m,n,d); ut = zeros(m,n,d);
14 for k=1:d
15 ut(:,:,k) = tv_denoising(v0(:,:,k));
16 ug(:,:,k) = imgaussfilt(v0(:,:,k),2); % sigma=2
17 end
18

19 imwrite(ut,'Lena256_test-TV_denoised.png')
20 imwrite(ug,'Lena256_test-Gaussian-filter.png')

Figure P.3: Step 1: Image denoising or image blur. The original Lena (left), the TV-
denoised image (middle), and the Gaussian-filtered image (right).
372 Appendix P. Projects

P.1.2. Gradient Calculation: Sobel Gradient


Note: Edges correspond to a change of pixels’ intensity. To detect it,
the easiest way is to apply filters that highlight the intensity change
in both directions: horizontal (x) and vertical (y).

Algorithm P.3. Sobel gradient


• The image gradient, ∇u = (ux , uy ), is calculated as the convolution
of the image (u) and the Sobel kernels (Kx , Ky ).
   
−1 0 1 1 2 1
Kx = −2 0 2 , Ky =  0 0 0. (P.1.4)
   
−1 0 1 −1 −2 −1

That is,
ux = conv(u, Kx ), uy = conv(u, Ky ). (P.1.5)

• Then the magnitude G and the slope θ of the gradient are calculated
as follow:
q u 
2 2 y
G = ux + uy , θ = arctan = atan2(uy , ux ). (P.1.6)
ux
See sobel_grad.m on the following page.

Note: The Gradient Magnitude & atan2


• The gradient magnitude is saved in E0.
(See Line 21 of color_edge.m on p.369.)
– It is normalized for its maximum value to be 1, for the purpose
of figuring.
• >> help atan2
atan2(Y,X) is the four quadrant arctangent of the
elements of X and Y such that -pi <= atan2(Y,X) <= pi.
...
P.1. Project: Canny Edge Detection Algorithm for Color Images 373

sobel_grad.m
1 function [grad,theta] = sobel_grad(u)
2 % [grad,theta] = sobel_grad(u)
3 % It computes the Sobel gradient magnitude, |grad(u)|,
4 % and edge normal angle, theta.
5

6 [m,n,d]=size(u);
7 grad = zeros(m,n); theta = zeros(m,n);
8

9 %%--------------------------------------------------
10 for q=1:n
11 qm=max(q-1,1); qp=min(q+1,n);
12 for p=1:m
13 pm=max(p-1,1); pp=min(p+1,m);
14 ux = u(pp,qm)-u(pm,qm) +2.*(u(pp,q)-u(pm,q)) +u(pp,qp)-u(pm,qp);
15 uy = u(pm,qp)-u(pm,qm) +2.*(u(p,qp)-u(p,qm)) +u(pp,qp)-u(pp,qm);
16 grad(p,q) = sqrt(ux^2 + uy^2);
17 theta(p,q) = atan2(uy,ux);
18 end
19 end

Figure P.4: The gradient magnitudes of (R,G,B) components of Lena.


374 Appendix P. Projects

Note: Through the project, we may get the single-component gradi-


ent by simply taking the maximum of (R,G,B)-gradients. See Line
21 of color_edge.m, p.369.

Figure P.5: Step 2: The maximum of (R,G,B) gradients, for the Lena image.

Figure P.6: The color checkerboard image in Figure P.2 (left) and the maximum of its
(R,G,B)-gradients (right).

Now, more reliable edges can be detected!


P.1. Project: Canny Edge Detection Algorithm for Color Images 375

P.1.3. Edge Thinning: Non-maximum Suppression


As one can see from above figures, some of the edges are thick and others
are thin.
• The goal of edge thinning is to mitigate the thick edges.

Algorithm P.4. Non-maximum Suppression


• The algorithm goes through the gradient intensity matrix and
• finds the pixels with the maximum value in the edge normal di-
rections.
The principle is simple!
non_max_suppression.m
1 function Z = non_max_suppression(E0,TH)
2 % function Z = non_max_suppression(E0,TH)
3

4 [m,n] = size(E0);
5 Z = zeros(m,n);
6

7 TH(TH<0) = TH(TH<0)+pi;
8 R = mod(floor((TH+pi/8)/(pi/4)),4); % region=0,1,2,3
9

10 for i=2:m-1, for j=2:n-1


11 if R(i,j)==0, % around 0 degrees
12 a=E0(i-1,j); b=E0(i+1,j);
13 elseif R(i,j)==1, % around 45 degrees
14 a=E0(i-1,j-1); b=E0(i+1,j+1);
15 elseif R(i,j)==2, % around 90 degrees
16 a=E0(i,j-1); b=E0(i,j+1);
17 else, % around 135 degrees
18 a=E0(i-1,j+1); b=E0(i+1,j-1);
19 end
20 if E0(i,j)>=max(a,b), Z(i,j) = E0(i,j); end
21 end,end
376 Appendix P. Projects

Figure P.7: Step 3: Non-maximum suppression. The Sobel gradient in Figure P.5 (left) and
the non-maximum suppressed (right).
P.1. Project: Canny Edge Detection Algorithm for Color Images 377

P.1.4. Double Threshold


The double threshold step aims at identifying 3 kinds of pixels:
strong, weak, and non-relevant:
• Strong pixels: pixels of high intensities
⇒ They surely contribute to the final edge.
• Weak pixels: pixels of mid-range intensities
⇒ Not to be considered as non-relevant for the edge detection.
• Other pixels are considered as non-relevant for the edge.

Note: Implementation of Double Threshold


0. Set high and low thresholds.
1. High threshold is used to identify the strong pixels.
2. Low threshold is used to identify the non-relevant pixels.
3. All pixels having intensity between both thresholds are flagged as
weak.
4. The Hysteresis mechanism (Step 5) will help us identify the ones
that could be considered as strong and the ones that are considered
as non-relevant.

In-Reality P.5. highThreshold & lowThreshold


It is often the case that you should assign two ratios to set the
highThreshold and lowThreshold.
• For example:

highRatio = 0.09; lowRatio = 0.3;


highThreshold = highRatio*max(E1(:));
lowThreshold = lowRatio *highThreshold;

where E1 is the non-maximum suppressed gradient intensity.


Note: In order to detect weak edges more effectively, one can employ
dynamic (variable) thresholds. (⇒ It requires some more research.)
378 Appendix P. Projects

double_threshold.m
1 unction [strong,weak] = double_threshold(E1,highRatio,lowRatio)
2

3 highThreshold = highRatio*max(E1(:));
4 lowThreshold = lowRatio *highThreshold;
5

6 strong = (E1>=highThreshold);
7 weak = (E1<highThreshold).*(E1>=lowThreshold);

Figure P.8: Step 4: Double threshold. The strong pixels (left), the weak pixels (middle),
and the combined (right).

Remark P.6. Step 4: Double Threshold


• You should select highRatio and lowRatio appropriately.
• You may have to eliminate isolated pixels from strong pixels, which
can be carried out as a post-processing.

P.1.5. Edge Tracking by Hysteresis

Algorithm P.7. The Edge Tracking Rule


The hysteresis consists of transforming weak pixels into strong ones
⇐⇒ at least one of 8 surrounding pixels is a strong one.

You will implement a function for edge tracking.


P.1. Project: Canny Edge Detection Algorithm for Color Images 379

Here are What to Do


1. Download the modelcode: Edge-Detection.MM.tar.
2. Complete Steps 4 & 5 with color_edge.m.
• For Step 4, the major work must be to set appropriately two param-
eters: highRatio and lowRatio.
• For Step 5, you have to implement a function named hysteresis
which realizes the edge tracking rule in Algorithm P.7.
3. Post-processing (Step 6): Implement a trimming function in order
to eliminate isolated edge pixels (or, of length e.g. ≤ 3).

4. Run the resulting code of yours to get the edges.


5. Download another image, run your code, and tune it to get more
reliable edges. For tuning:
• You may try to use imgaussfilt with various σ, rather than the TV-
denoising. See Line 15 of color_edge.m, p.369.
• Try various combinations of highRatio and lowRatio.

6. Extra Credit:
(a) Analysis: For noise reduction, you can employ either the TV-
denoising model or the builtin function imgaussfilt(Image,σ)
with various choices of σ. Analyze effects of different choices of
parameters and functions on edge detection.
(b) New Idea for the Gradient Intensity. We chose the maximum
of (R,G,B)-gradients; see Line 21 of color_edge.m. Do you have
any idea better than that?

Report what you have done, including newly-implemented M-


files, choices of parameters and functions, and images and their
edges.
380 Appendix P. Projects

P.2. Project: Text Extraction from Images, PDF


Files, and Speech Data
In text extraction applications, the core technology is the Optical Char-
acter Recognition (OCR).
• Its primary function is to extract texts from images.
• In modern days, using advanced machine learning algorithms, the
OCR can identify and convert image texts into audio files, for easy
listening.

There are some powerful text extraction software (having accuracy 98+%).
• However most of them are not freely/conveniently available.
• We will develop two text extraction algorithms, one for image data and
the other for speech data.

Project Objectives: To develop two separate Python programs:


pdfim2text & speech2text.
1. PDF-Image to Text (pdfim2text)
• Input: (an image) or (a pdf file)
– A PDF may include images.
– When a PDF is generated by scanning, each page is an image.
• Core Task: Extract all texts; convert texts to audio data.
2. Speech to Text+Speech (speech2text)
• Input: speech data from (microphone) or (a wave file)
• Core Task: Extract texts; play the extracted texts.
P.2. Project: Text Extraction from Images, PDF Files, and Speech Data 381

An Example
pdfim2text
1 #!/usr/bin/python
2

3 import pytesseract
4 from pdf2image import convert_from_path
5 from PIL import Image
6 from gtts import gTTS
7 from playsound import playsound
8 import os, pathlib, glob
9 from termcolor import colored
10

11 def takeInput():
12 pmode = 0;
13 IN = input("Enter a pdf or an image: ")
14 if os.path.isfile(IN):
15 path_stem = pathlib.Path(IN).stem
16 path_ext = pathlib.Path(IN).suffix
17 if path_ext.lower() == '.pdf': pmode=1
18 else:
19 exit()
20 return IN, path_stem, pmode
21

22 def pdf2txt(IN):
23 # you have to complete the function appropriately
24 return 'Aha, it is a pdf file.\
25 For pdf2txt, you may save the text here without return.'
26

27 def im2txt(IN):
28 # you have to complete the function appropriately
29 return 'Now, it is an image.\
30 For im2txt, try to return the text to play'
31

32 if __name__ == '__main__':
33 IN, path_stem, pmode = takeInput() #pmode=0:image; pmode=1:pdf
34 if pmode:
35 txt = pdf2txt(IN)
36 else:
37 txt = im2txt(IN)
38

39 audio = gTTS(text=txt, lang="en", slow=False);


40 WAV = '0000-' + path_stem + '-text.wav';
41 audio.save(WAV); print(colored('Text: saved to <%s>' %(WAV),'yellow'))
42 playsound(WAV); os.remove(WAV)
382 Appendix P. Projects

What to Do
First download https://skim.math.msstate.edu/LectureNotes/data/Image-
Speech-Text-Processing.PY.tar. Untar it to see the file pdfim2text and
example codes in a subdirectory example-code.
1. Complete pdfim2text appropriately.
• You may find clues from example-code/pdf2txt.py
2. Implement speech2text from scratch.
• You may get hints from speech_mic2wave.py and image2text.py
in the directory example-code.
Try to put all functions into a single file for each command, which en-
hances portability of the commands.

Report
• Work in a directory, of which the name begins with your last name.
• Use the three-page project document as a data file for pdfim2text.
• zip or tar your work directory and submit via email.
• Write a report to explain what you have done, including images and
wave files; upload it to Canvas.
Bibliography
[1] R. B ARRETT, M. B ERRY, T. C HAN, J. D EMMEL , J. D ONATO, J. D ONGARRA , V. E I -
JKHOUT, R. P OZO, C. R OMINE , AND H. VAN DER V ORST , Templates for the solution of
linear systems: Building blocks for iterative methods, SIAM, Philadelphia, 1994. The
postscript file is free to download from http://www.netlib.org/templates/ along with
source codes.

[2] J. C ANNY, A computational approach to edge detection, IEEE Transactions on Pattern


Analysis and Machine Intelligence, PAMI-8 (1986), pp. 679–698.

[3] O. C HUM AND J. M ATAS, Matching with prosac-progressive sample consensus, in


2005 IEEE computer society conference on computer vision and pattern recognition
(CVPR’05), vol. 1, IEEE, 2005, pp. 220–226.

[4] C. C ORTES AND V. N. VAPNIK, Support-vector networks, Machine Learning, 20 (1995),


pp. 273–297.

[5] M. F ISCHLER AND R. B OLLES, Random sample consensus: A paradigm for model
fitting with applications to image analysis and automated cartography, Communica-
tions of the ACM, 24 (1981), pp. 381–395.

[6] B. G ROSSER AND B. L ANG, An o(n2 ) algorithm for the bidiagonal svd, Lin. Alg. Appl.,
358 (2003), pp. 45–70.

[7] C. K ELLY, Iterative methods for linear and nonlinear equations, SIAM, Philadelphia,
1995.

[8] P. C. N IEDFELDT AND R. W. B EARD, Recursive ransac: multiple signal estimation


with outliers, IFAC Proceedings Volumes, 46 (2013), pp. 430–435.

[9] M. N IELSEN, Neural networks and deep learning. (The online book can be found at
http://neuralnetworksanddeeplearning.com), 2013.

[10] F. R OSENBLATT, The Perceptron: A probabilistic model for information storage and
organization in the brain, Psychological Review, (1958), pp. 65–386.

[11] L. R UDIN, S. O SHER , AND E. FATEMI, Nonlinear total variation based noise removal
algorithms, Physica D, 60 (1992), pp. 259–268.

[12] P. H. T ORR AND A. Z ISSERMAN, Mlesac: A new robust estimator with application
to estimating image geometry, Computer vision and image understanding, 78 (2000),
pp. 138–156.

383
384 BIBLIOGRAPHY

[13] P. R. W ILLEMS, B. L ANG, AND C. V ÖMEL, Computing the bidiagonal SVD using
multiple relatively robust representations, SIAM Journal on Matrix Analysis and Ap-
plications, 28 (2006), pp. 907–926.

[14] S. X U, An Introduction to Scientific Computing with MATLAB and Python Tutorials,


CRC Press, Boca Raton, FL, 2022.
Index
(n + 1)-point difference formula, 94 boundary effects, 353
:, Python slicing, 216 boundary-value problem, 192
:, in Matlab, 15 break, 23
_ _init_ _() constructor, 226
call_get_cubes.py, 220
absolute error, 47 Canny edge detection algorithm, 366
activation function, 266 cardinal functions, 87
activation functions, popular, 275 Cauchy-Schwarz inequality, 154
Adaline, 272, 276 chain rule, 74
adaptive step size, 186 change of basis, 304
affine function, 337 change of variables, 76, 202
algorithm, 47 change-of-base formula, 61
algorithmic design, 4 characteristic equation, 142
algorithmic parameter, 273 characteristic polynomial, 142
analog signals, 38 charpoly, 143
anonymous function, 25 child class, 229
anonymous_function.m, 25 circle.m, 32
approximation, 194 class, 225
area_closed_curve.m, 33, 62 Classes.py, 229
artificial neurons, 265 classification problem, 204
atan2, 372 closest point, 246
attributes, 226 clustering, 262
audio files, 380 CNN, 291
augmented matrix, 115, 119 code block, 215
average slope, 109 coding, iii, 2
average speed, 66 coding vs. programming, 5
coefficient matrix, 115
backbone of programming, 8 coefficients, 104
backtracking line search, 186 cofactor, 137
backward phase, 122 cofactor expansion, 137
backward-difference, 93 color image, 351
basic variables, 121 color image processing, 351
basis, 77, 236 color_edge.m, 369
basis function, 76 column space, 235
best approximation, 246 common logarithm, 60
bias, 267 Compile-f90-c-cpp, 360
big Oh, 50, 51 complementary slackness, 341
binary classifier, 265, 266 complex number system, 35

385
386 INDEX

computer programming, iii, 2, 8 DFT, 39


computer vision, 350 dft.m, 40
concave, 344 diagonalizable, 145, 308
concave maximization problem, 337 diagonalization theorem, 146
condition number, 153, 191 diagonalization.m, 147
conditionally stable, 47 difference formula, (n + 1)-points, 94
consensus set, 206 difference formula, five-point, 96
consistent system, 115 difference formula, three-point, 95
constraint set, 181 difference formula, two-point, 93
continue, 24 difference quotient, 66
contour, 36 differentiable, 165
contour, in Matlab, 18 differentiate, 72
conv2, 353 differentiation rules, 72
convergence of Newton’s method, 101 digital signals, 38
convergence of order α, 48 directional derivative, 168
converges absolutely, 79 discrete Fourier transform, 38, 39
convex optimization problem, 337 discrete_Fourier.m, 40
convex problem, 344 discrete_Fourier_inverse.m, 40
convolution, 353, 372 distance, 149
correction term, 99 diverges, 79
cost function, 266 doc, in Matlab, 18
covariance, 305 domain, 164, 181
covariance matrix, 304–306, 308, 319 dot product, 13, 148
Covariance.py, 306 dot product preservation, 242
critical point, 184 dot, in Matlab, 13
critical points, 176 double threshold, 377
csvwrite, 32 double_threshold.m, 378
curse of dimensionality, 263 dual function, 337, 338, 342, 344
curvaturePG.m, 370 dual problem, 179, 336
cython, 212, 357 dual variable, 179
duality gap, 340
daspect, 32 duality_convex.py, 347
data matrix, 309 dyadic decomposition, 312, 318
debugging, 8 dynamic (variable) thresholds, 377
deepcopy, 217
default value, 228 e, 58
deletion of objects, 228 e, as a limit, 58
dependent variable, 164 e_limit.m, 58
derivative, 71 echelon form, 120
derivative_rules.m, 74 edge, 367, 368
design matrix, 199 edge detection, 366
desktop calculator, 214 edge normal direction, 351
det, 138 edge normal directions, 375
determinant, 136, 137 edge sharpening, 370
determinant.m, 138 edge thinning, 375
INDEX 387

edge tracking, 378 Fourier transform, 38


edge tracking rule, 378 fplot, in Matlab, 18
Edge_Detection.m, 367 free variable, 123
effective programming, 9 free variables, 121
eig, 143, 162 free_fall.m, 67
eigenvalue, 141, 142 frequency increment, 39
eigenvalues.m, 143 frequency resolution, 39
eigenvector, 141 frequently_used_rules.py, 218
elementary row operations, 117, 139 Frobenius norm, 152
ellipsoid, 308 fsurf, in Matlab, 18
ensembling, 295, 298 function, 53
epigraph, 342, 345 function of two variables, 164
equivalent system, 114 fundamental questions, two, 118
Euclidean norm, 151 Fundamental Theorem of Algebra, 104
Euler’s identity, 38, 109
Euler’s number, 58 Galileo’s law of free-fall, 66
general solution, 123, 125
Euler, Leonhard, 58
get_cubes.py, 220
eulers_identity.m, 109
get_hypothesis_WLS.m, 209
Excel, 39
ggplot, 331
existence, 118
global variable, 228
explainable ai, 285
golden ratio, 26
explicit method, 370
gradient, 171, 189
exponential function, 55
gradient descent algorithm, 187
exponential growth of error, 47
gradient descent method, 181, 184, 188,
exponential regression, 56
272, 286
eye, in Matlab, 16
gradient magnitude, 372
fast Fourier transform, 39 Gram-Schmidt process, 248, 250
fastest increasing direction, 173 Gram-Schmidt process, normalized, 249
feasible set, 334 Green’s Theorem, 32
FFT, 39 Guess The Weight of the Ox Competition,
Fibonacci sequence, 27 260
Fibonacci_sequence.m, 27 help, in Matlab, 7, 18
fig_plot.m, 17 Hessian, 189
fimplicit, 109 horizontal line test, 54
finite difference method, 93 Horner’s method, 105, 228
first-order necessary conditions, 344 horner, in Python, 223
five-point difference formula, 96 horner.m, 107, 224
fmesh, in Matlab, 18 hyperparameter, 273
folding frequency, 43 hyperplane, 266, 267
for loop, 21 hypothesis, 206
forward Euler method, 370 hypothesis space, 322
forward phase, 122, 125 hysteresis, 366, 378
forward-difference, 93
four essential components, 12, 19 identity function, 271
388 INDEX

IDFT, 39 iterative algorithm, 155


idft.m, 40
imag, imaginaty part, 37 k-nearest neighbor, 281
image blur, 370, 371 k-NN, 281
image compression, 312 Karush-Kuhn-Tucker conditions, 344
image denoising, 371 KD-tree, 282
image processing, 350 KKT conditions, 344
image texts, 380 Kronecker delta, 87
imaginary part, 37 Krylov subspace methods, 188
imaginary unit, 35
imgaussfilt, 370, 379 Lagrange dual function, 179, 336
inconclusive, 79 Lagrange dual problem, 336
inconsistent system, 115 Lagrange form of interpolating polyno-
indentation, 215 mial, 87
independent variable, 164 Lagrange interpolating polynomial, 87
induced matrix 2-norm, 154 Lagrange multiplier, 174, 370
induced matrix norm, 152 Lagrange multipliers, 334, 338, 342
infinity-norm, 151 Lagrange polynomial, 93
information engineering, iii Lagrange_interpol.py, 89
inheritance, 229 Lagrangian, 176, 177, 334, 337, 338, 342,
initialization, 19, 225 344
inlier.m, 210 lazy learner, 281
inliers, 204 learning rate, 268, 272
inner product, 148, 150 least-squares line, 198
instance, 225 least-squares problem, 195, 321
instantaneous speed, 66 least-squares solution, 195, 252, 321
instantiation, 225 least_squares.m, 197
intercept, 267 left singular vectors, 310, 315
interpol_error.py, 92 left-hand limit, 70
interpolation, 194 length, 149
Interpolation Error Theorem, 90, 94 length preservation, 242
interpretability, 264 level curve, 173
interval of convergence, 83 likelihood, 277
inverse discrete Fourier transform, 39 linear algebra basics, 113
inverse Fourier transform, 38 linear combination, 234
inverse function, 53, 54 linear convergence, 48
inverse power method, 159 linear dependence, 126
inverse_matrix.m, 130 linear equation, 114
inverse_power.m, 160 linear growth of error, 47
inverse_power.py, 160 linear independence, 126
invertible matrix, 129 linear SVM, 280
invertible matrix theorem, 132, 140, 142 linear system, 114
Iris_perceptron.py, 270 linear_equations_rref.m, 119
iris_sklearn.py, 294 linearity rule, 74
iteration, 19 linearization, 202
INDEX 389

linearly dependent, 126 mini-batch, 287


linearly independent, 126 minimax problem, 177, 178, 335, 337, 338,
linspace, in Matlab, 18 342
list, in Python, 216 Minimization Problem, 181
little oh, 50, 51 minimum point set, 206
load, 32 minimum volume enclosing ellipsoid, 308
localization of roots, 104 Minkowski distance, 282
log-likelihood function, 277 MLESAC, 208
logarithmic function, 59 MNIST data, 284
logistic cost function, 277 mod, 24
Logistic Regression, 277 modelcode, 296, 368
logistic regression, 276 modularization, 8, 231
lower bound property, 336, 337 module, 8
LS problem, 195, 321 modulo, 24
monomial basis, 77, 78
M-file, 6 multi-class classification, 274
machine learning, 53, 260, 380 multi-line comments, 215
machine learning algorithm, 260 multiple local minima problem, 264
machine learning modelcode, 296 multiple output, 26
Machine_Learning_Model.py, 296 MVEE, 308
Maclaurin series, 82 myclf.py, 297
majority vote, 298 mysort.m, 10
mathematical analysis, 4 mysqrt.m, 102
Matlab, 12, 353
matlab_conv2_boundary.m, 354 natural logarithm, 60
matrix 2-norm, 154 nested loop, 22
matrix equation, 117 nested multiplication, 105
matrix norm, 152 network.py, 288
matrix transformation, 247 neuron, 265
matrix-matrix multiplication, 16 Newton’s method, 99
matrix-vector multiplication, 15, 148 Newton-Raphson method, 99
maximin problem, 179, 336–338, 342 newton_horner, in Python, 223
maximum useful frequency, 43 newton_horner.m, 108, 224
maximum-norm, 151 No Free Lunch Theorem, 293
mean curvature, 370 noise reduction, 370
Mean Value Theorem, 85 non-maximum suppression, 375
mesh, 36 non_max_suppression.m, 375
mesh, in Matlab, 18 nonconvex problem, 348
method of Lagrange multipliers, 280 nonlinear regression, 202
method of normal equations, 197, 209, nonlinear SVMs, 280
252, 321 nonsingular matrix, 129
method, in Python class, 226 nonvertical supporting hyperplane, 343,
microphone, 380 346
midpoint formula, 110 norm, 149, 151
midpoint formula for f 00 , 96 normal, 173
390 INDEX

normal equations, 196, 201 outliers, 204


normal matrix, 152 overfitting, 263
normal vector, 280
np.set_printoptions, 157 p-norms, 151
null space, 235 parameter estimation problem, 204
nullity, 237 parameter vector, 199
numeric library, 359 parametric description, 123
numerical approximation, 32 parametric vector form, 133
numerical differentiation, 93 parent class, 229
numpy, 25, 212, 222, 357 partial derivative, 166
numpy.loadtxt, 331 PCA, 304
numpy.savetxt, 331 pca_regression.m, 325
Nyquist criterion, 43 pdfim2text, 380–382
peppers_compress.m, 313
object-oriented programming, 225 perceptron, 267
objective function, 181 perceptron.py, 269
observation vector, 199 pivot column, 121, 126
OCR, 380 pivot position, 121
Octave, 25 plot, in Matlab, 17
Octave, how to import symbolic package, polynomial approximation, 81
36 polynomial interpolation, 53
Octave, how to know if Octave is runng- Polynomial Interpolation Error Theorem,
ing, 36 90, 95
one-shot learning, 264 Polynomial Interpolation Theorem, 86
one-to-one function, 54 polynomial of degree n, 104
one-versus-all, 274 Polynomial_01.py, 226
one-versus-rest, 274 Polynomial_02.py, 227
OOP, 225 population of the world, 56
operator 2-norm, 154 population.m, 57
operator norm, 152 positive definite, 188
optical character recognition, 380 positive semidefinite, 350
optimal step length, 190 positive semidefinite matrix, 309
optimization problem, 334, 338, 342 post-processing, 353
orthogonal, 150, 243 power iteration, 155
orthogonal basis, 238, 248, 249 power method, 155
orthogonal complement, 243 power rule, 74
orthogonal decomposition theorem, 244 power series, 78
orthogonal matrix, 242, 308 power-of-2 restriction, 39
orthogonal projection, 239, 244, 246 power_iteration.m, 157
orthogonal set, 238 power_iteration.py, 157
orthogonal_matrix.m, 242 principal component analysis, 304
orthogonality preservation, 242 principal components, 311
orthonormal basis, 241, 248, 249 principal directions, 304, 306, 308
orthonormal set, 241 probabilistic model, 276
outer product, 349 product rule, 74
INDEX 391

programming, iii, 2, 8 rectifier, 275


PROSAC, 208 reduced echelon form, 120
pseudocode, 47 reduced row echelon form, 119
pseudoinverse, 197, 252, 321, 323 REF, 120
pseudoinverse, the k-th, 323 reference semantics, in Python, 217
PSNR, 314 region, 30
Pythagorean theorem, 150, 246 regression analysis, 53, 198
Python, 212, 353 regression coefficients, 198
Python essentials, 215 regression line, 198
Python script, 363 Regression_Analysis.m, 326
Python wrap-up, 213, 361 Regression_Analysis.py, 330
Python_calls_F90_GCC.py, 361 relative error, 47
python_startup.py, 214 Remainder Theorem, 105
remarks on Python implementation, 231
QR factorization, 250, 252 repetition, 6, 19
QR factorization algorithm, 251 research and development, 76
QR factorization for least-squares prob- retrieving elements, in Python, 216
lem, 252 reusability, 5, 6, 231
QR iteration, 253 reverse, 53
qr_iteration.m, 254 rgb2gray formula, 366
quadratic convergence, 48 Richardson’s method, 181
quadratic formula, 35, 351 right singular vectors, 310, 315
quantization, 366 right-hand slope, 70
quotient rule, 74 Rosenbrock function, 185
rosenbrock_2D_GD.py, 185
R-RANSAC, 208 rotational symmetry, 275
R&D, 76 row equivalent, 117
radius of convergence, 80 row reduced echelon form, 120
random orthogonal matrix, 242 row reduction algorithm, 122
random sample consensus, 206 RREF, 120
range, 164 rref, 119
range, in Python, 217 Run_network.py, 290
rank theorem, 237
rank-1 matrix, 349 sampling, 366
rank-one matrix, 349 save multiple functions in a file, 325
RANSAC, 206 saveas, 32, 33
ransac2.m, 210 scalar multiplication, 234
ratio test, 79, 83 scatter plot, 56
Rayleigh quotient, 309 scene analysis, 204
readmatrix, 33 Schur decomposition, 253
real part, 37 Scikit-learn, 292
real, real part, 37 SciPy, 353
real-valued solution, 35, 37 scipy, 212, 357
real_imaginary_parts.m, 37 scipy.signal.convolve2d, 353
real_STFT.m, 62 scipy_convolve_boundary.py, 354
392 INDEX

score matrix, 309 stable, 47


search direction, 181, 188, 272 staircasing effect, 370
secant_lines_abs_x2_minus_1.m, 70 stand-alone functions, 231
second-derivative midpoint formula, 96 standard basis, 236
self, 226 standard basis for Rn , 77
sequence_sqrt2.m, 49 standard unit vectors, 77
SGD, 187 Starkville, 63
shared objects, 213 stdt.m, 45
sharing boundaries, 225 steepest descent direction, 189
short-time Fourier transform, 44 steepest descent method, 181
short_time_DFT.m, 45 step length, 181, 188, 272, 278
signal_DFT.m, 41 STFT, 44
similar, 144 stft2.m, 46
similarity transformation, 144 stochastic gradient descent, 187, 287
sinc function, 83 string, in Python, 216
singular value decomposition, 310, 311, strong duality, 339, 346
315 structure tensor, 349, 350
singular values, 310, 315 structure_tensor.m, 352
sklearn.ensemble.VotingClassifier, 298 submatrix, 137
sklearn_classifiers.py, 298 subordinate 2-norm, 154
Slater’s condition, 340 subordinate norm, 152
Slater’s Theorem, 340 subspace, 234
slicing, in Python, 216 sufficient decrease condition, 186
slope of the curve, 69 Sum of Squared Errors, 266
smoothing assumption, 204 super-convergence, 102
Sobel derivative, 353 supergraph, 342
Sobel gradient, 372 superlinear convergence, 48
Sobel kernels, 372 supervised learning, 261
sobel_derivatives.m, 352 support vector machine, 279
sobel_grad.m, 372, 373 surf, in Matlab, 18
soft-margin classification, 280 SVD, 310
softplus function, 275 SVD theorem, 315
solution, 114 SVD, algebraic interpretation, 317
solution set, 114 symbolic computation, 67
SortArray.m, 11 symmetric, 308
span, 128 symmetric positive definite, 188
spectrogram, 44 synthetic division, 105
speech2text, 380 system, 53
Speed up Python Programs, 212, 357 system of linear equations, 113, 114
sqrt, 103
square root, 102 tangent line, 66, 69, 100
squareroot_Q.m, 3 Taylor polynomial of order n, 83
squaresum.m, 6 Taylor series, 81, 82, 97
SSE, 266 Taylor series, commonly used, 83
stability, 370 Taylor’s formula, 182
INDEX 393

Taylor’s series formula, 97 unit vector, 149, 241


Taylor’s Theorem, 84, 96 unstable, 47
Taylor’s Theorem with integral remain- unsupervised learning, 262
der, 183 update direction, 272
Taylor’s Theorem with Lagrange Remain- upper triangular matrix, 253
der, 84, 96 util.m, 324
Taylor’s Theorem, Alternative Form of, 97 util.py, 330
taylor, in Matlab, 83 util_Covariance.py, 306
Term-by-Term Differentiation, 80 util_Poly.py, 230
term-by-term integration, 80
test_denoising.m, 371 variance, 305
test_f90.f90, 358 vector norm, 151
test_gpp.cpp, 359 vector space, 234
test_py3.py, 360 visualize_complex_solution.m, 36
three tasks, 260 volume scaling factor, 136
three-point difference formula, 95 VotingClassifier, 298
three-point formulas, 95 weak duality, 339
time-stepping procedure, 370 Weierstrass Approximation Theorem, 86
total variation, 370 weight matrix, 204
trace, 152 weighted least-squares method, 204
training data, 261 weighted normal equations, 205
transform_to_gray.m, 368 while loop, 20
transpose, 131 why, in Matlab, 16
truncated data matrix, 311 win_cos.m, 45
truncated score matrix, 311 Wine_data.py, 329
tuple, in Python, 216 writematrix, 33
TV, 370
tv_denoising.m, 370 x-intercept, 100
two-class classification, 265 XAI, 285
two-point difference formula, 93
zero padding, 353
Ubuntu, 292 zero vector, 234
unique inverse, 129 Zeros-Polynomials-Newton-Horner.py,
uniqueness, 118 223
unit circle, 173 zeros_of_poly_built_in.py, 222

You might also like