100% found this document useful (8 votes)
69 views

Instant Access To Introduction To Scientific Computing and Data Analysis 2nd Edition Mark H. Holmes Ebook Full Chapters

ebook

Uploaded by

hoglcatyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (8 votes)
69 views

Instant Access To Introduction To Scientific Computing and Data Analysis 2nd Edition Mark H. Holmes Ebook Full Chapters

ebook

Uploaded by

hoglcatyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Full download ebooks at ebookmeta.

com

Introduction to Scientific Computing and Data


Analysis 2nd Edition Mark H. Holmes

https://ebookmeta.com/product/introduction-to-scientific-
computing-and-data-analysis-2nd-edition-mark-h-holmes/

OR CLICK BUTTON

DOWLOAD NOW

Download more ebook from https://ebookmeta.com


More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Introduction to Differential Equations 2e 2nd Edition


Mark H Holmes

https://ebookmeta.com/product/introduction-to-differential-
equations-2e-2nd-edition-mark-h-holmes/

A Gentle Introduction to Scientific Computing Chapman


Hall CRC Numerical Analysis and Scientific Computing
Series 1st Edition Dan Stanescu

https://ebookmeta.com/product/a-gentle-introduction-to-
scientific-computing-chapman-hall-crc-numerical-analysis-and-
scientific-computing-series-1st-edition-dan-stanescu/

Introduction to the Tools of Scientific Computing 2nd


Edition Einar Smith

https://ebookmeta.com/product/introduction-to-the-tools-of-
scientific-computing-2nd-edition-einar-smith/

Introduction to Computational Engineering with MATLAB®


(Chapman & Hall/CRC Numerical Analysis and Scientific
Computing Series) 1st Edition Timothy Bower

https://ebookmeta.com/product/introduction-to-computational-
engineering-with-matlab-chapman-hall-crc-numerical-analysis-and-
scientific-computing-series-1st-edition-timothy-bower/
Introduction to Engineering and Scientific Computing
with Python 1st Edition David E. Clough

https://ebookmeta.com/product/introduction-to-engineering-and-
scientific-computing-with-python-1st-edition-david-e-clough/

Python for Astronomers An Introduction to Scientific


Computing 3rd Edition Imad Pasha

https://ebookmeta.com/product/python-for-astronomers-an-
introduction-to-scientific-computing-3rd-edition-imad-pasha/

An Introduction to Scientific Computing with MATLAB and


Python Tutorials 1st Edition Sheng Xu

https://ebookmeta.com/product/an-introduction-to-scientific-
computing-with-matlab-and-python-tutorials-1st-edition-sheng-xu/

Introduction to Statistics and Data Analysis, 6th


Edition Peck

https://ebookmeta.com/product/introduction-to-statistics-and-
data-analysis-6th-edition-peck/

Pandas Cookbook Recipes for Scientific Computing Time


Series Analysis and Data Visualization using Python 1st
Edition Theodore Petrou

https://ebookmeta.com/product/pandas-cookbook-recipes-for-
scientific-computing-time-series-analysis-and-data-visualization-
using-python-1st-edition-theodore-petrou/
13

Mark H. Holmes

Introduction
to Scientific
Computing and
Data Analysis
Editorial Board
Second Edition T. J.Barth
M.Griebel
D.E.Keyes
R.M.Nieminen
D.Roose
T.Schlick
Texts in Computational Science and Engineering

Volume 13

Series Editors
Timothy J. Barth, NASA Ames Research Center, National Aeronautics Space
Division, Moffett Field, CA, USA
Michael Griebel, Institut für Numerische Simulation, Universität Bonn, Bonn,
Germany
David E. Keyes, New York, NY, USA
Risto M. Nieminen, School of Science & Technology, Aalto University, Aalto,
Finland
Dirk Roose, Department of Computer Science, Katholieke Universiteit Leuven,
Leuven, Belgium
Tamar Schlick, Department of Chemistry, Courant Institute of Mathematical
Sciences, New York University, New York, NY, USA
This series contains graduate and undergraduate textbooks on topics described by
the term “computational science and engineering”. This includes theoretical aspects
of scientific computing such as mathematical modeling, optimization methods,
discretization techniques, multiscale approaches, fast solution algorithms, paral-
lelization, and visualization methods as well as the application of these approaches
throughout the disciplines of biology, chemistry, physics, engineering, earth sciences,
and economics.
Mark H. Holmes

Introduction to Scientific
Computing and Data
Analysis
Second Edition
Mark H. Holmes
Department of Mathematical Sciences
Rensselaer Polytechnic Institute
Troy, NY, USA

ISSN 1611-0994 ISSN 2197-179X (electronic)


Texts in Computational Science and Engineering
ISBN 978-3-031-22429-4 ISBN 978-3-031-22430-0 (eBook)
https://doi.org/10.1007/978-3-031-22430-0

Mathematics Subject Classification: 65-XX, 65Dxx, 65Fxx, 65Hxx, 65Lxx, 68T09

1st edition: © Springer International Publishing Switzerland 2016


2nd edition: © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer
Nature Switzerland AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

The objective of this text is easy to state, and it is to investigate ways to use a
computer to solve various mathematical problems. One of the challenges for those
learning this material is that it involves a nonlinear combination of mathematical
analysis and nitty-gritty computer programming. Texts vary considerably in how
they balance these two aspects of the subject. You can see this in the brief history
of the subject given in Figure 1 (which is an example of what is called an ngram
plot). According to this plot, the earlier books concentrated more on the analysis
(theory). In the early 1970s this changed and there was more of an emphasis on
methods (which generally means much less theory), and these continue to dominant
the area today. However, the 1980s saw the advent of scientific computing books,
which combine theory and programming, and you can see a subsequent decline in
the other two types of books when this occurred. This text falls within this latter
group.
There are two important threads running through the text. One concerns under-
standing the mathematical problem that is being solved. As an example, when using
Newton’s method to solve f (x) = 0, the usual statement is that it will work if

Numerical Methods
Percentage

Scientific Computing
2
Numerical Analysis

0
1950 1960 1970 1980 1990 2000 2010
Year

Figure 1 Historical record according to Google. The values are the number of instances that the
expression appeared in a published book in the respective year, expressed as a percentage for that
year, times 105 [Michel et al., 2011].

v
vi Preface

you guess a starting value close to the solution. It is important to know how to deter-
mine good starting points, and, perhaps even more importantly, knowing whether the
problem being solved even has a solution. Consequently, when deriving Newton’s
method, and others like it, an effort is made to explain how to fairly easily answer
these questions.
The second theme is the importance in scientific computing of having a solid grasp
of the theory underlying the methods being used. A computer has the unfortunate
ability to produce answers even if the methods used to find the solution are completely
wrong. Consequently, it is essential to have an understanding of how the method
works, and how the error in the computation depends on the method being used.
Needless to say, is also important to be able to code these methods, and in the
process be able to adapt them to the particular problem being solved. There is consid-
erable room for interpretation on what this means. To explain, in terms of computing
languages, the current favorites are MATLAB and Python. Using the commands they
provide, a text such as this one becomes more of a user’s manual, reducing the entire
book down to a few commands. For example, with MATLAB, this book (as well as
most others in this area) can be replaced with the following commands:
Chapter 1: eps
Chapter 2: fzero(@f,x0)
Chapter 3: A\b
Chapter 4: eig(A)
Chapter 5: polyfit(x,y,n)
Chapter 6: integral(@f,a,b)
Chapter 7: ode45 (@f,tspan,y0)
Chapter 8: fminsearch(@fun,x0)
Chapter 9: svd(A)
Certainly this statement qualifies as hyperbole, and as an example, Chapters 4 and 5
should probably have two commands listed. The other extreme is to write all of the
methods from scratch, something that was expected of students in the early days of
computing. In the end, the level of coding depends on what the learning outcomes
are for the course, and the background and computing prerequisites required for the
course.
Many of the topics included are typical of what are found in an upper division
scientific computing course. There are also notable additions. This includes material
related to data analysis, as well as variational methods and derivative-free minimiza-
tion methods. Moreover, there are differences related to emphasis. An example here
concerns the preeminent role matrix factorizations play in numerical linear algebra,
and this is made evident in the development of the material.
The coverage of any particular topic is not exhaustive, but intended to introduce
the basic ideas. For this reason, numerous references are provided for those who
might be interested in further study, and many of these are from the current research
literature. To quantify this statement, a code was written that reads the tex.bbl file
containing the references for this text, and then uses MATLAB to plot the number as a
function of the year published. The result is Figure 2, and it shows that approximately
Preface vii

15

10
Number
Median = 2006
Mean = 2002
5

0
1950 1960 1970 1980 1990 2000 2010 2020
Year

Figure 2 Number of references in this book, after 1950, as a function of the year they were
published.

half of the references were published in the last ten years. By the way, in terms of
data generation and plotting, Figure 1 was produced by writing a code which reads
the html source code for the ngram webpage and then uses MATLAB to produce the
plot.
The MATLAB codes used to produce almost every figure, and table with numerical
output, in this text are available from the author’s website as well as from Springer-
Link. In other words, the MATLAB codes for all of the methods considered, and the
examples used, are available. These can be used as a learning tool. This also goes to
the importance in computational-based research, and education, of providing open
source to guarantee the correctness and reproducibility of the work. Some interesting
comments on this can be found in Morin et al. [2012] and Peng [2011].
The prerequisites depend on which chapters are covered, but the typical two-year
lower division mathematics program (consisting of calculus, matrix algebra, and
differential equations) should be sufficient for the entire text. However, one topic
plays an oversized role in this subject, and this is Taylor’s theorem. This also tends
to be the topic that students had the most trouble within calculus. For this reason,
an appendix is included that reviews some of the more pertinent aspects of Taylor’s
theorem. It should also be pointed out that there are numerous theorems in the text,
as well as an outline of the proof for many of them. These should be read with care
because they contain information that is useful when testing the code that implements
the respective method (i.e., they provide one of the essential ways we will have to
make sure the computed results are actually correct).
I would like to thank the reviewers of an early draft of the book, who made
several very constructive suggestions to improve the text. Also, as usual, I would like
to thank those who developed and have maintained TeXShop, a free and very good
TeX previewer.

Troy, NY, USA Mark H. Holmes


January 2016
Preface to Second Edition

One of the principal reasons for this edition is to include material necessary for an
upper division course in computational linear algebra. So, least squares is covered
more extensively, along with new material covering Householder QR, sparse matrix
methods, preconditioning, and Markov chains. The computational implications of the
Gershgorin, Perron-Frobenius, and Eckart-Young theorems have also been expanded.
It is assumed in the development that a previous course in linear algebra has been
taken, but as a refresher some of the more pertinent facts from such a course are
given in an appendix.
Other significant changes include a reorganization and expansion of the exer-
cises. Also, the material on cubic splines, particularly related to data analysis, has
been expanded (e.g., Section 6.4), and the presentation of Gaussian quadrature has
been modified. There is also a new section providing an introduction to data-based
modeling and dynamic modes.
As with the first edition, the material is developed and presented, independent
of the language used for computing. So, there are no computer codes in the text.
Rather, the procedures are written in a generic format, such as Newton’s method on
page 41. There are, however, a few exercises where example MATLAB commands
are given indicating how the problem can be done (e.g., the commands on page 198
for inputting, and displaying, a grayscale image). All of the codes used in the compu-
tational examples in the text are available from the author’s GitHub repository
(https://github.com/HolmesRPI/IntroSciComp2nd). Other files for the book, such
as an errata page and the answers for selected exercises, are also there.

Troy, NY, USA Mark H. Holmes


August 2022

ix
Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Preface to Second Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction to Scientific Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Unexpected Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Floating-Point Number System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Normal Floats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 Non-Normal Floats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.4 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.5 Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.6 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Arbitrary-Precision Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Explaining, and Possibly Fixing, the Unexpected Results . . . . . . . 12
1.5 Error and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.1 Over-computing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Multicore Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Solving A Nonlinear Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1 The Problem to Solve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2 Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.1 Convergence Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.1 Picking x0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.2 Order of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.3 Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3.5 Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.4 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.1 Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

xi
xii Contents

2.5 Other Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54


2.5.1 Is Newton’s Method Really Newton’s Method? . . . . . . . 54
3 Matrix Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2 Finding L and U . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.2.1 What Matrices Have a LU Factorization? . . . . . . . . . . . . . 77
3.3 Implementing a LU Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3.1 Pivoting Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3.2 LU and Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . 80
3.4 LU Method: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.4.1 Flops and Multicore Processors . . . . . . . . . . . . . . . . . . . . . 87
3.5 Vector and Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.5.1 Matrix Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.6 Error and Residual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.6.1 Correct Significant Digits . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.6.2 The Condition Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.6.3 A Heuristic for the Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.7 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.7.1 Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.8 Tri-Diagonal Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.9 Sparse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.10 Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.11 Some Additional Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.11.1 Yogi Berra and Perturbation Theory . . . . . . . . . . . . . . . . . 111
3.11.2 Fixing an Ill-Conditioned Matrix . . . . . . . . . . . . . . . . . . . . 112
3.11.3 Insightful Observations about the Condition
Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.11.4 Faster than LU? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.11.5 Historical Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4 Eigenvalue Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.1 Review of Eigenvalue Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.2 Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.2.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.3 Extensions of the Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.3.1 Inverse Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.3.2 Rayleigh Quotient Iteration . . . . . . . . . . . . . . . . . . . . . . . . 145
4.4 Calculating Multiple Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.4.1 Orthogonal Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.4.2 Regular and Modified Gram-Schmidt . . . . . . . . . . . . . . . . 151
4.4.3 QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.4.4 The QR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.4.5 QR Method versus Orthogonal Iteration . . . . . . . . . . . . . . 157
4.4.6 Are the Computed Values Correct? . . . . . . . . . . . . . . . . . . 158
Contents xiii

4.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161


4.5.1 Natural Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.5.2 Graphs and Adjacency Matrices . . . . . . . . . . . . . . . . . . . . 163
4.5.3 Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.6 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.6.1 Derivation of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.6.2 Interpretations of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.6.3 Summary of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.6.4 Consequences of a SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4.6.5 Computing a SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
4.6.6 Low-Rank Approximations . . . . . . . . . . . . . . . . . . . . . . . . 179
4.6.7 Application: Image Compression . . . . . . . . . . . . . . . . . . . . 180
5 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.1 Information from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.2 Global Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.2.1 Direct Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.2.2 Lagrange Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.2.3 Runge’s Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
5.3 Piecewise Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
5.4 Piecewise Cubic Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
5.4.1 Cubic B-splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
5.5 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
5.5.1 Global Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . 226
5.5.2 Piecewise Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . 228
5.5.3 Cubic Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
5.5.4 Chebyshev Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
5.5.5 Chebyshev versus Cubic Splines . . . . . . . . . . . . . . . . . . . . 237
5.5.6 Other Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
5.6 Questions and Additional Comments . . . . . . . . . . . . . . . . . . . . . . . . 239
6 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
6.1 The Definition from Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
6.1.1 Midpoint Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
6.2 Methods Based on Polynomial Interpolation . . . . . . . . . . . . . . . . . 261
6.2.1 Trapezoidal Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
6.2.2 Simpson’s Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.2.3 Cubic Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
6.2.4 Other Interpolation Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
6.3 Romberg Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
6.3.1 Computing Using Romberg . . . . . . . . . . . . . . . . . . . . . . . . 273
6.4 Methods for Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
6.5 Methods Based on Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
6.5.1 1-Point Gaussian Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
6.5.2 2-Point Gaussian Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
xiv Contents

6.5.3 m-Point Gaussian Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281


6.5.4 Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
6.5.5 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
6.6 Adaptive Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
6.7 Other Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
6.8 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
7 Initial Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
7.1 Solution of an IVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
7.2 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
7.2.1 Higher Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
7.2.2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
7.3 IVP Methods Using Numerical Differentiation . . . . . . . . . . . . . . . 311
7.3.1 The Five Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
7.3.2 Additional Difference Methods . . . . . . . . . . . . . . . . . . . . . 318
7.3.3 Error and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
7.3.4 Computing the Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
7.4 IVP Methods Using Numerical Integration . . . . . . . . . . . . . . . . . . . 324
7.5 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
7.5.1 RK2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
7.5.2 RK4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
7.5.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
7.5.4 RK-n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
7.6 Solving Systems of IVPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
7.6.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
7.6.2 Simple Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
7.6.3 Component Approach and Symplectic Methods . . . . . . . 339
7.7 Some Additional Questions and Ideas . . . . . . . . . . . . . . . . . . . . . . . 342
7.7.1 RK4 Order Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
8 Optimization: Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
8.1 Model Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
8.2 Error Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
8.2.1 Vertical Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
8.3 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
8.3.1 Two Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
8.3.2 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
8.3.3 QR Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
8.3.4 Moore-Penrose Pseudo-Inverse . . . . . . . . . . . . . . . . . . . . . 370
8.3.5 Overdetermined System . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
8.3.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
8.3.7 Over-fitting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
8.4 QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
8.4.1 Finding Q and R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
8.4.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
8.4.3 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
Contents xv

8.5 Other Error Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384


8.6 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
8.7 Fitting IVPs to Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
8.7.1 Logistic Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
8.7.2 FitzHugh-Nagumo Equations . . . . . . . . . . . . . . . . . . . . . . . 394
8.7.3 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
9 Optimization: Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
9.2 Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
9.2.1 Descent Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
9.3 Solving Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
9.3.1 Basic Descent Algorithm for Ax = b . . . . . . . . . . . . . . . . 416
9.3.2 Method of Steepest Descents for Ax = b . . . . . . . . . . . . . 418
9.3.3 Conjugate Gradient Method for Ax = b . . . . . . . . . . . . . . 420
9.3.4 Large Sparse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
9.3.5 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
9.3.6 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
9.3.7 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
9.4 General Nonlinear Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
9.4.1 Descent Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
9.4.2 Line Search Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
9.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
9.5 Levenberg-Marquardt Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
9.5.1 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
9.6 Minimization Without Differentiation . . . . . . . . . . . . . . . . . . . . . . . 449
9.7 Variational Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
9.7.1 Example: Minimum Potential Energy . . . . . . . . . . . . . . . . 453
9.7.2 Example: Brachistochrone Problem . . . . . . . . . . . . . . . . . 455
9.7.3 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
10 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
10.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
10.2.1 Example: Word Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
10.2.2 Derivation of Principal Component Approximation . . . . 480
10.2.3 Summary of the PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
10.2.4 Scaling Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
10.2.5 Geometry and Data Approximation . . . . . . . . . . . . . . . . . . 488
10.2.6 Application: Crime Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
10.2.7 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
10.2.8 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
10.3 Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 495
10.3.1 Derivation of the Method . . . . . . . . . . . . . . . . . . . . . . . . . . 497
10.3.2 Reduced Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
10.3.3 Contrast Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
xvi Contents

10.3.4 Summary of ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505


10.3.5 Application: Image Steganography . . . . . . . . . . . . . . . . . . 506
10.4 Data Based Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
10.4.1 Dynamic Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
10.4.2 Example: COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
10.4.3 Propagation Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
10.4.4 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518

A Taylor’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529


A.1 Useful Examples for x Near Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
A.2 Order Symbol and Truncation Error . . . . . . . . . . . . . . . . . . . . . . . . . 532
B Vector and Matrix Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
Chapter 1
Introduction to Scientific
Computing

This chapter provides a brief introduction to the floating-point number sys-


tem used in most scientific and engineering applications. A few examples are
given in the next section illustrating some of the challenges using finite preci-
sion arithmetic, but it is worth quoting Donald Knuth to get things started.
If you are unfamiliar with him, he was instrumental in the development of
the analysis of algorithms, and is the creator of TeX. Anyway, here are the
relevant quotes [Knuth, 1997]:
“We don’t know how much of the computer’s answers to believe. Novice com-
puter users solve this problem by implicitly trusting in the computer as an
infallible authority; they tend to believe that all digits of a printed answer
are significant. Disillusioned computer users have just the opposite approach;
they are constantly afraid that their answers are almost meaningless”.
“every well-rounded programmer ought to have a knowledge of what goes on
during the elementary steps of floating-point arithmetic. This subject is not
at all as trivial as most people think, and it involves a surprising amount of
interesting information”.
One of the objectives in what follows is to help you from becoming disil-
lusioned by identifying where problems can occur, and to also provide an
appreciation for the difficulty of floating-point computation.

1.1 Unexpected Results

What follows are examples where the computed results are not what is
expected. The reason for the problem is the same for each example. Namely,
the finite precision arithmetic used by the computer generates errors that are
significant enough that they affect the final result. The calculations to follow
use double-precision arithmetic, and this is explained in Section 1.2.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1
M. H. Holmes, Introduction to Scientific Computing and Data Analysis,
Texts in Computational Science and Engineering 13,
https://doi.org/10.1007/978-3-031-22430-0 1
2 1 Introduction to Scientific Computing

Example 1.1

Consider adding a series of numbers from largest to smallest


1 1 1
S(n) = 1 + + ··· + + , (1.1)
2 n−1 n
and the same series added from smallest to largest
1 1 1
s(n) = + + · · · + + 1. (1.2)
n n−1 2
If one calculates s(n) and S(n), and then calculates the difference S(n) −
s(n), the values given in Table 1.1 are obtained. It is evident that for larger
values of n, the two sums differ. In other words, using a computer, addition
does not necessarily satisfy one of the fundamental properties of arithmetic,
which is that addition is associative. So, given that both s(n) and S(n) are
likely incorrect, a question relevant to this text is, is it possible to determine
which one is closer to the exact result? 

Example 1.2

Consider the function


y = (x − 1)8 . (1.3)
Expanding this you get the polynomial

y = x8 − 8x7 + 28x6 − 56x5 + 70x4 − 56x3 + 28x2 − 8x + 1. (1.4)

The expressions in (1.4) and (1.3) are equal and, given a value of x, either
should be able to be used to evaluate the function. However, when evaluating
them with a computer they do not necessarily produce the same values and
this is shown in Figure 1.1. In the upper graph, where 0.9 ≤ x ≤ 1.1, they
do appear to agree. However, looking at the plots over the somewhat smaller

n S(n) − s(n)
10 0
100 −8.88e−16
1,000 2.66e−15
10,000 −3.73e−14
100,000 −7.28e−14
1,000,000 −7.83e−13

Table 1.1 Difference in partial sums for the harmonic series considered in Example 1.1.
Note that 8.9e−16 = 8.9 × 10−16 .
1.1 Unexpected Results 3

-9
10
10
Using (1.3)
Using (1.4)
y-axis

0
0.9 0.95 1 1.05 1.1

10-14
4

2
y-axis

-2
0.98 0.99 1 1.01 1.02

x-axis

Figure 1.1 Plots of (1.4) and (1.3). Upper graph: the interval is 0.9 ≤ x ≤ 1.1, and the
two functions are so close that the curves are indistinguishable. Lower graph: The interval
is 0.98 ≤ x ≤ 1.02, and now they are not so close.

interval 0.98 ≤ x ≤ 1.02, which is given in the lower graph, the two expres-
sions definitely do not agree. The situation is even worse than the fact that
the graphs differ. First, according to (1.3), y is never negative but according
to the computer (1.4) violates this condition. Second, according to (1.3), y is
symmetric about x = 1 but the computer claims (1.4) is not. 
Example 1.3

As a third example, consider the function



16 + k − 4
y= . (1.5)
k
This is plotted in Figure 1.2. According to l’Hospital’s rule
1
lim y = .
k→0 8
The computer agrees with this result for k down to about 10−12 but for
smaller values of k there is a problem. First, the function starts to oscillate
and then, after k drops a little below 10−14 , the computer decides that y = 0.
It is also worth pointing out that ratios as in (1.5) arise in formulas for the
numerical evaluation of derivatives, and this is the subject of Section 7.2. In
particular, (1.5) comes from an approximation used to evaluate f  (0), where
4 1 Introduction to Scientific Computing

3/16

1/8
y-axis

1/16

10-18 10-14 10-10 10-6 10-2

k-axis

Figure 1.2 Plot of (1.5).


f (x) = 16 + x. Finally, it needs to be mentioned that the curve shown in
the figure contains a plotting error, and this is corrected in Figure 1.8. 

Example 1.4

The final example concerns evaluating trigonometric functions, specifically √


sin(x). As you should recall, if n is an integer, then sin( π4 + 2πn) = 12 2.
Using double precision one finds that if n = 252 + 2 then sin( π4 + 2πn) ≈ −1,
while if n = 252 + 3, then sin( π4 + 2πn) ≈ 0. Clearly, neither is close to the
correct value. To investigate this further, the relative error
 
 sin( π + 2πn) − 1 √2 
 4 2 
 √  (1.6)
 1
2 2


is plotted in Figure 1.3 as a function of n. It shows that overall the error


increases with n, eventually getting close to one. It also means, for example,
that if you want to evaluate sin(x) correct to, say, 12 digits then you will
need to require |x| ≤ 15,000. 

0
10
Relative Error

-4
10

-8
10

-12
10

10-16
0 5 10 15
10 10 10 10
n-axis

Figure 1.3 The relative error (1.6) when computing y = sin( π4 + 2πn).
1.2 Floating-Point Number System 5

1.2 Floating-Point Number System

The problems illustrated in the above examples are minor compared to the
difficulties that arose in the early days of computing. It was not unusual to
get irreproducible results, in the sense that two different computers would
calculate different answers to the same formula. To help prevent this, a set
of standards was established that computer manufactures were expected to
comply with. The one of interest here concerns the floating-point system, and
it is based on the IEEE-754 standard set in 1985. It consists of normal floats
(described below), along with zero, ± Inf, and NaN.

1.2.1 Normal Floats

Normal (or normalized) floating-point numbers are real numbers that the
computer has the exact value for. The form they are written in is determined
by the binary nature of computer systems. Specifically, they have the form

xf = (±) m × 2E , (1.7)

where
b1 b2 bN −1
m=1+ + 2 + · · · + N −1 . (1.8)
2 2 2
In this representation m, E, and the bi ’s have the following properties:
• m: This is the mantissa. The bi ’s are either zero or one, and for this reason
1 ≤ m < 2. In fact, the largest value of the mantissa is m = 2 − ε, where
ε = 1/2N −1 (see Exercise 1.32). The number ε is known as machine epsilon
and it plays a critical role in determining the spacing of floating-point
numbers.
• E: This is the exponent and it is an integer that satisfies Em ≤ E ≤ EM .
For double precision, −1022 ≤ E ≤ 1023. In general, according to the IEEE
requirements, Em = −EM + 1 and EM = 2M −1 − 1, where M is a positive
integer.
As defined, a floating-point system requires specification of the two integers
N and M , and some of the standard choices are listed in Table 1.2. The one of
particular importance for scientific computing is double precision, for which
N = 53 and M = 11.
Other formats are used, and they are usually connected to specific appli-
cations. An example is bfloat16, which was developed by the Google Brain
project for their machine learning algorithms. These numbers have the
same storage requirements as half-precision floating-point numbers but have
N = M = 8. This means the exponents for bfloat16 cover the same range as
single-precision floating-point numbers but with fewer mantissa values.
6 1 Introduction to Scientific Computing

Preci- N M Em EM xm xM ε Decimal
sion Digits

Half 11 5 −14 15 6 × 10−5 7 × 104 10−3 3

Single 24 8 −126 127 10−38 3 × 1038 10−7 7

Double 53 11 −1022 1023 2 × 10−308 2 × 10308 2 × 10−16 16

Quad 113 15 −16382 16383 3 × 10−4932 104932 10−34 34

Table 1.2 Values for various binary normalized floating-point systems specified by IEEE-
754 and its extensions. The values for the smallest positive xm , largest positive xM , and
machine epsilon ε are given to one significant digit. Similarly, the number of decimal digits
is approximate and is determined from the expression N log10 2.

Example 1.5

1. Is x = 3 a floating-point number?
Answer: Yes, because 3 = 2 + 1 = (1 + 12 )× 2. So, E = 1, b1 = 1, and the
other bi ’s are zero.

2. Is x = −10 a floating-point number?


1
Answer: Yes, because −10 = −8 − 2 = −(1 + 22 )× 23 . In this case, E = 3,
b2 = 1, and the other bi ’s are zero.

1

3. Examples of numbers that are not floating-point numbers are 10 , 2, and
π. 

Example: Floats in 1 ≤ x < 2

In this case, E = 0 and (1.7) takes the form


b1 b2 bN −1
xf = 1 + + 2 + · · · + N −1 . (1.9)
2 2 2
The endpoint x = 1 is obtained when all of the bi ’s are zero. The next largest
float occurs when bN −1 = 1 and all of the other bi ’s are zero. In other words,
the next largest float is xf = 1 + ε, where ε = 1/2N −1 . As stated earlier, the
number ε is known as machine epsilon. In the case of double precision,

ε = 2−52 ≈ 2.2 × 10−16 . (1.10)


1.2 Floating-Point Number System 7

This number should be memorized as it will appear frequently throughout


this textbook.
The next largest float occurs when bN −2 = 1 and all of the other bi ’s are
zero, which gives us xf = 1 + 1/2N −2 = 1 + 2ε. This pattern continues, and
one ends up concluding that the floats in 1 ≤ x < 2 are equally spaced, a
distance ε apart. In fact, it is possible to show that the largest value of (1.9)
is a distance ε from 2 (see Exercise 1.32(d)). Therefore, the floating-point
numbers in 1 ≤ x ≤ 2 are equally spaced a distance ε apart. 

Example: Floats in 2n ≤ x ≤ 2n+1

The floats in the interval 2n ≤ x < 2n+1 have the form


 
b1 b2 bN −1
xf = 1 + + 2 + · · · + N −1 × 2n .
2 2 2

So, looking at (1.9) one concludes that the floats in this interval are those
in 1 ≤ x < 2, but just multiplied by 2n . This means the floats in the closed
interval 2n ≤ x ≤ 2n+1 are evenly spaced but now the spacing is 2n ε, as
illustrated in Figure 1.4. 
A conclusion coming from Figure 1.4 is that for large values of n the dis-
tance between the floats can be huge. For example, take n = 100. For double
precision, between 2100 ≈ 1030 and 2101 ≈ 2 × 1030 they are a distance of
ε × 2100 ≈ 2.8 × 1014 apart. The fact that they are so far apart can cause
problems, and a particular example of this is considered in Section 1.2.6.

Integers

The smallest positive integer that is not included in the floating-point system
is 2N + 1. In other words, all nonzero integers satisfying |x| ≤ 2N are normal-
ized floating-point numbers. An important question is whether the computer
recognizes them as integers. This is needed in programming because integers
are used as counters in a for loop as well as the indices of a vector or matrix.

Figure 1.4 The floating-point numbers in the interval 2n ≤ x ≤ 2n+1 are equally spaced,
being a distance 2n ε apart. So, they are a distance ε/2 apart for 2−1 ≤ x ≤ 1, ε apart for
1 ≤ x ≤ 2, and 2ε apart for 2 ≤ x ≤ 22 .
8 1 Introduction to Scientific Computing

Figure 1.5 The floating-point numbers just to the left and right of x = 1. The red dashed
lines are located halfway between the floats, and any real number between them is rounded
to the floating-point number in that subinterval.

Most computer systems have a way to treat integers as integers, where addi-
tion and subtraction are done exactly as long as the integers are not too big.
For example, for languages such as C and FORTRAN you use a type decla-
ration at the beginning of the program to identify a variable as an integer.
Languages such as MATLAB, Python, and Julia use dynamic typing, which
means they determine the variable type on the fly. This is more convenient,
but it does add to the computing time.
One last comment concerns the biggest and smallest positive normal floats.
In particular, for double precision, xM = (2 − ε) × 21023 ≈ 2 × 10308 is the
largest positive normal float, and xm = 2−1022 ≈ 2 × 10−308 is the smallest
positive normal float.

1.2.2 Rounding

Assuming x is a real number satisfying xm ≤ |x| ≤ xM then in the computer


this is rounded to a normal float xf , and the relative error satisfies

|x − xf | ε
≤ .
|x| 2
To do this it uses a “round-to-nearest” rule, which means xf is the closest float
to x. To illustrate, the floats just to the left of x = 1 are a distance 12 ε apart,
and those just to the right are a distance ε apart. This is shown in Figure 1.5.
So, any number in Region II, which corresponds to 1 − 14 ε < x < 1 + 12 ε, is
rounded to xf = 1. Similarly, any number in Region III, which corresponds to
1 + 12 ε < x < 1 + 32 ε, is rounded to xf = 1 + ε. In the case of a tie, a “round-
to-even” rule is used, where the nearest float with an even least significant
digit is used.

Example 1.6
 
Evaluate 1 + 14 ε − 1 using floating-point arithmetic.
 
Answer: Since 1 < 1 + 14 ε < 1 + 12 ε, then 1 + 14 ε is rounded to 1 (see Figure
1.5). So, the answer is zero. 
1.2 Floating-Point Number System 9

1.2.3 Non-Normal Floats

The floating-point number system is required to provide a floating-point num-


ber to any algebraic expression of the form xf ± yf , xf ∗ yf , or xf /yf . So, in
addition to the normal floats, the following are also included.

Zero

It is not possible to represent zero using (1.7), and so it must be included


as a special case. Correspondingly, there is an interval −x0 < x < x0 where
any number in this interval is rounded to xf = 0. If subnormals are not
used (these are explained below), then x0 = xm /2, and if they are used, then
x0 = xtiny /2.

Inf and NaN

Positive numbers larger than xM are either rounded to xM , if close enough, or


assigned the value Inf. The latter is a situation known as positive overflow. A
similar situation occurs for very negative numbers, something called negative
overflow and it produces the value −Inf.
For those situations when the calculated value is ill-defined, such as 0/0,
the floating-point system assigns it a value of NaN (Not a Number).
Because ±Inf and NaN are floating-point numbers you can use them in
mathematical expressions. For example, the computer will be able to evalu-
ate 4 ∗ Inf, sin(Inf), eInf , Inf ∗ NaN, etc. You can probably guess what they
evaluate to, but if not then try them using MATLAB, Julia, or Python.

Subnormals

These are a relatively small set of numbers from xtiny = 2Em −N +1 up to


xm that are included to help alleviate the problem of rounding a nonzero
number to zero (producing what is called gradual underflow). For example,
using double precision, xtiny = 2−1074 ≈ 5 × 10−324 .
Because they are often implemented in software, computations with sub-
normals can take much longer than with normal floats and this causes prob-
lems for timing in multicore processors. There is also a significant loss of
precision with subnormals. For these, and other, reasons it is not uncommon
in scientific computing to use a flush-to-zero option to prevent the system
from using subnormals.
10 1 Introduction to Scientific Computing

1.2.4 Significance

Working with numbers that have a fixed number of digits has the poten-
tial to cause complications that you should be aware of. For example, even
though floating-point addition and multiplication are commutative, they are
not necessarily associative or distributive. Example 1.1 is an illustration of
the non-associativity of addition.
Another complication involves significance. To perform addition, a com-
puter must adjust the exponents before adding. For example, suppose the
base is fixed to have 4 digits. So, letting → denote the exponent adjusting
step,

1.234 × 104 + 1.234 × 102 → 1.234 × 104 + 0.012 × 104 = 1.246 × 104 .

This also means that

1.234 × 104 + 1.234 × 10−6 → 1.234 × 104 + 0.000 × 104 = 1.234 × 104 .

In this example, 1.234 × 10−6 is insignificant relative to 1.234 × 104 and con-
tributes nothing to the sum. This is why, using double precision, a computer
will claim that (1020 + 1) − 1020 = 0.
For subtraction the potential complication is the loss of significance, which
can arise when the two numbers are very close. For example, assuming that
the base is fixed to have 4 digits, then (1.2344 − 1.2342)/0.002 → (1.234 −
1.234)/0.002 = 0, whereas the exact value is 0.1.
Awareness of the complications mentioned above is important, but it is
outside the purview of this text to explore them in depth. For those inter-
ested in more detail related to floating-point arithmetic, they should consult
Goldberg [1991], Overton [2001], or Muller et al. [2010].

1.2.5 Flops

All numerical algorithms are judged by their accuracy and how long it takes
to compute the answer. As one estimate of the time, an old favorite is to
determine the flop count, where flop is an acronym for floating-point opera-
tion. To use this, it is necessary to have an appreciation of how long various
operations take. In principle these are easy to determine. As an example, to
determine the computing time for an addition one just writes a code where
this is done N times, where N is a large integer, and then divides the total
computing time by N . The outcomes of such tests are shown in Table 1.3,
where the times are scaled by how long it takes to do an addition. Note that
the actual times here are very short, with an addition taking approximately
2 nsec for MATLAB, FORTRAN, and C/C++, and 10 nsec using Python.
1.2 Floating-Point Number System 11

Operation MATLAB Python FORTRAN C/C++


Addition or Subtraction 1 1 1 1
Multiplication 1 1 1 1
Division 2 1 2 2

x 6 3 3 12
sin x 7 3 5 8
ln x 14 8 7 15
ex 12 3 8 8
xn , for n = 5, 10, or 20 35 5 3 24

Table 1.3 Approximate relative computing times for various floating-point operations
in MATLAB (R2022a), Python (v3.11), FORTRAN (gfortran v12.2), and C/C++ (clang
v14.0). Note that each column is normalized by the time it takes that language to do an
addition.

Because of this, even though x = 3/2 might take twice as long to compute
than x = 0.5 ∗ 3, it’s really not necessary to worry about this (at least in the
problems considered in this text).
One of the principal reasons why Python and MATLAB differ in execution
times from FORTRAN, even though the same processor is used, involves
branching. They, like Julia and R, use dynamic typing. This means they
check on the type of numbers being used and then branch to the appropriate
algorithm. This way they are able to carry out integer arithmetic exactly. In
comparison, FORTRAN, like C and C++, is statically typed. This means
the typing is done at the start of the program, and so there is no slow down
when the commands are issued. This is one of the reasons most large-scale
scientific computing codes are written in a statically typed language.

1.2.6 Functions

Any computer system designed for scientific computing has routines to eval-
uate√well-known or often used functions. This includes elementary functions
like x; transcendental functions like sin(x), ex , and ln(x); and special func-
tions like erf(x) and Jν (x). To discuss how these fit into a floating-point sys-
tem, these will be written in the generic form of y = f (x). The ideal goal is
that, letting yf denote the computed value and assuming that xm ≤ |y| ≤ xM ,

|y − yf | ε
≤ .
|y| 2
Whether or not this happens depends on the function and the value of x.
12 1 Introduction to Scientific Computing

Oscillatory functions are some of the more difficult to evaluate accurately,


and an illustration of this is given in Figure 1.3. What this shows is that if
you want an error commensurate with the accuracy associated with double
precision you should probably restrict the domain to |x| < 104 . In this case
the relative error is no more than about 10−12 . However, if you are willing to
have the value correct to no more than 4 significant digits then the domain
expands to about |x| < 1012 . One of the reasons for the difficulty evaluating
this function is that the distance between the floating-point numbers increases
with x and eventually gets much larger than the period 2π. When this occurs
the computed value is meaningless.
√ Fortunately, the situation for monotonic
function, such as exp(x), x, and ln(x), is much better and we can expect
the usable domains to be much larger. Those you might want to investigate
some of the challenges related to accurate function evaluation should consult
Muller [2016].

1.3 Arbitrary-Precision Arithmetic

Some applications, such as cryptography, require exact manipulation of


extremely large integers. Because of their length, these integers are not rep-
resentable using double, or even quadruple, precision. This has given rise to
the idea of arbitrary-precision arithmetic, where the limitation is determined
by the available memory for the computer. The price paid for this is that the
computations are slower, with the computing time increasing fairly quickly
as the size of the integers is increased.
As an example of the type of problem arbitrary-precision arithmetic is used
for, there is the Great Internet Mersenne Prime Search (GIMPS). A Mersenne
prime has the form 2n − 1, and considerable computing resources have been
invested into finding them. The largest one currently known, which took 39
days to compute, has n = 82,589,933, which results in a prime number with
24,862,048 digits [GIMPS, 2022]. Just printing this number, with 3,400 digits
per page, would take more than twelve times the pages in this text.
There are multiple computational challenges finding large prime numbers.
Those interested in the computational, and theoretical, underpinnings of com-
puting primes should consult Crandall and Pomerance [2010].

1.4 Explaining, and Possibly Fixing, the


Unexpected Results

The problem identified in Example 1.4 was discussed in Section 1.2.6. What
follows is a discussion related to the other examples that were presented in
Section 1.1.
1.4 Explaining, and Possibly Fixing, the Unexpected Results 13

Example 1.1 (cont’d)

The differences in the two sums are not unexpected when using double-
precision arithmetic. Also, the order of the error is consistent with the accu-
racy obtained for double precision. The question was asked about which sum
might produce the more accurate result. One can argue that it is better to
add from small to big. The reason being that if one starts with the larger
terms, and the sum gets big enough, then the smaller terms are less able
to have an effect on the answer. To check on this, it is necessary to know
the exact value, or at least have an accurate approximate value. This can be
found using what is known as the Euler-Maclaurin formula, from which one
can show that for larger values of n,
n
 1 1 1 1 1
= ln(n) + γ + − 2
+ 4
− + ··· ,
k 2n 12n 120n 252n6
k=1

where γ = 0.5772 · · · is Euler’s constant. To investigate the accuracy of the


two sums, the errors are shown in Figure 1.6. For smaller values of n the
errors are fairly small, and there is no obvious preference between s and S.
For larger values of n, however, there is a preference as s(n) generally serves
as a more accurate approximation than S(n). It is also seen that there is also
a slow overall increase in the error for both, but this is not unusual when
such a large number of floating-point calculations are involved (see Exercise
1.25).
Given the importance of summation in computing, it should not be surpris-
ing that numerous schemes have been devised to produce an accurate sum. A
particularly interesting example is something
n called compensated summation.
Given the problem of calculating i=1 xi , where xi > 0, the compensated
summation procedure is as follows:

10-11
S
10 -12 s
Error

10-13

10-14

-15
10
104 105 106 107 108
Number of Terms

Figure 1.6 The error in computing the partial sum of the harmonic series using (1.1)
and (1.2).
14 1 Introduction to Scientific Computing

let: sum = 0 and err = 0


loop: for i = 1, 2, 3, · · · , n
z = xi + err
q = sum (1.11)
sum = q + z
err = z − (sum − q)
end

The error in computing s(n) when using this procedure versus just adding
the terms recursively is given in Table 1.4. The improvement in the accuracy
is dramatic. Not only does it gives a more accurate value, but the error does
not increase with n unlike what happens when adding the terms recursively.
Compensated summation is based on estimating the error in a floating-
point addition, and then compensating for this in the calculation. The
sequence of steps involved illustrating how the method works is given in
Table 1.5. To explain, suppose that the mantissa used by the computer has
four digits and base 10 arithmetic is used. To add a = 0.1234 × 103 and
b = 0.1234 × 101 it is first necessary to express them using the same expo-
nent, and write b = 0.001234 × 103 (this is the alignment step in Table 1.5).
Given the four digit mantissa, b gets replaced with b = 0.0012 × 103 . Using
the notation in Table 1.5, the digits b2 = 34 are lost in this shuffle. With
this, sf = 0.1246 × 103 and sf − a = 0.12 × 101 . Accordingly, (sf − a) − b =
−0.0034 × 101 , which means we have recovered the piece missing from the
sum. In Table 1.5, err = b − (sf − a), and this is added back in during the
next iteration step. There are variations on this procedure, and also limita-
tions on its usefulness. Those interested in reading more about this should
consult Demmel and Hida [2004] or Blanchard et al. [2020]. 

n E(n) − c(n) E(n) − s(n)

104 0 4 × 10−15

105 0 2 × 10−14

106 2 × 10−15 5 × 10−14

107 4 × 10−15 10−13

108 4 × 10−15 5 × 10−13

109 0 2 × 10−12

Table 1.4 Comparison between compensated summation, as given in (1.11), and regular
summation. Note E(n) is the exact result, c(n) is the value using compensated summation,
and s(n) is given in (1.2).
1.4 Explaining, and Possibly Fixing, the Unexpected Results 15

Operation floating-point Result Comments


mantissas for a and b
are aligned for the addition

sf = (a + b)f due to the fixed number of digits, b2 is lost

sf − a a is removed from the sum

In removing b, the part that


(sf − a) − b remains is −b2

Table 1.5 Steps explaining how the error in floating-point addition is estimated for
compensated summation. Adapted from [Higham, 2002].

Example 1.2 (cont’d)

The first thing to notice is that the values of the function in the lower plot
in Figure 1.2 are close to machine epsilon. The expanded version of the poly-
nomial is required to take values near x = 1 and combine them to produce a
value close to zero. The errors seen here are consistent with arithmetic using
double precision, and the fact that the values are sometimes negative also is
not surprising.
It is natural to ask, given the expanded version of the polynomial (1.4),
whether it is possible to find an algorithm for it that is not so sensitive to
round off error. There are procedures for the efficient evaluation of a polyno-
mial, and two examples are Horner’s method and Estrin’s method. To explain
how these work, Horner’s method is based on the following observations:

a2 x2 + a1 x + a0 = a0 + (a1 + a2 x)x,
a3 x3 + a2 x2 + a1 x + a0 = a0 + (a1 + (a2 + a3 x)x)x,
a4 x4 + a3 x3 + a2 x2 + a1 x + a0 = a0 + (a1 + (a2 + (a3 + a4 x)x)x)x.

Higher order polynomials can be factored in a similar manner, and the result-
ing algorithm for evaluating the nth degree polynomial p(x) = a0 + a1 x +
· · · + an xn is

let: p = an
loop: for i = 1, 2, 3, · · · , n
p = an−i + p ∗ x (1.12)
end
16 1 Introduction to Scientific Computing

This procedure is said to be optimal because it uses the minimum number


of flops to compute pn (x). In particular, it requires 2n adds and multiplies,
while the direct method requires about 3n.
Because of the reduced computing cost, Horner’s method is often used in
library programs for evaluating polynomials. For example, the library rou-
tines that are used by some computers to evaluate tan(x) and a tan(x) involve
polynomials of degree 15 and 22 [Harrison et al., 1999], and having this done
as quickly as possible is an important consideration. However, because addi-
tions and multiplications take only about 10−9 sec, the speedup using Horner
is not noticeable unless you are evaluating the polynomial at a huge number
of points. The advantage using Horner is that it tends to be less sensitive
to round-off error. To illustrate, using Horner to evaluate the polynomial in
(1.4), the curve shown in Figure 1.7 is obtained. It clearly suffers the same
oscillatory behavior the direct method has, which is shown in Figure 1.1.
However, the amplitude of the oscillations is about half of what is obtained
using the direct method. 

Example 1.3 (cont’d)

The function considered was



16 + k − 4
y= , (1.13)
k
and this is replotted in Figure 1.8. The plot differs from the one in Figure
1.8 in two respects. One is that the k interval is smaller to help magnify the
curve in the region where it drops to zero. The second difference concerns
the dashed lines. The plotting command used to make Figure 1.8 connected
the given function values as if y is continuous (which it is). However, the
floating-point evaluation of y is discontinuous, and the dashed lines indicate
where those discontinuous points are located. Why the jumps occur can be
explained using Figure 1.5. The floating-point value jumps as you move from

10-14
4
Using Horner
Using (1.3)
2
y-axis

-2
0.98 0.99 1 1.01 1.02
x-axis

Figure 1.7 Plot of (1.4) when evaluated using Horner’s method, and using (1.3).
1.4 Explaining, and Possibly Fixing, the Unexpected Results 17

Figure 1.8 Plot of (1.13).

region III to region II, and jumps again when you move into region I. Normally
these jumps are imperceptible in a typical plot, but in the region near where
the drop to zero occurs they are being magnified to the extent that they are
clearly evident in the plot.
It is fairly easy to avoid the indeterminate form at k = 0 by rewriting the
function. For example, rationalizing the numerator you get that
√ √ √
16 + k − 4 16 + k − 4 16 + k + 4
= √
k k 16 + k + 4
1
=√ .
16 + k + 4
With this, letting k → 0 is no longer a problem.
A follow-up question is, how does the computer come up with the value
y√= 0? The short answer is that for k close to zero, the computer rounds
16 + k to 4 and this means that the numerator is rounded to zero. As seen
in Figure 1.8, the k values being considered here are far from the smallest
positive normalized float xm , so the denominator is not also rounded to zero.
Consequently, the computer ends up with y = 0.
To provide more detail about why, and where, y = 0, there are two factors
contributing
√ to this. One is that 16 + k gets rounded to 16, and the other is
that 16 + k gets rounded √ to 4. To investigate where these occur, note that
for small values of k, 16 + k is just a bit larger than 4. The floating-point
numbers the computer has to work with in this region are√4, 4(1 + ε), 4(1 +
2ε), · · · . Given the rounding
√ rule, the computer will claim 16 + k = 4 once
k is small enough that 16 + k < 4(1 + ε/2). Squaring this and dropping the
very small ε2 term you get that k < 16ε. In a similar manner you find that
16 + k gets rounded to 16 for k < 8ε. Based on this, the conclusion is that
if k < 16ε, then the computer will claim that y = 0. It should be mentioned
that if you account for both rounding procedures together that you find the
drop to zero occurs at about k ≈ 24ε. 
18 1 Introduction to Scientific Computing

1.5 Error and Accuracy

One of the most important words used in this text is error (and it is used
a lot). There are different types of error that we will often make use of. For
example, if xc is a computed value and x is the exact value, then
1. |x − xc | is the error.

|x − xc |
2. is the relative error (assuming x = 0).
|x|
Both are used, but the relative error has the advantage that it is based on
a normalized value. To explain, consider the requirement that the computed
solution should satisfy |x − xc | < 0.1 versus satisfying |x − xc |/|x| < 0.1. If
x = 10−8 then using the error you would accept xc = 10−2 , even though this is
a factor of 106 larger than the solution. With the relative error, you would not
accept xc = 10−2 but would accept any value that satisfies 0.9 < |xc /x| < 1.1.
The other useful aspect of the relative error is that it has a connection to the
number of correct significant digits, and this is explained later.
The error and the relative error require you to know the exact solution.
For this reason, they will play an important role in the derivation and testing
of the numerical methods, but will have little, if any, role in the algorithm
that is finally produced.
One of the problems of not knowing the error is that it can be difficult to
know when to stop a computation. As an example, consider the problem of
calculating the value of
∞
7
s=8− .
n=1
8n
Given that this involves the geometric series, you can show that s = 7. The
value of the series can be computed by first rewriting it as recurrence relation
as follows: s0 = 8 and
7
sk = sk−1 − , for k = 1, 2, 3, 4, · · · . (1.14)
8k
The computed values for sk are given in Table 1.6. It is possible to introduce
a measure for the improvement in the value of sk seen in this table by using
one of the following:
1. |sk − sk−1 | is the iterative error,

|sk − sk−1 |
2. is the relative iterative error (assuming sk = 0).
|sk |
The values for these quantities are given in Table 1.6. As with the relative
error, the preference in this text will be to use the relative iterative error
whenever possible given that it is a normalized value.
1.5 Error and Accuracy 19

k sk |sk − sk−1 | |sk − sk−1 |/|sk | |s − sk |/|s|

1 7.125000000000000
2 7.015625000000000 1.1e−01 1.6e−02 2.2e−03
3 7.001953125000000 1.4e−02 2.0e−03 2.8e−04
4 7.000244140625000 1.7e−03 2.4e−04 3.5e−05
5 7.000030517578125 2.1e−04 3.1e−05 4.4e−06
6 7.000003814697266 2.7e−05 3.8e−06 5.4e−07

Table 1.6 Values of sk , which is given in (1.14), as they approach the exact value of
s = 7. Also given are the iterative error, the relative iterative error, and the relative error.

Many of the methods in this text involve determining a quantity s by


computing a sequence s1 , s2 , s3 , · · · where sk → s. As in the above example,
the question is when to stop computing and accept the current value of sk
as the answer. Typically, an error tolerance tol is specified at the beginning
of the code and the code stops when |(sk − sk−1 )/sk | < tol. For example, if
tol = 10−4 , then in Table 1.6 the code would stop once s5 is computed. As
with many things in computing there is a certain level of uncertainty in this
result. For example, there is always the possibility that the sequence con-
verges so slowly that sk and sk−1 are almost identical yet sk is nowhere near
s. Or, when summing a series, there is the possibility that sk is so large that
the remaining terms to be added, no matter how many there might be, are
insignificant (an example of this is given in Exercise 1.27). Such complica-
tions are the basis for the first quote on page 1.

Correct Significant Digits

You would think that an elemental requirement in computing is that the


computed solution xc has a certain number of correct significant digits. So, for
example, if x = 100 and you wanted a value with at least 3 correct significant
digits, you would accept xc = 100.2 and xc = 99.98, but not accept xc =
100.8. As will be seen below, this idea is more complicated than you might
expect.
We begin with the operation of rounding. Suppose x is rounded to p digits
to obtain x̄. As shown in Exercise 1.30,
 x − x̄ 
 
  < 5 × 10−p . (1.15)
x
So, the exponent in the relative error is determined by the number of digits
used in the rounding. This raises the question whether an inequality as in
20 1 Introduction to Scientific Computing

(1.15) can be used to determine the number of correct significant digits in


the computed solution xc . For example, if |(x − xc )/x| < 5 × 10−p then is xc
correct to p significant digits? This means that if you round x and xc to p
digits and obtain x̄ and x̄c , respectively, then x̄c = x̄. Based simply on the
sk values in Table 1.6, the answer is no, not necessarily. Except for the last
entry, there are p − 1 correct digits.
It turns out that the connections between rounding and correct significant
digits can be surprising. For example, suppose that x = 0.9949 and xc =
0.9951. So, xc is correct to one digit, correct to 3 digits, but it is not correct to
2 digits or to 4 digits. In fact, the difficulty of determining correctly rounded
numbers underlies what is known as “The Table-Maker’s Dilemma”.
The connection between the number of correct significant digits and the
relative error is too useful to be deterred by the unpleasant fact that the
exact connection is a bit convoluted. So, in this text, the following simple
heuristic is used:

Expected Significant Digits. If


x − x 
 c
  < 10−p , (1.16)
x
then xc is expected to be correct to at least p significant digits.
The upper bound 10−p is used in (1.16) rather than 5 × 10−p because it
provides a more accurate value for p. As a case in point, it provides a better
approximation for the correct significant digits in Table 1.6 as well as in
Example 2.4. Also, significant digits refer to the maximum number of correct
significant digits. So, for example, if x = 0.9949 and xc = 0.9951, then p = 3.

1.5.1 Over-computing?

In using a numerical method, the question comes up as to how accurately


to compute the answer. For example, numerical methods are used to solve
problems in mechanics, and one often compares the computed values with
data obtained experimentally. This begs the question, if the data is correct
to only two or three digits, is it really necessary to obtain a numerical solution
that is correct to 15 or 16 significant digits (the limit for double precision)? It
is true that in many situations you do not need the accuracy provided using
double precision, and that is why lower precision types are available (see Table
1.2). The latter are typically used when speed is of particular importance,
such as in graphics cards. The issue that arises in scientific computing is
the prevalence of ill-conditioned problems. These are problems that tend to
magnify small errors. For example, in Chapter 3 it will be seen that when
Exercises 21

solving the matrix equation Ax = b it is easily possible that 15 or 16 digits


are needed just to guarantee that the computed solution is correct to one or
two digits. On the other hand, some methods that will be considered actually
try to take advantage of not over-computing the solution. Several examples of
this will arise in Chapter 9 when computing minimizers of nonlinear functions.

1.6 Multicore Computing

Scientific computing is a mathematical subject that is dependent on the cur-


rent state of the hardware being used. As a case in point, most computing
systems now have multicore processors. Having multiple cores can enable a
problem to be split into smaller problems
1000 that are then done by the cores in
parallel. For example, to compute i=1 xi , if you have 10 cores you could
have each core add up 100 of the xi ’s. In theory this could reduce your com-
puting time by a factor of 10. The tricky part is writing the software that
implements this idea efficiently.
Significant work has been invested in writing computer software that makes
use of multiple cores. A particularly important example is BLAS (basic linear
algebra subprograms). These are low-level routines for using multiple cores
to evaluate αAx + βy and αAB + βC. These expressions are the building
blocks for many of the algorithms used in scientific computing. As an exam-
ple, suppose you need to compute Ax1 , Ax2 , · · · , Axm . This can be done
sequentially or you can form the matrix X = [x1 x2 · · · xm ] and then com-
pute AX. In MATLAB, since it uses BLAS, the matrix version is about
a factor of 6 times faster than computing them sequentially (on a 10 core
system).
Most programmers are not able to write the low-level routines needed to
take advantage of multiple cores. However, MATLAB as well as programs
such as Python and Julia have numerous built-in commands that do this
automatically. As an example of a consequence of this, very few now write
code to solve the matrix equation Ax = b. Instead, if you are using MAT-
LAB, you simply use the command x=A \ b. On a 10 core machine, MATLAB
obtains the solution approximately 5 times faster than when using just one
core.

Exercises

For an exercise requiring you to compute the solution, you should use double-
precision arithmetic.

1.1. The following are true or false. If true, provide an explanation why it is
true, and if it is false provide an example demonstrating this along with an
22 1 Introduction to Scientific Computing

explanation why it shows the statement is incorrect. In this problem x is a


real number and xf is its floating-point approximation.
(a) If x = 0, then xf = 0.
(b) If x > y, then xf > yf .
(c) If xf is a floating-point number with 1 ≤ xf ≤ 2, then there is an integer
k so that xf = 1 + kε.
(d) If xf is a floating-point number, then 1/xf is a floating-point number.
(e) If x is a solution of ax = b, then xf is also a solution.
(f) Approximately 25% of the positive floating-point numbers are in the
interval 0 < x < 1.

Section 1.2

1.2. Show that the following are floating-point numbers: (a) 15, (b) −6, (c)
3/2, (d) −5/4.

1.3. By hand, evaluate the given formula using double-precision arithmetic.


Note that the terms must be evaluated in the order indicated by the paren-
thesis. Make sure to explain your steps.
  
(a) (1 − 13 ε) − 1, (d) 1 + 14 ε 1 + 13 ε ,

(b) (1 − 2−55 ) − 1, (e) (1 − ε/8)/(1 + ε/3),

(f) (1 + (2−53 + 2−54 ))4 − 1.


(c) (1 + 2−54 )4 ,

1.4. Let x = 2m + 2n and y = 2m . Taking m = 20, use the computer to find


the largest integer value of n where you get x − y = 0. Do the same for
m = 0 and m = −20. Using the ratio |(x − y)/x|, explain why you get these
particular values of n.

1.5. Find nonzero numbers for x and y so the computed value of x/y is the
stated result. Also, provide a short explanation why your example does this.

(a) Inf, (c) 0,


(b) NaN, (d) 1 (with x = y).

1.6. Compute the following, and provide a plausible explanation for the
answer.

(a) 10 · NaN and 0 · NaN, (d) Inf − Inf, (g) 1Inf ,


(b) NaN/NaN, (e) 0 · Inf, (h) e−Inf and eInf ,
(c) −6 · Inf, (f) 0Inf and (Inf)0 , (i) sin(Inf).
Exercises 23

1.7. Letting

a −1 1 a
A= , x= , and y= ,
0 3 −1 0

compute the following, and provide a plausible explanation for the answer.

(a) x + y when a = Inf, (c) A2 when a = Inf,


(b) Ax when a = NaN, (d) Ay when a = Inf.

1.8. Let xf and yf be adjacent floating-point numbers. You can assume they
are positive and normal floats.
(a) What is the minimum possible distance between xf and yf ?
(b) What is the maximum possible distance between xf and yf ?
(c) How many double-precision numbers lie between two consecutive single-
precision numbers? You can assume the single-precision numbers are
positive.

1.9. In double precision, (i) what is the distance from 32 to the next largest
floating-point number and (ii) what is the distance from 32 to the next small-
est floating-point number?

1.10. (a) In double precision, explain why the floating-point numbers in the
interval 252 ≤ x ≤ 253 are exactly the integers in this interval.
(b) In double precision, explain why the only floating-point numbers satis-
fying 253 ≤ xf ≤ xM are integers.

1.11. For a computer, the epoch is the time and date from which it measures
system time. Typically for UNIX systems, the epoch is midnight on January 1,
1970. Assume in this problem that system time is measured in microseconds.
So, for example, if the system time is 86400000000 then it is midnight on
January 2, 1971 since there are 86400000000 microseconds in a day. Assuming
double precision is used, and every year contains 365 days, on what date will
a computer using UNIX be unable to accurately determine the time of day?
Comment: The epoch and system time are used by UNIX programmers to
have “time t parties”. The more notable being the one held on 2/13/2009.

1.12. (a) Find the largest open interval about x = 16 so all real numbers
from the interval are rounded to xf = 16. That is, find the smallest value
of L and largest value of R with L < 16 < R so any number from the
interval (L, R) is rounded to the floating-point number xf = 16. Assume
double precision is used.
(b) Show that x = 50 is a floating-point number, and then redo part (a) for
x = 50.
24 1 Introduction to Scientific Computing

1.13. Let rand be a computer command that returns a uniformly distributed


random number from the interval (0, 1). It does this by picking a random
number from a prescribed collection S of floating-point numbers satisfying
0 < xf < 1. The requirements are that the numbers in S are equally spaced,
and the number selected is as likely to be in the interval (0, 1/2) as in the
interval (1/2, 1).
(a) Suppose S consists of all the floating-point numbers that satisfy 0 <
xf < 1. Why would this not satisfy the stated requirements?
(b) What is the largest set of floating-point numbers that can be used for S?
If xf is one of the floats in S, what is the probability that rand picks it?
(c) An often used method is to prescribe a positive integer p, then take S =
{1/2p , 2/2p , 3/2p , · · · , (2p − 1)/2p }. Typically p is chosen so that 2p − 1
is a Mersenne prime (e.g., p = 13, p = 29, or p = 53). Also, if p ≤ 53 then
all the numbers in S are floating-point numbers. What is S when p = 1,
p = 2, p = 3, and p = 53?
(d) Explain why the set S in part (c) satisfies the stated requirements if
p ≤ 53. What is the probability that rand picks any particular number
from S?
Comment: Generating uniformly distributed random numbers is a critical
component of any computer system, and how to do this is continually
being improved [Goualard, 2020; Jacak et al., 2021].

1.14. Using compound interest, if an amount a is invested at an annual


interest rate r, and compounded n times per year, then the amount A at the
end of one year is
r n
A=a 1+ .
n
It’s not hard to show that the larger the value of n the larger the value of A.
Assume that a = 100 and the interest rate is 1% (so, r = 0.01). Also assume
there are 365 days in a year. Compute A for the following cases:
(a) Compounding every hour (so, n = 365 ∗ 24).
(b) Compounding every second.
(c) Compounding every millisecond.
(d) Compounding every nanosecond.
(e) Compounding every picosecond.
(f) You should find that the values computed in (d) and (e) are incorrect.
The question is why, that is, what causes the floating-point calculation
to produce an incorrect value? Based on this, given a value of r (with
0 < r < 1), at what value of n would you expect an incorrect result as in
(d) and (e) to be computed?

1.15. Consider the ratio


n(n − 2)(n − 4) · · · 2
R= ,
(n − 1)(n − 3)(n − 5) · · · 1
Exercises 25

where n is even. It is known that if n = 100 then R ≈ 12.5645 and if n = 400


then R ≈ 25.0820.
(a) The algorithm below will, in theory, compute R. Try them and show that
they work if n = 100 but not if n = 400 (for the latter, the first line must
be changed). Explain why this happens.

n = 100, T = 1, B = 1
for i = 2, 4, 6, · · · , n
T =T ∗i
end
for i = 3, 5, 7, · · · , n − 1
B =B∗i
end
R = T /B

(b) How can R be rewritten so it is possible to compute its value when


n = 400? Prove it works by computing the result. Also, compute R for
n = 4,000,000.

1.16. Compute the following. Your answer must contain at least 12 signifi-
cant digits. If you must modify the sum(s) in any way to obtain the answer,
explain what you did and why.
1000 1000 k
 ek k=0 e
(a) , (d) 1000 ,
1 + ek n=0 ne
n
k=0
1000
 1000
cosh(k) 
(b) , (e) ln(ek + 1),
1 + sinh(k)
k=0 k=0
1000
 1000
 √ 1000

(c) 3 + ek − 1 + en ,
(f) ln ek ,
k=0 n=0
k=0
1000
     
10 1 10 1
(g) k sin π(k + ) − sin π(k − ) .
k k
k=1

1.17. Using the quadratic formula, compute the positive solution of the given
equation. If you run into a complication in computing the solution, explain how
you resolved the problem. (a) x2 + 108 x − 14 = 0, (b) 10−16 x2 + x − 14 = 0.

1.18. Let z = (x2 + y 2 )/2. Assume that x and y are positive.


(a) Show that min{x, y} ≤ z ≤ max{x, y}.
(b) Compute z when x = 10200 , y = 10210 . Your value must satisfy the
inequalities in part (a). If you run into a complication in computing
the value, explain how you resolved the problem.
(c) Redo part (b) using x = 10−200 , y = 10−210 .
26 1 Introduction to Scientific Computing

(d) Write down an algorithm that can be used to accurately evaluate z for
any positive normalized floating-point numbers for x and y.

1.19. In some circumstances, the order of multiplication makes a difference.


As an illustration, given n-vectors x, y, and z then (xyT )z = x(yT z).
(a) Prove this for n = 3.
(b) For large n, which is computed faster: (xyT )z or x(yT z)?

1.20. Homer Simpson, in the 1998 episode “The Wizard of Evergreen Ter-
race”, claimed he had a counterexample to Fermat’s Last Theorem, and it
was that 398712 + 436512 = 447212 . This exercise considers whether it is pos-
sible to prove numerically that Homer is correct. Note that another (false)
counterexample appeared in the 1995 episode “Treehouse of Horror VI”.
(a) Calculate 398712 + 436512 − 447212 . If Homer is right, what should the
answer be? 1/12
(b) Calculate 398712 + 436512 − 4472. If Homer is right, what should
the answer be? 
(c) Calculate 398712 + 436512 /447212 . If Homer is right, what should the
answer be?
 1/12 12
(d) Calculate 398712 + 436512 − 447212 . If Homer is right, what
should the answer be?
(e) One argument that Homer could make is that (c) is the correct result and
(a) and (b) can be ignored because if they are correct then you should
not get a discrepancy between (a) and (d). Explain why double-precision
arithmetic cannot be used to prove whether Homer is right or wrong.
Note: Homer’s blackboard containing the stated formula, along with a
few other gems, can be found in Singh [2013]. It also explains why Homer
appears to have an interest in mathematics and physics.

1.21. This problem considers the consequences of rounding using double


precision. Assume the “round-to-nearest” rule is used, and if there is a tie
then the smaller value is picked (this rule for ties is used to make the problem
easier).
(a) For what real numbers x will the computer claim the inequalities 1 <
x < 2 hold?
(b) For what real numbers x will the computer claim x = 4?
(c) Suppose it is stated that there is a floating-point number xf that is the
exact solution of x2 − 2 = 0. Why is this not√possible? Also, suppose x̄f
¯f are the floats to the left and right of 2, respectively. What does
and x̄
¯f − x̄f equal?

Exercises 27

0.8

0.6
y-axis

0.4

0.2

0
-3 -2 -1 0 1 2 3
-7
x-axis 10

Figure 1.9 A plot of f (x) for Exercise 1.23.

Section 1.4

1.22. A way to sum a series of positive numbers is to use a sort-then-add


procedure. You do this as follows: (1) sort the n terms in the series from
smallest to largest and then add the two smallest entries (which then replaces
those two entries), (2) sort the resulting n − 1 terms by size and add the two
smallest entries (which then replaces those two entries), etc.
(a) Work out the steps in the procedure to sum 6 + 4 + 5 + 2 + 1.
(b) In Table 1.4, E(105 ) = 12.0901461298634. Use the sort-then-add proce-
dure to compute s(105 ), and compare the resulting error with what is
given in Table 1.4. Also, how does the computing time needed for this
procedure compare to just computing s(105 ) directly?


1.23. The graph of the function f (x) = ( 1 + x2 − 1)/x2 is shown in Figure
1.9 where the values of f (x) were computed using double precision.
(a) Using l’Hospital’s rule, determine limx→0 f (x).
(b) Rewrite f (x) in such a way that the function is defined at x = 0. Evaluate
this function at x = 0.
(c) Why does the computer state that f (x) = 0 for small values of x? Also,
the graph shows that as x decreases to zero the function drops to zero.
Determine where this occurs (approximately).

1.24. The graph of the function f (x) = (ex − 1)/x is shown in Figure 1.10.
In this problem, the quadratic Taylor polynomial approximation ex ≈ 1 +
x + x2 /2, for |x| 1, is useful.
(a) Using l’Hospital’s rule, determine limx→0 f (x).
(b) Redo part (a) but use the quadratic Taylor polynomial approximation.
(c) Suppose that x is positive. Why does the computer state that f (x) = 0
for small values of x? Approximately, where does the drop to zero occur?
(d) Redo part (c) when x is negative.
28 1 Introduction to Scientific Computing

1.25. This problem considers the following algorithm:



x = 2, s = 0
for i = 1, 2, 3, · · · , N
s=s+x
end
y = |x − s/N |

It is assumed N is a prescribed positive integer.


(a) What is the exact value for y?
(b) If N = 103 , the computed value is y ≈ ×10−14 , if N = 106 , one gets that
y ≈ 10−11 , and if N = 109 , one gets that y ≈ 10−8 . Why is the error
getting worse as N increases? Is there any correlation between the value
of N and the value of y?
(c) Use compensated summation to compute this result and compare the
values with those given in part (b).
1.26. The polynomial pn (x) = a0 + a1 x + · · · + an xn can be separated into
the sum of two polynomials, one which contains even powers of x and the
other involving odd powers. This problem explores the computational benefits
of this. To make things simple, you can assume n is even, so n = 2m, where
m is a positive integer.
(a) Setting z = x2 , find f (z) and g(z) so that pn (x) = f (z) + xg(z).
(b) What is the minimum flop count to compute the expression in part (a)?
Also, explain why it is about halfway between the flop count for the
direct method and the count using Horner’s method.
(c) Evaluate (1.4) using the formula in part (a), and then plot the values
for 0.98 ≤ x ≤ 1.02 (use 1000 points in this interval). In comparison to
the plot obtained using the direct method, does the reduced flop count
reduce the error in the calculation?

Section 1.5

1.5
y-axis

0.5

0
-3 -2 -1 0 1 2
x-axis 10-15

Figure 1.10 A plot of f (x) for Exercise 1.24.


Exercises 29

∞
1.27. This problem concerns computing k=1 1/k 2 . The exact value of the
sum is π 2 /6.
(a) Suppose that the sum is computed using the algorithm given below.
Explain how the stopping condition works.

s = 1, ss = 0, k = 1
while s = ss
k =k+1
ss = s
s = s + 1/k 2
end

(b) Compute the sum using the procedure in part (a) and report its value to
12 digits. At what value of k does the algorithm stop? How many digits
of the computed sum are correct?
(c) Assuming the number of correct digits in your answer is part (b) is m,
compute the sum so the value is correct to, at least, m + 2 digits. Make
sure to explain how you do this. Also, your procedure cannot use the
known value of the sum.

1.28. The computed values of a sequence are s1 = 10.3, s2 = 10.03, s3 =


10.003, s4 = 10.0003, · · · .
(a) What is the apparent value of limk→∞ sk ?
(b) If the stopping condition is that the iterative error satisfies |sk − sk−1 | <
tol, where tol = 10−8 , at what value of k does the computation stop?
When it stops, how many correct significant digits does sk contain?
(c) If the stopping condition is that the relative iterative error satisfies
|sk − sk−1 |/|sk | < tol, where tol = 10−8 , at what value of k does the
computation stop? When it stops, how many correct significant digits
does sk contain?

1.29. The computed values of a sequence are s1 = 0.0027, s2 = 0.00207, s3 =


0.002007, s4 = 0.0020007, · · · .
(a) What is the apparent value of limk→∞ sk ?
(b) If the stopping condition is that the iterative error satisfies |sk − sk−1 | <
tol, where tol = 10−8 , at what value of k does the computation stop?
When it stops, how many correct significant digits does sk contain?
(c) If the stopping condition is that the relative iterative error satisfies
|sk − sk−1 |/|sk | < tol, where tol = 10−8 , at what value of k does the
computation stop? When it stops, how many correct significant digits
does sk contain?

1.30. A positive number can be written as x = d × 10n , where 1 ≤ d < 10.


Assume that this is written in decimal form as d = d1 .d2 d3 · · · .
30 1 Introduction to Scientific Computing

(a) Suppose x is rounded to p digits to obtain x̄. Letting x̄ = d¯ × 10n , explain


why (ignoring ties)

d1 .d2 d3 · · · dp if dp+1 < 5,
d¯ = −p+1
d1 .d2 d3 · · · dp + 10 if dp+1 > 5.

(b) Show that  


 x − x̄ 
  −p
 x  < 5 × 10 .
Additional Questions

1.31. This problem considers ways to compute xn , where n is a positive


integer.
(a) Compare the total number of flops between computing xn = x ∗ x ∗ · · · ∗
x, and computing

n y ∗ y ∗ ··· ∗ y if n is even
x =
x ∗ y ∗ y ∗ · · · ∗ y if n is odd,

where y = x2 . As examples of the last formula, x6 = y ∗ y ∗ y, while x5 =


x ∗ y ∗ y.
(b) Suppose n = 28. Explain the connection between the floating-point rep-
resentation 28 = (1 + 1/2 + 1/22 ) × 24 and the factorization
 
28
 2 2 2 2  2 2  2 2
x = x ∗ x2 ∗ x .

What is the minimum number of flops required to determine x28 using


this formula? Note that this procedure is a version of the square-and-
multiply algorithm.
(c) Suppose n = 100, so its floating-point representation is (1 + 12 + 214 ) ×
26 . Explain how to use the idea in part (b) to calculate x100 . How does
the flop count compare with the two methods in part (a)?
(d) Another approach, assuming x is positive, is to write xn = en ln x . Based
on the values in Table 1.3, what is the approximate flop time for this?
How does it compare with the flop times found in parts (b) and (c)?
1.32. This problem considers finding the largest value of the mantissa.
(a) What values of the bj ’s in (1.8) produce the largest value of m?
n+1
1
(b) Assuming x = 1, show that 1−x = 1 + x + x2 + · · · + xn + x1−x .
(c) Use the result from part (b) to show that the value of m from part (a)
can be written as m = 2 − ε. From this show that the largest value of
the mantissa is m = 1 + Kε, where K = 2N −1 − 1.
(d) Use part (c) to explain why the float just to the left of x = 2 is 2 − ε.
Also, explain why the float just to the right of x = 2 is 2(1 + ε).
Another Random Document on
Scribd Without Any Related Topics
properly fitted, the cavesson is carefully put on. The nose-band
should be about three inches above the nostrils; if higher, it would
partly lose its power; if lower it would affect the horse's breathing. It
must not be so tight as to make the horse uneasy.

LONGEING.

This instruction should be begun on a circle from fifteen to twenty


yards in diameter. As horses are usually fed, watered, saddled, and
led from the near side, they are inclined to lead better from that
than the off side. It will therefore generally be found necessary to
give two lessons on the right to one on the left.
The first lesson to be taught the young horse is to go forward.
Until he does this freely nothing else should be required of him.
When he obeys freely, he should occasionally be stopped and
caressed.
If the horse hesitates or stands still when he is ordered to move
on, he should be encouraged, as such hesitation oftener comes from
fear and ignorance as to what is required than from obstinacy or
vice.
The horse should at first be led around the circle at a walk. A man
with a whip (with which at first the horse should not be struck)
should follow at a short distance and show the whip occasionally if
the horse is inclined to hang back; if this does not produce the
desired effect, he should strike the ground in rear of the horse, and
at length touch him lightly with the whip until he obeys.
After the horse begins to move freely at a walk the man holding
the longeing-rein should gently urge him to the trot, gradually
lengthening the rein so that it may be scarcely felt, and should go
round the circle at an active pace nearly opposite the horse's
shoulder so as to keep him out and press him forward. If the horse
takes kindly to this lesson, the man holding the rein may lengthen it
by degrees until he has only to turn in the same spot, the man with
the whip being careful to keep the horse out to the line of the circle.
Should the horse break his pace, or plunge, the rein should be
shaken without jerking it until he returns to the trot.
The man holding the longeing-rein should have a light and easy
hand. For the first two or three days the horse must not be urged
too much; if he goes gently, without jumping or resisting, enough is
accomplished. He should be longed to the right, left, and right again,
changing from a walk to a trot and back again in each case. He
should be frequently halted by gently feeling the rein and speaking
to him.
After a few days of the above practice the horse may be urged a
little more in the trot, but the greatest care and attention are
requisite to teach him the use of his limbs without straining him.
Much harm may be done in this instruction by a sudden jerk or too
forcible pull of the longe.
Care must be taken that the lessons are not made so long as to
fatigue or fret the horse. At first they should be short and be
gradually increased in length as the instruction progresses. At the
conclusion of each lesson the horse should be led to the centre of
the ring and made much of. The man holding the longeing-rein
should take it short in one hand, at the same time patting and
rubbing the horse about the head and neck with the other; he
should then try to bend the horse's neck a little to the right and then
to the left by means of the longeing-rein. The bend should be in the
very poll of the neck, and this exercise should be repeated at the
end of every lesson, cautiously and by slow degrees, until the horse
responds easily. This exercise will greatly facilitate the future
instruction of the animal.
The running-rein is of great value in teaching a horse to keep his
head in a proper position, and affords valuable aid in his first
handling. If judiciously used, it saves the rider a great deal of trouble
and the horse much ill usage. It is especially useful in controlling
horses that are inclined to bolt. It should act directly on the snaffle-
bit itself, and is wholly independent of the reins.
The Running-rein consists of three parts—the chin-strap,
martingale, and rein.
The Chin-strap, about six to eight inches long, on which is
suspended a loose ring, is fastened to both snaffle-bit rings.
The Martingale has only one ring; the loop through which the
girth passes is made adjustable by a buckle. The martingale is so
adjusted that when taut the ring will be on a level with the points of
the horse's shoulders.
The Rein is about eight and one half feet long; one end is buckled
into the near pommel-ring; the free end is then passed through the
martingale-ring from rear to front, thence through the chin-strap
ring from left to right, thence through the martingale-ring from front
to rear, and is held in the rider's right hand. A pull on this rein will
act directly on the mouth-piece, drawing it back and somewhat
downward toward the horse's breast-bone.

PREPARATORY LESSON TO MAKE THE HORSE TRACTABLE.

Before commencing the bending lessons it is well to give the


horse a preparatory one in obedience. This first act of submission
makes the horse quiet and gives him confidence, and gives the man
such ascendancy as to prevent the horse at the outset from resisting
the means employed to bring him under control.
Go up to the horse, pat him on the neck, and speak to him; then
take the reins from the horse's neck and hold them at a few inches
from the rings of the bit with the left hand; take such position as to
offer as much resistance as possible to the horse should he attempt
to break away; hold the whip in the right hand, with the point down;
raise the whip quietly and tap the horse on the breast; the horse
naturally tries to move back to avoid the whip; follow the horse,
pulling at the same time against him, and continuing the use of the
whip; be careful to show no sign of anger nor any symptom of
yielding. The horse, tired of trying ineffectually to avoid the whip,
soon ceases to pull and moves forward; then drop the point of the
whip and make much of him. This repeated once or twice usually
proves sufficient; the horse, having found how to avoid the
punishment, no longer waits for the application of the whip, but
anticipates it by moving up at the slightest gesture.

BENDING LESSONS.

These lessons should be given to the horse each day so long as


the snaffle-bit is used alone; but the exercise should be varied, so
that the horse may not become fatigued or disgusted.
The balance of the horse's body and his lightness in hand depend
on the proper carriage of his head and neck.
A young horse usually tries to resist the bit, either by bending his
neck to one side, by setting his jaw against the bit, or by carrying his
nose too high or too low.
The bending lessons serve to make a horse manageable by
teaching him to conform to the movements of the reins and to yield
to the pressure of the bit. During the lessons the horse must never
be hurried.
To Bend the Horse's Neck to the Right.—Take a position on
the near side of the horse, in front of his shoulder and facing toward
his neck; take the off rein close up to the bit with the right hand, the
near rein the same way with the left hand, the thumbs toward each
other, the little fingers outward; bring the right hand toward the
body, at the same time extend the left arm so as to turn the head to
the horse's right.
The force employed must be gradual and proportioned to the
resistance met with, and care must be taken not to bring the horse's
nose too close to his chest. If the horse moves backward, continue
the pressure until, finding it impossible to avoid the restraint
imposed by the bit, he stands still and yields to it.
When the bend is complete, the horse holds his head there
without any restraint and champs the bit; then make much of him
and let him resume his natural position by degrees, without throwing
his head around hurriedly. A horse, as a rule, champs the bit when
he ceases to resist.
The horse's neck is bent to the left in a similar manner, the man
standing on the off side.
To Rein in.—Cross the reins behind the horse's jaw, taking the
near rein in the right hand and the off rein in the left, at about six
inches from the rings; draw them across each other till the horse
gives way to the pressure and brings his nose in. Prevent the horse
from raising his head by lowering the hands. When the horse gives
way to the cross-pressure of the reins, ease the hand and make
much of him.

SADDLING.

This should be done at first on the longeing-ground. One man,


facing the horse and taking the snaffle-reins in both hands near the
bit, should hold him while another places the saddle on his back. If
the horse shows no uneasiness or resistance, let down the cincha-
strap and cincha; fasten the cincha-strap loosely at first, and tighten
it afterwards by degrees. Care must be taken not to make the cincha
so tight as to cause uneasiness to the horse. If the horse resists or is
restless, remove the saddle and let him see and smell it; he will then
generally allow it to be placed; if necessary, strap up a leg until the
horse is saddled. The longeing is then continued with the horse
saddled.

MOUNTING.

When the horse becomes accustomed to the saddle, he should be


mounted. Two men should assist the man who is to mount. The man
with the longe, facing the horse and taking the snaffle-reins in both
hands near the bit, holds his head rather high and engages his
attention; the second man bears down on the off stirrup at the
proper moment to keep the saddle even when the third man
mounts. The man who mounts proceeds with caution, stopping and
caressing the horse if he shows any uneasiness; after being seated
the man pats the horse a few moments, and without attempting to
make him move, dismounts with the care and gentleness exercised
in mounting. This is repeated several times, until the horse submits
without fear. The rider then mounts, takes a snaffle-rein in each
hand, and feels lightly the horse's mouth; the man with the longe
leads the horse forward and afterwards longes him to the left, and
then to the right, at a walk; if the horse shows any disposition to
kick or plunge, the longe is shaken to engage his attention and to
keep up his head. After a few turns the rider dismounts, the horse is
fed from the hand, patted, and dismissed.
These lessons are continued until the horse can be mounted and
dismounted without any difficulty; and when he can be made to go
forward, to the right and left, to halt and rein back by gentle
application of the aids, the longe is dispensed with.
The horse is now exercised in the riding-hall or open manège, the
lessons for young horses not exceeding three quarters of an hour.
The horse is ridden on the track first at a walk, then at a slow trot,
and afterwards the trot and walk are alternated, care being taken to
turn the corners squarely; the horse is next marched to the right and
left, halted and reined back to accustom him to obey the bit and the
pressure of the legs. When he is obedient to the snaffle, the horse is
equipped with the curb-bit. The bit must have rings at the ends of
the mouthpiece for snaffle-reins, or a bit-bridoon must be used in
order that the horse may be accustomed by degrees to the action of
the curb-bit. The first instruction given to the horse with the curb-bit
is bending the neck and reining in, dismounted; he is then mounted
and exercised in the riding-hall or open manège as before described,
and receives the bending and reining lessons mounted.

BENDING LESSONS, MOUNTED.

The horse is now equipped with a curb-bridle.


To Bend the Horse's Neck to the Right.—Adjust the reins in
the left hand; seize the right rein with the right hand well down;
draw it gently to the right and rear until the horse's head is brought
completely around to the right, in the same position as in the bend
dismounted. When the horse champs the bit, make much of him,
and allow him to resume his natural position. The horse's neck is
bent to the left in a similar manner.
To Rein in.—Lower the bridle-hand as much as possible, turning
the back uppermost; with the right hand, nails down, take hold of
the curb-reins above and close to the left hand and shorten them by
degrees, drawing them through the left hand, which closes on the
reins each time they are shortened.
When the horse resists much and holds his nose up, keep the
reins steady; do not shorten or lengthen them; close the legs to
prevent the horse from backing; after remaining perhaps a minute or
more with his nose up and his jaw set against the bit he will yield,
bring his nose in, and champ the bit; make much of him, loosen the
reins, and after a few seconds rein in again.
This exercise gives the horse confidence, and teaches him to arch
his neck and bring his head in proper position whenever he feels the
bit.
Most young horses are afraid of the bit, and they must never be
frightened by sudden jerks on the reins, lest they should afterwards
refuse to stand the requisite pressure of the bit. A certain amount of
bearing is necessary to induce the horse to work boldly and well, as
well as to apprise the rider of what the horse is going to do.
In reining in, some horses rest the lower jaw against the breast;
to counteract this, press both legs equally and force the horse
forward to the bit.
Some horses will not work up to the hand; that is, will not bear
the bit at all. Such horses are unfit for the service.
Whenever, without an apparent cause, a horse resists or is
restive, the bit, saddle, and equipment should be carefully examined
to see if any part hurts or irritates him.

REARING.

Should the horse rear, the rider must yield the hand when the
horse is up, and urge him vigorously forward when he is coming
down; if the horse is punished while up, he may spring and fall
backward.
Use the running-rein with a rearing horse.

KICKING.

This can be prevented by holding the horse's head well up and


closing the legs; if necessary, they are closed so much as to force
the horse forward.

SHYING.

This sometimes results from defect of sight and sometimes from


fear. If from fear, the horse must be taken up to the object with
great patience and gentleness, and be allowed to touch the object
with his nose. In no case should a horse be punished for timidity.
The dread of chastisement will increase his restiveness.

TO ACCUSTOM THE HORSES TO FIRING.

Station a few men at a little distance from and on both sides of


the stable-door, and cause them to fire pistols as the horses are led
into the stable to be fed; for the same object a gun may be fired
during the hour of feeding. If a horse is nervous, he may be put on
the longe and fed from the hand and petted each time the pistol is
discharged; or he may be thrown, care being taken not to discharge
the pistol so as to burn him or injure him in any way. The horses
should be trained to be steady under the fire of the pieces, and also
under pistol-firing by the cannoneers on the chests and by the
drivers from their teams.
SWIMMING HORSES.

The horses are at first equipped with the watering-bridle, and are
without saddles. The reins are on the horse's neck just in front of
the withers, and knotted so that they will not hang low enough to
entangle the horse's feet, care being taken to have them loose
enough to permit the horse to push his nose well out, so as to have
entire freedom of the head. The horse should be watered before
putting him into the stream.
When the rider gets into deep water, he drops the reins, seizes a
lock of the mane with the up-stream hand, allows his body to drift
off quietly to the down-stream side of the horse, and floats or swims
flat on the water, guiding the horse as much as possible by splashing
water against his head, only using the reins when splashing fails.
The horse is easily controlled when swimming; he is also easily
confused, and it is therefore necessary that the rider should be
gentle and deliberate. The rider must be cautioned that the horse is
easily pulled over backward by the reins when swimming, and also
that he may plunge when he touches bottom. When the horse
touches the bottom at the landing, the rider pulls himself on the
horse's back and takes the reins.
The rider may also be required to swim, holding the horse's tail,
allowing the horse to tow him.
After the man and horse have gained confidence, the rider may
be required to be seated on the horse while swimming. As the extra
weight presses the horse down and impedes his movements, the
rider should hold his knees well up to lessen the resistance, and
steady his seat by holding on to the mane or pommel of the saddle.
The men are instructed, in crossing running water, to keep their
eyes fixed on the opposite bank.
The practice of swimming gives horses confidence in deep water
when in harness. Streams deep and wide enough to swim one and
even two pairs of a team have been crossed by light artillery in our
service.
BREAKING IN THE YOUNG HORSE TO HARNESS.

The harness should be put on the horse in the stable with


caution, and at first without the traces, so that in the event of the
horse jumping about they will not hang about his legs and frighten
him. The horse should then be fed in his harness, and after standing
for some hours be walked about in it.
When the horse has thus been fed and walked about, and has
become reconciled to the harness, the traces should be attached,
and a rope tied to the rear end of each; a man then takes the ends
of the ropes, and the horse is walked about, the man holding the
ropes, taking care that the traces do not rub against the sides of the
horse in the beginning, but accustom him to them gradually.
When the horse has become accustomed to the pressure of the
collar and traces, he may then be hitched in with a steady horse. At
first the utmost caution should be observed and a foreleg held up, if
necessary, while the traces are being fastened, and no noise or
shouting should be permitted. After being hitched in, the horse
should be permitted to stand still for some minutes before the
carriage is started, and it should be put in motion by the other
horses. The horse should be left to himself and not be required to
draw at first; all that should be demanded of him is to move forward
quietly.

MANAGEMENT OF VICIOUS HORSES.

A vicious or refractory horse may be thrown. He is thus made to


submit to control without exciting his resentment, or suffering any
other physical pain than that resulting from his own resistance.
During the operation the man acts with deliberation, speaks with a
kind voice, and never uses harsh treatment.

TO THROW THE HORSE.

The method explained is a modification of the one generally


known as "Rarey's Method." The horse is equipped with a watering-
bridle and surcingle. The surcingle is buckled securely but not tightly
around the horse's body just back of the withers. The man is
provided with two strong straps. No. 1 is about ten feet long and
one inch wide, and has a loop or iron ring at one end. No. 2 is about
three feet six inches long and from one and one half to two inches
wide; one end has a strong buckle and two keepers (one on each
side of the strap). In the absence of straps as specified, the halter-
strap may be substituted for No. 1, and the stirrup-strap for No. 2.
The horse is taken to an open space, preferably covered with turf,
free from stones, etc., to prevent injuring the horse's knees. Pass the
free end of No. 1 through the ring and make a slip-loop; raise the
horse's off forefoot, and place the loop around the pastern; see that
the loop has no twist in it; let the foot down, draw the strap taut,
and pass the free end over the horse's back from the off side and
under the surcingle from front to rear, the free end hanging down on
the near side. Pass the free end of No. 2 through the inside keeper
and make a slip-loop; raise the near forefoot and place the loop
around the pastern, with the buckle outside, and make it snug; raise
the heel against the forearm, pass the free end of the strap, from
the inside, over the forearm, and buckle the strap sufficiently tight to
hold the leg in this position. Let the bridle-reins either hang down or
place them on the neck; they may be caught hold of at any time
after the first plunging is over. It is important that the off forefoot be
kept from the ground after the horse first raises it, and this will be
better accomplished if both hands are used at strap No. 1 during the
first plunge.
The man takes his place behind the surcingle on the near side of
and close to the horse, the left foot in advance, and grasps securely
with the left hand the free end of No. 1, and, if the strap is long
enough, makes a turn with it around the left hand, the right hand
grasping it loosely, forefingers close to the surcingle, back of the
hand against the horse's back. Quietly and gently urge the horse
forward; the instant he raises his foot, pull the strap with the left
hand, bring the off heel against the forearm, the strap slipping
through the right hand, which should be kept in place, but which
grasps the strap as soon as the foot is sufficiently raised, and holds
it firmly; make a turn with the strap around the right hand, and take
both reins in the left hand on the near side of the horse. The horse
is now brought to his knees; bring the horse's nose well to the left
and raised, placing the right shoulder and arm against the horse's
side, thus indicating to him that he is to lie on his right side. A horse
of a stubborn disposition may remain in this kneeling position for
some time, and this he should be allowed to do until he is willing to
lie down of his own volition. No force will be used to push the horse
down. From this kneeling position the horse may rear and plunge,
but as he moves so should the man, maintaining his relative position
to the horse, and a firm hold of the long strap, in order to deprive
the horse of the use of his right foreleg. In most cases, after
remaining in this kneeling position for a short time, the horse will lie
down. The man maintains his hold of the strap and reins until the
horse is quiet and shows no immediate disposition to attempt to
rise; or he has the strap and reins so placed that he can grasp them
directly the horse attempts to get up.
To dispel his fears and reconcile him to his unexpectedly assumed
position, he should now be petted, spoken to in a kindly tone of
voice, and generally made much of. When he becomes quiet and
ceases to struggle, the man should pass around him, handle his feet,
and straighten out and rub his legs. If the horse shows no inclination
to rise before being told to do so, the straps may be unfastened and
removed, but so long as the eye shows a wild, startled expression
the straps should not be removed. The eye is the true index of the
horse's feelings and disposition, and if closely observed will always
betray his intentions.
When he has remained in the lying position for a short time after
the straps have been removed, and he no longer struggles or
attempts to rise, or if he attempts to rise and cannot be prevented
from doing so, the man should raise his horse's head a little with the
reins and command: "Up!" When the horse gets up, he should be
made much of and given to understand that he has done what was
required of him. It will be advantageous to throw the horse three or
four times at each lesson, but the throwings should not follow each
other in rapid succession, in order to avoid the overfatigue and
constraint which might incite the horse to insubordination and
resistance.
It will be found that horses of a peculiarly wilful and stubborn
disposition will not lie quiet after the straps have been removed. To
overcome horses of this class, the long strap should be made fast to
the left fore foot so that both knees will be secured in a bent
position. The horse need be no longer held, but will be allowed to
struggle. He may rear, or plunge, or assume a kneeling position, but
whatever he may do no restraint should be put upon him. After
finding that all his struggles are of no avail, and that the only result
attained by them is suffering to himself, he will succumb and quietly
lie down. When, from his ceasing to struggle when handled, and
from the appearance of his eye, there is reason to believe that the
horse has yielded, the straps may be gradually loosened and
removed. Two or three lessons properly administered in this way will
conquer the most stubborn horse.
After a stubborn horse has been thrown several times, it may
happen that he will not permit his fore leg to be strapped up, and
will resist by rearing, plunging, striking, or kicking. In such cases
another strap, "No. 3," may be necessary. This is a strong leather
surcingle about three inches wide in which two iron rings, about two
feet six inches apart, are securely fastened. The leather girth is
secured so that the rings will be about the middle of the horse's
sides. Two long straps, "No. 1," are used. One is placed on each
front pastern without raising the foot. The free ends of the straps
are run through the rings on the surcingle so that they can be used
as a pair of driving-reins. These straps are held by one man in rear
of the horse, while another, approaching the horse on the near side,
attempts to raise his left foot. The instant the horse rears, strikes, or
plunges he is brought to his knees by the man holding the long
reins; after this is repeated several times the horse will allow his foot
to be strapped up. Should the horse stand, or refuse to move, the
whip may be used.
These means may be used to break horses of rearing, plunging,
or bucking under the saddle. In this case the surcingle is dispensed
with; the rider holds the straps and exerts sufficient force when the
horse is refractory to bring him to his knees. The same means may
be used to discipline horses which refuse to carry double, the man in
the rear holding the straps.

TO BREAK THE HORSE OF KICKING.

The horse is thrown and one end of each of the long straps is
made fast to the bit-rings; the other ends are passed through the
rings on the leather surcingle and secured to the hind pasterns.
When thus secured, all means should be resorted to in order to
make the horse kick, and this should be repeated until he no longer
struggles or attempts to move his hind legs under any provocation
whatever.

TREATMENT AND CARE OF HORSES.

Horses require gentle treatment. Docile, but bold, horses are apt
to retaliate upon those who abuse them, while persistent kindness
often reclaims vicious animals.
A horse must never be kicked in the belly, or struck about the
head with the hands, reins, or any instrument whatever.
Never threaten, strike, or otherwise abuse a horse.
Before entering a stall speak to the horse gently, and then go in
quietly.
Never take a rapid gait until the horse has been warmed up by
gentle exercise.
Never put up a horse brought to the stable or line heated, but
throw a blanket over him and rub his legs, or walk him until cool. If
he is wet, put him under shelter and wisp him against the hair until
dry.
Never feed grain to a horse, or allow him to stand uncovered,
when heated. Hay will not hurt a horse no matter how warm he may
be.
Never water a horse when heated, unless the exercise or march is
to be immediately resumed. A few mouthfuls of water, however, will
do no harm, and should ordinarily be given him.
Never throw water over a horse coming in hot, not even over his
legs or feet.
Never allow a horse's back to be cooled suddenly by washing or
even removing the blanket unnecessarily.
To cool the back gradually, the blanket may be removed and
replaced with the dry side next the horse.
At least two hours' exercise daily is necessary to the health and
good condition of horses; they should be marched a few miles when
cold weather, muddy ground, etc., prevent drill.
Horses' legs will be often hand-rubbed, particularly after severe
exercise, as this removes enlargement and relieves or prevents
stiffness.
In mild weather the sheath will be washed out once a month with
warm water and castile soap and then greased; during the cold
season the intervals between washings should be longer.
Sore backs and galled shoulders are generally occasioned by
neglect. The greatest pains will be taken in the fitting of the saddles
and collars; the men must never be allowed to lounge or sit
unevenly in their saddles. Every driver should keep a pair of soft
leather pads, stuffed with hair, about six inches by four; the moment
any tenderness is noticed in a horse's shoulder, the pressure is
removed by placing these pads under the collar above and below the
tender part.

DESTRUCTION OF HORSES.
Occasions arise rendering the destruction of horses necessary.
The following instructions will enable one to arrive at a point directly
over the summit of the brain, and which when fired upon will cause
instantaneous death. Draw a line, A A, horizontally across the
forehead from the upper margin of one zygomatic ridge to the other,
and from its central point, B, measure vertically upward on the
forehead 3½ to 4½ inches. The point, D, thus obtained is directly
over the brain-cavity.
Fig. 76.
Before firing, the horse should be induced to lower his head,
which is easily accomplished by placing a little food upon the
ground, the muzzle of the weapon being brought directly over the
spot indicated.
It is a mistake to suppose that the star, or curl, is over the brain-
cavity, for it is generally below the cavity.
CHAPTER VII.
Organization of Artillery. Composition of Light Batteries. Equipment. Equipment and Clothing for
Marches. Marches. Selection of Camps. Making Camp. Breaking Camp. Allowance of
Wagons.

ORGANIZATION OF ARTILLERY.

Artillery troops are divided into light artillery and heavy artillery. To the light
artillery belongs the service of the batteries which manœuvre with troops in the
field.
The light-artillery batteries include horse-batteries, in which the cannoneers are
mounted on horseback; field-batteries, in which the cannoneers march by the side
of their pieces, or are mounted on the ammunition-chests, axle-seats, and off
horses; and mountain-batteries, in which the pieces may be transported on pack-
animals.
Machine-batteries are designated, according to their equipment and model of
gun, as horse, field, or mountain, Gatling, Gardner, etc., batteries.
The 3.2-inch gun is used in both field-and horse-batteries; the 3.6-inch gun is
used in field-batteries only.
A field-battery equipped with the 3.2-inch gun is called a light field-battery; one
equipped with the 3.6-inch gun is called a heavy field-battery. A battalion of artillery
consists of two, three, or four batteries, and is commanded by a field-officer of
artillery.
The heavy artillery of an army in the field consists of those batteries which serve
the siege-and position-guns, and the artillery-ammunition and supply trains.
The light artillery of an army corps consists of divisional artillery and corps
artillery.
The Divisional Artillery consists of a battalion of from two to four batteries, is
an integral part of the division, and is commanded by a field-officer who has a staff
consisting of an adjutant (lieutenant), sergeant-major, quartermaster-sergeant, and
chief trumpeter.
The Corps Artillery consists of two or more battalions; it is composed of field-
and horse-batteries in suitable proportions, and is commanded by a colonel who has
a staff consisting of an adjutant (lieutenant), a quartermaster and commissary
(lieutenant) sergeant-major, quartermaster-sergeant, and chief trumpeter. All the
artillery attached to an army corps constitutes an artillery brigade. A battalion of
horse-artillery is attached to and is part of each division of cavalry. In smaller
commands a battery may be attached to an infantry or cavalry brigade.
The proportion of artillery is from three to four guns to one thousand men. The
chief of artillery of an army or corps is a brigadier-general, and is on the staff of the
commander of the corps. The corps artillery is under the orders of the brigadier-
general, chief of artillery, and he also assumes control of the divisional artillery in
action when ordered to do so by the corps commander.
The field-officer commanding the divisional artillery is the chief of artillery of the
division, and is on the staff of the division commander, but he will encamp with the
divisional artillery.

COMPOSITION OF LIGHT BATTERIES.

A battery consists of a fixed number of pieces and caissons of a combined


battery-wagon and forge, and an artillery-wagon, together with a sufficient number
of officers, men, and horses for the efficient service of the battery.
Organization of Light Batteries.—A battery is maintained on one of the
following footings: 1, for instruction; 2, for war.

Instruction. War.
6 Guns, 6 Guns,
Field-battery.
4 Caissons. 9 Caissons.
Officers. Men. Horses. Officers. Men. Horses.
Commanding the
Captain 1 1 platoons and
caissons.
Lieutenants 3 4
a. First sergeant,
stable and
Staff-sergeants 2a 2 3b 3
veterinary
sergeant.
Sergeants 6 6 6 6
b. First sergeant,
quartermaster
and stable
Corporals 9c 3 15d 9
and
veterinary
sergeants.
Artificers 4e 5f 5
c. Six gunners and
Trumpeters 2 2 2 2 three caisson
corporals.
Guidon 1 1 1 1
d. Six gunners
and nine
Wagoner 1 4
caisson
corporals.
Drivers 24 48 48 96
e. Two
blacksmiths,
Cannoneers 36 84 one saddler,
one
machinist.
Supernumerary
8
drivers
f. Three
blacksmiths,
Spare horses 4 16
one saddler,
one machinist
Range-finders. 2 2
Total 4 84 66 5 175 144

The machinist should be conversant with the construction and mechanism of the
gun, and competent to make the ordinary repairs it may require.
The men should be intelligent, active, and muscular, and not less than five feet
five inches, nor more than six feet, in height; very large men are specially
undesirable. The great majority should be men accustomed to horses; a suitable
proportion must be mechanics.
If a public horse be allowed to each subaltern, the number of horses in the above
table will be proportionately increased.
The battery-wagon and forge and the artillery-wagon, when not horsed, must be
kept with the battery and equipped with the proper tools and stores.
When a battery on the instruction footing is ordered to march, it must be
supplied with additional horses necessary to horse all the carriages.
In horse-batteries, in addition to the number of horses above described, ten
saddle-horses (including one spare horse) are required for each gun detachment.

EQUIPMENT.
In garrison the first sergeant, quartermaster-sergeant, stable and veterinary
sergeants, and chiefs of section are armed with the sabre, and the caisson
corporals, trumpeters, guidon, and drivers also, when specially directed.
In the field the first sergeant, quartermaster-sergeant, stable sergeant, and chiefs
of section are armed with the sabre and revolver; all other men are armed with the
revolver and knife.
In preparing for a march or field service the kinds and quantities of supplies
required will depend on the duration and character of the work. Having determined
what is required, divide the work of preparing for service among the officers and
non-commissioned officers immediately in charge, and then carefully superintend
the work yourself.
Attention is called to the following points:
Rations, forage, medicines, veterinary medicines, instruments, and bandages,
leather and spare parts for repairs to harness, carriages, etc., horseshoes,
horseshoe-nails, blacksmith's, saddler's, and carpenter's tools (if there be no
battery-wagon and forge), field-desk, with a supply of blanks, paper, envelopes,
pens, ink, and pencils, the necessary company-books, and a book of telegraph
blanks, ammunition (shell, shrapnel, canister, cartridges, fuzes, fuze-cutters, friction-
primers, lanyards), oil for harness, cosmoline for guns, equipment and clothing for
each man, number and kind of tents, Sibley stoves, axes, hatchets, mauls, scythes,
sickles, buckets, spades, shovels, pickaxes, wagon-tongues, coupling-poles, hame-
strings, open links, odometer, rope, axle-grease, picket-rope, light jacks, lanterns,
matches, cooking utensils, personal outfit.

COOKING UTENSILS.

PACK IN BOX A.
75 150
Article.
Men. Men.
Dishpans 2 3
Coffee-mill 1 1
Bread-knives 2 2
Meat-knives 2 2
Steel 1 1
Cleaver 1 1
Saw 1 1
Forks, carving 2 3
Forks, spit 2 2
Spoons, long 2 2
Can-openers 2 2
Ladles 2 2
Frying-pans 2 2
Small rations
PACK IN BOX B.
75 150
Article.
Men. Men.
Coffee-boiler 1 2
Camp-kettles 4 6
Water-buckets 2 3
Dipper 1 1
Hash machine 1 1
1 axe, 1 spade, 1 shovel, tied together and fastened to outside of box.
Put the camp-kettles inside the coffee-boilers.

Vinegar-keg, 1, Dutch ovens, 2, or Buzzacott oven, 1, for 75 men; double the number for 150
men.

One of the boxes may be large enough to contain the Buzzacott oven. In order to
pack it put in the top inverted, and then invert the body of the oven and set it inside
the top.

EQUIPMENT AND CLOTHING FOR MARCHES.

OFFICERS' CLOTHING, EQUIPMENT, ETC.

An officer's equipment usually consists of sabre, revolver, and ammunition, and a


good binocular-glass. He should also be provided with a compass, watch, knife, and
notebook and pencil. A small watch so fitted in a leather strap that it may be worn
on the wrist is recommended as very convenient.
The clothing and bedding carried will depend on the climate and the character of
the march. The following list contains about everything one requires:

1 water-bucket
1 dipper
1 quart cup, tin
1 washbasin, rubber
1 small looking-glass
1 lantern, with oil or candles for same
1 small oil-stove, fitted in box
1 oil-can
5 gals. oil
1 tin matchbox
Box of matches
1 or 2 folding-chairs
1 folding-bed
1 folding-table
1 rubber blanket
1 small strip carpet
1 pillow
Necessary bedding
1 pair trousers, extra
1 blouse, extra
1 pair shoes
½ doz. shoestrings
1 pair overshoes
1 pair slippers
1 pair gauntlets, extra
1 campaign hat
1 overcoat
1 rubber coat
2 pairs drawers, extra
2 undershirts, extra
2 flannel shirts
4 pairs socks
4 towels
½ doz. handkerchiefs
1 sponge, in oil-silk bag
Packages of toilet-paper
1 portfolio, with pens, ink, paper, envelopes, and stamps
1 hold-all, containing comb, brush, clothes-brush, scissors, soap (in soapbox),
toothbrush and tooth-powder, shaving materials
1 housewife, with needles, thread and buttons
Pipe, tobacco, and fuzees
1 piece stout cord
1 tape-measure
1 pocket-map

In Cold Climate.

1 buffalo or felt-lined overcoat


1 fur cap
1 pair fur gloves
4 pairs woolen socks
1 pair felt boots
1 pair high arctic overshoes
Extra bedding or sleeping-bag

Sticking-plaster, lint, safety-pins, tin of mustard-leaves, and a few simple remedies in case of
dysentery, diarrhœa, constipation, etc.
If messing alone, 1 tin kettle, 1 frying-pan, 2 baking-pans (small), 1 wire gridiron, 1
corkscrew, salt-and pepper-boxes, 1 can-opener, 1 small meat-knife, 1 iron fork (long), 1 iron
spoon (long), 1 small soup-ladle, 2 plates, 2 tin cups, 2 spoons, 2 teaspoons, 2 knives, 2 forks,
tablecloths and napkins, and such stores as one may wish.

EQUIPMENT AND CLOTHING FOR ENLISTED MEN.

Equipment for Each Enlisted Man.—One hunting-knife, one pistol, one


holster, one pistol-cartridge belt (woven), one screwdriver, one canteen, one cup,
one meat-ration can (knife, fork, and spoon), and for each cannoneer one
haversack.
Clothing for Each Enlisted Man.—Two blankets, one rubber blanket or
poncho, one overcoat, one campaign hat, one pair of leggings, two blouses, two
pairs of trousers, two dark blue flannel shirts, two knit undershirts, two pairs of
drawers, two pairs of shoes, three pairs of socks, two towels, toilet articles, and
stable-clothing for those requiring it. The extra articles will be carried as follows:
By Mounted Non-commissioned Officers, Trumpeters, and Guidon.—Dark-blue
flannel shirt, undershirt, drawers, socks, and screwdriver, in saddle-bag, off pocket.
Mess-kit, in saddle-bag, near pocket. Blouse, trousers, and shoes, in knapsack.
Overcoat, rolled and strapped on the cantle of saddle. Nose-bag, on off side of
cantle, the strap passing around and under the overcoat. Canteen and cup (cup on
canteen-strap) strapped to near pommel-ring.
By Drivers.—Dark-blue flannel shirt, stable-clothes, and shoes, in saddle-bag, off
pocket, near horse. Mess-kit, in saddle-bag, near pocket, near horse. Blouse,
trousers, and screwdriver, in the saddle-bag, off pocket, off horse. Undershirt,
drawers, and socks, in saddle-bag, near pocket, off horse. Overcoat, rolled and
strapped on cantle, near horse. Nose-bags, one on each side of off horse, the strap
passing around the cantle and under the overcoat. Canteen and cup (cup on
canteen-strap) strapped to near pommel-ring, near horse. Watering-bridles,
currycombs, brushes, and halters, in the nose-bags.
By Cannoneers.—Blouse, trousers, and stable-clothes, in knapsack, flap side.
Underwear, shoes, and screwdriver, in knapsack, bottom side. Mess-kit, in
haversack, worn on left side of person, or carried in wagon. Overcoat, strapped on
knapsack. Canteen and cup (cup on canteen-strap) worn on right side of person.
The blankets, folded in section bundles, are carried in the wagons. The knapsacks
are carried in the wagons.
If there be an artillery-wagon with the battery, all the men have knapsacks and
haversacks, which are utilized as prescribed for cannoneers.
When the Army of the Potomac crossed the river in October, 1862, each officer
was responsible for his own outfit; each man carried five days' short rations in his
knapsack and three in his haversack, one half shelter-tent, his blanket or overcoat,
one change of underclothing, and his arms and ammunition.
To Roll the Overcoat.—Turn one sleeve wrong side out, fold the overcoat right
side out along middle back seam, sleeve laid straight, sleeve wrong side out
underneath.
Fold cape twice from side to side, lay it on coat, collar to collar. Turn edges of
coat in so as to make sides parallel, and to measure 12 inches wide at shoulder and
16 inches at bottom. Roll from collar down to within 20 inches of bottom, turn up
bottom and pull one thickness of skirt over the roll, making all snug.

MARCHES.

The "general," sounded one hour before the time designated for marching, is the
signal to strike tents, load wagons, pack animals, and send them to the place of
assembly.
The execution of marching orders must not be delayed. If the commander is not
with the troops when they are to march, the next in rank puts the column in motion.
When a march is in prospect, it is well to go out daily, for a week or ten days
previously, for a couple of hours' march. This will harden the horses' shoulders and
discover what corrections are to be made. The average march for field-artillery on
good roads is from 15 to 20 miles a day; horse-artillery, 25 miles.
A single battery, when the march is a long one, will do well to trot occasionally;
so doing shortens the road and greatly relieves man and horse. If the country is
undulating, the platoons should march with considerable distance between them,
and the trot should be taken up by each in succession on arrival at the level ground
where the preceding platoon began to increase its pace. The walk should be
resumed in the same manner.
When marching with other troops, these liberties cannot be taken, and the walk
is, with rare exceptions, the gait used. In rapid marches the slow trot alternates with
the walk.
When the services of artillery are urgently needed, it may be required to trot four
or five miles without breaking the gait.
Long marches or expeditions should be begun moderately, particularly with new
horses. Ten or twelve miles a day is enough for the first marches, which, on good
roads, may be increased to 20 or 25 miles when necessary, after the horses are
inured to their work. Should the march be continued for a long period, at least one
day in seven should be devoted to rest. It is also important that the horses and
equipments be thoroughly inspected at least once a week. On ordinary roads horse-
artillery with cavalry marches usually at the rate of 4 or 5 miles an hour. Field-
batteries, by themselves, can march 3½ to 4 miles an hour on a good road, but on
heavy or hilly roads, or when the battery forms part of a column, the rate of
progress will depend entirely upon circumstances. Should a long march be made,
the horses should be fed on the road; ordinarily watering will be sufficient. In very
hot weather frequent watering will be advisable. To keep horses in condition, it is
essential that they should be in no wise stinted of water. No matter how warm a
horse is six to ten swallows of water will not hurt him.
Always march with a feed of grain; if not used on the road, it enables the horses
to be fed as soon after arriving in camp as desirable. Horses should be arranged in
teams, as far as possible, so as to be of uniform pace in walking, and of similar
disposition.
On long marches it may be advisable to change the near and off horses days
about. Drivers should be required to ride off horses during part of each day's march;
and, unless the entire battery be dismounted by order of the captain, all mounted
men and cannoneers will ordinarily be permitted to mount and dismount at will
when the battery is moving at a walk on level ground.
Cannoneers of field-batteries should always walk up and down hill.
The care of horses on the march is one of the most important duties of an
artillery officer; by constant attention on the part of the captain, chiefs of platoon,
and chiefs of section many horses that would otherwise be disabled for months may
be kept in serviceable condition.
The men must not be allowed to lounge in their saddles, which leads to galls, and
the drivers should be made to pay continual attention to their driving, and see that
every horse does his fair share of work.
The lead-drivers of each team must keep their distance from the team in their
front; swing-drivers must keep the traces to their front stretched, and the wheel-
drivers those to their front.
Have the wheels greased daily, and oil the bearing of the lunette on the pintle-
hook.
Grease on the soles of horses' hoofs prevents snow from balling.
On starting from camp the first two miles should be made at an easy walk; a halt
of from 10 to 15 minutes should then be made to allow the men to relieve
themselves and to rearrange harness, after which a halt of from 5 to 10 minutes is
made at the end of every hour for the purpose of adjusting harness, tightening
girths, etc. When troops march for the greater part of the day, a halt of about an
hour is usually made about noon. At each halt pole-props will be let down; collars
unlocked and thrown back on the saddle or withers, and cleaned if necessary;
saddles replaced if they have moved; cinchas tightened if necessary, and horses'
feet examined.
The march is usually in column of sections; when practicable, it will be in column
of platoons at close intervals; but the front of the column must not be frequently
diminished or increased, as this unavoidably adds to the fatigue of the horses,
particularly of those in rear. The column of platoons should not be used when it fills
the road from side to side so as to prevent the passage of carriages, staff-orderlies,
etc.
A non-commissioned officer may be sent forward to reconnoitre the road or
ground that the battery is to pass over.
The distance of two yards between carriages is maintained, except in bad or
difficult ground, when it may be increased to four or more yards. The strictest
attention should be paid by the chiefs of platoon and of section to the preservation
of distances, which should not be increased more than is absolutely necessary. The
leading guide should maintain a slow and steady walk, and under no circumstances
is a carriage to move at a trot without the orders of the battery commander; when
necessary to close up, it should be done at a quick walk; no practice is more
fatiguing to horses and injurious to their shoulders than the alternate trotting and
walking so often seen at the rear of a column.
If the leading carriage is temporarily stopped for any cause, the rear carriages
should, if practicable, draw up alongside each other, in order to avoid or diminish as
much as possible any check to the column.
Chiefs of platoons must never be permitted to leave their platoons to march at
the head of the column; when not marching at the rear of their platoons, they will
halt frequently to see that their carriages are well up and marching properly.
Chiefs of platoon and of section, without waiting for express instructions, give
such orders as may be necessary for helping horses out of difficulty, for the passage
of obstacles, etc.; the cannoneers assist at the piece or caisson as may be required.
A small bunch of bale-wire, in lengths of from one to two feet, if carried by each
chief of section in his saddle-pouch, is very useful for temporary repairs of harness.
If the ruts be very deep, the carriages quarter the road, unless it be very narrow
and sunken; in this case the horses will be left to themselves and not hurried; a
skilful driver can help his horses greatly, particularly the wheelers.
When water-call is sounded, the chiefs of section, under the supervision of the
chiefs of platoon or of the first sergeant, have the watering-buckets taken off the
carriages, and their horses watered without confusion. When water is very scarce,
the nostrils may be sponged, which gives great relief, particularly in hot weather,
when it is not possible to let the horses drink.
Toward the close of the march an officer or non-commissioned officer may be
sent forward to select a camp-ground. The last two miles or more should be made
at a walk, and the horses brought into camp without excitement.
Upon the arrival of the battery in camp damages must be repaired without delay,
horses shod, wheels and pintles greased, etc. On the march artificers and cooks
should always ride, or be mounted on the chests; if fatigued from marching, they
cannot be expected to work efficiently after getting into camp.
The march of larger bodies of artillery is conducted on the same principles. A long
column cannot move as rapidly as a small one, and at the same time preserve equal
order; an allowance is therefore made for every column proportionate to its length.
When the roads are good, or even tolerable, the artillery is always obliged to wait
for the infantry, which is attended with much additional fatigue to the horses, from
having the harness so much longer on them. Likewise, when the roads are at all
bad, artillery can only keep up with cavalry, when the latter are marching at the
ordinary rate, by forcing their horses too much and wearing them out very rapidly.
When, therefore, there is no danger, the artillery should be allowed to march by
itself so as to regulate its own rate of march.
Chiefs of section should carry nippers in their saddle-pouches to cut wire fences if
necessary.

ACCIDENTS TO CARRIAGES.

When an accident happens to a carriage, it is pulled out of the column, if


possible, so as not to interrupt the march; otherwise the carriages in rear pass it by
the most convenient flank, and close to proper distance. The disabled carriage
resumes its place as soon as the damage is repaired. If the road be narrow, it must
fall into the first interval it finds, and regain its proper place as soon as the ground
permits. If a field-piece is disabled, the cannoneers left to repair it, who cannot be
carried on the limber-chest, mount on the axle-seats and off horses whenever the
piece takes the trot to regain its place. If a caisson is disabled, the caisson corporal
and the men necessary to repair it are left with it.
When a piece and its carriage are overturned, it is better to disengage the piece
by letting the breech rest on the ground, or on a block of wood, and then raise the
muzzle with a handspike while the cap-squares are taken off; the carriage is then
righted and the piece mounted.
To right the carriage without disengaging the piece, detach the limber, secure the
cap-squares, and lash the breech to the stock; place the middle of a rope over the
nave of one wheel, pass the ends of it downward between the lower spokes of that
wheel, then under the carriage, through the corresponding spokes of the other
wheel, and then upward over the wheel and across the top of the carriage to the
side where it was first attached. The ends of the rope and the wheel to be raised

You might also like