7 R TextBySamShen2017
7 R TextBySamShen2017
Preface page vi
Acknowledgements viii
1 Basics of R Programming 1
1.1 Download and install R and R-Studio 1
1.2 R Tutorial 2
1.2.1 R as a smart calculator 3
1.2.2 Define a sequence in R 4
1.2.3 Define a function in R 5
1.2.4 Plot with R 5
1.2.5 Symbolic calculations by R 6
1.2.6 Vectors and matrices 7
1.2.7 Simple statistics by R 10
1.3 Online Tutorials 12
1.3.1 Youtube tutorial: for true beginners 12
1.3.2 Youtube tutorial: for some basic statistical summaries 12
1.3.3 Youtube tutorial: Input data by reading a csv file into R 12
References 14
Exercises 14
This book is the instruction manual used for a short-course on R programming for
Climate Data Analysis and Visualization first taught at the U.S. National Center
for Environmental Information (NCEI), Asheville, North Carolina, 30 May- 2 June
2017. The purpose of the course is to train NCEI scientists and the personnel from
the the Cooperative Institute for Climate Science (CICS) -North Carolina to write
simple R programs for the climate data managed by the U.S. National Oceanic
and Atmospheric Administration (NOAA), so that the NOAA data can be easily
accessed, understood, and utilized by the general public, such as school students
and teachers. NOAAGlobalTemp is the primary dataset used for examples of this
book.
R is an open source programming language and software environment, originally
designed for statistical computing and graphics first appeared in 1993. In the first
10 years, R was more or less used only in the statistics community, but now, R has
become a top 20 most popular computer programming languages in 2017, ranked
by Cleveroad, Techworm and others, R and its interface R Studio are free and have
become a very popular tool handling big data: to make calculations and to plot.
R programs are often shorter due to its sophistication of design and mathematical
optimization. R calculation and plotting codes can be incorporated with the readme
file of a NOAA dataset. A data user can easily use the R code in the readme file
to read the data, change the data format, make some quick calculations, and plot
critical figures for his applications. R maps and numerous visualization functions
make R programming a convenient tool for not only NOAA data professionals, cli-
mate research scientists, and business applicants, but also to teachers and students.
Thus, R programming is a convenient tool for climate data’s delivery, transparency,
accuracy check, and documentation.
This course is divided into six chapters, which are taken from the book entitled
“Climate Mathematics with R” authored by Samuel Shen and Richard Somerville
to be published by the Cambridge University Press. Chapter 1 describes R basics,
such as arithmetic, simple curve plotting, functions, loops, matrix operations, do-
ing statistics, if-else syntax, and logic variables. Chapter 2 is to use R for observed
data which are often space-time incomplete due to missing data. NOAAGlobalTemp
dataset is used as an example and is analyzed extensively. We show area-weighted
spatial average, polynomial fitting, trend calculation with missing data, efficient
extraction a subset of the data, data formating, and data writing. Chapter 3 dis-
cusses more advanced R graphics, including maps, multiple curves on the same
figure, and margin setup and font change for publication. Chapter 4 shows how to
vi
handle large datasets in different formats, such as .nc, .bin, .csv, .asc, and .dat. It
uses NCEP/NCAR reanalysis’ monthly mean temperature data in .nc format as an
example to show data reading, data conversion into a standard space-time data ma-
trix, writing the matrix into a .csv file, plotting the temperature maps, calculating
empirical orthogonal functions (EOFs) and principal components (PCs) efficiently
by the singular mode decomposition (SVD) method, and EOF and PC plotting.
Chapters 5 and 6 are on basics of linear algebra and statistics using R. They can
be omitted in formal teaching and used as reference materials for the previous four
chapters.
The book is intended for a wide range of audience. A high school student with
some knowledge of matrices can understand most of its materials. An undergraduate
students with two semesters of calculus and one semester of linear algebra can
understand the entire book. Some sophisticated R programming tricks and examples
are useful to climate scientists, engineers, professors, and graduate students.
Finally, a layman user can simply copy and paste the R codes in this book to
produce some desired graphics, as long as he can spend 10 minutes to install R and
R Studio following a Youtube instruction.
The book is designed for a one-week course total 20 hours. Half the time is used
teaching and demonstration, and another half is for students practice guided by
an instructor. Each student is recommended to produce an R code for her/his own
work or interest with the instructor’s help.
The book’s typeset follows a Cambridge University Press LaTex template.
The project was supported in part by the U.S. National Oceanographic and At-
mospheric Administration (Award No. 13342-Z7812001). The students in the U.S.
National Center for Environmental Information (NCEI), Asheville, North Carolina,
30 May- 2 June 2017 helped clarify some R codes and correct some typos.
1 Basics of Computer Program R
The same author also has a youtube instruction about R installation for Mac (2
minutes):
[Link]
When R is installed, one can open R. The R Console window will appear. See
Fig. 1.1. One can use R Console to perform calculations, such as typing 2+3 and
hitting return. However, most people today prefer using RStudio as the interface.
To install RStudio, visit
[Link]
This site allows to choose Windows, or Mac OS, or Unix.
1
t
Fig. 1.1
R Console window after opening R.
After both R and RStudio are installed, one can use either R or RStudio, or
both, depending on his interest. However, RStudio will not work without R. Thus,
always install R first.
When opening RStudio, four windows will appear as shown in Fig. 1.2: The top
left window is called R script, for writing the R code. The green arrow on top of
the window can be clicked to run the code. Each run is shown in the lower left
R Console window, and recorded on the upper right R History window. When
plotting, the figure will appear in the lower right R Plots window. For example,
plot(x,x*x) renders the eight points in the Plots window, because x=1:8 defines
a sequence of numbers from 1 to 8. x*x yields a sequence from 12 to 82 .
1.2 R Tutorial
There are many excellent tutorials for a quick learn of R programming, using a few
hours or a few evenings, are available online and in Youtube, such as
t
Fig. 1.2
R Studio windows.
[Link]
You can google around and find your preferred tutorials.
It is very hard for the beginners of R to navigate through the official, formal,
detailed, and massive R-Project documentation:
[Link]
R can be used like a smart calculator that allows fancier calculations than those
done on regular calculators.
1+4
[1] 5
2+pi/4-0.8
[1] 1.985398
x<-1
y<-2
z<-4
t<-2*x^y-z
t
[1] -2
u=2 # "=" sign and "<-" is almost equivalent
v=3 # The text behind the "#" sign is comments
u+v
[1] 5
sin(u*v) # u*v = 6 is considered radian
[1] -0.2794155
Directly enter a sequence of daily maximum temperature data at San Diego Inter-
national Airport during 1-7 May 2017 [unit: ◦ F ].
tmax <- c(77, 72, 75, 73,66,64,59)
The data are from the United States Historical Climatology Network (USHCN)
[Link]/cdo-web/quickdata
The command c( ) is used to hold a data sequence and is named tmax. Entering
the tmax command will render temperature data sequence:
tmax
[1] 4.5 4.1 -2.1 3.4 2.5 6.0 4.3
seq(1,8)
seq(8)
seq(1,8, by=1)
seq(1,8, length=8)
seq(1,8, [Link] =8)
The most useful sequence commands are seq(1,8, by=1) and seq(1,8, length=8)
or seq(1,8, len=8). The former is determined by a begin value, end value, and
step size, and the latter by a begin value, end value, and number of values in the
sequence. For example, seq(1951,2016, len=66*12) renders a sequence of all the
months from January 1951 to December 2016.
1.2.3 Define a function in R
R can can plot all kinds of curves, surfaces, statistical plots, and maps. Below are a
few very simple examples for R beginners. For adding labels, ticks, color, and other
features to a plot, you learn them from later parts of the book and can also google
R plot to find the commands for the proper inclusion of the desired features.
R plotting is based on the coordinate data. The following command plots the
seven days of San Diego Tmax data above:
plot(1:7, tmax)
The result figure is shown in Fig. 1.3.
t
Fig. 1.3
The daily maximum temperature during 1-7 May 2017 of the San Diego International
Airport.
plot(sin, -pi, 2*pi) #plot the curve of y=sin(x) from -pi to 2 pi
# Plot a 3D surface
x <- seq(-1, 1, length=100)
y <- seq(-1, 1, length=100)
z <- outer(x, y, function(x, y)(1-x^2-y^2))
#outer (x,y, function) renders z function on the x, y grid
persp(x,y,z, theta=330)
# yields a 3D surface with perspective angle 330 deg
#Contour plot
contour(x,y,z) #lined contours
[Link](x,y,z) #color map of contours
The color map of contours resulted from the last command is shown in Fig. 1.4.
t
Fig. 1.4
The color map of contours for the function z = 1 − x2 − y 2 .
People used to think that R can only handle numbers. Actually R can also do sym-
bolic calculations, such as finding a derivative, although, up to now R is not the
best symbolic calculation tool. One can use WolframAlpha, SymPy, and Yacas for
free symbolic calculations or use the paid software package Maple or Mathematica.
Google symbolic calculation for calculus to find a long list of symbolic calculation
software packages, e.g., [Link]
D(expression(x^2,’x’), ’x’)
# Take derivative of x^2 w.r.t. x
2 * x #The answer is 2x
fx= expression(x^2*sin(x),’x’)
#Change the expression and use the same derivative command
D(fx,’x’)
2 * x * sin(x) + x^2 * cos(x)
integrate(cos,0,pi/2)
#Integrate cos(x) from 0 to pi/2 equals to 1 with details below
#1 with absolute error < 1.1e-14
The above two integration examples are for definite integral. It seems that no
efficient R packages are available for finding antiderivatives, or indefinite integrals.
ysol=solve(my,c(1,3))
#solve linear equations matrix %*% x = b
ysol #solve(matrix, b)
#[1] 2.5 -0.5
my%*%ysol #verifies the solution
# [,1]
#[1,] 1
#[2,] 3
The global temperature data from 2007-2016 in the above R code example are
displayed in Fig. 1.5, together with their linear trend line
0.6
0.5
0.4
0.3
Year
t
Fig. 1.5
The 2007-2016 global average annual mean surface air temperature anomalies with
respect to the 1971-2000 climate normal. The red is a linear trend line computed from
a linear regression model.
1.3 Online Tutorials
This is a very good and slow paced 22 minutes youtube tutorial: Chapter 1. An
Introduction to R
[Link]
An excel file can be saved as csv file: [Link]. This 15 minutes youtube video shows
how to read a csv file into R by Layth Alwan. He also shows linear regression.
[Link]
R can input all kinds of data files, including xlsx, netCDF, fortran data, and
sas data. Some commands are below. One can google to find proper data reading
command for your particular data format.
ff <- tempfile()
cat(file = ff, "123456", "987654", sep = "\n")
[Link](ff, c("F2.1","F2.0","I2")) #read a fotran file
library(ncdf)
ncin <- [Link](ncfname) # open a NetCDF file
lon <- [Link](ncin, "lon") #read a netCDF file into R
Many more details of reading and reformatting of .nc file will be discussed later
when dealing with NCEP/NCAR Reanalysis data.
Some libraries are not in the R project anymore. For example,
library(ncdf) #The following error message pops up
Error in library(ncdf) : there is no package called ncdf
One can then google r data reading netcdf R-project and go to the R-project
website. The following can be found.
Package ncdf was removed from the CRAN repository.
Formerly available versions can be obtained from the archive.
Archived on 2016-01-11: use ’RNetCDF’ or ’ncdf4’ instead.
This means that one should use RNetCDF, which can be downloaded from internet.
Thus, if a library gives an error message, then google the library package, download
and install the package, and finally read the data with a specified format.
References
Exercises
1.1 For some purposes, climatology or climate is dened as the mean state, or nor-
mal state, of a climate parameter, and is calculated from data over a period
of time called the climatology period (e.g., 1961-1990). Thus the surface air
temperature climate or climatology at a given location may be calculated by
averaging observational temperature data over a period such as 1961 through
1990. Thirty years are often considered in the climate science community
as the standard length of a climatology period. Due to the relatively high
density of weather stations in 1961-1990, compared to earlier periods, investi-
gators have often used 1961-1990 as their climatology period, although some
may now choose 1971-2000 or 1981-2010. Surface air temperature (SAT) is
often dened as the temperature inside a white-painted louvered instrument
container or box, known as a Stevenson screen, located on a stand about 2
meters above the ground. The purpose of the Stevenson screen is to shelter
the instruments from radiation, precipitation, animals, leaves, etc, while al-
lowing the air to circulate freely inside the box. Daily maximum temperature
(Tmax) is the maximum temperature measured inside the screen box by a
maximum temperature thermometer within 24 hours.
Go to the United States Historical Climatology Network (USHCN) website
[Link]
and download the monthly Tmax, Tmin, and Tmean data of the Cuyamaca
station (USHCN Site No. 042239) near San Diego, California. Use R to cal-
culate the climatology of the August , California, USA according to the 1961-
1990 climatology period.
14
t
Fig. 1.6
Inside a Stevenson screen, invented by Thomas Stevenson in 1864, and recommended
by the World Meteorological Organization (WMO) to measure Tmax and Tmin using
two thermometers. The data were recorded every 24 hours. Tmax and Tmin are the
temperature extremes over the previous 24 hours and depend on the time of
observation. Thus, the observations have the time of observation bias (TOB) due to
the inconsistent time of data recording. A much-used dataset called the USHCN
dataset includes data corrected for TOB, as well as the raw (uncorrected) data.
1.2 Express the Tmax climatology as an integral when regarding Tmax as a func-
tion of time t, using the definition of an integral from the statistics perspective.
1.3 Use R (a) to plot the the Cuyamaca January Tmin data from 1951 to 2010
with continuous curve, and (b) to plot the linear trend lines of Tmin on the
same plot as (a) in the following time periods:
(i) 1951-2010,
(ii) 1961-2010,
(iii) 1971-2010, and
(iv) 1981-2010.
Finally, what is the temporal trend per decade for each of the four periods
above?
1.4 Trend and derivative:
(a) Use the derivative to explain the trends of the above exercise problem,
and
(b) Treat the time series of the Cuyamaca January Tmin in above exercise
problem as a smooth function from 1951 to 2010. Use the curve and
its derivative to explain the instantaneous rate of change. Use the
concept of derivative.
(c) Use the average rate of change for a given period of time to explain the
linear trend in each of the four periods. Use the concept of mean value
theorem in the integral form.
1.5 Use the integral concept to describe the rainfall deficit or surplus history of San
Diego since January 1 of this year according to the USHCN daily precipitation
data, or do this for another location you are familiar with. You may use the
integral to describe the precipitation deficit or surplus. The daily data can be
found and downloaded from
[Link]
Requirements: You should use at least one figure. Your English text must be
longer than 100 words.
1.6 Time series and trend line plots for the NOAA global average annual mean
temperature anomaly data:
[Link]
noaa-global-surface-temperature-noaaglobaltemp
(a) Plot the global average annual mean temperature from 1880 to 2015.
(b) Find the linear trend of the temperature from 1880 to 2015. Plot the
trend line on the same figure as a).
(c) Find the linear trend from 1900 to 1999. Plot the trend line on the same
figure as a).
1.7 Use the gridded NOAA global monthly temperature data from the following
website or another data source
[Link]
noaa-global-surface-temperature-noaaglobaltemp
(a) Choose two 5-by-5 degrees lat-lon grid boxes of your interest. Plot the
temperature anomaly time series of the two boxes on the same figure
using two different colors.
(b) Choose sufficiently many grid boxes that cover the state of Texas. Com-
pute the average temperature of these boxes. Then plot the monthly
average temperature of these anomalies. Show the trend line on the
same figure.
1.8 Research problem: Use the integral of temperature with respect to time to
interpret the concept of cumulative degree-days in agriculture. Consider the
energy needed by plants to grow.
Requirements: You must use at least one figure and one table. Your English
text must be longer than 100 words.
R Analysis of Incomplete Climate
2 Data
Unlike the climate model data which are space-time complete, the observed data
are often space-time incomplete, i.e., some space-time grid points or boxes do not
have data. We call this the missing data problem.
Missing data problems can be of many kinds and can be very complicated. Here
we use the NOAAGlobalTemp dataset to illustrate a few methods often used in
analyzing datasets with missing data. NOAAGlobalTemp is the merged land and
oceanic observed surface temperature anomalies with respect to the 1970-2000 base
period climatology, produced by the United States National Center for Environ-
mental Information in 2015.
[Link]
noaa-global-surface-temperature-noaaglobaltemp
This dataset is a monthly data from January 1880 to the present with 5 × 5◦
latitude-longitude spatial resolution. The earlier years had many missing data while
the recent years are better covered. Figure 2.1 shows the history of the percentage
of area covered by the data. One hundred minus this percentage is the percentage
of missing data. The minimum coverage is nearly 60%, much of which is due to
the good coverage provided by NOAA ERSST (extended reconstructed sea surface
temperature).
Using software 4DVD (4-dimensional visual delivery of big climate data) devel-
oped at San Diego State University, one can easily see where and when data are
missing. Figure 2.2 shows the NOAAGlobalTemp data distribution over the globe
for January 1917. The data cover 72% of the global area. The black region includes
28% of the global area and has missing data. The data void regions include the
polar areas which could not be accessed at that time, the central tropical Pacific
regions which were not on the tracks of commercial ships, central Asia, part of
Africa, and the Amazon region. Figure 2.3 shows that the grid box (12.5S, 117.5W)
in the Amazon region did not begin to have data until 1918, and the data time
series after 1918 is discontinuous with missing data around 1921 and 1922.
17
t
Fig. 2.1
Percentage of the global surface area covered by the NOAAGlobalTemp dataset.
t
Fig. 2.2
The January 1917 distribution of the NOAAGlobalTemp data. The black region
means missing data.
t
Fig. 2.3
Time series of the monthly temperature anomalies for a grid box over the Amazon
region.
This section describes how to use R to read the data and convert the data into a
standard space-time matrix for various of kinds of analyses.
First, we download the NOAAGlobalTemp gridded data from its ftp site
[Link]
The anomalies are with respect to the 1971-2000 climatology.
The ftp site has two data formats: asc and bin. We use the asc format as example
to describe the R analysis. The following R code reads the asc data and makes the
conversion.
rm(list=ls(all=TRUE))
# Download .asc file
setwd("/Users/sshen/Desktop/MyDocs/teach/SIOC290-ClimateMath2016/Rcodes/NOAAGlobalTemp")
da1=scan("[Link]")
length(da1)
#[1] 4267130
da1[1:3]
#[1] 1.0 1880.0 -999.9 #means mon, year, temp
#data in 72 rows (2.5, ..., 357.5) and
#data in 36 columns (-87.5, ..., 87.5)
tm1=seq(1,4267129, by=2594)
tm2=seq(2,4267130, by=2594)
length(tm1)
length(tm2)
mm1=da1[tm1] #Extract months
yy1=da1[tm2] #Extract years
head(mm1)
head(yy1)
length(mm1)
length(yy1)
rw1<-paste(yy1, sep="-", mm1) #Combine YYYY with MM
head(tm1)
head(tm2)
tm3=cbind(tm1,tm2)
tm4=[Link](t(tm3))
head(tm4)
#[1] 1 2 2595 2596 5189 5190
da2<-da1[-tm4] #Remote the months and years data from the scanned data
length(da2)/(36*72)
#[1] 1645 #months, 137 yrs 1 mon: Jan 1880-Jan 2017
da3<-matrix(da2,ncol=1645) #Generate the space-time data
#2592 (=36*72) rows and 1645 months (=137 yrs 1 mon)
To facilitate the use of space-time data, we add the latitude and longitude co-
ordinates for each grid box as the first two columns, and the time mark for each
month as the first row. This can be done by the following R code.
colnames(da3)<-rw1
lat1=seq(-87.5, 87.5, length=36)
lon1=seq(2.5, 357.5, length=72)
LAT=rep(lat1, each=72)
LON=rep(lon1,36)
gpcpst=cbind(LAT, LON, da3)
head(gpcpst)
dim(gpcpst)
#[1] 2592 1647 #The first two columns are Lat and Lon
#-87.5 to 87.5 and then 2.5 to 375.5
#The first row for time is header, not counted as data.
[Link](gpcpst,file="[Link]")
#Output the data as a csv file
With this space-time data, one can plot a data map for a given month or a data time
series for a given location. For example, the following R code plots the temperature
data map for December 2015, an El Niño month (See Fig. 2.4).
t
Fig. 2.4
Monthly mean temperature anomalies of December 2015 based on the
NOAAGlobalTemp data.
If one wishes to study the data over a particular region, say, the tropical Pacific
for El Niño characteristics, he can extract the data for the region for a given time
interval. The following code extracts the space-time data for the tropical Pacific
region (20S-20N, 160E-120W) from 1951 to 2000.
#Keep only the data for the Pacific region
n2<-which(gpcpst[,1]>-20&gpcpst[,1]<20&gpcpst[,2]>160&gpcpst[,2]<260)
dim(gpcpst)
length(n2)
#[1] 160 $4 latitude bends and 20 longitude bends
pacificdat=gpcpst[n2,855:1454]
Here, we have used a powerful and convenient which search command. This very
useful command is easier to program and faster than if conditions.
Despite the good coverage of ERSST, it still has a few missing data in this tropical
Pacific area. Because the missing data are assigned -999.00, they can significantly
impact the computing results, such as SVD, when they are used in computing. We
assign the missing data to be zero, instead of -999.00. The following code plots the
December 1997 temperature data for the tropical Pacific region (20S-20N, 160E-
120W) (see Fig. 2.5).
t
Fig. 2.5
Tropical Pacific SST anomalies of December 1997 based on the NOAAGlobalTemp
data.
Lat=seq(-17.5,17.5, by=5)
Lon=seq(162.5, 257.5, by=5)
[Link]()
par(mar=c(4,5,3,0))
mapmat=matrix(pacificdat[,564], nrow=20)
int=seq(-5,5,[Link]=81)
[Link]=colorRampPalette(c(’black’,’blue’, ’darkgreen’,
’green’, ’yellow’,’pink’,’red’,’maroon’),interpolate=’spline’)
#mapmat= mapmat[,seq(length(mapmat[1,]),1)]
[Link](Lon, Lat, mapmat, [Link]=[Link], levels=int,
xlim=c(120,300),ylim=c(-40,40),
[Link]=title(main="Tropic Pacific SAT Anomalies [deg C]: Dec 1997",
xlab="Latitude",ylab="Longitude", [Link]=1.5),
[Link]={axis(1, [Link]=1.5); axis(2, [Link]=1.5);
map(’world2’, add=TRUE);grid()},
[Link]=title(main="[oC]"),
[Link]={axis(4, [Link]=1.5)})
A special case is to extract data for a specified grid box with given latitude and
longitude, e.g., the San Diego box (32.5N, 117.5W) or (+32.5, 242.5). This can be
easily done by the following R code that includes a simple plotting command.
#Extract data for a specified box with given lat and lon
n2 <- which(gpcpst[,1]==32.5&gpcpst[,2]==242.5)
SanDiegoData <- gpcpst[n2,855:1454]
plot(seq(1880,2017, len=length(SanDiegoData)),
SanDiegoData, type="l",
xlab="Year", ylab="Temp [oC]",
main="San Diego temperature anomalies history")
The area-weighted average, also called spatial average, of a temperature field T (φ, θ, t)
on a sphere is mathematically defined as follows
ZZ
1
T̄ (t) = T (φ, θ, t) cos(φ)dφdθ, (2.1)
4π
where φ is latitude and θ is longitude, and t is time. The above formula’s discrete
form for a grid of resolution ∆φ × ∆θ is
cos(φij )∆φ∆θ
T̄ˆ(t) =
X
T (i, j, t) , (2.2)
i,j
4π
where (i, j) are coordinate indices for the grid box (i,j), and ∆φ and ∆θ are in
radian. If it is a 5◦ resolution, then ∆φ = ∆θ = (5/180)π.
If NOAAGlobalTemp had data in every box, then the global average would be
easy to calculate according to the above formua:
cos(φij )(5/180)2
T̄ˆ(t) =
X
T (i, j, t) . (2.3)
i,j
4
However, NOAAGlobalTemp has missing data. We thus should not average the
data-void region. A method is to consider the spatial average problem as a weighted
average, which assigns a data box with weight proportional to cos φij and a data-
void box with zero weight. We thus generate a weight matrix areaw corresponding
to the data matrix temp by the following R code.
#36-by-72 boxes and Jan1880-Jan2016=1633 months + lat and lon
areaw=matrix(0,nrow=2592,ncol = 1647)
dim(areaw)
#[1] 2592 1647
areaw[,1]=temp[,1]
areaw[,2]=temp[,2]
#create an area-weight matrix equal to cosine box with data and zero for missing
for(j in 3:1647) {for (i in 1:2592) {if(temp[i,j]> -290.0) {areaw[i,j]=veca[i]} }}
Then compute an area-weighted temperature data matrix and its average:
#area-weight data matrixs first two columns as lat-lon
tempw=areaw*temp
tempw[,1:2]=temp[,1:2]
#create monthly global average vector for 1645 months
#Jan 1880- Jan 2017
avev=colSums(tempw[,3:1647])/colSums(areaw[,3:1647])
Figure 2.6 shows the spatial average of the monthly temperature data from
NOAAGlobalTemp from January 1880 to January 2017 and can be generated by
the following R code.
timemo=seq(1880,2017,length=1645)
plot(timemo,avev,type="l", [Link]=1.4,
xlab="Year", ylab="Temperature anomaly [oC]",
main="Area-weighted global average of monthly SAT anomalies: Jan 1880-Jan 2017")
abline(lm(avev ~ timemo),col="blue",lwd=2)
text(1930,0.7, "Linear trend: 0.69 [oC] per century",
cex=1.4, col="blue")
As a byproduct of the above weighted average, the matrix areaw can be used to
calculate the percentage of area covered by the data.
t
Fig. 2.6
Spatial average of monthly temperature anomalies with respect to 1971-2000
climatology based on the NOAAGlobalTemp data.
rcover=100*colSums(areaw[,3:1647])/sum(veca)
The following R code can plot this time series against time, which is the percentage
of data covered area with respect to the entire sphere, shown in Fig. 2.1 at the
beginning of this chapter.
#Plot this time series
motime=seq(1880, 2017, length=1645)
plot(motime,rcover,type="l",ylim=c(0,100),
main="NOAAGlobalTemp Data Coverage: Jan 1880-Jan 2017",
xlab="Year",ylab="Percent area covered [\%]")
The NOAA National Centers for Environmental Information (NCEI) also computed
the monthly mean global averages, which can also be downloaded from the NOAA-
GlobalTemp website. The differences between our monthly means and the NCEI’s
monthly means are less than 0.02◦ C. Figure 2.7 shows our data minus the NCEI
data, and can be generated by the following R code.
#Download the NCEI spatial average time series of monthly data
#[Link]
setwd("/Users/sshen/Desktop/MyDocs/teach/SIOC290-ClimateMath2016/Rcodes/Ch15-Rgraphics")
aveNCEI<-[Link]("[Link].land_ocean.[Link]", header=FALSE)
t
Fig. 2.7
Shen’s spatial average of monthly anomalies minus the NCEI time series.
It is known that climate changes are not uniform across a year. We thus plot
the trends of each month from January to December in the period of 1880-2016.
Figure 2.8 shows the strongest trend 0.75◦ /century in March, and the weakest trend
0.656◦ /century in September. This method of study is even more meaningful for
hemispheric averages or regional averages, such as the United States. Figure 2.8
can be produced by the following R code.
#Plot the each month’s anomalies with trend in 12 panels
[Link]()
t
Fig. 2.8
The trend of the spatial average of each month based on the NOAAGlobalTemp data
from 1880-2016.
The following R code can compute and plot the annual mean. It first convert the
vector data of monthly spatial averages to a 12-column matrix. Each column is a
month. The row mean yields the annual mean.
t
Fig. 2.9
Annual mean of the monthly spatial average anomalies from the NOAAGlobalTemp
data.
The global average annual mean temperature apparently does not vary linearly with
time. It is thus useful to examine the underlying nonlinear variation of the annual
temperature time series. The simplest nonlinear trend exploration is thorough a
polynomial fit. Usually, orthogonal polynomial fits are more efficient and have better
fidelity to the data. Figure 2.10 shows two fits by the 9th order and 20th order
orthogonal polynomials. The choice of 9th order is because it is the lowest order
polynomial which can reflect the oscillation of temperature from the high in the
1880s to the low in the 1910s, then rising until the 1940s, decreasing in the 1960s
and 1970s. The choice of the 20th order polynomial fit is because it is the lowest
order orthogonal polynomial that can mimic the detailed climate variations, such
as the local highs around 1900 and 1945. We have tried higher order polynomials
which often show an unphysical overfit.
t
Fig. 2.10
Annual mean time series and its fit by orthogonal polynomials.
It is widely known that the global average temperature has increased, especially
in recent decades since the 1970s. This is known to the general public as “global
warming.” However, the increase is non-uniform, and a few area have even experi-
enced cooling, such as the 1900- 1999 cooling over the North Atlantic off the coast
of Greenland. Figure 2.11 shows the uneven spatial distribution of the linear trend
of the monthly SAT anomalies from January 1900 to December 1999. Most parts of
the world experienced warming particularly over the land areas. Canada and Rus-
sia thus experienced more warming in the 20th century, compared to other regions
around the world.
Many grid boxes do not have complete data stream from January 1900-December
1999. Our trend calculation’s R code allows some missing data in the middle of the
data streams, but it requires data at both the beginning month (January 1900) and
the end month (December 1999). When a grid box does not satisfy the requirement,
the trend for the box is not calculated. Figure 2.11’s large white areas over the polar
regions, Pacific, Africa, and Central America do not satisfy the requirement. For
the missing data in the middle of a data stream for a grid box, our linear regression
omits the missing data and carries out the regression with a shorter temperature
data stream, and correspondingly with a shorter time data stream.
We used lm(temp1[i,243: 1442] timemo1, [Link]=[Link]) to treat
the missing data between the beginning month and the end month. The missing
data have been replaced by NA. The R command [Link]=[Link] means that
the missing data are omitted in the regression, and the fitted data at the missing
data’s time locations are omitted too and are not outputted. One can use another
command lm(temp1[i,243: 1442] timemo1, [Link]=[Link]) to do
linear regression with missing data. The slope and intercept results computed by
the two commands are the same. The only difference is that the latter outputs NA
for the fitted data at the missing data’s time locations. For example,
x=1:8
y=c(2,4,NA,3,6.8,NA,NA,9)
fitted(lm(y ~ x, [Link]=[Link]))
# 1 2 3 4 5 6 7 8
#2.08 3.04 NA 4.96 5.92 NA NA 8.80
##
fitted(lm(y ~ x, [Link]=[Link]))
# 1 2 4 5 8
#2.08 3.04 4.96 5.92 8.80
t
Fig. 2.11
Linear trend of SAT from January 1900 to December 1999. The trend was calculated
for each grid box using the NOAAGlobalTemp data, and the procedure required that
the box did not have missing data for the first month (January 1900) and the last
month (December 1999). The white regions mean that the data did not satisfy our
calculation conditions, i.e. these are the regions of insufficient amount of data.
If we relax our trend calculation condition and allow a trend to be computed for a
grid box when the box has less than one third ot its data missing, then the trends
can be computed for more grid boxes. Figure 2.12 shows the trend map computed
under this relaxed condition.
Figure 2.12 uses ◦ C per decade as the unit, while Fig. 2.11 uses ◦ C per century.
The patterns of the two figures are consistent, which implies that the relaxed con-
dition for trend calculation has not led to spatially inconsistent trends. Thus, Fig.
2.12 can be regarded as an accurate spatial extension of Fig. 2.11.
Figure 2.12 can be generated by the following R code.
#Trend for each box for the 20th century: Version 2: Allow 2/3 of data
t
Fig. 2.12
Linear trend of SAT from January 1900-December 1999. The trend was calculated for
each grid box using the NOAAGlobalTemp data when the box has less than 1/3 of
data missing.
Our recent period of long-term rapid warming (four decades from 1976-2016) ex-
hibits a warming that is greater than the last long-term warming from the 1910s to
the early 1950s, which also lasted about four decades. Figure 2.13 shows the strong
global warming trend from January 1976 to December 2016. It shows that during
this period, the world became warmer on every continent except Antarctica.
t
Fig. 2.13
Linear trend of SAT from January 1976-December 2016. The white regions mean
insufficient amount of data.
The trend data for Fig. 2.13 can be calculated using the following R code.
timemo2=seq(1976,2017, len=492)
temp1=temp
temp1[temp1 < -490.00] <- NA
trend7616=rep(0,2592)
for (i in 1:2592){
if([Link](temp1[i,1155])==FALSE & [Link](temp1[i,1646])==FALSE)
{trend7616[i]=lm(temp1[i,1155: 1646] ~ timemo2, [Link]=[Link])$coefficients[2]}
else
{trend7616[i]=NA}
}
The R code for plotting Fig. 2.13 is almost identical to that for the 20th century
trend of Fig. 2.12 and is omitted here.
References
[1] Huang, B., V.F. Banzon, E. Freeman, J. Lawrimore, W. Liu, T.C. Peterson,
T.M. Smith, P.W. Thorne, S.D. Woodruff, H.M. and Zhang (2015): Extended
reconstructed sea surface temperature version 4 (ERSST. v4). Part I: upgrades
and intercomparisons. Journal of Climate, 28, 911-930.
[2] Karl, T.R., A. Arguez, B. Huang, J.H. Lawrimore, J.R. McMahon, M.J. Menne,
[Link], R.S. Vose, H.M. and Zhang (2015): Possible artifacts of data
biases in the recent global surface warming hiatus. Science, 348,1469-1472.
[3] Smith, T.M. and R.W. Reynolds (2003): Extended reconstruction of global sea
surface temperatures based on COADS data (18541997). Journal of Climate,
16,1495-1510.
Exercises
2.1 Following the R code for generating Fig. 2.6 for the monthly global average
SAT anomalies, write an R code to generate a similar figure but for the North-
ern Hemisphere’s SAT anomalies from January 1880 to December 2016, based
on the gridded 5-deg NOAAGlobalTemp.
2.2 Compute and plot the spatial average of the annual mean SAT for the North-
ern Hemisphere from 1880 to December 2016.
2.3 Do the same as the previous problem, but for the Southern Hemisphere.
2.4 Plot and compare the maps of the January SAT anomalies’ linear trends from
1948 to 2016 based on the gridded January SAT anomalies around the 1971-
2000 climatology period for two datasets: the NCEP/NCAR Reanalysis data
and the NOAAGlobalTemp data. Use 200-500 words to describe your results.
2.5 (a) Plot the time series of the spatially averaged annual mean SAT anomalies
for the contiguous United States using the NOAAGlobalTemp data from 1880
to 2016.
(b) Add a linear trend line to the time series plot in (a). Mark the trend value
on the figure with the unit [◦ C per century].
36
3 R Graphics for Climate Science
This chapter is an introduction to the basic skills needed to use R graphics for
climate science. These skills are sufficient to meet most needs for climate science
research, teaching and publications. We have divided these skills into the following
categories:
(i) Plotting multiple data time series in the same figure, including multiple panels
in a figure, adjusting margins, and using proper fonts for text, labels, and
axes;
(ii) Creating color maps of a climate parameter, such as the surface air temperature
on the globe or over a given region; and
(iii) Animation.
Chapter 3 already showed how to plot a simple time series using plot(xtime, ydata).
Climate science often requires one to plot two different quantities, such as two time
series, on the same plot so that direct comparisons can be made. For example, to
see whether a hot year is also a dry year, one may plot the temperature data on the
same figure as the precipitation data. The left side of the y-axis shows temperature
and the right side shows precipitation. The following code plots a figure containing
the contiguous United States (CONUS) annual mean temperature and annual total
precipitation from 2001-2010 (see Fig. 3.1).
t
Fig. 3.1
Contiguous United States annual mean temperature and annual total precipitation.
Figure 3.1 shows that during the ten years from 2001 to 2010, the CONUS precip-
itation and temperature are in opposite phase: higher temperature tends to occur
in dry years with less precipitation, and lower temperature tends to occur in wet
years with more precipitation.
R has the flexibility to create plots with specific margins, mathematical symbols for
text and labels, text fonts, text size, and more. R also allows one to merge multiple
figures. These capabilities are often useful in producing a high-quality figure for
presentations or publication.
par(mar=c(2,5,3,1)) specifies the four margins of a figure. The first margin 2
(i.e., two line space) is the x-axis, the second 5 is for the y-axis, 3 is for the top, and
1 is for the right. One can change the numbers in par(mar=c(2,5,3,1)) to adjust
the margins. A simple example is shown in Fig. 3.2, which may be generated by
the following R program.
t
Fig. 3.2
Set margins, insert mathematical symbols, and write text outside a figure.
Similar to using [Link]=1.8 to change the font size of the tick values, one can
use
[Link]=1.5, [Link]=1.5, [Link]=1.5
to change the font sizes for axis labels, the main title, and the sub-title. An example
is shown in Fig. 3.3 generated by the R code below.
par(mar=c(8,6,3,2))
par(mgp=c(2.5,1,0))
plot(1:200/20, rnorm(200),sub="Sub-title: 200 random values",
xlab= "Time", ylab="Random values", main="Normal random values",
[Link]=1.5, [Link]=2, [Link]=2.5, [Link]=2.0)
par(mgp=c(2,1,0))
plot(sin,xlim=c(10,20))
The above R code used many R plot functions. An actual climate science line
plot is often simpler than this. One can simply remove the redundant functions in
the above R code to produce the desired figure.
Let us plot the global average annual mean surface air temperature (SAT) from
1880 - 2016 using the above plot functions (see Fig. 3.4). The data is from the
NOAAGlobalTemp dataset
[Link]
noaa-global-surface-temperature-noaaglobaltemp
We write the data in two columns in a file named NOAATemp. The first column is
the year, and the second is the temperature anomalies.
t
Fig. 3.4
Global average annual mean SAT based on the United States’ NOAAGlobalTemp data
.
Another way to compare the temperature and precipitation time series is to plot
them in different panels and display them in one figure, as shown in Fig. 3.5.
t
Fig. 3.5
(a) Contiguous United States annual mean temperature; and (b) annual total
precipitation.
Figure 3.5 can be generated by the following R code. This figure’s arrangement
has used the setups described in the above sub-section.
#Plot US temp and prec times series on the same figure
par(mfrow=c(2,1))
par(mar=c(0,5,3,1)) #Zero space between (a) and (b)
Time <- 2001:2010
Tmean <- c(12.06, 11.78,11.81,11.72,12.02,12.36,12.03,11.27,11.33,11.66)
Prec <- c(737.11,737.87,774.95,844.55,764.03,757.43,741.17,793.50,820.42,796.80)
plot(Time,Tmean,type="o",col="red",xaxt="n", xlab="",ylab="Tmean [dec C]")
text(2006, 12,font=2,"US Annual Mean Temperature", cex=1.5)
text(2001.5,12.25,"(a)")
#Plot the panel on row 2
par(mar=c(3,5,0,1))
plot(Time, Prec,type="o",col="blue",xlab="Time",ylab="Prec [mm]")
text(2006, 800, font=2, "US Annual Total Precipitation", cex=1.5)
text(2001.5,840,"(b)")
After completing this figure, the R console may “remember” the setup. When
you plot the next figure expecting the default setup, R may still use the previous
setup. One can remove the R “memory” by
rm(list=ls())
[Link]()
A more flexible way to stack multiple panels together as a single figure is to
use the layout matrix. The following example has three panels on a 2-by-2 matrix
space. The first panel occupies the first row’s two positions. Panels 2 and 3 occupies
the second row’s two positions.
(i) The main purpose of a contour plot is to show a 3D surface with contours or
filled contours, or simply a color map for a climate parameter;
(ii) (x, y, z) coordinates data or a function z = f (x, y) should be given; and
(iii) A color scheme should be defined, such as [Link] = [Link].
A few simple examples are below.
x <- y <- seq(-1, 1, len=25)
z <- matrix(rnorm(25*25),nrow=25)
contour(x,y,z, main="Contour Plot of Normal Random Values")
[Link](x,y,z, main="Filled Contour Plot of Normal Random Values")
[Link](x,y,z, [Link] = [Link])
[Link](x,y,z, [Link] = colorRampPalette(c("red", "white", "blue")))
For climate applications, a contour plot is often overlaid on a geography map, such
as a world map or a map of country or a region. Our first example is to show a very
simple color plot over the world: plotting the standard normal random values on a
5◦ × 5◦ grid over the globe.
Lon3<-seq(230,295,length=14)
mapdat<-matrix(rnorm(13*14),nrow=14)
int=seq(-3,3,[Link]=81)
[Link]=colorRampPalette(c(’black’,’purple’,’blue’,’white’,
’green’, ’yellow’,’pink’,’red’,’maroon’),
interpolate=’spline’)
[Link](Lon3, Lat3, mapdat, [Link]=[Link], levels=int,
[Link]=title(main="Standard Normal Random Values on a World Map: 5-deg
xlab="Lon", ylab="Lat"),
[Link]={axis(1); axis(2);map(’world2’, add=TRUE);grid()})
servational data which have been adjusted in a physically consistent way with the
assistance of climate models. The data assimilation system is a tool to accomplish
such a data adjustment process correctly.
rm(list=ls(all=TRUE))
setwd("/Users/sshen/Desktop/Papers/KarlTom/Recon2016/Test-with-Gregori-prec-data")
# 4 dimensions: lon,lat,level,time
nc=ncdf4::nc_open("[Link]")
nc
nc$dim$lon$vals # output values 0.0->357.5
nc$dim$lat$vals #output values 90->-90
nc$dim$time$vals
#nc$dim$time$units
#nc$dim$level$vals
Lon <- ncvar_get(nc, "lon")
Lat1 <- ncvar_get(nc, "lat")
Time<- ncvar_get(nc, "time")
head(Time)
#[1] 65378 65409 65437 65468 65498 65529
library(chron)
[Link](1297320/24,c(month = 1, day = 1, year = 1800))
#1948-01-01
precnc<- ncvar_get(nc, "air")
dim(precnc)
#[1] 144 73 826, i.e., 826 months=1948-01 to 2016-10, 68 years 10 mons
#plot a 90S-90N precip along a meridional line at 160E over Pacific
plot(seq(-90,90,length=73),precnc[15,,1],
type="l", xlab="Lat", ylab="Temp [oC]",
main="90S-90N temperature [mm/day]
along a meridional line at 35E: Jan 1948",
lwd=3)
t
Fig. 3.8
The surface air temperature along a meridional line at 160◦ E over the Pacific.
Here, our first example is to plot the temperature variation in the meridional
(i.e., north-south) direction from pole to pole, for a given longitude.
Next we plot the global color contour map showing the January temperature
climatology as the average of the January temperature from 1948 to 2015, plus the
surface air temperature of January 1983, and finally its anomaly calculated as the
difference defined as the January 1983 data minus the January climatology. The R
code is below, and the results are shown in Figs. 3.9 - 3.11 .
#Compute and plot climatology and standard deviation Jan 1948-Dec 2015
library(maps)
climmat=matrix(0,nrow=144,ncol=73)
sdmat=matrix(0,nrow=144,ncol=73)
Jmon<-12*seq(0,67,1)
for (i in 1:144){
for (j in 1:73) {climmat[i,j]=mean(precnc[i,j,Jmon]);
sdmat[i,j]=sd(precnc[i,j,])
}
}
mapmat=climmat
#Note that R requires coordinates increasing from south to north -90->90
#and from west to east from 0->360. We must arrange Lat and Lon this way.
#Correspondingly, we have to flip the data matrix left to right according to
#the data matrix precnc[i,j,]: 360 (i.e. 180W) lon and from North Pole
#and South Pole, then lon 178.75W, 176.75W, ..., 0E. This puts Greenwich
#at the center, China on the right, and USA on the left. However, our map should
#have the Pacific at the center, and USA on the right. Thus, we make a flip.
Lat=-Lat1
mapmat= mapmat[,length(mapmat[1,]):1]#Matrix flip around a column
#mapmat= t(apply(t(mapmat),2,rev))
int=seq(-50,50,[Link]=81)
[Link]=colorRampPalette(c(’black’,’blue’,’darkgreen’,’green’,
’white’,’yellow’,’pink’,’red’,’maroon’),interpolate=’spline’)
[Link](Lon, Lat, mapmat, [Link]=[Link], levels=int,
[Link]=title(main="NCEP RA 1948-2015 January climatology [deg C]",
xlab="Longitude",ylab="Latitude"),
[Link]={axis(1); axis(2);map(’world2’, add=TRUE);grid()},
[Link]=title(main="[oC]"))
t
Fig. 3.9
NCEP Reanalysis January climatology (upper panel) computed as the January
temperature mean from 1948-2015. Lower panel shows the standard deviation of the
same 1948-2015 January temperature data.
t
Fig. 3.10
NCEP Reanalysis temperature of January 1983: an El Niño event.
To see the El Niño, we compute the temperature anomaly, which is the January
1983 temperature minus the January climatology. A large tongue-shaped region over
the eastern tropical Pacific appears with temperatures up to almost 6◦ C warmer
than the climatological average temperatures (Fig. 3.11). This is the typical El Niño
signal.
t
Fig. 3.11
NCEP Reanalysis temperature anomaly of January 1983, showing the eastern tropical
Pacific’s El Niño warming tongue.
To describe the use of [Link], we use the ideal geostrophic wind field as an
example to plot a vector field on a map (see Fig.3.13). The geostrophic wind field is
a result of the balance between the pressure gradient force (PGF) and the Coriolis
force (CF).
Figure 3.13 can be generated by the following R code.
#Wind directions due to the balance between PGF and Coriolis force
#using an arrow plot for vector fields on a map
library(fields)
library(maps)
library(mapproj)
lat<-rep(seq(-75,75,len=6),12)
Polar High
75N
45N
Subtropical High
15N
Intertropical Convergence Zone (ITCZ)
15S
Subtropical High
45S
75S
165W 105W 45W 15E 75E 135E
t
Fig. 3.13
Vector field of the ideal geostrophic wind field.
lon<-rep(seq(-165,165,len=12),each=6)
x<-lon
y<-lat
#u<- rep(c(-1,1,-1,-1,1,-1), each=12)
#v<- rep(c(1,-1,1,-1,1,-1), each=12)
u<- rep(c(-1,1,-1,-1,1,-1), 12)
v<- rep(c(1,-1,1,-1,1,-1), 12)
wmap<-map(database="world", boundary=TRUE, interior=TRUE)
grid(nx=12,ny=6)
#[Link](wmap,col=3,nx=12,ny=6,label=TRUE,lty=2)
points(lon, lat,pch=16,cex=0.8)
[Link](lon,lat,u,v, [Link]=.08, length=.08, col=’blue’, lwd=2)
box()
axis(1, at=seq(-165,135,60), lab=c("165W","105W","45W","15E","75E","135E"),
[Link]="black",tck = -0.05, las=1, line=-0.9,lwd=0)
axis(1, at=seq(-165,135,60),
[Link]="black",tck = 0.05, las=1, labels = NA)
axis(2, at=seq(-75,75,30),lab=c("75S","45S","15S","15N","45N","75N"),
[Link]="black", tck = -0.05, las=2, line=-0.9,lwd=0)
axis(2, at=seq(-75,75,30),
[Link]="black", tck = 0.05, las=1, labels = NA)
text(30, 0, "Intertropical Convergence Zone (ITCZ)", col="red")
text(75, 30, "Subtropical High", col="red")
text(75, -30, "Subtropical High", col="red")
mtext(side=3, "Polar High", col="red", line=0.0)
3.3.2 Plot a sea wind field from netCDF data
This sub-section uses vectorplot in rasterVis to plot the wind velocity field (i.e.,
the surface wind data over the global ocean) is used as an example. The procedure
is described from the data download to the final product of a wind field. The NOAA
wind data were generated from multiple satellites observations, such as QuikSCAT,
SSMIs, TMI, and AMSR-E , on a global at 1/4◦ × 1/4◦ grid with a time resolution
of 6 hours.
10
8
50°N
6
Latitude
0°
50°S
2
0
100°E 160°W 60°W
t
Longitude
Fig. 3.14
The NOAA sea wind field of 1 January 1995: UTC00Z at 1/4◦ × 1/4◦ resolution.
library(ncdf4)
library(chron)
library(RColorBrewer)
library(lattice)
[Link]("[Link]
"[Link]", method = "curl")
mincwind <- nc_open("[Link]")
dim(mincwind)
#[Link](mincwind)
u <- ncvar_get(mincwind, "u")
class(u)
dim(u)
v <- ncvar_get(mincwind, "v")
class(v)
dim(v)
u9 <- raster(t(u[, , 9])[ncol(u):1, ])
v9 <- raster(t(v[, , 9])[ncol(v):1, ])
[Link](u[, , 9])
[Link](u[, , 9],[Link] = [Link])
[Link](u[, , 9],[Link] = colorRampPalette(c("red", "white", "blue")))
contourplot(u[, , 9])
[Link]("raster")
library(raster)
library(sp)
library(rgdal)
u9 <- raster(t(u[, , 9])[ncol(u):1, ])
v9 <- raster(t(v[, , 9])[ncol(v):1, ])
w <- brick(u9, v9)
wlon <- ncvar_get(mincwind, "lon")
wlat <- ncvar_get(mincwind, "lat")
range(wlon)
range(wlat)
plot(w[[1]])
plot(w[[2]])
[Link]("rasterVis")
[Link]("latticeExtra")
library(latticeEtra)
library(rasterVis)
vectorplot(w * 10, isField = "dXY", region = FALSE, margin = FALSE, narrows = 10000)
Also see the following websites for more vector field plots [Link]
[Link]
t
Fig. 3.15
“Lower 48” contiguous states of the United States.
References
Exercises
3.1 Use R to plot the temperature and precipitation anomaly time series from the
NCEP Reanalysis data for the grid boxes of Tahiti and Darwin. Put the four
time series on the same figure, and explain the their behaviors during the El
Niño and La Niña periods.
3.2 Use R and NCEP Reanalysis data to display the El Niño temperature anomaly
for January 2016. Find the latitude and longitude of the grid box on which
the maximum temperature anomaly of the month occurred? What is the max-
imum anomaly?
3.3 Use R to compute the 1971-2000 climatology from the NCEP Reanalysis’
annual mean temperature data for each grid box. Plot the climatology map.
3.4 Use R to compute the 1948-2010 standard deviation from the NCEP Re-
analysis’ annual mean temperature data for each grid box. Plot the standard
deviation map.
3.5 Use R to plot the map of the North America and use arrows to indicate the
Alaska Jet Stream.
58
Advanced R Analysis and Plotting
4 for Climate Data
The empirical orthogonal function (EOF) method is a commonly used tool for
climate data analysis in modern days. EOFs show spatial patterns of climate data,
such as the El Niño warm anomaly pattern of the eastern Tropical Pacific. The
corresponding temporal patterns are called principal components (PC). Thus, the
EOF analysis is also called the PC analysis. We describe the EOFs and PCs as a
natural space-time decomposition of a space-time data matrix using the singular
value decomposition (SVD) method and a simple R command svd(datamatrix).
This is different from traditional approach of an eigenvalue problem of a covariance
matrix. This chapter provides recipe-like R codes and their explanations for EOF
and PC calculations. It also describes temporal trend calculations of climate data
and the trend influences on the first a few EOFs.
The spatial fields of many climate data applications are two-dimensional, or 2Dim
for short. The corresponding EOFs are over a 2Dim domain on the Earth’s surface,
and the corresponding PCs are on a time interval. This section describes the basic
concepts of using SVD to compute EOFs and PCs and using R graphics to show
them. We use a simple synthetic data set to illustrate the procedures. The next
section will feature using real climate data.
The spatial domain is Ω = [0, 2π] × [0, 2π], and the time interval is T = [1, 10]. The
synthetic data are generated by the following function
z(x, y, t) = c1 (t)ψ1 (x, y) + c2 (t)ψ2 (x, y), (4.3)
where ψ1 (x, y) and ψ1 (x, y) are two orthonormal basis functions given below
ψ1 (x, y) = (1/π) sin x sin y, (4.4)
ψ2 (x, y) = (1/π) sin(8x) sin(8y), (4.5)
with
Z
dΩψk2 (x, y) = 1, k = 1, 2 (4.6)
Ω
Z
dΩψ1 (x, y)ψ2 (x, y) = 0 (4.7)
Ω
t
Fig. 4.1
The z(x, y, t) function as t = 1 and t = 10.
4.2.2 SVD for the synthetic data: EOFs, variances and PCs
We first convert the synthetic data matrix into a 1000 × 10 dimensional space-time
data. Then SVD can be applied to the space-time data to generate EOFs, variances
and PCs.
The following code coverts the 3Dim array mydat(,,,) into a 2Dim space-time
data matrix da1.
da1<- matrix(0,nrow=length(x)*length(y),ncol=10)
for (i in 1:10) {da1[,i]=c(t(mydat[,,i]))}
Applying SVD on this space-time data is shown below.
da2<-svd(da1)
uda2<-da2$u
vda2<-da2$v
dda2<-da2$d
dda2
#[1] 3.589047e+01 1.596154e+01 7.764115e-14 6.081008e-14
#The first mode variance 36/(36+16)= 69%
The EOFs shown in Fig. 4.2 can be plotted by the following R code.
par(mgp=c(2,1,0))
[Link](x,y,matrix(-uda2[,1],nrow=100), [Link] =rainbow,
[Link]=title(main="SVD Mode 1: EOF1", xlab="x", ylab="y", [Link]=1.0)
[Link] = title(main = "Scale"),
[Link] = {axis(1,seq(0,2*pi, by = 1), cex=1.0)
axis(2,seq(0, 2*pi, by = 1), cex=1.0)})
Figure 4.2 shows that the EOF patterns from SVD are similar to the original
orthonormal basis ψ1 (x, y) and ψ2 (x, y). This means that SVD has recovered the
original orthonormal basis functions. However, this is not always true, when the
variances of the two modes are close to each other. These two SVD eigenvalues will
then be close to each other. Consequently, the EOFs, as eigenfunctions, will have
large differences from the original true orthonormal basis functions. This is quan-
tified by the North’s rule-of-thumb, which states that both EOFs will have large
errors, which are inversely proportional to the difference between the two eigenval-
ues. Thus, when the two eigenvalues have a small difference, the two corresponding
eigenfunctions will have large errors due to mode mixing. In linear algebra terms,
this means that when two eigenvalues are close to each other, the corresponding
eigenspaces tend to be close to each other. They form a 2-dimensional eigenspace,
which has infinitely many eigenvectors due to the mixture of the two eigenvectors.
A physically meaningful eigenvector should have no ambiguity, and infinitely many
eigenvectors imply uncertainties, large errors, and no physical interpretation.
t
Fig. 4.2
The first row shows two EOFs from SVD, and the second row shows two orthonormal
basis functions on the xy-domain: φ1 (x, y) = −(1/π) sin x sin y, and
φ2 (x, y) = (1/π) sin 8x sin 8y.
The original orthonormal basis functions can be plotted by the following R codes.
t
Fig. 4.3
Two principal components (PCs) from SVD approximation and two accurate time
coefficients: − sin t and exp(−0.3t).
PC1 demonstrates sinusoidal oscillation, while PC2 shows a wavy increase. These
two temporal patterns are similar to the original time coefficients − sin(t) and
− exp(−0.3t). Here, the negative signs are added to make the EOF patterns have
the same sign as the original basis functions, because the EOFs are determined up
to the sign, i. e., the plus or minus sign is indeterminate.
PC1 and PC2 are orthogonal, but coefficients c1 (t) and c2 (t) are not orthogonal.
This can be verified by the following code.
The SVD theory tells us that the original data can be recovered from the EOFs,
PCs and the variances by the following formula
z = U DV 0 . (4.10)
Because the eigenvalues for this problem, except the first two, are close to be zero,
we can have an accurate reconstruction by using the first two EOFs, PCs and
their corresponding eigenvalues. The R code for both 2-mode approximation and
all-mode recovery is below.
B<-uda2[,1:2]%*%diag(dda2)[1:2,1:2]%*%t(vda2[,1:2])
B1<-uda2%*%diag(dda2)%*%t(vda2)
The left panel of Fig. 4.2.2 shows the recovered z at time t = 5 using only two
EOF modes, and can be plotted by the following R code.
[Link]()
[Link](x,y,matrix(B[,5],nrow=100), [Link] =rainbow,
[Link]=title(main="2-mode SVD reconstructed field t=5",
xlab="x", ylab="y", [Link]=1.0),
[Link] = title(main = "Scale"),
[Link] = {axis(1,seq(0,3*pi, by = 1), cex=1.0)
axis(2,seq(0, 2*pi, by = 1), cex=1.0)})
The full recovery or the original field at time t = 5, shown in the right panel
of Fig. 4.2.2, is virtually identical to the 2-mode approximation. The difference is
less than 10−10 at any given points. This high level of accuracy may not always be
achieved even with the full recovery U DV t when high spatial variability appears.
To be specific, the full recovery U DV t may have non-negligible numerical digits
truncation errors (single precision or double precision), which can cause large errors
in the recovery results, when the high spatial variability is involved.
2−mode SVD reconstructed field t=5 Scale All−mode SVD reconstructed field t=5 Scale
0.4 0.4
6 6
5 5
0.2 0.2
4 4
0.0 0.0
y
y
3 3
2 2
−0.2 −0.2
1 1
0 −0.4 0 −0.4
0 1 2 3 4 5 6 0 1 2 3 4 5 6
x x
t
Fig. 4.4
Recovery of the original data using two modes (the left panel) and using all modes
(the right panel).
This section presents an example of computing EOFs and PCs from a netCDF file
downloaded from the Internet. In climate research and teaching, data from netCDF
files are often encountered. We shall download the data and make an EOF analysis.
The example is the surface temperature data from the NCEP/NCAR Reanalysis I,
which outputs the 2.5 degree monthly data from January 1948 to the present. We
choose the most frequently used surface air temperature (SAT) field.
# 4 dimensions: lon,lat,level,time
nc=ncdf4::nc_open("/Users/sshen/Desktop/Papers/KarlTom/
Recon2016/Test-with-Gregori-prec-data/[Link]")
nc
nc$dim$lon$vals #output lon values 0.0->357.5
nc$dim$lat$vals #output lat values 90->-90
nc$dim$time$vals #output time values in GMT hours: 1297320, 1298064
nc$dim$time$units
#[1] "hours since 1800-01-01 [Link].0"
#nc$dim$level$vals
Lon <- ncvar_get(nc, "lon")
Lat1 <- ncvar_get(nc, "lat")
Time<- ncvar_get(nc, "time")
#Time is the same as nc$dim$time$vals
head(Time)
#[1] 1297320 1298064 1298760 1299504 1300224 1300968
library(chron)
Tymd<-[Link](Time[1]/24,c(month = 1, day = 1, year = 1800))
#c(month = 1, day = 1, year = 1800) is the reference time
Tymd
#$month
#[1] 1
#$day
#[1] 1
#$year
#[1] 1948
#1948-01-01
precnc<- ncvar_get(nc, "air")
dim(precnc)
#[1] 144 73 826, i.e., 826 months=1948-01 to 2016-10, 68 years 10 mons
To check whether our downloaded data appear to be reasonable, we plot the first
month’s temperature data at longitude 180◦ E from the South Pole to the North
Pole (see Fig. 4.5). The figure shows a reasonable temperature distribution: a high
temperature nearly 30◦ C over the tropics, a lower temperature below -30◦ C over the
Antarctic region at the left, and between -20◦ C and -10◦ C over the Arctic region at
the right. We are thus reasonably confident that our downloaded data are correct
and that the data values correctly correspond to their assigned positions on the
latitude-longitude grid.
Figure ?? may be generated by the following R code:
The very large standard deviation, more than 5◦ C over the high latitude Northern
Hemisphere shown in Fig.3.9 may be artificially amplified by the climate model
employed to carry out the NCEP/NCAR reanalysis. This error may be due to the
models handling of a physically complex phenomenon, namely the sea ice and albedo
feedback. The actually standard deviation might thus be smaller over the same
region. This type of error highlights a caution that should be kept in mind when
using reanalysis datasets. Combining a complex climate model with observational
data can improve the realism of data sets, but it can also introduce a type of error
that could not exist if no model were used.
t
Fig. 4.6
Percentage variance, and the cumulative variance of the covariance matrix of the
January SAT from 1948-2015.
#plot eigenvalues
par(mar=c(3,4,2,4))
plot(100*(svdJ$d)^2/sum((svdJ$d)^2), type="o", ylab="Percentage of variance [%]",
xlab="Mode number", main="Eigenvalues of covariance matrix")
legend(20,5, col=c("black"),lty=1,lwd=2.0,
legend=c("Percentange variance"),bty="n",
[Link]=2,cex=1.0, [Link]="black")
par(new=TRUE)
plot(cumsum(100*(svdJ$d)^2/sum((svdJ$d)^2)),type="o",
col="blue",lwd=1.5,axes=FALSE,xlab="",ylab="")
legend(20,50, col=c("blue"),lty=1,lwd=2.0,
legend=c("Cumulative variance"),bty="n",
[Link]=2,cex=1.0, [Link]="blue")
axis(4)
mtext("Cumulative variance [%]",side=4,line=2)
The EOFs are from the column vectors of the SVD’s U matrix and the PCs are
the SVD’s V columns. The first three EOFs and PCs are shown in Figs. 4.7-4.9,
which may be generated by the following R code.
t
Fig. 4.7
The first EOF and PCs= from the January SAT’s standardized area-weighted
anomalies.
#plot PC1
pcdat<-svdJ$v[,1]
Time<-seq(1948,2015)
plot(Time, -pcdat, type="o", main="PC1 of NCEP RA Jan SAT: 1948-2015",
xlab="Year",ylab="PC values",
lwd=2, ylim=c(-0.3,0.3))
Often, the first two or three EOFs and PCs will have some physical interpreta-
tions, because the higher modes’ eigenvalues are too close to each other, and hence
have a large amount of uncertainty.
In the case of the January SAT Reanalysis data here, PC1 shows an increas-
ing trend. The corresponding EOF1 shows the spatially non-uniform pattern of
temperature increasing.
EOF2 shows an El Niño pattern, with a warm tongue over the eastern tropical
Pacific. PC2 shows the timpoeral variation of the the El Niño signal. For example,
the January 1983 and 1998 peaks correspond to two strong El Niños.
EOF3 appears to correspond to a mode known as the Pacific Decadal Oscillation.
t
Fig. 4.9
The third EOF and PC from the January SAT’s standardized area-weighted anomalies.
It shows a dipole pattern over the Northern Pacific. PC3 shows quasi-periodicity
in this mode with a period of about 20 to 30 years.
t
Fig. 4.10
The first EOF and PC from the January SAT’s de-trended standardized area-weighted
anomalies.
One can then use the EOF plotting code described in the previous sub-subsection
to make the plots of eigenvalues, EOFs and PCs. Comparing with the EOFs and
PCs of the previous sub-sub-section, it is clear that the de-trended EOF1 here is
similar to the non-de-trended EOF2. However, they are not exactly the same. Thus,
the de-trending process is approximately similar to the EOF1 filtering, although not
exactly the same. Similar statements can be made for the other EOFs and PCs.
t
Fig. 4.11
The second EOF and PC from the January SAT’s de-trended standardized
area-weighted anomalies.
The first eigenvalue of the de-trended anomalies explains about 10% of the total
variance, approximately equivalent to the 8% of the total variance explained by
EOF2 of the non-de-trended anomalies (see Fig. 4.6). This can be derived from the
non-de-trended SVD results. Let
d2
ci = 100 PK i , i = i, 2, · · · , K (4.11)
i=1 d2i
be the percentage of variance explained by the ith mode, where K is the total
number of modes available. In our case of January temperature from 1948-2015,
K = 68. The SVD calculation of the previous sub-sub-section found that
d2 d22
c2 = 100 PK 2 = 100 PK , (4.13)
i=1 d2i d21 + i=2 d2i
one can derive that
d2 1
100 PK 2 = 100
2
i=2 di
1/c2 − d21 /d22
c2 0.0825
= 100 = 100 = 9.9[%] (4.14)
1 − c1 1 − 0.1663
To verify that PC1 of the non-de-trended anomalies represents the trend, we com-
pute and plot the area-weighted SAT (see Fig. 4.12) from the NCEP/NCAR RA1
data using the following R code
t
Fig. 4.12
The global area-weighted January SAT anomalies from 1948-2015 based on the
NCEP/NCAR RA1 data.
Next we compute and plot the temperature trend from 1948 to 2015 based on the
NCEP/NCAR Reanalysis’ January temperature data. Each grid box has a time
series of 68 years of January SAT anomalies from 1948 to 2015, and each grid
box has a linear trend. These trends form a spatial pattern (see Fig. 4.13, which
is similar to that of EOF1 for the non-de-trended SAT anomaly data. The trend
value for each grid box is computed by R’s linear model command: lm(anomJ[i,]
Time)$coefficients[2]. The anomaly data are assumed to be written in the
space-time data matrix gpcpst with the first two columns as latitude and lon-
gitude. The following R codes make the trend calculation and plot.
#plot the trend of Jan SAT non-standardized anomaly data
#Begin with the space-time data matrix
monJ=seq(1,816,12)
gpcpdat=gpcpst[,3:818]
gpcpJ=gpcpdat[,monJ]
plot(gpcpJ[,23])
climJ<-rowMeans(gpcpJ)
anomJ=(gpcpdat[,monJ]-climJ)
trendV<-rep(0,len=10512)#trend for each grid box: a vector
for (i in 1:10512) {
trendV[i]<-lm(anomJ[i,] ~ Time)$coefficients[2]
}
mapmat1=matrix(10*trendV,nrow=144)
mapv1=pmin(mapmat1,1) #Compress the values >5 to 5
mapmat=pmax(mapv1,-1) #compress the values <-5 t -5
[Link]=colorRampPalette(c(’blue’,’green’,’white’, ’yellow’,’red’),
interpolate=’spline’)
int=seq(-1,1,[Link]=61)
mapmat=mapmat[, seq(length(mapmat[1,]),1)]
[Link](Lon, Lat, mapmat, [Link]=[Link], levels=int,
[Link]=title(
main="Trend of the NCEP RA1 Jan 1948-2015 Anom Temp",
xlab="Latitude",ylab="Longitude"),
[Link]={axis(1); axis(2);map(’world2’, add=TRUE);grid()},
[Link]=title(main="oC/10a"))
t
Fig. 4.13
Linear trend of NCEP Reanalysis’ January SAT from 1948-2015.
Figure 4.13 shows that the trends are non-uniform. The trend magnitudes over
land are larger than those over ocean. The largest positive trends are over the Arctic
region, while the Antarctic region has negative trends. These trends may not be
reliable since the reanalysis climate model has an amplified variance over the polar
regions, another example of model deficiencies leading to erroneous information in
the reanalysis produced by using that model.
Indeed, the spatial distribution of the trends appears similar to EOF1 of the
non-de-trended data (see Fig. 4.7). The spatial correlation between the trend map
of Fig. 4.7 and the EOF1 map in Fig. 4.7 is very high and in fact is 0.97. This result
implies that temperature’s spatial patterns are more coherent than the temporal
patterns, which is consistent with the existence of large spatial correlation scales
for the monthly mean temperature field.
References
[1] Monahan, A.H., J.C. Fyfe, M.H. Ambaum, D.B. Stephenson, G.R. and North,
2009: Empirical orthogonal functions: The medium is the message. Journal of
Climate, 22, 6501-6514.
[2] North, G. R., F. J. Moeng, T. J. Bell and R. F. Cahalan, 1982: Sampling Errors
in the Estimation of Empirical Orthogonal Functions. Mon. Wea. Rev., 110,
699-706.
[3] Strang, G., 2016: Introduction to Linear Algebra, 5th edition, Wellesley-
Cambridge Press, Wellesley, U.S.A.
Exercises
4.1 Use the SVD approach to find the EOFs and PCs for the Northern Hemi-
sphere’s January SAT based on the anomaly data from the NCEP/NCAR
Reanalysis. Use the 1981-2010 January mean as the January climatology for
each grid box. Plot the squared SVD eigenvalues against mode number for
the first 30 modes. The suggested steps are below.
(a) Convert the Reanalysis data into a space-time data matrix.
(b) Extract the Northern Hemisphere’s January data using proper row and
column indices.
(c) Apply the SVD to the extract space-time matrix.
(d) Plot the squared eigenvalues.
4.2 Use R to plot the first three EOFs and PCs from the previous problem.
Interpret the climatic meaning as much as you can, but limited 200-500 words.
4.3 Compute and plot the first three EOFs and PCs for the January SAT anoma-
lies for the contiguous United States. Use these EOFs and PCs to describe
the U.S. climate patterns, but limited 200-500 words. [Hint: You may first
use internet to find a grid mask table for the contiguous U.S. Use the table
and R’s which command to extract the U.S. data out of the January Northern
Hemispheric data.]
80
Climate Data Matrices and Linear
5 Algebra
This chapter introduces matrix from the perspective of space-time climate data and
emphasizes the singular value decomposition (SVD) that decomposes a space-time
data matrix into three matrices: the spatial pattern matrix, “energy” matrix, and
temporal pattern matrix. An extensive analysis is made for the sea level pressure
data of Darwin and Tahiti and their optimal formation of a weighted Southern
Oscillation Index. The chapter also contains the conventional and basic materials
of linear algebra: matrix operations, linear equations, multivariate regression by R,
and various applications, such as balancing the number of molecules for a chemical
reaction equation.
t
Fig. 5.1
Annual precipitation anomalies data of the Northern Hemisphere at longitude 2.5◦ E
[units: mm/day]. The annual total of the anomalies should be multiplied by 365.
20th century to study matrices. Computer software systems, such as R, have also
been developed in recent years that greatly facilitate working with matrices.
This chapter will discuss following topics:
(i) Matrix algebra of addition, subtraction, multiplication, and division (i.e., in-
verse matrix));
Matrix algebra is quite different from the algebra for a few scalars of x, y, z as
we learned in high school. For example, matrix multiplication does not have the
commutative property, i.e., matrix A times matrix B is not always the same as
matrix B times matrix A. This section describes a set of rules for doing matrix
algebra.
5.2.1 Matrix equality, addition and subtraction
In the above example, the second matrix multiplication (5.15) involves the same
matrices as the first one (5.14), but in a different order: If eq. (5.14) is denoted by
AB, then eq. (5.15) is BA. Clearly, the results are different. In general,
AB 6= BA (5.17)
for a matrix multiplication. Thus, matrix multiplication does not have the commu-
tative property which the multiplication of two scalars x and y does have: xy = yx.
Example 5.
1 2 1 3
A= , At = (5.18)
3 4 2 4
It is obvious that
(A + B)t = At + B t . (5.19)
However, a true but less obvious formula is the transpose of a matrix multiplication:
(AB)t = B t At . (5.20)
When a matrix whose only non-zero elements are on the diagonal elements, this
matrix is call a diagonal matrix:
d1
d2
D= , (5.21)
..
.
dn
Thus, the division problem becomes a multiplication problem when the inverse is
found. The inverse of x is defined as x−1 × x = 1.
Matrix division is defined in the same way:
A/B = A × B −1 , (5.24)
B −1 B = I , (5.25)
Example 6.
solve(matrix(c(1,1,1,-1), nrow=2))
# [,1] [,2]
#[1,] 0.5 0.5
#[2,] 0.5 -0.5
That is
−1
1 1 0.5 0.5
= (5.26)
1 −1 0.5 −0.5
Finding the inverse matrix of a matrix B “by hand” is usually a very difficult and
involves a long procedure for a large matrix, say, a 4 × 4 matrix. Modern climate
models can involve multiple n × n matrices with n from several hundred to several
million, or even billion. In this book, we use R to find inverse matrices and do
not attempt to explain how to find the inverse of a matrix by hand. Mastering
the material in this book will not require you to have this skill. A typical linear
algebra textbook will devote a large portion of its material to finding an inverse
of a matrix. A commonly used scheme is called echelon reduction through row
operations. For detailed information, see the excellent text Introduction to Linear
Algebra by Gilbert Strang.
x1 + x2 = 20
x1 − x2 = 4 (5.27)
Ax = b (5.28)
which involves three matrices:
1 1 x1 20
A= ,x = ,b = . (5.29)
1 −1 x2 4
Here, Ax means a matrix multiplication: A2×2 x2×1 .
The matrix notation of a system of two linear equations can be extended to
systems of many linear equations, hundreds or millions of equations in climate
modeling and climate data analysis. Typical linear algebra textbooks introduce
matrices in this way by describing linear equations in a matrix form. However, this
approach may be less intuitive for climate science, which emphasizes data. Thus,
our book uses data to introduce matrices as shown at the beginning of this chapter.
Although one can easily guess that the solution to the above simple matrix equa-
tion (5.28) is x1 = 12 and x2 = 8, a more general method for computing the solution
may be using the R code shown below:
solve(matrix(c(1,1,1,-1),nrow=2),c(20,4))
#[1] 12 8 #This is the result x1=12, and x2=8.
x = A−1 b, (5.30)
where A−1 is the inverse matrix of A and was found earlier in eq. (5.26). One can
verify that
0.5 0.5 20 12
A−1 b = = (5.31)
0.5 −0.5 4 8
is indeed the solution of the system of two linear equations.
A=matrix(c(1,-1,2,0,3,1),nrow=2)
A
# [,1] [,2] [,3]
#[1,] 1 2 3
#[2,] -1 0 1
covm=(1/(dim(A)[2]))*A%*%t(A)
covm #is the covariance matrix.
# [,1] [,2]
#[1,] 4.6666667 0.6666667
#[2,] 0.6666667 0.6666667
The covariance matrix times a vector u yields a new vector in a different direction.
u=c(1,-1)
v=covm%*%u
v
# [,1]
#[1,] 4
#[2,] 0
#u and v are in different directions.
where λ is a scalar which has the property that it simply scales w so that the above
equation holds. This scalar λ is called an eigenvalue (i.e., a “self-value”, “own-
value”, or “characteristic value” of the matrix C), and w is called an eigenvector.
R can calculate the eigenvalues and eigenvectors of a covariance matrix covm with
a command eigen(covm). The output is in an R data frame, which has two storages:
ew$values for eigenvalues and ew$vector for eigenvectors, as shown below.
eigen(covm)
eigen(covm)$values
#[1] 4.7748518 0.5584816
eigen(covm)$vectors
#[,1] [,2]
#[1,] -0.9870875 0.1601822
#[2,] -0.1601822 -0.9870875
#Verify the eigenvectors and eigenvalues
covm%*%ew$vectors[,1]/ew$values[1]
#[,1]
#[1,] -0.9870875
#[2,] -0.1601822
#This is the first eigenvector
A 2 × 2 covariance matrix has two eigenvalues, and two eigenvectors (λ1 , w1 ) and
(λ2 , w2 ), which are shown below from the R computation above for our example
covariance matrix C:
−0.9870875
λ1 = 4.7748518, w1 = , (5.34)
−0.1601822
0.1601822
λ2 = 0.5584816, w2 = . (5.35)
−0.9870875
We encounter space-time data every day, a simple example being the air tempera-
ture at different locations at different times. If you take a plane to travel from San
Diego to New York, you may experience the temperature at San Diego in the morn-
ing when you depart and that at New York in the evening after your arrival. Such
data have many important applications. We may need to examine the precipitation
conditions around the world at different days in order to monitor agricultural yields.
A cellphone company may need to monitor its market share and the temporal vari-
ations of that quantity in different countries. A physician may need to monitor a
patients symptoms in different areas of the body at different times. The observed
data in all these examples can form a space-time data matrix with the row posi-
tion corresponding to the spatial location and the column position corresponding
to time, as in Table 5.1.
Graphically, the space-time data may typically be plotted as a time series at
each given spatial position, or as a spatial map at each given time. Although these
straight-forward graphical representations can sometimes provide very useful in-
formation as input for signal detection, the signals are often buried in the data
and may need to be detected by different linear combinations in space and time.
Sometimes the data matrices are extremely large, with millions of data points in
either space or time. Then the question arises as to how can we extract the essential
information in such a big data matrix? Can we somehow manage to represent the
data in a more simple and yet more useful way? A very useful approach to such
a task involves a space-time separation. Singular value decomposition (SVD) is a
method designed for this purpose. SVD decomposes a space-time data matrix into a
spatial pattern matrix U , a diagonal energy level matrix D, and a temporal matrix
V t , i.e., the data matrix A is decomposed into
where ul are column vectors of U , and δlk is the Kronecker delta equal to zero when
k 6= l and one when k = l. The columns of V are also a set of orthonormal vectors
known as temporal eigenvectors.
Usually, the elements of the U and V matrices are unitless (i.e., dimensionless),
and the unit of the D elements is the same as the elements of the data matrix. For
example, if A is a space-time precipitation data matrix with a unit [mm/day], then
the dimension of the D elements is also [mm/day].
Example 1. SVD for the 2 × 3 data matrix A in Section 5.4.
Figure 5.2 shows the the standardized Tahiti and Darwin SLP data, SOI, cumulative
SOI index (CSOI), and Atlantic Multi-decadal Oscillation (AMO) index. The AMO
index is defined as the standardized average sea surface temperature (SST) over the
North Atlantic region (80◦ W-0◦ E, 0◦ N-60◦ N). The AMO data from January 1951
to December 2015 can be downloaded from the NOAA Earth System Research
Laboratory website
[Link]
It appears that the SCOI and AMO index follow a similar nonlinear trend. Both
CSOI and AMO decrease from 1950s to a bottom in the 1970s, then increase to
around year 1998 followed by about a decade plateau, and then start to decrease
around 2006. CSOI has a much smaller variance than the AMO index and has a
more clear signal.
The first three panels of Fig. 5.2 can be generated from the following R code.
The following R code can generate the fourth panel of Fig. 5.2.
3
2
2
1
1
Presure
Presure
0
0
-1
-1
-2
-2
-3
-3
1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010
Year Year
CSOI
0.4
AMO index
2
0.2
−50
Negative CSOI index
AMO index
SOI index
0.0
0
−100
−0.2
-2
-2
−150
−0.4
-4
-4
t
Year
Fig. 5.2
Standardized sea level pressure anomalies of Tahiti (upper left panel), that of Darwin
(upper right). SOI time series (lower left), and the cumulative of the negative SOI
time series (lower right) and AMO time series.
The space-time data matrix of the SLP at Tahiti and Darwin from January 1951-
December 2015 can be obtained from
ptada <-cbind(ptamonv,pdamonv)
This is a matrix of two columns: the first column is the Tahiti SLP and the second
column is the Darwin SLP. Because normally the spatial position is indicated by row
and the time is indicated by column, we transpose the matrix ptada<-t(ptada)
This is the 1951-2015 standardized SLP data for Tahiti and Darwin: 2 rows and
780 columns.
dim(ptada)
[1] 2 780
The spatial matrix U is a 2 × 2 orthogonal matrix since there are only two points.
Each column is an eigenvector of the covariance matrix C = (1/t)AAt , where An×t
is the original data matrix of n spatial dimension and t temporal dimension. The
SVD decomposition of the matrix A becomes
U=svdptd$u
U
# [,1] [,2]
#[1,] -0.6146784 0.7887779
#[2,] 0.7887779 0.6146784
The two column vectors of U are the covariance matrix’ eigenvectors, i.e., the
empirical orthogonal functions (EOF) in the atmospheric science literature as de-
scribed earlier. The EOFs represent spatial patterns. The first column is the first
spatial mode is u1 = (−0.61, 0.79), meaning opposite signs of Tahiti and Darwin,
which justifies the SOI index as one pressure minus another. This result further
suggests that a better index could be the weighted SOI:
This mode’s energy level, i.e., the temporal variance, is d1 = 31.35 given by
svdptd$d
[1] 31.34582 22.25421
D=diag(svdptd$d)
D
# [,1] [,2]
#[1,] 31.34582 0.00000
#[2,] 0.00000 22.25421
which forms the diagonal matrix D in the SVD formula. In the nature, the second
eigenvalue is often much smaller than the first, but that is not true in this example.
Here the second mode’s energy level is d2 = 22.25, which is equal to 71% of the
first energy level 31.35.
The second weighted SOI mode, i.e. the second column u2 of U , is thus
From the SVD formula A = U DV t , the above two weighted SOIs are U t A:
U t A = DV t , (5.49)
V=svdptd$v
V
# [,1] [,2]
# [1,] -5.820531e-02 1.017018e-02
# [2,] -4.026198e-02 -4.419324e-02
# [3,] -2.743069e-03 -8.276652e-02
# ......
The first temporal mode v1 is the first row of V 0 and is called the first principal
component (PC1). The above formulas imply that
v1 = W SOI1/d1 (5.50)
v2 = W SOI2/d2 (5.51)
The two PCs are orthonormal vectors, meaning the dot product of two different
PC vectors is zero, and that of the two same PC vectors is one. The two EOFs
are also orthonormal vectors. Thus, the SLP data at Tahiti and Darwin have been
decomposed into a set of spatially and temporally orthonormal vectors: EOFs and
PCs, together with energy levels.
The WSOIs’ standard deviations are d1 and d2 , reflecting the WSOI’s oscillation
magnitude and frequency.
We also have the relations
%Plot WSOI1
xtime<-seq(1951, 2016-1/12, 1/12)
wsoi1=D[1,1]*t(V)[1,]
plot(xtime, wsoi1,type="l",xlab="Year",ylab="Weighted SOI 1",
col="black",xlim=range(xtime), ylim=range(wsoi1), lwd=1)
axis (3, at=seq(1951,2015,4), labels=seq(1951,2015,4))
%Plot WSOI2
wsoi2=D[2,2]*t(V)[2,]
plot(xtime, wsoi2,type="l",xlab="Year",ylab="Weighted SOI 2",
col="black",xlim=range(xtime), ylim=c(-2,2), lwd=1)
axis (3, at=seq(1951,2015,4), labels=seq(1951,2015,4))
When the cumulative WSOI1 decreases, so does the SH surface air temperature
from 1951 to 1980. When the cumulative WSOI1 increases, so does the temperature
from the 1980s to the peak 1998. Later, the cumulative WSOI1 decreases to a
plateau from 1998 to 2002, then remains on plateau until 2007, then decreases
again. This also agrees with nonlinear trend of the SH surface air temperature
anomalies before 1998.
1951 1959 1967 1975 1983 1991 1999 2007 2015 1951 1959 1967 1975 1983 1991 1999 2007 2015
3
2
2
1
1
Weighted SOI 1
Weighted SOI 2
0
0
-1
-1
-2
-3
-2
1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010
Year Year
Comparison between CWSOI1 and the Smoothed AMO Index CWSOI2 Index: The Cumulative PC2
0.3
CWSOI1
0
0.2
−20
CSOI
0.1
AMO index
−40
−50
Weighted SOI 2
Weighted SOI 1
0.0
AMO index
−60
−0.1
−100
−80
−0.2
−100
−150
−0.3
−120
−0.4
1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010
t
Year Year
Fig. 5.3
Weighted SOI1 (upper left panel), weighted SOI2 (upper right), cumulative WSOI1
(lower left), and cumulative WSOI2 (lower right).
The CWSOI2 decreases from 1951 to the 1980s, remains at a flat valley until
its further increase from around 2007. This increase coincides with the persistent
global surface air temperature increase in the last decade.
Therefore, SVD results may lead to physical interpretations and may help provide
physical insight and is thus a valuable and convenient tool to use.
We present a visualization of EOFs from the above SVD results using ggplot.
The space-time data matrix ptada of the SLP at Tahiti and Darwin from January
1951 to December 2015 has 2 rows for space and 780 columns for time. The U matrix
from the SVD is a 2 × 2 matrix. Its first column represents the El Niño mode. Note
that the eigenvectors are determined except for a positive or negative sign. Because
Tahiti has a positive SST anomaly during an El Niño, we choose Tahiti 0.61 and
hence make Darwin -0.79. This is the negative first eigenvector from the SVD.
The second mode is Tahiti 0.79 and Darwin 0.61. These two modes are orthogonal
because (−0.79, 0.61) · (0.61, 0.79) = 0. They are displayed in Fig. 5.4, which may
be generated by the following R code.
[Link]()
par(mfrow=c(2,1))
t
Fig. 5.4
The two orthogonal ENSO modes from the Tahiti and Darwin standardized SLP data.
The relative data sizes are proportional to the component values of each eigenvector
in the U matrix. Red color indicates positive values, and blue indicates the negative
values.
The EOFs and PC time series from the SVD can be computed and plotted in
another way using the following R code
THowever, the conservation of mass requires that the atomic weights on both
sides of the equation be equal. The photons have no mass. Thus, the chemical
equation as written above is incorrect. The correct equation should specify precisely
how many CO2 molecules react with how many H2 O molecules to generate how
many C6 H12 O6 and O2 molecules. Suppose that these coefficients are x1 , x2 , x3 , x4 .
We then have
x1 CO2 + x2 H2 O −→ x3 C6 H12 O6 + x4 O2 (5.54)
Making the number of atoms of carbon on the left and right sides of the equation
equal yields
x1 = 6x3 (5.55)
because water and oxygen contain no carbon. Doing the same for hydrogen atoms
leads to
2x2 = 12x3 . (5.56)
We thus have three equations in four variables. Thus, these equations have infinitely
many solutions. We can set any variable fixed, and express the other three using
this fixed variable. Since the largest molecule is glucose, we set its coefficient x3
fixed. Then we have
x1 = 6x3 , x2 = 6x3 , x4 = 6x3 . (5.58)
Thus, the chemical equation is
6x3 CO2 + 6x3 H2 O −→ x3 C6 H12 O6 + 6x3 O2 . (5.59)
If we want to produce one glucose molecule, i.e., x3 = 1, then we need 6 carbon
dioxide and 6 water molecules:
6CO2 + 6H2 O −→ C6 H12 O6 + 6O2 . (5.60)
Similarly, one can write chemical equations for many common reactions, such as
iron oxidation
3F e + 4H2 O −→ 4H2 + F e3 O4 , (5.61)
and the redox reaction in a human body which consumes glucose and converts the
glucose into energy, water and carbon dioxide:
C6 H12 O6 + 6O2 −→ 6CO2 + 6H2 O. (5.62)
Example 2. This example will show that four arbitrarily specified points cannot
all be on a plane. The fitted plane has the shortest distance squares, i.e., the least
squares (LS), or minimal mean square error (MMSE). Thus, the residuals are non-
zero, in contrast to the zero residuals in the previous example.
u=c(1,2,3,1)
v=c(2,4,3,-1)
w=c(1,-2,3,4)
mydata=[Link](u,v,w)
myfit <- lm(w ~ u + v, data=mydata)
summary(myfit)#Show the result
Call:
lm(formula = w ~ u + v, data = mydata)
Residuals:
1 2 3 4
1.0 -1.0 0.5 -0.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0000 1.8708 0.535 0.687
u1 2.0000 1.2472 1.604 0.355
v1 -1.5000 0.5528 -2.714 0.225
The 95% confidence interval for W ’s coefficient is 0.12 ± 2 × 0.21, that for X’s
coefficient is −0.53 ± 2 × 0.19,Y ’s coefficient is −0.04389 ± 2 × 0.15. Each confidence
interval includes zero. Thus, there is no significant no-zero trend for the Z data with
respect to W, X, Y . This result is to be expected, because the data are randomly
generated and thus should not have a trend.
In practical applications, a user can simply convert the data into the same data
frame format as shown here. Then, R command
lm(formula = Z W + X + Y, data = fdat)
can do the regression job.
R can also do nonlinear regression with specified functions, such as quadratic
functions and exponential functions. See examples from the URLs
[Link]
[Link]
References
[1] Golub, G.H. and Reinsch, C., 1970: Singular value decomposition and least
squares solutions. Numerische mathematik, 14(5), pp.403-420.
[2] Larson, R., 2013: Elementary linear algebra, 7th edition. Brooks/Cole Cengage
Learning, Boston, 390pp.
[3] Strang, G., 1993: Introduction to Linear Algebra. Wellesley, MA: Wellesley-
Cambridge Press, 400pp.
Exercises
svd(mat)
$d
[1] 1.414214 1.414214
$u
[,1] [,2]
[1,] -0.7071068 -0.7071068
[2,] -0.7071068 0.7071068
$v
[,1] [,2]
[1,] -1 0
[2,] 0 -1
mat
[,1] [,2]
111
[1,] 1 1
[2,] 1 -1
Show detailed calculations of all the relevant matrices and vectors. Use
space-time decomposition to describe your results. For extra credit: Describe
the spatial and temporal modes, and their corresponding variances or energies.
5.2 Use R and the updated Darwin and Tahiti standardized SLP data to repro-
duce the EOFs and PCs and to plot the EOF pattern maps and PC time
series.
5.3 Do the same procedures in the previous problem but for original non-standardized
data. Comment on the difference of the results from the previous problem.
5.4 (a) Download the monthly precipitation data at five different stations over
the United States from the USHCN website:
[Link]
(b) Use R to organize the January data from 1951 to 2010 into the space-time
format.
(c) Compute the climatology of each station as the 1971-2000 mean.
(d) Compute the space-time anomaly data matrix A as the original space-time
data matrix minus the climatology.
(e) Use R to make the SVD decomposition of the space-time anomaly data
matrix A = U DV t .
(f) Write the U and D matrices.
5.5 In the previous problem, use R and the formula U DV t to reconstruct the
original data matrix A. This is a verification of the SVD decomposition, and
is also called EOF-PC reconstruction.
5.6 Use R to plot the maps of the first three EOF modes, similar to the two El
Niño mode maps shown in Fig. 5.4. Try to explain the climate meaning of the
EOF maps.
5.7 Use R to plot the first three PC time series. Try to explain the climate meaning
of the time series.
5.8 (a) A covariance matrix C can be computed from a space-time observed
anomaly data matrix X which has N rows for spatial locations and Y columns
for time in years:
C = X · X t /Y (5.66)
This is an N ×N matrix. (a) Choose a Y matrix from the USHCN annual total
precipitation data at three California stations from north to south [Berkeley,
CA (040693); Santa Barbara, CA (047902); Cuyamaca, CA (042239)] and five
years from 2001 to 2005. and calculate a covariance matrix for N = 3 and
Y = 5.
(b) Use R to find the inverse matrix of the covariance matrix C.
(c) Use R to find the eigenvalues and eigenvectors of C.
(d) Use R to make SVD decomposition of the data matrix X = U DV t . Ex-
plicitly write out the three matrices U, D and V .
(e) Use R to explore the relationship between the eigenvalues of C and the
matrix D.
(f) Compare the eigenvectors of C and the matrix U .
(g) Plot the PC time series and describe their behavior.
5.9 The burning of methane (CH4 ) with oxygen (O2 ) produces water (H2 O)and
carbon dioxide (CO2 ). Balance the chemical reaction equation.
5.10 The burning of propane (C3 H8 ) with oxygen (O2 ) produces water (H2 O)and
carbon dioxide (CO2 ). Balance the chemical reaction equation.
5.11 The burning of gasoline (C8 H18 ) with oxygen (O2 ) produces water (H2 O)and
carbon dioxide (CO2 ). Balance the chemical reaction equation.
Basic Statistical Methods for
6 Climate Data Analysis
The word “statistics” comes from the Latin “status” meaning “state.” We use the
term “statistics” to mean a suite of scientific methods for analyzing data and for
drawing credible conclusions from the data. Statistical methods are routinely used
for analyzing and drawing conclusions from climate data, such as for calculating
the climate “normal” of precipitation at a weather station and for quantifying the
reliability of the calculation. Statistical methods are often used for demonstrating
that global warming is occurring, based on a significant upward trend of surface
air temperature (SAT) anomalies, and on establishing a given significance level for
this trend. Statistical methods are also used for inferring a significant shift from
a lower state of North Pacific sea level pressure (SLP) to a higher state or from
a lower temperature regime to a higher one. A list of questions such as those just
cited can be infinitely long. The purpose of this chapter is to provide basic concepts
and a kind of “user manual” covering the most commonly used statistical methods
in climate data analysis, so that users can arrive at credible conclusions based on
the data, together with a given error probability.
R codes will be supplied for examples in this chapter. Users can easily apply these
codes, and the formulas given in this book, for their data analysis needs without
any need for an extensive background knowledge of calculus, and without a deep
understanding of statistics. To interpret the statistical results in a meaningful way,
however, knowledge of the domain of climate science will be very useful, when
using statistical concepts and the results of calculations to establish conclusions
from specific climate datasets.
The statistical methods in this chapter have been chosen in order to focus on
making credible inferences about the climate state, with a given error probability,
based on the analysis of climate data, so that observational data can lead to ob-
jective and reliable conclusions. We will first describe a list of statistical indices,
such as the mean, variance and quantiles, for climate data. We will then take up
the topics of probability distributions and statistical inferences.
114
6.1 Statistical indices from the global temperature
data from 1880 to 2015
The following link provides data for the global average annual mean surface air
temperature anomalies from 1880 to 2015 (Karl et al. 2015, NOAA GlobalTemp
dataset at NCDC
[Link] In the
data list, the first datum corresponds to 1880 and the last to 2015. These 136 years
of data are used to illustrate the following statistical concepts: mean, variance,
standard deviation, skewness, kurtosis, median, 5th percentile, 95th percentile, and
other quantiles. The anomalies are with respect to the 20th century mean, i.e., the
1900-1999 climatology period. The global average of the 20th century mean is 12.7
◦
C. The 2015 anomaly was 0.65 ◦ C. Thus, the 2015’s global average annual mean
temperature is 13.4◦ C.
Because we have just quoted numbers that purport to be observations of annual
mean global mean surface temperatures, this may be a good place to mention
an important caveat. The caveat is that observational estimates of the global mean
surface temperature are less accurate than similar estimates of year-to-year changes.
This is one of several reasons why global mean surface temperature data are almost
always plotted as anomalies (such as differences between the observed temperature
and a long-term average temperature) rather than as the temperatures themselves.
It is also important to realize that the characteristic spatial correlation length scale
for surface temperature anomalies is much larger (hundreds of kilometers) than the
spatial correlation length scale for surface temperatures. The use of anomalies is
also a way of reducing or eliminating individual station biases that are invariant
with time. A simple example of such biases is that due to station location, which
usually does not change with time. It is easy to understand, for instance, that a
station located in a valley in the middle of a mountainous region might report
surface temperatures that are higher than an accurate mean surface temperature
for the entire region, but the anomalies at the station might be more accurately
reflect the characteristics of the anomalies for the region. For a clear and concise
summary of these important issues, with references, see
[Link]
observations-reanalyses-and-the-elusive-absolute-global-mean-temperature/
We use R to calculate all the needed statistical parameters. The data is read as
tmean15.
setwd("/Users/sshen/Desktop/MyDocs/teach/SIOC290-ClimateMath2017/
Book-ClimMath-Cambridge-PT1-2017-07-21/Data")
dat1 <- [Link]("[Link].land_ocean.[Link]")
dim(dat1)
tmean15=dat1[,2] #Take only the second column of this data matrix
head(tmean15) #The first five values
#[1] -0.367918 -0.317154 -0.317069 -0.393357 -0.457649 -0.468707
mean(tmean15)
#[1] -0.2034367
sd(tmean15)
#[1] 0.3038567
var(tmean15)
#[1] 0.09232888
library(e1071)
#This R library is needed to compute the following parameters
skewness(tmean15)
#[1] 0.7141481
kurtosis(tmean15)
#[1] -0.3712142
median(tmean15)
#[1] -0.29694
quantile(tmean15,probs= c(0.05,0.25, 0.75, 0.95))
# 5% 25% 75% 95%
#-0.5792472 -0.4228540 -0.0159035 0.3743795
The following R commands can plot the time series of the temperature data with
a linear trend (see Fig. 6.1).
yrtime15=seq(1880,2015)
reg8015<-lm(tmean15 ~ yrtime15)
# Display regression results
reg8015
#Call:
#lm(formula = tmean15 ~ yrtime15)
#Coefficients:
#(Intercept) yrtime15
#-13.208662 0.006678
# Plot the temperature time series and its trend line
plot(yrtime15,tmean15,xlab="Year",ylab="Temperature deg C",
main="Global Annual Mean Land and Ocean Surface
Temperature Anomalies 1880-2015", type="l")
abline(reg8015, col="red")
text(1930, 0.4, "Linear temperature trend 0.6678 oC per century",
col="red",cex=1.2)
The above statistical indices were computed using the following mathematical
formulas, described by x = {x1 , x2 , · · · , xn } as the sampling data for a time series:
n
1X
mean: µ(x) = xk , (6.1)
n
k=1
n
1 X
variance by unbiased estimate: σ 2 (x) = (xk − µ(x))2 , (6.2)
n−1
k=1
The significance of these indices is as follows. The mean gives the average of sam-
ples. The variance and standard deviation measure the spread of samples. They are
large when the samples have a broad spread. Skewness is a dimensionless quan-
tity. It measures the asymmetry of samples. Zero skewness signifies a symmetric
distribution. For example, the skewness of a normal distribution is zero. Negative
skewness denotes a skew to the left, meaning that the long distribution tail is on
the left side of the distribution. Positive skewness has a long tail on the right side.
t
Fig. 6.1
Time series of the global average annual mean temperature with respect to 1900-1999
climatology: 12.7 ◦ C.
We will use the 136 years of temperature data and R to illustrate some commonly
used statistical figures, namely the histogram, boxplot, scatter plot, qq-plot, and
linear regression trend line.
Figure 6.3 is the box plot of the 136 years of global average annual mean tempera-
ture data, and can be made from the following R command
b=boxplot(tmean15, ylab="Temperature anomalies")
The rectangular box’s mid line indicates the level of the median, which is -0.30◦ C.
The rectangular box’s lower boundary is the first quartile, i.e., 25-percentile. The
Histogram of 1880−2015 Temperature Anomalies
40
30
Frequency
20
10
0
t
Temperature anomalies
Fig. 6.2
Histogram of the global average annual mean temperature anomalies from 1880-2015.
box’s upper boundary is the third quartile, i.e., the 75-percentile. The box’s height
is the third quartile minus the first quartile, and is called the interquartile range
(IQR). The upper “whisker” is the third quartile plus 1.5 IQR. The lower whisker
is supposed to be at the first quartile minus 1.5 IQR. However, this whisker would
then be lower than the lower extreme. Thus, the lower whisker takes the value of
the lower extreme, which is -0.68 ◦ C. The points outside of the two whiskers are
considered outliers. Our dataset has one outlier, which is 0.65 ◦ C. This is the hottest
year in the dataset. It was year 2015.
Sometimes, one may need to plot multiple box plots on the same figure, which
can be done by R. One can look at an example in the R-project document
[Link]
The scatter plot is convenient for displaying whether two datasets are correlated
with one another. We use the southern oscillation index (SOI) and the contiguous
United States temperature as an example to describe the scatter plot. The data
can be downloaded from
[Link]/teleconnections/enso/indicators/soi/
[Link]/temp-and-precip/
The following R code can produce the scatter plot shown in Fig. 6.4.
0.6
0.4
Temperature anomalies [deg C]
0.2
0.0
−0.2
−0.4
−0.6
t
Fig. 6.3
Box plot of the global average annual mean temperature anomalies from 1880-2015.
The correlation between the two datasets is 0. Thus, the slope of the red trend line
is also zero.
−4 −2 0 2 4
SOI [dimensionless]
t
Fig. 6.4
Scatter plot of the January U.S. temperature vs. the January SOI from 1951-2016.
The scatter plot shows that the nearly zero correlation is mainly due to the five
negative SOI values, which are El Niño Januarys: 1983 (-3.5), 1992 (-2.9), 1998
(-2.7), 2016 (-2.2), 1958 (-1.9). When these strong El Niño Januarys are removed,
then the correlation is 0.2. The slope is then 0.64, compared with 1.0 for perfect
correlation.
The R commands to retain the data without the above five El Niño years are
below
soijc=soij[c(1:7,9:32,34:41,43:47,49:65)]
ustjc=ustj[c(1:7,9:32,34:41,43:47,49:65)] With these data, the scatter plot
and trend line can be produced in the same way.
We thus may say that the SOI has some predictive skill for the January temper-
atures of the contiguous United States, for the non-El Niño years. This correlation
is stronger for specific regions of the U.S. The physical reason for this result has to
do with the fact that the temperature field over the U.S. is inhomogeneous, and in
different regions, it is related to the tropical ocean dynamics in different ways. This
gives us a hint as to how to find the predictive skill for a specific objective field:
to create a scatter plot using the objective field, which is being predicted, and the
field used for making the prediction. The objective field is called the predicant or
predictand, and the field used to make the prediction is called the predictor. A very
useful predictive skill would be that the predictor leads the predicant by a certain
time, say one month. Then the scatter plot will be made from the pairs between
predictor and predicant data with one-month lead. The absolute value of the cor-
relation can then be used as a measure of the predictive skill. Since the 1980s, the
U.S. Climate Prediction Center has been using sea surface temperature (SST) and
sea level pressure (SLP) as predictors for the U.S. temperature and precipitation
via the canonical correlation analysis method (CCA). Therefore, before a prediction
is made, it is a good idea to examine the predictive skill via scatter plots, which
can help identify the best predictors.
However, the scatter plot approach above for maximum correlation is only ap-
plicable for linear predictions or for weakly nonlinear relationships. Nature can
sometimes be very nonlinear, which require more sophisticated assessments of pre-
dictive skill, such as neural networks and time-frequency analysis. The CCA and
other advanced statistical prediction methods are beyond the scope of this book.
6.2.4 QQ-plot
The data of Table 6.1 can also be displayed by the bar chart in Fig. 6.6. This
figure visually displays the different cloudiness climates of the three cities. Thus,
either the table or the figure demonstrates that a probability distribution can be a
good description of important properties of a random variable, such as cloud cover.
Here, a random variable means a variable that can take on a value in a random
way, such as weather conditions (sunny, rainy, snowy, cloudy, stormy, windy, etc).
Almost anything we deal with in our daily lives is a random variable, that is to say,
a variable which has a random nature, in contrast to a deterministic variable. We
describe a random variable by probability and explore what is the probability of
the variable having a certain value or a certain interval of values. This description
is the probability distribution.
Figure 6.6 can be generated by the following R code.
[Link]()
layout(matrix(c(1,2,3), 1, 3, byrow = TRUE),
widths=c(3,3,3), heights=c(1,1,1))
lasvegas=c(0.58,0.42)
sandiego=c(0.4,0.6)
seattle=c(0.16,0.84)
names(lasvegas)=c("Clear","Cloudy")
names(sandiego)=c("Clear","Cloudy")
Proability Distribution of Weather
Las Vegas San Diego Seattle
0.6
0.8
0.6
0.4
0.4
Probability
0.4
0.2
0.2
0.2
0.0
0.0
0.0
Clear Cloudy Clear Cloudy Clear Cloudy
t
Fig. 6.6
Probability distributions of different climate conditions according to cloudiness for
three cities in the United States.
names(seattle)=c("Clear","Cloudy")
barplot(lasvegas,col=c("skyblue","gray"),ylab="Probability")
mtext("Las Vegas", side=3,line=1)
barplot(sandiego,col=c("skyblue","gray"))
mtext("San Diego", side=3,line=1)
barplot(seattle,col=c("skyblue","gray"))
mtext("Seattle", side=3,line=1)
mtext("Proability Distribution of Weather",
cex=1.3,side = 3, line = -1.5, outer = TRUE)
where D is the domain of the pdf, the entire range of the possible x values, e.g.,
D = (−50, 50)◦ C in the case of temperature for the U.S. This formula is called the
probability normalization condition, as shown in Fig. 6.7.
dA=f(x)dx
0.2
Area=1
∞
0.1
⌠ f(x)dx=1
dx ⌡−∞
f(x)
0.0
x x+dx
−3 −2 −1 0 1 2 3
Random variable x
t
Fig. 6.7
Normalization condition of a probability distribution function.
Of course, the normalization condition for a discrete random value, such as clear
and cloudy skies, is a summation, rather than the above integral. Consider the San
Diego case in Table 6.1. The normalization condition is 0.40 + 0.60 = 1.0.
Figure 6.8 shows five different normal distributions, each of which is a bell-shaped
curve with the highest density when the random variable x takes the mean value,
and approaches zero as x goes to infinity. The figure can be generated by the
following R code.
t
Fig. 6.8
Probability density function for five normal distributions.
#Plot t-distribution by R
x <- seq(-4, 4, length=200)
plot(x,dt(x, df=3), type="l", lwd=4, col="red",
ylim = c(0,0.6),
xlab="Random variable t",
ylab ="Probability density",
main="Student t-distribution T(t,df)")
lines(x,dt(x, df=1), type="l", lwd=2, col="blue")
lines(x,dt(x, df=2), type="l", lwd=2, col="black")
lines(x,dt(x, df=6), type="l", lwd=2, col="purple")
lines(x,dt(x, df=Inf), type="l", lwd=2, col="green")
#ex.cs1 <- expression(plain(sin) * phi, paste("cos", phi))
ex.cs1 <- c("df=3", "df=1","df=2","df=6","df=Infinity")
legend("topleft",legend = ex.cs1, lty=1,
col=c(’red’,’blue’,’black’,’purple’,’green’), cex=1, bty=n)
When the df, the number of degrees of freedom (df = n − 1) is infinity, the t-
distribution is exactly the same as the standard normal distribution N (0, 1). Even
when df = 6, the t-distribution is already very close to the standard normal dis-
tribution. Thus, t-distribution is meaningfully different from the standard normal
distribution only when the sample size is small, say, n=5 (i.e., df=4).
The exact mathematical expression of the pdf for the t-distribution is quite com-
plicated and uses a Gamma function, which is a special function beyond the scope
of this book.
If the data (x1 , x2 , · · · , xn ) are normally distributed with the same mean µ and
standard deviation σ, then the sample mean, i.e., the mean of the data
n
1X
x̄ = xi , (6.12)
n i=1
√
is normally distributed with mean equal to µ and standard deviation equal to σ/ n.
Given the sample size n, mean µ, and standard deviation σ for a set of normal
data, what is the interval [a, b] such that the 95% of the sample means will occur
within the interval [a, b]? Intuitively, the sample mean should be close to the true
mean µ most of the times. However, because the sample data are random, the
sample means are also random and may be very far away from the true mean. For
the example of the global temperature, we might assume that the “true” mean is
14◦ C and the “true” standard deviation is 0.3 ◦ C. Here, “true” is an assumption,
however, since no one knows the truth. The sample means are close to 14 most of
the time, but climate variations may lead to a sample mean being equal to 16◦ C or
12◦ C, thus far away from the “true” mean 14◦ C. We can use the interval [a, b] to
quantify the probability of the sample mean being inside this interval. We wish to
say that with 95% probability, the sample mean is inside this interval [a, b]. This
leads to the following confidence interval formula.
For a normally distributed population (x1 , x2 , · · · , xn ) with the same mean µ and
standard deviation σ, the confidence interval at the 95% confidence level is
√ √
(µ − 1.96σ/ n, µ + 1.96σ/ n). (6.13)
This simulation shows that 9,496 of the 10,000 simulations have the sample means
inside the confidence interval. The probability is thus 0.9496, or approximately 0.95.
Figure 6.10 displays the histogram of the simulation results. It shows that 9,496
sample means from among 10,000 are in the confidence interval (13.92, 14.08). Only
504 sample means are outside the interval with 254 in (−∞, 13.92) and 250 in
(14.08, ∞). Thus, the confidence level is the probability of the sample mean falling
into the confidence interval. Intuitively, when the confidence interval is small, the
confidence level is low since there is a smaller chance for the sample mean to fall
into a smaller interval.
300
100
0
Temperature [deg C]
t
Fig. 6.10
Histogram of 10,000 simulated sample mean temperature based on the assumption of
normal distribution with the “true” mean equal 14◦ C and “true” standard deviation
0.3 ◦ C. Approximately, 95% of the sample means are within the confidence interval
(13.92, 14.08), 2.5% in (14.08, ∞), and 2.5% in (−∞, 13.92).
when the confidence interval is small. The extreme case is that the confidence
interval has zero length, which means that with 95% chance, the sample mean is
exactly equal to the true mean. The chance to be wrong is only 5%. To be more
accurate, our intuition suggests that we need to have a small standard deviation,
and have a large sample. The above confidence interval formula (6.13) quantifies
√ √
this intuition (µ − 1.96σ/ n, µ + 1.96σ/ n). A small σ and a large n enable us
to have a small confidence interval, and hence an accurate estimation of the mean.
Thus, to obtain an accurate result in a survey, one should use a large sample.
This subsection shows a method to find out how large a sample should be, for
the case when the confidence probability is given. We also want to deal with the
practical situation where the true mean and standard deviation are almost never
known. Furthermore, it is usually not known whether the random variable is in
fact normally distributed. These two problems can be solved by a very important
theoretical result of mathematical statistics, called the central limit theorem (CLT),
which says that when the sample size n is sufficiently large, the sample mean x̄ =
Pn
i=1 xi /n is approximately normally distributed, regardless of the distributions
of xi (i = 1, 2, · · · , n). The approximation becomes better when n becomes larger.
Some textbooks suggest that n = 30 is good enough to be considered a “large”
sample; others use n = 50. In climate science, we often use n = 30.
When the number of samples is large in this sense, the normal distribution as-
sumption for the sample mean is taken care of. We then compute the sample mean
and sample standard deviation by the following formulas
n
1X
x̄ = xi , (6.15)
n i=1
v
u n
u 1 X 2
S=t (xi , −x̄) . (6.16)
n − 1 i=1
y = x ± , (6.21)
where stands for errors, x is the true but never-known value to be observed, and
y is the observational data. Thus, data are equal to the truth plus errors. The
√
expected value of the error is zero and the standard deviation of the error is S/ n,
also called standard error.
The confidence level 95% comes into the equation when we require that the
observed value must lie in the interval (µ − EM, µ + EM ) with a probability equal
to 0.95. This corresponds to the requirement that the standard normal random
variable z is found in the interval (z− , z+ ) with a probability equal to 0.95, which
implies that z− = −1.96 and z+ = 1.96. Thus, the confidence interval of the sample
mean at the 95% confidence level is
√ √
(x̄ − 1.96S/ n, x̄ + 1.96S/ n), (6.22)
or
√ √
(x̄ − zα/2 S/ n, x̄ + zα/2 S/ n), (6.23)
where zα/2 = z0.05/2 = 1.96. So, 1 − α = 0.95 is used to represent the probability
inside the confidence interval, while α = 0.05 is the “tail probability” outside of the
confidence interval. Outside of the confidence interval means occurring on either
the left side or the right side of the distribution. Each side represents α/2 = 0.025
tail probability. The red area of Fig. 6.11 indicates the tail probability.
Figure 6.11 can be plotted by the following R code.
Probability
= 0.135 Probability
0.2
= 0.68
0.1
Probability
= 0.025
SE SE x SE SE
0.0
t
Fig. 6.11
Schematic illustration of confidence intervals and confidence levels of a sample mean
for a large sample size. The confidence interval at 95% confidence level is between the
two red points, and that at 68% is between the two blue points. SE stands for the
standard error, and 1.96 SE is approximately regarded as 2 SE in this figure.
In practice, we often regard 1.96 as 2.0, and the 2σ-error bar as the 95% confidence
interval.
Example 1. Estimate (a) the mean of the 1880-2015 global average annual mean
temperatures of the Earth, and (b) the confidence interval of the sample mean at
the 95% confidence level.
The answer is that the mean is −0.2034◦ C and the confidence interval is
(−0.2545, −0.1524)◦ C. These values may be obtained by the following R code.
Example 2. The standard deviation of the global average annual mean tem-
perature is given to be 0.3◦ C. The standard error is required to be less or equal to
0.05◦ C. Find the minimal sample size required.
The solution is (0.3/0.05)2 = 36. The sample size must be greater than or equal
to 36.
H0 probability 0.975
p−value
t
Fig. 6.12
The standard normal distribution chart for statistical inference: z-score, p-value for
x̄ < µ, and significance level 2.5%. The value z0.025 = −1.96 is called the critical
z-score for this hypothesis test.
rm(list=ls())
par(mgp=c(1.4,0.5,0))
curve(dnorm(x,0,1), xlim=c(-3,3), lwd=3,
main=’Z-score, p-value, and significance level’,
xlab="z: standard normal random variable",
ylab=’Probability density’,xaxt="n",
[Link]=1.2, ylim=c(-0.02,0.5))
lines(c(-3,3),c(0,0))
lines(c(-1.96,-1.96),c(0, dnorm(-1.96)),col=’red’)
polygon(c(-3.0,seq(-3.0, -2.5, length=100), -2.5),
c(0, dnorm(seq(-3.0, -2.5, length=100)), 0.0),col=’skyblue’)
points(-1.96,0, pch=19, col="red")
points(-2.5,0,pch=19, col="skyblue")
text(-1.8,-0.02, expression(z[0.025]), cex=1.3)
text(-2.40,-0.02, "z-score", cex=1.1)
arrows(-2.8,0.06,-2.6,0.003, length=0.1)
text(-2.5,0.09, "p-value", cex=1.3)
To make the justification, we compute a parameter
x̄ − µ
z= √ , (6.25)
S/ n
where x̄ is the sample mean, S is the sample standard deviation, and n is the sample
size. This z value is called the z-statistic, or simply the z-score, which follows the
standard normal distribution, because the sample size n = 60 is large. From the z-
score, we can determine the probability of the random variable z being in a certain
interval, such as (−∞, zs ). This significance level 2.5% corresponds to zs = −1.96
according to Fig. 6.11. Thus, the z-score can quantify how significantly is z different
from zero, which is equivalent to the sample mean being significantly different from
the assumed or given value. The associated probability, e.g., the probability in
(−∞, z), is called the p-value that measures the chance of a wrong inference. We
want this p-value to be small in order to be able to claim significance. The typical
significance levels used in practice are 5%, 2.5%, and 1%. Choosing which level to
use depends on the nature of the problem. For drought conditions, one may use 5%,
while for flood control and dam design, one may choose 1%. A statistical inference
is significant when the p-value is less than the given significance level.
For our problem of 60 years of data from 1880-1939, the sample size is n = 60. The
sample mean can be computed by an R command xbar=mean(tmean15[1:60]),
and the sample standard deviation can be computed by S=sd(tmean15[1:60]).
The results are x̄ = −0.4500 and S = 0.1109.
When µ = 0, the z-score computed using formula (6.25) is -31.43. The probability
in the interval (−∞, z) is tiny, namely 4.4 × 10−217 , which can be regarded as zero.
We can thus conclude that the sample mean from 1880-1939 is significantly less
than zero at a p-value equal to 4.4 × 10−217 , which means tat our conclusion is
correct at a significance level of 2.5%.
A formal statistical terminology for the above inference is called hypothesis test,
which tests a null hypothesis
Our question of the average temperature from 1880-1939 is to reject the null hy-
pothesis and confirm the alternative hypothesis. The method is to examine where
the z-score point is on a standard normal distribution chart and what is the corre-
sponding p-value. Thus, the statistical inference becomes a problem of z-score and
p-value using the standard normal distribution chart (See Fig. 6.12). Our z-score
is -31.43 in the H1 region, and our p-value is 4.4 × 10−217 , much less than 0.025.
We thus accept the alternative hypothesis, i.e., we reject the null hypothesis with
a tiny p-value 4.4 × 10−217 . We conclude that the 1880-1939 mean temperature is
significantly less than zero.
One can similarly formulate a hypothesis test for a warming period from 1981-
2015 and ask whether the average temperature during this period is significantly
greater than zero. The two hypotheses are
H0 : x̄ ≤ 0, (Null hypothesis: the mean is not greater than zero) (6.28)
and an alternative hypothesis
H1 : x̄ > 0, (Alternative hypothesis: the mean is greater than zero). (6.29)
One can follow the same procedure to compute the z-score, see whether it in the
H0 region or H1 region, and compute the p-value. Finally an inference can be made
based on the z-score and the p-value.
For the standard normal distribution, z0.975 = 1.96 < t0.975 = 2.2622, because
the t-distribution is flatter than the corresponding normal distribution and has
fatter tails. Thus, the critical t-scores are larger.
Clearly, one should use the t-test to make the inference when the sample size is
very very small, say, n = 7. However, it is unclear whether one should use the t-test
or the z-test if the sample size is, say, 27. The recommendation is to always use the
t-test if you are not sure whether the z-test is applicable, because t-test has been
mathematically proven to be accurate, while the z-test is an approximation. Since
the t-distribution approaches the normal distribution when dof approaches infinity,
the t-test will yield the same result as the z-test when the z-test is applicable.
H0 : T̄1 = T̄2 (Null hypothesis: The temperatures of the two decades are the same)
(6.39)
and
H1 : T̄1 6= T̄2 (Alternative hypothesis: The two decades are different). (6.40)
This is a two-sided test. The alternative region is a union of both sides (−∞, t0.025 )
and (t0.975 , ∞) if the significance level is set to be 5%. We will compute the t-score
using formula (6.33). The result is below:
H0 : T̄1 > T̄2 (Null hypothesis: The temperatures of the two decades are the same)
(6.41)
and
H1 : T̄1 ≤ T̄2 (Alternative hypothesis: The two decades are different). (6.42)
The t-score is the same as the above, but the critical t-score is now t0.95 = 1.734.
Again, the t-score 2.57836 is in the H1 region.
When studying climate change, one often makes a linear regression and ask if a
linear trend is significantly positive, negative, and different from zero. For example,
is the linear trend of the global average annual mean temperature from 1880-2015
shown in Fig. 6.1 significantly greater than zero? This is again a t-test problem.
The estimated trend b̂ from a linear regression follows a t-distribution.
With the given data pairs {(xi , yi ), i = 1, 2, · · · , n} and their regression line
discussed in Chapter 3
ŷ = b̂0 + b̂1 x, (6.43)
the t-score for the trend b̂1 is defined by the following formula
b̂
t= √1 , dof = n=2. (6.44)
Sn Sxx
Here,
r
SSE
Sn = (6.45)
n−2
with the sum of squared errors SSE defined as
n
X 2
SSE = yi − (b̄0 + b̄1 xi ) , (6.46)
i=1
and
n
X
Sxx = (xi − x̄)2 (6.47)
i=1
Clearly, the t-score 20.05 is in the H1 region. We conclude that the trend is signif-
icantly greater than zero with a p-value equal to 1 × 10−16 .
The above results were computed by the following R code.
setwd("/Users/sshen/Desktop/MyDocs/teach/SIOC290-ClimateMath2017/Book-ClimMath-Cambridge
dat1 <- [Link]("[Link].land_ocean.[Link]")
tm=dat1[,2]
x = 1880:2015
summary(lm(tm ~ x))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -1.321e+01 6.489e-01 -20.36 <2e-16 ***
# x 6.678e-03 3.331e-04 20.05 <2e-16 ***
Sometimes one may need to check if the trend is greater than a specified value
β1 . Then, the t-score is defined by the following formula
b̂1 − β1
t= √ , dof = n-2. (6.51)
Sn Sxx
In this case, the t-score must be computed from the formulas, not from the summary
of a linear regression by R.
This statistics chapter has presented a very brief course in statistics, but it pro-
vides a sufficient statistics basics and R codes for doing simple statistical analysis
of climate data. This chapter also provides the foundation for expanding a reader’s
statistics knowledge and skills by studying more comprehensive or advanced materi-
als on climate statistics. A few free statistics tutorials available online are introduced
below.
The manuscript by David Stephenson of the University of Reading, the United
Kingdom, provides the basics of statistics with climate data as examples:
[Link]
This online manuscript is appropriate for readers who have virtually no statistics
background.
Eric Gilleland of NCAR authored a slide for using R to do climate statistics,
particular the analysis of extreme values:
[Link]
Malta_Posters_and_Talks/[Link]
This set of lecture notes provides many R codes for analyzing climate data, such
as risk estimation. The material is very useful for climate data users, and does not
require much mathematical background.
The “Statistical methods for the analysis of simulated and observed climate data
applied in projects and institutions dealing with climate change impact and adap-
tation” by the Climate Service Center, Hamburg, Germany, is particularly useful
for weather and climate data.
[Link]
projekte/csc-report13_englisch_final-mit_umschlag.pdf
This online report provides a “user’s manual” for a large number of statistical meth-
ods used for climate data analysis with real climate data examples. The material
is an excellent references fro users of the statistics for climate data.
References
[1] Climate Service Center, Germany, 2013: Statistical methods for the analysis of
simulated and observed climate data. Report 13, Version 2.0,
[Link]
projekte/csc-report13_englisch_final-mit_umschlag.pdf
[2] Gilleland, E., 2009: Statistical software for weather and climate: The R pro-
gramming language.
[Link]
[3] Stephenson, D.B., 2005: Data analysis methods in weather and climate research.
Lecture notes, 98pp:
[Link]
148
Exercises
6.1 Assume that the average bank balance of U.S. residents is $5, 000. Assume
that the bank balances are normally distributed. A group of 25 samples was
taken. The sample data have a mean equal to $5, 000 and standard deviation of
$1, 000. Find the confidence interval of this group of samples at 95% confidence
level.
6.2 The two most commonly used datasets of global ocean and land average an-
nual mean surface air temperature (SAT) anomalies are those credited to the
research groups led by Dr. James E. Hansen of NASA (relative to 1951-1980
climatology period) and Professor Phil Jones, of the University of East Anglia
(relative to 1961-1990 climatology period):
[Link]
[Link]
(a) Find the average anomalies for each period of 15 years, starting at 1880.
(b) Use the t-distribution to find the confidence interval of each 15-year period
SAT average at the 95% confidence level using the t-distribution. You can use
either Hansen’s data or Jones’ data. Figure SPM.1(a) of IPCC 2013 (AR4) is
a helpful reference.
(c) Find the hottest and the coldest 15-year periods from 1880-2014, which
is divided into nine disjoint 15-year periods. Use the t-distribution to check
whether the temperature difference in the hottest 15-year period minus that
in the coldest 15-year period is significantly greater than zero. Do this problem
for either Hansen’s data or Jones data.
(d) Discuss the differences between the Hansen and Jones datasets.
6.3 To test if the average of temperature in Period 1 is significantly different from
that in Period 2, one can use the t-statistic
x̄1 − x̄2
t∗ = q 2 , (6.52)
s1 s22
n1 + n2
where x̄i and s2i are the sample mean and variance of the Period i (i = 1, 2).
The degree of freedom (i.e., df) of the relevant t-distribution is equal to the
smaller n1 − 1 and n2 − 1. The null hypothesis is that the two averages do not
have significant differences, i.e., their difference is zero (in a statistical sense
with a confidence interval). The alternative hypothesis is that the difference
is significantly different from zero. Now you can choose to use a one-sided
test when the difference is positive. Use a significance level of 5% or 1%, or
another level of at your own choosing.
(a) Choose two 15-year periods which have very different average anomalies.
Use the higher one minus the lower one. Use the t-test method for a one-sided
test to check if the difference is significantly greater than zero. Do this for the
global average annual mean temperature data from either Hansen’s dataset
or Jones dataset.
(b) Choose two 15-year periods which have very similar average anomalies.
Use the higher one minus the lower one. Use the t-test method for a two-sided
test to check if the difference is not significantly different from zero. Do this
for the global average annual mean temperature data from either Hansen’s
dataset or Jones dataset.
Author index
151
Subject index
152