Lab 5 Regression Analysis Math 2B
The field of agriculture has always provided many interesting statistical applications. In fact,
R.A. Fisher, considered to be the founder of modern statistics, used agriculture to pioneer
statistical procedures and methodologies that are still in use today. Least squares or regression
analysis is one such procedure.
I. Linear Regression
The basic idea of linear regression is to find the equation of a line 𝑦 = 𝑚𝑥 + 𝑏 that best fits a
collection of data points (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ). If we denote the lines approximation of each
point by 𝑦̃𝑘 = 𝑚𝑥𝑘 + 𝑏 then by best fit we mean the line that minimizes the sum of the squares
of the error (SSE).
𝑛
2
𝑆𝑆𝐸 = ∑(𝑦𝑘 − 𝑦
̃)
𝑘
𝑘=1
This can be done by using multivariate calculus to find the minimum of the resulting function of
𝑚 and 𝑏. However, in the context of linear algebra this amounts to finding the least squares
approximation of the system of equations
𝑚𝑥1 + 𝑏 = 𝑦1
𝑚𝑥2 + 𝑏 = 𝑦2
⋮
𝑚𝑥𝑛 + 𝑏 = 𝑦𝑛
Then the SSE is just the square of the least squares error.
1) Find the line that best approximates the points (1, 1), (2, 4), (3, 9), (5, 25).
2) Use the regression line to predict the value of 𝑦 when 𝑥 = 4, when 𝑥 = 6, and when 𝑥 = 15.
Given that the initial data fits the model 𝑦 = 𝑥 2 , what can you say about the accuracy of these
predictions?
3) Find the SSE of the regression line.
II. Statistical Viewpoint
In general, a large SSE indicates the regression line doesn’t fit the data well. How large is large
is for statisticians to determine. From a statistical point of view, the predictions 𝑦̃ based on the
regression line are called expected values or averages of 𝑦.
Lab 5 Regression Analysis Math 2B
The sample variance and standard deviation are given by
𝑆𝑆𝐸 𝑆𝑆𝐸
̃𝑦 2 =
𝜎 , ̃𝑦 = √
𝜎
𝑛−2 𝑛−2
where 𝑛 is the number of data points. Approximately 70 % of the time, the actual 𝑦 value for a
given 𝑥 falls in the interval (𝑦̃ − 𝜎
̃,
𝑦 𝑦
̃+𝜎
̃).
𝑦
Suppose it is believed the yield of bushels of wheat is dependent on the amount of fertilizer used.
Of course, the yield of any type of crop is known to be dependent on several other variables. In
particular, amount of sunshine, soil composition, and irrigation are most important. For the
purposes of this question all other pertinent variables have been controlled. Thus, the relationship
between yield and amount of fertilizer can be adequately studied without confounding concerns.
Some data on this topic is captured in the table below.
Pounds of fertilizer per acre Bushels of wheat per acre
100 40
200 45
300 50
400 65
500 70
600 70
800 80
4) Find the regression line that best fits this data.
5) Find the sum of squares error, sample variance and sample standard deviation for this
regression line.
6) Use the regression line to predict the yield associated with 350 pounds of fertilizer per acre.
What is the interpretation of this prediction?
III. Non-linear models using Least Squares
Data is not always modeled best by linear functions. One way to obtain nonlinear models is by
transforming the data so that it matches a linear function. In general if you want to find a model
of the form 𝑦 = 𝑎 + 𝑏𝑓(𝑥) to model a data set (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ) you can define a variable
𝑋 = 𝑓(𝑥). Then you can find a linear regression for the data set (𝑋1 , 𝑦1 ), … , (𝑋𝑛 , 𝑦𝑛 ).
Lab 5 Regression Analysis Math 2B
7) Find a curve of the form 𝑦 = 𝑎 + 𝑏/𝑥 that best fits the data points (1, 7), (3, 3), (6, 1). Graph
the curve and plot the data on the same coordinate system.
8) Find a curve of the form 𝑦 = 𝑎 + 𝑏√𝑥 that best fits the data points (3, 1.5), (7, 2.5), (10, 3).
Graph the curve and plot the data on the same coordinate system.
Gauge is a measure of shotgun bore. Gauge numbers originally referred to the number of lead
balls with the diameter equal to that of the gun barrel that could be made from a pound of lead.
Thus a 16-guage shotgun’s bore was smaller than a 12-guage shotgun’s. Today, an international
agreement assigns millimeter measures to each gauge. The following table gives such
information for popular gauges of shot guns.
Gauge Bore Diameter (in mm)
6 23.34
10 19.67
12 18.52
14 17.60
16 16.81
20 15.90
9) Find a model for this data of the form 𝑦 = 𝑟𝑒 𝑘𝑥 . (Hint: apply a log transformation to the
model to obtain a model with a more linear form.) Use this model to estimate the bore diameter
of an 18-guage shotgun.
IV. Multi-variate Linear Regression
Some situations depend on more than one variable. If the output 𝑦 depends on the variables
𝑥1 , … , 𝑥𝑛 then we can create a linear model of the form 𝑦 = 𝑏0 + 𝑏1 𝑥1 + ⋯ + 𝑏𝑛 𝑥𝑛 . To fit the
points (𝑥11 , 𝑥12 , … , 𝑥1𝑛 , 𝑦1 ), (𝑥21 , 𝑥22 , … , 𝑥2𝑛 , 𝑦2 ), … , (𝑥𝑚1 , 𝑥𝑚2 , … , 𝑥𝑚𝑛 , 𝑦𝑚 ) we consider
the system of equations
𝑏0 + 𝑏1 𝑥11 + 𝑏2 𝑥12 + ⋯ + 𝑏𝑛 𝑥1𝑛 = 𝑦1
𝑏0 + 𝑏1 𝑥21 + 𝑏2 𝑥22 + ⋯ + 𝑏𝑛 𝑥2𝑛 = 𝑦2
⋮
𝑏0 + 𝑏1 𝑥𝑚1 + 𝑏2 𝑥𝑚2 + ⋯ + 𝑏𝑛 𝑥𝑚𝑛 = 𝑦𝑚
10) Find a model of the form 𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 that best fits the points (2, 3, 5), (4, 5, 4),
(1, 2, 0) and (0, 1, −3).
Lab 5 Regression Analysis Math 2B
We will use multi-variate regression to find more non-linear models in only two variables. For
instance we can find a quadratic least squares model 𝑦 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐 for data points
(𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ) by considering the system of equations
𝑎𝑥12 + 𝑏𝑥1 + 𝑐 = 𝑦1
𝑎𝑥22 + 𝑏𝑥2 + 𝑐 = 𝑦2
⋮
2
𝑎𝑥𝑛 + 𝑏𝑥𝑛 + 𝑐 = 𝑦𝑛
11) Find the best quadratic fit to the points (1, 4), (−2, 5), (3, 1), and (4, 1).
In our first lab we found a cubic polynomial to model the US population based on 4 data points.
When we added a fifth data point we found a quartic model for the US population. This quartic
model predicted that the population of the United States would have become extinct before the
turn of the 21st century. Now we have a way to find different models regardless of the number of
data points. The data we used in lab 1 was
(−20, 106), (−10, 123), (0, 132), (10, 151), (20, 179)
where the 𝑥 coordinate represents years after 1940 and the 𝑦 coordinate represents the US
population in millions of people.
12) Use this data to find a cubic model for the US population. The population of the United
States in 2017 was approximately 326 million. How does this new model compare?
13) Use the five data points to find quadratic and linear models for the US populations.
14) How do the models you found in (13) compare to the population of the US in 2017? Can you
think of a way to find a better model using those five data points? Find the best model you can
using only those five data points.