0% found this document useful (0 votes)
8 views

An Introduction to Stadistical Learning-129-140-1-8

This document discusses a lab on linear regression using Python, specifically focusing on importing necessary libraries and constructing model matrices with the ISLP package. It details the process of fitting a simple linear regression model using the Boston housing dataset, including methods for making predictions and visualizing results. Additionally, it introduces the concept of defining functions for plotting and examines diagnostic plots for assessing model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

An Introduction to Stadistical Learning-129-140-1-8

This document discusses a lab on linear regression using Python, specifically focusing on importing necessary libraries and constructing model matrices with the ISLP package. It details the process of fitting a simple linear regression model using the Boston housing dataset, including methods for making predictions and visualizing results. Additionally, it introduces the concept of defining functions for plotting and examines diagnostic plots for assessing model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

116 3.

Linear Regression

3.6 Lab: Linear Regression


3.6.1 Importing packages
We import our standard libraries at this top level.
In [1]: import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots

New imports
Throughout this lab we will introduce new functions and libraries. However,
we will import them here to emphasize these are the new code objects in
this lab. Keeping imports near the top of a notebook makes the code more
readable, since scanning the first few lines tells us what libraries are used.
In [2]: import statsmodels.api as sm

We will provide relevant details about the functions below as they are
needed.
Besides importing whole modules, it is also possible to import only a
few items from a given module. This will help keep the namespace clean. namespace
We will use a few specific objects from the statsmodels package which we statsmodels
import here.
In [3]: from statsmodels.stats.outliers_influence \
import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm

As one of the import statements above is quite a long line, we inserted a


line break \ to ease readability.
We will also use some functions written for the labs in this book in the
ISLP package.

In [4]: from ISLP import load_data


from ISLP.models import (ModelSpec as MS ,
summarize ,
poly)

Inspecting Objects and Namespaces


The function dir() provides a list of objects in a namespace.
dir()
In [5]: dir()

Out[5]: ['In',
'MS',
'_',
'__',
'___',
'__builtin__ ',
'__builtins__ ',
...
3.6 Lab: Linear Regression 117

'poly ',
'quit ',
'sm',
'summarize ']

This shows you everything that Python can find at the top level. There
are certain objects like __builtins__ that contain references to built-in
functions like print().
Every python object has its own notion of namespace, also accessible
with dir(). This will include both the attributes of the object as well as
any methods associated with it. For instance, we see 'sum' in the listing
for an array.
In [6]: A = np.array ([3 ,5 ,11])
dir(A)

Out[6]: ...
'strides ',
'sum',
'swapaxes ',
...

This indicates that the object A.sum exists. In this case it is a method that
can be used to compute the sum of the array A as can be seen by typing
A.sum?.

In [7]: A.sum()

Out[7]: 19

3.6.2 Simple Linear Regression


In this section we will construct model matrices (also called design matri-
ces) using the ModelSpec() transform from ISLP.models.
We will use the Boston housing data set, which is contained in the ISLP
package. The Boston dataset records medv (median house value) for 506
neighborhoods around Boston. We will build a regression model to pre-
dict medv using 13 predictors such as rmvar (average number of rooms per
house), age (proportion of owner-occupied units built prior to 1940), and
lstat (percent of households with low socioeconomic status). We will use
statsmodels for this task, a Python package that implements several com-
monly used regression methods.
We have included a simple loading function load_data() in the ISLP pack-
load_data()
age:
In [8]: Boston = load_data("Boston")
Boston.columns

Out[8]: Index (['crim ', 'zn', 'indus ', 'chas ', 'nox', 'rm', 'age', 'dis',
'rad', 'tax', 'ptratio ', 'black ', 'lstat ', 'medv '],
dtype='object ')
118 3. Linear Regression

Type Boston? to find out more about these data.


We start by using the sm.OLS() function to fit a simple linear regression
sm.OLS()
model. Our response will be medv and lstat will be the single predictor.
For this model, we can create the model matrix by hand.
In [9]: X = pd.DataFrame ({'intercept ': np.ones(Boston.shape [0]) ,
'lstat ': Boston['lstat ']})
X[:4]

Out[9]: intercept lstat


0 1.0 4.98
1 1.0 9.14
2 1.0 4.03
3 1.0 2.94

We extract the response, and fit the model.


In [10]: y = Boston['medv ']
model = sm.OLS(y, X)
results = model.fit()

Note that sm.OLS() does not fit the model; it specifies the model, and then
model.fit() does the actual fitting.
Our ISLP function summarize() produces a simple table of the parame-
summarize()
ter estimates, their standard errors, t-statistics and p-values. The function
takes a single argument, such as the object results returned here by the
fit method, and returns such a summary.
In [11]: summarize(results)

Out[11]: coef std err t P>|t|


intercept 34.5538 0.563 61.415 0.0
lstat -0.9500 0.039 -24.528 0.0

Before we describe other methods for working with fitted models, we


outline a more useful and general framework for constructing a model ma-
trix X.

Using Transformations: Fit and Transform


Our model above has a single predictor, and constructing X was straight-
forward. In practice we often fit models with more than one predictor,
typically selected from an array or data frame. We may wish to introduce
transformations to the variables before fitting the model, specify interac-
tions between variables, and expand some particular variables into sets of
variables (e.g. polynomials). The sklearn package has a particular notion sklearn
for this type of task: a transform. A transform is an object that is created
with some parameters as arguments. The object has two main methods:
fit() and transform().
.fit()
We provide a general approach for specifying models and constructing .transform()
the model matrix through the transform ModelSpec() in the ISLP library.
ModelSpec()
ModelSpec() (renamed MS() in the preamble) creates a transform object,
and then a pair of methods transform() and fit() are used to construct a
corresponding model matrix.
3.6 Lab: Linear Regression 119

We first describe this process for our simple regression model using a
single predictor lstat in the Boston data frame, but will use it repeatedly
in more complex tasks in this and other labs in this book. In our case the
transform is created by the expression design = MS(['lstat']).
The fit() method takes the original array and may do some initial com-
putations on it, as specified in the transform object. For example, it may
compute means and standard deviations for centering and scaling. The
transform() method applies the fitted transformation to the array of data,
and produces the model matrix.
In [12]: design = MS(['lstat '])
design = design.fit(Boston)
X = design.transform(Boston)
X[:4]

Out[12]: intercept lstat


0 1.0 4.98
1 1.0 9.14
2 1.0 4.03
3 1.0 2.94

In this simple case, the fit() method does very little; it simply checks that
the variable 'lstat' specified in design exists in Boston. Then transform()
constructs the model matrix with two columns: an intercept and the vari-
able lstat.
These two operations can be combined with the fit_transform() method. .fit_
In [13]: design = MS(['lstat ']) transform()
X = design.fit_transform(Boston)
X[:4]

Out[13]: intercept lstat


0 1.0 4.98
1 1.0 9.14
2 1.0 4.03
3 1.0 2.94

Note that, as in the previous code chunk when the two steps were done
separately, the design object is changed as a result of the fit() operation.
The power of this pipeline will become clearer when we fit more complex
models that involve interactions and transformations.
Let’s return to our fitted regression model. The object results has several
methods that can be used for inference. We already presented a function
summarize() for showing the essentials of the fit. For a full and somewhat
exhaustive summary of the fit, we can use the summary() method (output
not shown).
In [14]: results.summary ()

The fitted coefficients can also be retrieved as the params attribute of


results.

In [15]: results.params
120 3. Linear Regression

Out[15]: intercept 34.553841


lstat -0.950049
dtype: float64

The get_prediction() method can be used to obtain predictions, and .get_


produce confidence intervals and prediction intervals for the prediction of prediction()
medv for given values of lstat.
We first create a new data frame, in this case containing only the vari-
able lstat, with the values for this variable at which we wish to make
predictions. We then use the transform() method of design to create the
corresponding model matrix.

In [16]: new_df = pd.DataFrame ({'lstat ':[5, 10, 15]})


newX = design.transform(new_df)
newX

Out[16]: intercept lstat


0 1.0 5
1 1.0 10
2 1.0 15

Next we compute the predictions at newX, and view them by extracting


the predicted_mean attribute.

In [17]: new_predictions = results.get_prediction(newX);


new_predictions.predicted_mean

Out[17]: array ([29.80359411 , 25.05334734 , 20.30310057])

We can produce confidence intervals for the predicted values.

In [18]: new_predictions.conf_int(alpha =0.05)

Out[18]: array ([[29.00741194 , 30.59977628] ,


[24.47413202 , 25.63256267] ,
[19.73158815 , 20.87461299]])

Prediction intervals are computing by setting obs=True:

In [19]: new_predictions.conf_int(obs=True , alpha =0.05)

Out[19]: array ([[17.56567478 , 42.04151344] ,


[12.82762635 , 37.27906833] ,
[ 8.0777421 , 32.52845905]])

For instance, the 95% confidence interval associated with an lstat value of
10 is (24.47, 25.63), and the 95% prediction interval is (12.82, 37.28). As
expected, the confidence and prediction intervals are centered around the
same point (a predicted value of 25.05 for medv when lstat equals 10), but
the latter are substantially wider.
Next we will plot medv and lstat using DataFrame.plot.scatter(), and .plot.
wish to add the regression line to the resulting plot. scatter()
3.6 Lab: Linear Regression 121

Defining Functions
While there is a function within the ISLP package that adds a line to an
existing plot, we take this opportunity to define our first function to do so. def
In [20]: def abline(ax , b, m):
"Add a line with slope m and intercept b to ax"
xlim = ax.get_xlim ()
ylim = [m * xlim [0] + b, m * xlim [1] + b]
ax.plot(xlim , ylim)

A few things are illustrated above. First we see the syntax for defining a
function: def funcname(...). The function has arguments ax, b, m where
ax is an axis object for an exisiting plot, b is the intercept and m is the slope
of the desired line. Other plotting options can be passed on to ax.plot by
including additional optional arguments as follows:
In [21]: def abline(ax , b, m, *args , ** kwargs):
"Add a line with slope m and intercept b to ax"
xlim = ax.get_xlim ()
ylim = [m * xlim [0] + b, m * xlim [1] + b]
ax.plot(xlim , ylim , *args , ** kwargs)

The addition of *args allows any number of non-named arguments to


abline, while *kwargs allows any number of named arguments (such as
linewidth=3) to abline. In our function, we pass these arguments verbatim
to ax.plot above. Readers interested in learning more about functions are
referred to the section on defining functions in docs.python.org/tutorial.
Let’s use our new function to add this regression line to a plot of medv
vs. lstat.
In [22]: ax = Boston.plot.scatter('lstat ', 'medv ')
abline(ax ,
results.params [0],
results.params [1],
'r--',
linewidth =3)

Thus, the final call to ax.plot() is ax.plot(xlim, ylim, 'r--', linewidth=3).


We have used the argument 'r--' to produce a red dashed line, and added
an argument to make it of width 3. There is some evidence for non-linearity
in the relationship between lstat and medv. We will explore this issue later
in this lab.
As mentioned above, there is an existing function to add a line to a plot
— ax.axline() — but knowing how to write such functions empowers us
to create more expressive displays.

Next we examine some diagnostic plots, several of which were discussed


in Section 3.3.3. We can find the fitted values and residuals of the fit as
attributes of the results object. Various influence measures describing the
regression model are computed with the get_influence() method. As we .get_
will not use the fig component returned as the first value from subplots(), influence()
we simply capture the second returned value in ax below.
In [23]: ax = subplots(figsize =(8 ,8))[1]
122 3. Linear Regression

ax.scatter(results.fittedvalues , results.resid)
ax.set_xlabel('Fitted value ')
ax.set_ylabel('Residual ')
ax.axhline (0, c='k', ls='--');

We add a horizontal line at 0 for reference using the ax.axhline() method,


.axhline()
indicating it should be black (c='k') and have a dashed linestyle (ls='--').
On the basis of the residual plot (not shown), there is some evidence
of non-linearity. Leverage statistics can be computed for any number of
predictors using the hat_matrix_diag attribute of the value returned by the
get_influence() method.

In [24]: infl = results.get_influence ()


ax = subplots(figsize =(8 ,8))[1]
ax.scatter(np.arange(X.shape [0]) , infl.hat_matrix_diag)
ax.set_xlabel('Index ')
ax.set_ylabel('Leverage ')
np.argmax(infl.hat_matrix_diag)

Out[24]: 374

The np.argmax() function identifies the index of the largest element of an


np.argmax()
array, optionally computed over an axis of the array. In this case, we maxi-
mized over the entire array to determine which observation has the largest
leverage statistic.

3.6.3 Multiple Linear Regression


In order to fit a multiple linear regression model using least squares, we
again use the ModelSpec() transform to construct the required model matrix
and response. The arguments to ModelSpec() can be quite general, but in
this case a list of column names suffice. We consider a fit here with the two
variables lstat and age.
In [25]: X = MS(['lstat ', 'age']).fit_transform(Boston)
model1 = sm.OLS(y, X)
results1 = model1.fit()
summarize(results1)

Out[25]: coef std err t P>|t|


intercept 33.2228 0.731 45.458 0.000
lstat -1.0321 0.048 -21.416 0.000
age 0.0345 0.012 2.826 0.005

Notice how we have compacted the first line into a succinct expression
describing the construction of X.
The Boston data set contains 12 variables, and so it would be cumbersome
to have to type all of these in order to perform a regression using all of the
predictors. Instead, we can use the following short-hand: .columns.
drop()
In [26]: terms = Boston.columns.drop('medv ')
terms
3.6 Lab: Linear Regression 123

Out[26]: Index (['crim ', 'zn', 'indus ', 'chas ', 'nox', 'rm', 'age', 'dis',
'rad', 'tax', 'ptratio ', 'lstat '],
dtype='object ')

We can now fit the model with all the variables in terms using the same
model matrix builder.
In [27]: X = MS(terms).fit_transform(Boston)
model = sm.OLS(y, X)
results = model.fit()
summarize(results)

Out[27]: coef std err t P>|t|


intercept 41.6173 4.936 8.431 0.000
crim -0.1214 0.033 -3.678 0.000
zn 0.0470 0.014 3.384 0.001
indus 0.0135 0.062 0.217 0.829
chas 2.8400 0.870 3.264 0.001
nox -18.7580 3.851 -4.870 0.000
rm 3.6581 0.420 8.705 0.000
age 0.0036 0.013 0.271 0.787
dis -1.4908 0.202 -7.394 0.000
rad 0.2894 0.067 4.325 0.000
tax -0.0127 0.004 -3.337 0.001
ptratio -0.9375 0.132 -7.091 0.000
lstat -0.5520 0.051 -10.897 0.000

What if we would like to perform a regression using all of the variables but
one? For example, in the above regression output, age has a high p-value.
So we may wish to run a regression excluding this predictor. The following
syntax results in a regression using all predictors except age (output not
shown).
In [28]: minus_age = Boston.columns.drop (['medv ', 'age'])
Xma = MS(minus_age).fit_transform(Boston)
model1 = sm.OLS(y, Xma)
summarize(model1.fit())

3.6.4 Multivariate Goodness of Fit


We can access the individual components of results by name (dir(results)
shows us what is available). Hence results.rsquared gives us the R2 , and
np.sqrt(results.scale) gives us the RSE.
Variance inflation factors (section 3.3.3) are sometimes useful to assess
the effect of collinearity in the model matrix of a regression model. We will
compute the VIFs in our multiple regression fit, and use the opportunity
to introduce the idea of list comprehension.
list compre-
hension
List Comprehension
Often we encounter a sequence of objects which we would like to transform
for some other task. Below, we compute the VIF for each feature in our X
matrix and produce a data frame whose index agrees with the columns of
X. The notion of list comprehension can often make such a task easier.

You might also like