An Introduction to Stadistical Learning-129-140-1-8
An Introduction to Stadistical Learning-129-140-1-8
Linear Regression
New imports
Throughout this lab we will introduce new functions and libraries. However,
we will import them here to emphasize these are the new code objects in
this lab. Keeping imports near the top of a notebook makes the code more
readable, since scanning the first few lines tells us what libraries are used.
In [2]: import statsmodels.api as sm
We will provide relevant details about the functions below as they are
needed.
Besides importing whole modules, it is also possible to import only a
few items from a given module. This will help keep the namespace clean. namespace
We will use a few specific objects from the statsmodels package which we statsmodels
import here.
In [3]: from statsmodels.stats.outliers_influence \
import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm
Out[5]: ['In',
'MS',
'_',
'__',
'___',
'__builtin__ ',
'__builtins__ ',
...
3.6 Lab: Linear Regression 117
'poly ',
'quit ',
'sm',
'summarize ']
This shows you everything that Python can find at the top level. There
are certain objects like __builtins__ that contain references to built-in
functions like print().
Every python object has its own notion of namespace, also accessible
with dir(). This will include both the attributes of the object as well as
any methods associated with it. For instance, we see 'sum' in the listing
for an array.
In [6]: A = np.array ([3 ,5 ,11])
dir(A)
Out[6]: ...
'strides ',
'sum',
'swapaxes ',
...
This indicates that the object A.sum exists. In this case it is a method that
can be used to compute the sum of the array A as can be seen by typing
A.sum?.
In [7]: A.sum()
Out[7]: 19
Out[8]: Index (['crim ', 'zn', 'indus ', 'chas ', 'nox', 'rm', 'age', 'dis',
'rad', 'tax', 'ptratio ', 'black ', 'lstat ', 'medv '],
dtype='object ')
118 3. Linear Regression
Note that sm.OLS() does not fit the model; it specifies the model, and then
model.fit() does the actual fitting.
Our ISLP function summarize() produces a simple table of the parame-
summarize()
ter estimates, their standard errors, t-statistics and p-values. The function
takes a single argument, such as the object results returned here by the
fit method, and returns such a summary.
In [11]: summarize(results)
We first describe this process for our simple regression model using a
single predictor lstat in the Boston data frame, but will use it repeatedly
in more complex tasks in this and other labs in this book. In our case the
transform is created by the expression design = MS(['lstat']).
The fit() method takes the original array and may do some initial com-
putations on it, as specified in the transform object. For example, it may
compute means and standard deviations for centering and scaling. The
transform() method applies the fitted transformation to the array of data,
and produces the model matrix.
In [12]: design = MS(['lstat '])
design = design.fit(Boston)
X = design.transform(Boston)
X[:4]
In this simple case, the fit() method does very little; it simply checks that
the variable 'lstat' specified in design exists in Boston. Then transform()
constructs the model matrix with two columns: an intercept and the vari-
able lstat.
These two operations can be combined with the fit_transform() method. .fit_
In [13]: design = MS(['lstat ']) transform()
X = design.fit_transform(Boston)
X[:4]
Note that, as in the previous code chunk when the two steps were done
separately, the design object is changed as a result of the fit() operation.
The power of this pipeline will become clearer when we fit more complex
models that involve interactions and transformations.
Let’s return to our fitted regression model. The object results has several
methods that can be used for inference. We already presented a function
summarize() for showing the essentials of the fit. For a full and somewhat
exhaustive summary of the fit, we can use the summary() method (output
not shown).
In [14]: results.summary ()
In [15]: results.params
120 3. Linear Regression
For instance, the 95% confidence interval associated with an lstat value of
10 is (24.47, 25.63), and the 95% prediction interval is (12.82, 37.28). As
expected, the confidence and prediction intervals are centered around the
same point (a predicted value of 25.05 for medv when lstat equals 10), but
the latter are substantially wider.
Next we will plot medv and lstat using DataFrame.plot.scatter(), and .plot.
wish to add the regression line to the resulting plot. scatter()
3.6 Lab: Linear Regression 121
Defining Functions
While there is a function within the ISLP package that adds a line to an
existing plot, we take this opportunity to define our first function to do so. def
In [20]: def abline(ax , b, m):
"Add a line with slope m and intercept b to ax"
xlim = ax.get_xlim ()
ylim = [m * xlim [0] + b, m * xlim [1] + b]
ax.plot(xlim , ylim)
A few things are illustrated above. First we see the syntax for defining a
function: def funcname(...). The function has arguments ax, b, m where
ax is an axis object for an exisiting plot, b is the intercept and m is the slope
of the desired line. Other plotting options can be passed on to ax.plot by
including additional optional arguments as follows:
In [21]: def abline(ax , b, m, *args , ** kwargs):
"Add a line with slope m and intercept b to ax"
xlim = ax.get_xlim ()
ylim = [m * xlim [0] + b, m * xlim [1] + b]
ax.plot(xlim , ylim , *args , ** kwargs)
ax.scatter(results.fittedvalues , results.resid)
ax.set_xlabel('Fitted value ')
ax.set_ylabel('Residual ')
ax.axhline (0, c='k', ls='--');
Out[24]: 374
Notice how we have compacted the first line into a succinct expression
describing the construction of X.
The Boston data set contains 12 variables, and so it would be cumbersome
to have to type all of these in order to perform a regression using all of the
predictors. Instead, we can use the following short-hand: .columns.
drop()
In [26]: terms = Boston.columns.drop('medv ')
terms
3.6 Lab: Linear Regression 123
Out[26]: Index (['crim ', 'zn', 'indus ', 'chas ', 'nox', 'rm', 'age', 'dis',
'rad', 'tax', 'ptratio ', 'lstat '],
dtype='object ')
We can now fit the model with all the variables in terms using the same
model matrix builder.
In [27]: X = MS(terms).fit_transform(Boston)
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
What if we would like to perform a regression using all of the variables but
one? For example, in the above regression output, age has a high p-value.
So we may wish to run a regression excluding this predictor. The following
syntax results in a regression using all predictors except age (output not
shown).
In [28]: minus_age = Boston.columns.drop (['medv ', 'age'])
Xma = MS(minus_age).fit_transform(Boston)
model1 = sm.OLS(y, Xma)
summarize(model1.fit())