message (3)
message (3)
publicly
available data set.
(a) Install the Scikit-learn library by following instructions form this link
https:
//scikit-learn.org/stable/install.html. In most cases, you should be
able to install using
1 pip install scikit-learn
After installing the library, you can fetch the data set by adding this line in
your code
1 from sklearn.datasets import fetch_california_housing
Now you can fetch the data set and check a few of its properties. Let’s start by
checking the size of the observations and targets. This can be done as follows:
1
2
3
1
housing = fetch_california_housing()
print(housing.data.shape, housing.target.shape)
print(housing.feature_names[0:6])
You should be able to see the size (and shape) of the observations and target
points as (20640,8) and (20640,), respectively. In other words, the data set
includes 20640 observations, each of which represented with 8 features. The
last line of code prints the name of the features. Finally, print out and read the
description of the data set using the following line of code:
print(housing.DESCR)
Include the previous four lines of code in the main function in the hw3.py file.
Then assign observations to a variable X and targets to t.
(b) Now that we have the observations and targets, we need to standardize our data
using mean removal and variance scaling. Standardization of datasets is
a common requirement for many machine learning estimators since they might
behave badly if the individual features do not more or less look like standard
normally distributed data: Gaussian with zero mean and unit variance. In the
hw3.py file, complete the standradscalar function by subtracting the mean of
the input array and dividing the result by the standard deviation as
xscaled = x − µ
σ
After completing the function, call it in the main function and standardize the
observations. Make sure that the standardization is being applied along the
intended axis.
(c) The standardization process has been implemented in the scikit-learn library
as well. In the main function, use the StandardScaler utility class from the
Preprocessing module (import with a different name to avoid conflicts) and
Page 3
COMP.4220
Machine Learning
standardazie your raw data again, but this time assign the results to a different
variable. Write a simple test to show that results from both functions (your
implementation of standradscalar vs. the StandardScaler class) are identi
cal. This can be done by using numpy functions such as all, allclose and
array
equal as well as comparing their mean and std values.
(d) In this step, we would like to split our data set into a training and a test
set. To
do so, first you should complete the train
test
split function. The first step
is to shuffle the data. To do this, you can apply the numpy.shuffle function
to the indices of the data set. Then use the shuffled indices to assign the points
to new arrays for both the observations and the targets. The second step is to
split the data by assigning the first portion of the data points (e.g., the first
80% of the points) to the training set and the rest to the test set. Finally, call
the completed function in the main function and return Xtrain,Xtest,ttrain,ttest.
Note that the same process can be done using the train
test
from the model
split function
selection module in scikit-learn. Verify this but make sure
you avoid naming conflict.
(e) Complete implementation of Ridge Regression class in the regression.py file.
The file already includes the Least Squares that we used in Homework 1. First
complete the fit function using the solution provided in (3.28) in the textbook.
Hint: this can be solved as an Ax = b equation without inverting A directly.
(f) Finish implementation of Ridge Regression class in the regression.py file by
completing the predict function.
(g) In the main function in hw3.py, use the RidgeRegression and LinearRegression
classes to train two regression models on the training set. Then fit the trained
model to the test set.
mean
(h) In a previous homework you implemented the root-mean-squared-error. The
Scikit-learn library also include many evaluation metrics. Import the R2 and
root
squared
error metrics and use them to evaluate the models you
trained. Print the results.
(i) In this step, you are comparing the results of your regression models with the
re
gression models from Scikit-learn. Import the LinearRegression and Ridge
classes from the library and train two models using the same data you used for
training the previous models. Then evaluate the models and print the results.
(j) In a 2 × 2 plot, use the scatter function to plot targets ttest vs. y(xtest)
for
each model. On each subplot add a line (using the plot function) starting from
minimum target value to the maximum target value. This line gives us a view
of how predictions are distributed around the ‘mean’ of the data.
(k) If you implement the regression correctly, you should get identical results
from
all four models. If your models are predicting worse than the Scikit-learn
Page 4
COMP.4220
Machine Learning
model, can you find out why and how to fix it? Explain your fix and show
identical results.