8/23/23, 11:42 PM 20MIS1025_Regression.
ipynb - Colaboratory
Predicting Continuous Target Variables with Regression Analysis
Overview
Introducing a simple linear regression model
Exploring the Housing Dataset
Visualizing the important characteristics of a dataset
Implementing an ordinary least squares linear regression model
Solving regression for regression parameters with gradient descent
Estimating the coefficient of a regression model via scikit-learn
Evaluating the performance of linear regression models
Summary
from IPython.display import Image
%matplotlib inline
Introducing a simple linear regression model
Exploring the Housing dataset
Source: https://archive.ics.uci.edu/ml/datasets/Housing
Attributes:
1. CRIM per capita crime rate by town
2. ZN proportion of residential land zoned for lots over
25,000 sq.ft.
3. INDUS proportion of non-retail business acres per town
4. CHAS Charles River dummy variable (= 1 if tract bounds
river; 0 otherwise)
5. NOX nitric oxides concentration (parts per 10 million)
6. RM average number of rooms per dwelling
7. AGE proportion of owner-occupied units built prior to 1940
8. DIS weighted distances to five Boston employment centres
9. RAD index of accessibility to radial highways
10. TAX full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks
by town
13. LSTAT % lower status of the population
14. MEDV Median value of owner-occupied homes in $1000's
import pandas as pd
#df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data', header=None, sep='\s+')
df = pd.read_csv('KDD_Train.csv')
X=df.iloc[: , 22:26].values
y=df.iloc[:, 27].values
print(X)
[[ 2. 2. 0. 0.]
[ 13. 1. 0. 0.]
[123. 6. 1. 1.]
...
[ 1. 1. 0. 0.]
https://colab.research.google.com/drive/12StZd_gIxuO71hxieKEPhxYWmM4cpSxr#scrollTo=kbWX24mXbhQz&printMode=true 1/5
8/23/23, 11:42 PM 20MIS1025_Regression.ipynb - Colaboratory
[144. 8. 1. 1.]
[ 1. 1. 0. 0.]]
Visualizing the important characteristics of a dataset
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', context='notebook')
cols = ['count', 'srv_count', 'serror_rate']
sns.pairplot(df[cols], height=2.5)
plt.tight_layout()
# plt.savefig('./figures/scatter.png', dpi=300)
plt.show()
import numpy as np
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.5)
hm = sns.heatmap(cm,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 15},
yticklabels=cols,
xticklabels=cols)
# plt.tight_layout()
# plt.savefig('./figures/corr_mat.png', dpi=300)
plt.show()
https://colab.research.google.com/drive/12StZd_gIxuO71hxieKEPhxYWmM4cpSxr#scrollTo=kbWX24mXbhQz&printMode=true 2/5
8/23/23, 11:42 PM 20MIS1025_Regression.ipynb - Colaboratory
sns.reset_orig()
%matplotlib inline
Estimating the coefficient of a regression model via scikit-learn
from sklearn.linear_model import LinearRegression
X = df[['count']].values
y = df['srv_count'].values
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
X_std = sc_x.fit_transform(X)
y_std = sc_y.fit_transform(y[:, np.newaxis]).flatten()
y_std
array([-0.35434285, -0.36811021, -0.2992734 , ..., -0.36811021,
-0.27173867, -0.36811021])
slr = LinearRegression()
slr.fit(X, y)
y_pred = slr.predict(X)
print('Slope: %.3f' % slr.coef_[0])
print('Intercept: %.3f' % slr.intercept_)
Slope: 0.299
Intercept: 2.605
y_pred
array([ 3.20265828, 6.48965824, 39.35965793, ..., 2.9038401 ,
45.63483969, 2.9038401 ])
def lin_regplot(X, y, model):
plt.scatter(X, y, c='lightblue')
plt.plot(X, model.predict(X), color='red', linewidth=2)
return
lin_regplot(X, y, slr)
plt.xlabel('[count]')
plt.ylabel('[srv_count]')
plt.tight_layout()
# plt.savefig('./figures/scikit_lr_fit.png', dpi=300)
plt.show()
https://colab.research.google.com/drive/12StZd_gIxuO71hxieKEPhxYWmM4cpSxr#scrollTo=kbWX24mXbhQz&printMode=true 3/5
8/23/23, 11:42 PM 20MIS1025_Regression.ipynb - Colaboratory
Evaluating the performance of linear regression models
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
slr = LinearRegression()
slr.fit(X_train, y_train)
y_train_pred = slr.predict(X_train)
y_test_pred = slr.predict(X_test)
plt.scatter(y_train_pred, y_train_pred - y_train,
c='blue', marker='o', label='Training data')
plt.scatter(y_test_pred, y_test_pred - y_test,
c='lightgreen', marker='s', label='Test data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.legend(loc='upper left')
plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red')
plt.xlim([-10, 50])
plt.tight_layout()
# plt.savefig('./figures/slr_residuals.png', dpi=300)
plt.show()
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
https://colab.research.google.com/drive/12StZd_gIxuO71hxieKEPhxYWmM4cpSxr#scrollTo=kbWX24mXbhQz&printMode=true 4/5
8/23/23, 11:42 PM 20MIS1025_Regression.ipynb - Colaboratory
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
MSE train: 4121.121, test: 4067.767
R^2 train: 0.222, test: 0.221
Using regularized methods for regression
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_train_pred = lasso.predict(X_train)
y_test_pred = lasso.predict(X_test)
print(lasso.coef_)
[0.29923578]
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
MSE train: 4121.121, test: 4067.767
R^2 train: 0.222, test: 0.221
C l b id d t C l t t h
check 0s completed at 11:41 PM
https://colab.research.google.com/drive/12StZd_gIxuO71hxieKEPhxYWmM4cpSxr#scrollTo=kbWX24mXbhQz&printMode=true 5/5