0% found this document useful (0 votes)
12 views

Week 6 LAB

This document discusses using linear regression on the Boston housing dataset to predict housing prices based on various features. It describes loading and exploring the data, preprocessing steps like checking for missing values, and preparing to build a linear regression model to predict median housing prices.

Uploaded by

tarunbinjadagi18
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Week 6 LAB

This document discusses using linear regression on the Boston housing dataset to predict housing prices based on various features. It describes loading and exploring the data, preprocessing steps like checking for missing values, and preparing to build a linear regression model to predict median housing prices.

Uploaded by

tarunbinjadagi18
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

12/15/22, 4:10 PM Boston_Housing_Linear_Regression

Linear Regression on Boston Housing


Dataset
This data was originally a part of UCI Machine Learning Repository and has been removed
now. This data also ships with the scikit-learn library. There are 506 samples and 13 feature
variables in this data-set. The objective is to predict the value of prices of the house using
the given features.

The description of all the features is given below:

CRIM: Per capita crime rate by town

ZN: Proportion of residential land zoned for lots over 25,000 sq. ft

INDUS: Proportion of non-retail business acres per town

CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX: Nitric oxide concentration (parts per 10 million)

RM: Average number of rooms per dwelling

AGE: Proportion of owner-occupied units built prior to 1940

DIS: Weighted distances to five Boston employment centers

RAD: Index of accessibility to radial highways

TAX: Full-value property tax rate per $10,000

B: 1000(Bk - 0.63)², where Bk is the proportion of [people of African American descent] by


town

LSTAT: Percentage of lower status of the population

MEDV: Median value of owner-occupied homes in $1000s

Import the required Libraries

In [1]: import numpy as np


import matplotlib.pyplot as plt

import pandas as pd
import seaborn as sns

%matplotlib inline

Load the Boston Housing DataSet from scikit-learn

In [2]: from sklearn.datasets import load_boston

boston_dataset = load_boston()

localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 1/7
12/15/22, 4:10 PM Boston_Housing_Linear_Regression

# boston_dataset is a dictionary
# let's check what it contains
boston_dataset.keys()

C:\Users\gptkgf\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: Futur
eWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 a
nd will be removed in 1.2.

The Boston housing prices dataset has an ethical problem. You can refer to
the documentation of this function for further details.

The scikit-learn maintainers therefore strongly discourage the use of this


dataset unless the purpose of the code is to study and educate about
ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

import pandas as pd
import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset (i.e.


:func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
dataset. You can load the datasets as follows::

from sklearn.datasets import fetch_california_housing


housing = fetch_california_housing()

for the California housing dataset and::

from sklearn.datasets import fetch_openml


housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.


warnings.warn(msg, category=FutureWarning)
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename', 'data_module'])
Out[2]:

Load the data into pandas dataframe

In [3]: boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)


boston.head()

Out[3]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98

1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14

2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94

4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

In [4]: boston.info()

localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 2/7
12/15/22, 4:10 PM Boston_Housing_Linear_Regression

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 506 non-null float64
1 ZN 506 non-null float64
2 INDUS 506 non-null float64
3 CHAS 506 non-null float64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 506 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null float64
9 TAX 506 non-null float64
10 PTRATIO 506 non-null float64
11 B 506 non-null float64
12 LSTAT 506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB

The target values is missing from the data. Create a new column of target values and
add it to dataframe

In [5]: boston['PRICE'] = boston_dataset.target

In [6]: boston

Out[6]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTA

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.9

1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.1

2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.0

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.9

4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.3

... ... ... ... ... ... ... ... ... ... ... ... ...

501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.6

502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.0

503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.6

504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.4

505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.8

506 rows × 14 columns

Data preprocessing

In [5]: # check for missing values in all the columns


boston.isnull().sum()

localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 3/7
12/15/22, 4:10 PM Boston_Housing_Linear_Regression

CRIM 0
Out[5]:
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
MEDV 0
dtype: int64

Data Visualization

Correlation matrix

In [7]: # compute the pair wise correlation for all columns


correlation_matrix = boston.corr().round(2)

In [8]: # use the heatmap function from seaborn to plot the correlation matrix
# annot = True to print the values inside the square
sns.heatmap(data=correlation_matrix, annot=True)

<AxesSubplot:>
Out[8]:

Observations

From the above coorelation plot we can see that PRICE is strongly correlated to LSTAT,
RM

localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 4/7
12/15/22, 4:10 PM Boston_Housing_Linear_Regression

RAD and TAX are stronly correlated, so we don't include this in our features together to
avoid multi-colinearity

In [7]: plt.figure(figsize=(20, 5))

features = ['LSTAT', 'RM']


target = boston['PRICE']

for i, col in enumerate(features):


plt.subplot(1, len(features) , i+1)
x = boston[col]
y = target
plt.scatter(x, y, marker='o')
plt.title(col)
plt.xlabel(col)
plt.ylabel('PRICE')

Prepare the data for training

In [10]: X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns = ['LSTAT','RM'])


Y = boston['PRICE']

Split the data into training and testing sets

In [11]: from sklearn.model_selection import train_test_split

# splits the training and test data set in 80% : 20%


# assign random_state to any value.This ensures consistency.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_s
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(404, 2)
(102, 2)
(404,)
(102,)

Train the model using sklearn LinearRegression

In [12]: from sklearn.linear_model import LinearRegression


from sklearn.metrics import mean_squared_error, r2_score

lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)

LinearRegression()
Out[12]:

localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 5/7
12/15/22, 4:10 PM Boston_Housing_Linear_Regression

In [13]: # model evaluation for training set

y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)

print("The model performance for training set")


print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
print("\n")

# model evaluation for testing set

y_test_predict = lin_model.predict(X_test)
# root mean square error of the model
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))

# r-squared score of the model


r2 = r2_score(Y_test, y_test_predict)

print("The model performance for testing set")


print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))

The model performance for training set


--------------------------------------
RMSE is 5.6371293350711955
R2 score is 0.6300745149331701

The model performance for testing set


--------------------------------------
RMSE is 5.13740078470291
R2 score is 0.6628996975186954

In [14]: # plotting the y_test vs y_pred


# ideally should have been a straight line
plt.scatter(Y_test, y_test_predict)
plt.show()

localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 6/7
12/15/22, 4:10 PM Boston_Housing_Linear_Regression

In [ ]:

In [ ]:

localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 7/7
12/15/22, 4:09 PM GradientDescent

In [80]: import numpy as np


import matplotlib.pyplot as plt

In [81]: %matplotlib inline


def gradient_descent(x,y):
m = b = 1
rate = 0.01
n = len(x)
plt.scatter(x,y)
for i in range(100):
y_predicted = m * x + b
plt.plot(x,y_predicted,color='green')
md = -(2/n)*sum(x*(y-y_predicted))
yd = -(2/n)*sum(y-y_predicted)
m = m - rate * md
b = b - rate * yd

In [82]: x = np.array([1,2,3,4,5])
y = np.array([5,7,9,11,13])

In [83]: gradient_descent(x,y)

In [ ]:

localhost:8888/nbconvert/html/Downloads/GradientDescent.ipynb?download=false 1/1
12/15/22, 4:07 PM PolynomialRegression

In [1]: # Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


datas = pd.read_csv('data.csv')
datas

Out[1]: sno Temperature Pressure

0 1 0 0.0002

1 2 20 0.0012

2 3 40 0.0060

3 4 60 0.0300

4 5 80 0.0900

5 6 100 0.2700

In [2]: X = datas.iloc[:, 1:2].values


y = datas.iloc[:, 2].values

In [3]: # Fitting Linear Regression to the dataset


from sklearn.linear_model import LinearRegression
lin = LinearRegression()

lin.fit(X, y)

LinearRegression()
Out[3]:

In [13]: # Fitting Polynomial Regression to the dataset


from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree = 4)
X_poly = poly.fit_transform(X)

poly.fit(X_poly, y)
lin2 = LinearRegression()
lin2.fit(X_poly, y)

LinearRegression()
Out[13]:

In [14]: # Visualising the Linear Regression results


plt.scatter(X, y, color = 'blue')

plt.plot(X, lin.predict(X), color = 'red')


plt.title('Linear Regression')
plt.xlabel('Temperature')
plt.ylabel('Pressure')

plt.show()

localhost:8888/nbconvert/html/Downloads/PolynomialRegression.ipynb?download=false 1/2
12/15/22, 4:07 PM PolynomialRegression

In [15]: # Visualising the Polynomial Regression results


plt.scatter(X, y, color = 'blue')

plt.plot(X, lin2.predict(poly.fit_transform(X)), color = 'red')


plt.title('Polynomial Regression')
plt.xlabel('Temperature')
plt.ylabel('Pressure')

plt.show()

In [ ]:

In [ ]:

localhost:8888/nbconvert/html/Downloads/PolynomialRegression.ipynb?download=false 2/2
12/15/22, 4:09 PM regression

In [1]: # import packages


import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler


from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score

In [2]: from sklearn.model_selection import train_test_split

In [3]: df = pd.read_csv('Advertising.csv')
df

Out[3]: TV radio newspaper sales

0 230.1 37.8 69.2 22.1

1 44.5 39.3 45.1 10.4

2 17.2 45.9 69.3 9.3

3 151.5 41.3 58.5 18.5

4 180.8 10.8 58.4 12.9

... ... ... ... ...

195 38.2 3.7 13.8 7.6

196 94.2 4.9 8.1 9.7

197 177.0 9.3 6.4 12.8

198 283.6 42.0 66.2 25.5

199 232.1 8.6 8.7 13.4

200 rows × 4 columns

In [4]: # dropping rows which have null values


df.dropna(inplace=True,axis=0)
df

localhost:8888/nbconvert/html/Downloads/regression.ipynb?download=false 1/3
12/15/22, 4:09 PM regression

Out[4]: TV radio newspaper sales

0 230.1 37.8 69.2 22.1

1 44.5 39.3 45.1 10.4

2 17.2 45.9 69.3 9.3

3 151.5 41.3 58.5 18.5

4 180.8 10.8 58.4 12.9

... ... ... ... ...

195 38.2 3.7 13.8 7.6

196 94.2 4.9 8.1 9.7

197 177.0 9.3 6.4 12.8

198 283.6 42.0 66.2 25.5

199 232.1 8.6 8.7 13.4

189 rows × 4 columns

In [5]: y = df['sales']
X = df.drop('sales',axis=1)

In [6]: # splitting the dataframe into train and test sets


X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=101

In [7]: #normalizing the values in each column


scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [8]: lr = LinearRegression()
model = lr.fit(X_train,y_train)

In [9]: y_pred = model.predict(X_test)


ydf = pd.DataFrame({'y_test':y_test,'y_pred':y_pred})
rslt_df = ydf.sort_values(by = 'y_test')

In [10]: print(mean_squared_error(y_test,y_pred)) #lesser the better

2.7506859249500466

In [11]: print(r2_score(y_test, y_pred)) #high value- better model

0.9148625826187149

In [12]: import matplotlib.pyplot as plt


plt.scatter(ydf['y_test'],ydf['y_pred'])
#it should be a straight line to satisfy normality assumption

<matplotlib.collections.PathCollection at 0x2364e818b20>
Out[12]:

localhost:8888/nbconvert/html/Downloads/regression.ipynb?download=false 2/3
12/15/22, 4:09 PM regression

In [15]: model.coef_

array([ 3.78101153, 2.58704901, -0.03067692])


Out[15]:

In [16]: model.intercept_

13.945454545454544
Out[16]:

In [ ]:

localhost:8888/nbconvert/html/Downloads/regression.ipynb?download=false 3/3

You might also like