Week 6 LAB
Week 6 LAB
ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
import pandas as pd
import seaborn as sns
%matplotlib inline
boston_dataset = load_boston()
localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 1/7
12/15/22, 4:10 PM Boston_Housing_Linear_Regression
# boston_dataset is a dictionary
# let's check what it contains
boston_dataset.keys()
C:\Users\gptkgf\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: Futur
eWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 a
nd will be removed in 1.2.
The Boston housing prices dataset has an ethical problem. You can refer to
the documentation of this function for further details.
In this special case, you can fetch the dataset from the original
source::
import pandas as pd
import numpy as np
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
Out[3]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
In [4]: boston.info()
localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 2/7
12/15/22, 4:10 PM Boston_Housing_Linear_Regression
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 506 non-null float64
1 ZN 506 non-null float64
2 INDUS 506 non-null float64
3 CHAS 506 non-null float64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 506 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null float64
9 TAX 506 non-null float64
10 PTRATIO 506 non-null float64
11 B 506 non-null float64
12 LSTAT 506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
The target values is missing from the data. Create a new column of target values and
add it to dataframe
In [6]: boston
Out[6]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTA
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.9
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.1
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.0
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.9
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.3
... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.6
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.0
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.6
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.4
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.8
Data preprocessing
localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 3/7
12/15/22, 4:10 PM Boston_Housing_Linear_Regression
CRIM 0
Out[5]:
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
MEDV 0
dtype: int64
Data Visualization
Correlation matrix
In [8]: # use the heatmap function from seaborn to plot the correlation matrix
# annot = True to print the values inside the square
sns.heatmap(data=correlation_matrix, annot=True)
<AxesSubplot:>
Out[8]:
Observations
From the above coorelation plot we can see that PRICE is strongly correlated to LSTAT,
RM
localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 4/7
12/15/22, 4:10 PM Boston_Housing_Linear_Regression
RAD and TAX are stronly correlated, so we don't include this in our features together to
avoid multi-colinearity
(404, 2)
(102, 2)
(404,)
(102,)
lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)
LinearRegression()
Out[12]:
localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 5/7
12/15/22, 4:10 PM Boston_Housing_Linear_Regression
y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)
y_test_predict = lin_model.predict(X_test)
# root mean square error of the model
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))
localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 6/7
12/15/22, 4:10 PM Boston_Housing_Linear_Regression
In [ ]:
In [ ]:
localhost:8888/nbconvert/html/Downloads/Boston_Housing_Linear_Regression.ipynb?download=false 7/7
12/15/22, 4:09 PM GradientDescent
In [82]: x = np.array([1,2,3,4,5])
y = np.array([5,7,9,11,13])
In [83]: gradient_descent(x,y)
In [ ]:
localhost:8888/nbconvert/html/Downloads/GradientDescent.ipynb?download=false 1/1
12/15/22, 4:07 PM PolynomialRegression
0 1 0 0.0002
1 2 20 0.0012
2 3 40 0.0060
3 4 60 0.0300
4 5 80 0.0900
5 6 100 0.2700
lin.fit(X, y)
LinearRegression()
Out[3]:
poly = PolynomialFeatures(degree = 4)
X_poly = poly.fit_transform(X)
poly.fit(X_poly, y)
lin2 = LinearRegression()
lin2.fit(X_poly, y)
LinearRegression()
Out[13]:
plt.show()
localhost:8888/nbconvert/html/Downloads/PolynomialRegression.ipynb?download=false 1/2
12/15/22, 4:07 PM PolynomialRegression
plt.show()
In [ ]:
In [ ]:
localhost:8888/nbconvert/html/Downloads/PolynomialRegression.ipynb?download=false 2/2
12/15/22, 4:09 PM regression
In [3]: df = pd.read_csv('Advertising.csv')
df
localhost:8888/nbconvert/html/Downloads/regression.ipynb?download=false 1/3
12/15/22, 4:09 PM regression
In [5]: y = df['sales']
X = df.drop('sales',axis=1)
In [8]: lr = LinearRegression()
model = lr.fit(X_train,y_train)
2.7506859249500466
0.9148625826187149
<matplotlib.collections.PathCollection at 0x2364e818b20>
Out[12]:
localhost:8888/nbconvert/html/Downloads/regression.ipynb?download=false 2/3
12/15/22, 4:09 PM regression
In [15]: model.coef_
In [16]: model.intercept_
13.945454545454544
Out[16]:
In [ ]:
localhost:8888/nbconvert/html/Downloads/regression.ipynb?download=false 3/3