0% found this document useful (0 votes)
5 views

Document From Jahnavi

Uploaded by

bangbangtanpower
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Document From Jahnavi

Uploaded by

bangbangtanpower
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

MACHI NE LEARNING PROJEC

TITLE: Predicting the Sale price of a house


using Linear regression

DONE BY:
N.R.DIVYASREE
N.JHANSI
Pre dicti ng the Sale price of a house using Linear re gre ssio
n

Problem Statement:
Consider a real estate company that has a dataset containing the prices of properties in a
region. It wishes to use the data to optimise the sale prices of the properties based
on important factors such as area, bedrooms, parking, etc.
Essentially, the company wants —
• To identify the variables affecting house prices, e.g. area, number of rooms,
bathrooms, etc.
• To create a linear model that quantitatively relates house prices with variables such
as number of rooms, area, number of bathrooms, etc.
• To know the accuracy of the model, i.e. how well these variables can predict house
prices.
Data
Use housing dataset.

Reading and Understanding the Data


# Supress Warnings import
warnings
warnings.filterwarnings('ignore')

# Import the numpy and pandas package


import numpy as np
import pandas as pd

# Data Visualisation
import matplotlib.pyplot as plt import seaborn as sns
housing=pd.DataFrame(pd.read_csv("C:/Users/NrDiv/OneDrive//Desktop/Housing.csv"))
housind.head()
OUTPUT:

Data Inspection
Housing.shape
housing.info()
housing.describe()

OUTPUT:
(545, 13)
Data Cleaning

# Checking Null values


housing.isnull().sum()*100/housing.shape[0]

OUTPUT:
fig, axs = plt.subplots(2,3, figsize = (10,5))
plt1 = sns.boxplot(housing['price'], ax = axs[0,0])
plt2 = sns.boxplot(housing['area'], ax = axs[0,1])
plt3 = sns.boxplot(housing['bedrooms'], ax = axs[0,2])
plt1 = sns.boxplot(housing['bathrooms'], ax = axs[1,0])
plt2 = sns.boxplot(housing['stories'], ax = axs[1,1])
plt3 = sns.boxplot(housing['parking'], ax = axs[1,2])
plt.tight_layout()
OUTPUT:

# outlier treatment for price


plt.boxplot(housing.price)
q1=housing.price.quantile(0.25)
q3=housing.price.quantile(0.75)
IQR = q3-q1

housing = housing[(housing.price>= q1-1.5*IQR) & (housing.price<= q3+1.5*IQR)]


OUTPUT:

# outlier treatment for price


plt.boxplot(housing. Area)
q1 = housing.area.quantile(0.25)
q3 = housing.area.quantile(0.75)
IQR = q3 - q1

housing = housing[(housing.area >= q1 - 1.5*IQR) & (housing.area <= q3 + 1.5*IQR)]


OUTPUT:
# Outlier Analysis
fig, axs = plt.subplots(2,3, figsize = (10,5))
plt1 = sns.boxplot(housing['price'], ax = axs[0,0])
plt2 = sns.boxplot(housing['area'], ax = axs[0,1])
plt3 = sns.boxplot(housing['bedrooms'], ax = axs[0,2])
plt1 = sns.boxplot(housing['bathrooms'], ax = axs[1,0])
plt2 = sns.boxplot(housing['stories'], ax = axs[1,1])

plt3 = sns.boxplot(housing['parking'], ax = axs[1,2])


plt.tight_layout()

OUTPUT:

Exploratory Data Analytics


#VISUALIZING NUMERICAL DATA
sns.pairplot(housing)
plt.show()
OUTPUT:
#visualising Categorical Variables
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'mainroad', y = 'price', data = housing)
plt.subplot(2,3,2)
sns.boxplot(x = 'guestroom', y = 'price', data = housing)
plt.subplot(2,3,3)
sns.boxplot(x = 'basement', y = 'price', data = housing)
plt.subplot(2,3,4)
sns.boxplot(x = 'hotwaterheating', y = 'price', data = housing)
plt.subplot(2,3,5)
sns.boxplot(x = 'airconditioning', y = 'price', data = housing)
plt.subplot(2,3,6)
sns.boxplot(x = 'furnishingstatus', y = 'price', data = housing)
plt.show()

OUTPUT:
/*We can also visualise some of these categorical features parallely by using
the hue argument. Below is the plot for furnishingstatus with airconditioning as the hue.*/
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'furnishingstatus', y = 'price', hue = 'airconditioning', data = housing)
plt.show()
OUTPUT:

Data Preparation
You can see that your dataset has many columns with values as 'Yes' or 'No'.
But in order to fit a regression line, we would need numerical values and not string. Hence,
we need to convert them to 1s and 0s, where 1 is a 'Yes' and 0 is a 'No'.

# List of variables to map


varlist = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
'prefarea']
def binary_map(x):
return x.map({'yes': 1, "no": 0})
# Applying the function to the housing list
housing[varlist] = housing[varlist].apply(binary_map)
housing.head()
OUTPUT:

Dummy Variables
The variable furnishingstatus has three levels. We need to convert these levels into
integer as well.
For this, we will use something called dummy variables.

status = pd.get_dummies(housing['furnishingstatus'])
status.head()
OUTPUT:

Now, you don't need three columns. You can drop the furnished column, as the type
of furnishing can be identified with just the last two columns where —
00 will correspond to furnished
01 will correspond to unfurnished
10 will correspond to semi-furnished

status = pd.get_dummies(housing['furnishingstatus'], drop_first = True)


housing = pd.concat([housing, status], axis = 1)
housing.head()
OUTPUT:

# Drop 'furnishingstatus' as we have created the dummies for it


housing.drop(['furnishingstatus'], axis = 1, inplace = True)
housing.head()

OUTPUT:

Splitting the Data into Training and Testing Sets

from sklearn.model_selection import train_test_split


np.random.seed(0)

df_train, df_test = train_test_split(housing, train_size = 0.7, test_size = 0.3, random_state =


100)
Rescaling the Features
As you saw in the demonstration for Simple Linear Regression, scaling doesn't impact your
model. Here we can see that except for area, all the columns have small integer values. So
it is extremely important to rescale the variables so that they have a comparable scale. If
we don't have comparable scales. There are two common ways of rescaling:
1. Min-Max scaling
2. Standardisation (mean-0, sigma-1)

from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()

num_vars = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking','price']


df_train[num_vars] = scaler.fit_transform(df_train[num_vars])
df_train.head()

OUTPUT:
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:334:
DataConversionWarning: Data with input dtype int64 were all converted to float64
by MinMaxScaler.

# Let's check the correlation coefficients to see which variables are highly correlated
plt.figure(figsize = (16, 10))
sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
plt.show()
OUTPUT:

Dividing into X and Y sets for the model building


y_train = df_train.pop('price')
X_train = df_train

Model Building

This time, we will be using the LinearRegression function from SciKit Learn for its
compatibility with RFE (which is a utility from sklearn)
RFE(Recursive feature elimination)

from sklearn.feature_selection import RFE


from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
rfe = RFE(lm, 6)
rfe = rfe.fit(X_train, y_train)
In [36]:
linkcode
list(zip(X_train.columns,rfe.support_,rfe.ranking_))
col = X_train.columns[rfe.support_]

print(col)
X_train.columns[~rfe.support_]

OUTPUT:
LinearRegression(copy_X=True, fit_intercept=True,
n_jobs=None, normalize=False)

[('area', True, 1),


('bedrooms', False, 7),
('bathrooms', True, 1),
('stories', True, 1),
('mainroad', False, 5),
('guestroom', False, 6),
('basement', False, 4),
('hotwaterheating', False, 2),
('airconditioning', True, 1),
('parking', True, 1),
('prefarea', True, 1),

('semi-furnished', False, 8),


('unfurnished', False, 3)]
Index(['area', 'bathrooms', 'stories', 'airconditioning', 'parking','prefarea'],
dtype='object')
Index(['bedrooms', 'mainroad', 'guestroom', 'basement', 'hotwaterheating', 'semi-
furnished', 'unfurnished'], dtype='object')

Building model using statsmodel, for the detailed statistics


# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]
# Adding a constant variable
import statsmodels.api as sm
X_train_rfe = sm.add_constant(X_train_rfe)
lm = sm.OLS(y_train,X_train_rfe).fit() # Running the linear model
print(lm.summary())

OUTPUT:
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.611

Model: OLS Adj. R-squared: 0.605


Method: Least Squares F-statistic: 92.83
Date: Tue, 12 Mar 2019 Prob (F-statistic): 1.31e-69
Time: 09:18:14 Log-Likelihood: 222.77
No. Observations: 361 AIC: -431.5
Df Residuals: 354 BIC: -404.3
Df Model: 6
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------- const
0.1097 0.015 7.442 0.000 0.081 0.139 area
0.3502 0.037 9.361 0.000 0.277 0.424 bathrooms
0.2012 0.033 6.134 0.000 0.137 0.266 stories
0.1884 0.026 7.219 0.000 0.137 0.240
airconditioning 0.0965 0.016 5.890 0.000 0.064 0.129
parking 0.1009 0.026 3.916 0.000 0.050 0.152
prefarea 0.1102 0.018 6.288 0.000 0.076 0.145

==============================================================================
Omnibus: 54.330 Durbin-Watson: 2.060

Prob(Omnibus): 0.000 Jarque-Bera (JB): 125.403


Skew: 0.762 Prob(JB): 5.87e-28
Kurtosis: 5.453 Cond. No. 6.98
==============================================================================

Residual Analysis of the train data


So, now to check if the error terms are also normally distributed (which is infact, one of the
major assumptions of linear regression), let us plot the histogram of the error terms and see
what it looks like.

y_train_price = lm.predict(X_train_rfe)
# Importing the required libraries for plots.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_price), bins = 20)
fig.suptitle('Error Terms', fontsize = 20) # Plot heading
plt.xlabel('Errors', fontsize = 18)

OUTPUT:
Text(0.5,0,'Errors')
plt.scatter(y_train,res)
plt.show()
OUTPUT:
Model Evaluation

num_vars = ['area','stories', 'bathrooms', 'airconditioning', 'prefarea','parking','price']


df_test[num_vars] =
scaler.fit_transform(df_test[num_vars]) Dividing into X_test
and y_test

y_test = df_test.pop('price')
X_test = df_test

X_test = sm.add_constant(X_test)
# Creating X_test_new dataframe by dropping variables from X_test
X_test_rfe = X_test[X_train_rfe.columns]
# Making predictions
y_pred = lm.predict(X_test_rfe)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

# Plotting y_test and y_pred to understand the spread.


fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20) # Plot
heading plt.xlabel('y_test', fontsize=18) # X-label
plt.ylabel('y_pred', fontsize=16) # Y-label

OUTPUT:
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:334:
DataConversionWarning: Data with input dtype int64 were all converted to float64
by MinMaxScaler.
return self.partial_fit(X, y)
0.579124777439774
Text(0,0.5,'y_pred')

We can see that the equation of our best fitted line is:
price=0.35×area+0.20×bathrooms+0.19×stories+0.10×airconditioning+0.10×parking+0.11×p
refarea

You might also like