0% found this document useful (0 votes)

5 views

Document From Jahnavi

Uploaded by

bangbangtanpower

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Document From Jahnavi

Uploaded by

bangbangtanpower

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

MACHI NE LEARNING PROJEC

TITLE: Predicting the Sale price of a house

using Linear regression

DONE BY:
N.R.DIVYASREE
N.JHANSI
Pre dicti ng the Sale price of a house using Linear re gre ssio
n

Problem Statement:
Consider a real estate company that has a dataset containing the prices of properties in a
region. It wishes to use the data to optimise the sale prices of the properties based
on important factors such as area, bedrooms, parking, etc.
Essentially, the company wants —
• To identify the variables affecting house prices, e.g. area, number of rooms,
bathrooms, etc.
• To create a linear model that quantitatively relates house prices with variables such
as number of rooms, area, number of bathrooms, etc.
• To know the accuracy of the model, i.e. how well these variables can predict house
prices.
Data
Use housing dataset.

Reading and Understanding the Data

# Supress Warnings import
warnings
warnings.filterwarnings('ignore')

# Import the numpy and pandas package

import numpy as np
import pandas as pd

# Data Visualisation
import matplotlib.pyplot as plt import seaborn as sns
housing=pd.DataFrame(pd.read_csv("C:/Users/NrDiv/OneDrive//Desktop/Housing.csv"))
housind.head()
OUTPUT:

Data Inspection
Housing.shape
housing.info()
housing.describe()

OUTPUT:
(545, 13)
Data Cleaning

# Checking Null values

housing.isnull().sum()*100/housing.shape[0]

OUTPUT:
fig, axs = plt.subplots(2,3, figsize = (10,5))
plt1 = sns.boxplot(housing['price'], ax = axs[0,0])
plt2 = sns.boxplot(housing['area'], ax = axs[0,1])
plt3 = sns.boxplot(housing['bedrooms'], ax = axs[0,2])
plt1 = sns.boxplot(housing['bathrooms'], ax = axs[1,0])
plt2 = sns.boxplot(housing['stories'], ax = axs[1,1])
plt3 = sns.boxplot(housing['parking'], ax = axs[1,2])
plt.tight_layout()
OUTPUT:

# outlier treatment for price

plt.boxplot(housing.price)
q1=housing.price.quantile(0.25)
q3=housing.price.quantile(0.75)
IQR = q3-q1

housing = housing[(housing.price>= q1-1.5IQR) & (housing.price<= q3+1.5IQR)]

OUTPUT:

# outlier treatment for price

plt.boxplot(housing. Area)
q1 = housing.area.quantile(0.25)
q3 = housing.area.quantile(0.75)
IQR = q3 - q1

housing = housing[(housing.area >= q1 - 1.5IQR) & (housing.area <= q3 + 1.5IQR)]

OUTPUT:
# Outlier Analysis
fig, axs = plt.subplots(2,3, figsize = (10,5))
plt1 = sns.boxplot(housing['price'], ax = axs[0,0])
plt2 = sns.boxplot(housing['area'], ax = axs[0,1])
plt3 = sns.boxplot(housing['bedrooms'], ax = axs[0,2])
plt1 = sns.boxplot(housing['bathrooms'], ax = axs[1,0])
plt2 = sns.boxplot(housing['stories'], ax = axs[1,1])

plt3 = sns.boxplot(housing['parking'], ax = axs[1,2])

plt.tight_layout()

OUTPUT:

Exploratory Data Analytics

#VISUALIZING NUMERICAL DATA
sns.pairplot(housing)
plt.show()
OUTPUT:
#visualising Categorical Variables
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'mainroad', y = 'price', data = housing)
plt.subplot(2,3,2)
sns.boxplot(x = 'guestroom', y = 'price', data = housing)
plt.subplot(2,3,3)
sns.boxplot(x = 'basement', y = 'price', data = housing)
plt.subplot(2,3,4)
sns.boxplot(x = 'hotwaterheating', y = 'price', data = housing)
plt.subplot(2,3,5)
sns.boxplot(x = 'airconditioning', y = 'price', data = housing)
plt.subplot(2,3,6)
sns.boxplot(x = 'furnishingstatus', y = 'price', data = housing)
plt.show()

OUTPUT:
/*We can also visualise some of these categorical features parallely by using
the hue argument. Below is the plot for furnishingstatus with airconditioning as the hue.*/
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'furnishingstatus', y = 'price', hue = 'airconditioning', data = housing)
plt.show()
OUTPUT:

Data Preparation
You can see that your dataset has many columns with values as 'Yes' or 'No'.
But in order to fit a regression line, we would need numerical values and not string. Hence,
we need to convert them to 1s and 0s, where 1 is a 'Yes' and 0 is a 'No'.

# List of variables to map

varlist = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
'prefarea']
def binary_map(x):
return x.map({'yes': 1, "no": 0})
# Applying the function to the housing list
housing[varlist] = housing[varlist].apply(binary_map)
housing.head()
OUTPUT:

Dummy Variables
The variable furnishingstatus has three levels. We need to convert these levels into
integer as well.
For this, we will use something called dummy variables.

status = pd.get_dummies(housing['furnishingstatus'])
status.head()
OUTPUT:

Now, you don't need three columns. You can drop the furnished column, as the type
of furnishing can be identified with just the last two columns where —
00 will correspond to furnished
01 will correspond to unfurnished
10 will correspond to semi-furnished

status = pd.get_dummies(housing['furnishingstatus'], drop_first = True)

housing = pd.concat([housing, status], axis = 1)
housing.head()
OUTPUT:

# Drop 'furnishingstatus' as we have created the dummies for it

housing.drop(['furnishingstatus'], axis = 1, inplace = True)
housing.head()

OUTPUT:

Splitting the Data into Training and Testing Sets

from sklearn.model_selection import train_test_split

np.random.seed(0)

df_train, df_test = train_test_split(housing, train_size = 0.7, test_size = 0.3, random_state =

100)
Rescaling the Features
As you saw in the demonstration for Simple Linear Regression, scaling doesn't impact your
model. Here we can see that except for area, all the columns have small integer values. So
it is extremely important to rescale the variables so that they have a comparable scale. If
we don't have comparable scales. There are two common ways of rescaling:
1. Min-Max scaling
2. Standardisation (mean-0, sigma-1)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

num_vars = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking','price']

df_train[num_vars] = scaler.fit_transform(df_train[num_vars])
df_train.head()

OUTPUT:
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:334:
DataConversionWarning: Data with input dtype int64 were all converted to float64
by MinMaxScaler.

# Let's check the correlation coefficients to see which variables are highly correlated
plt.figure(figsize = (16, 10))
sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
plt.show()
OUTPUT:

Dividing into X and Y sets for the model building

y_train = df_train.pop('price')
X_train = df_train

Model Building

This time, we will be using the LinearRegression function from SciKit Learn for its
compatibility with RFE (which is a utility from sklearn)
RFE(Recursive feature elimination)

from sklearn.feature_selection import RFE

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
rfe = RFE(lm, 6)
rfe = rfe.fit(X_train, y_train)
In [36]:
linkcode
list(zip(X_train.columns,rfe.support_,rfe.ranking_))
col = X_train.columns[rfe.support_]

print(col)
X_train.columns[~rfe.support_]

OUTPUT:
LinearRegression(copy_X=True, fit_intercept=True,
n_jobs=None, normalize=False)

[('area', True, 1),

('bedrooms', False, 7),
('bathrooms', True, 1),
('stories', True, 1),
('mainroad', False, 5),
('guestroom', False, 6),
('basement', False, 4),
('hotwaterheating', False, 2),
('airconditioning', True, 1),
('parking', True, 1),
('prefarea', True, 1),

('semi-furnished', False, 8),

('unfurnished', False, 3)]
Index(['area', 'bathrooms', 'stories', 'airconditioning', 'parking','prefarea'],
dtype='object')
Index(['bedrooms', 'mainroad', 'guestroom', 'basement', 'hotwaterheating', 'semi-
furnished', 'unfurnished'], dtype='object')

Building model using statsmodel, for the detailed statistics

# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]
# Adding a constant variable
import statsmodels.api as sm
X_train_rfe = sm.add_constant(X_train_rfe)
lm = sm.OLS(y_train,X_train_rfe).fit() # Running the linear model
print(lm.summary())

OUTPUT:
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.611

Model: OLS Adj. R-squared: 0.605

Method: Least Squares F-statistic: 92.83
Date: Tue, 12 Mar 2019 Prob (F-statistic): 1.31e-69
Time: 09:18:14 Log-Likelihood: 222.77
No. Observations: 361 AIC: -431.5
Df Residuals: 354 BIC: -404.3
Df Model: 6
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------- const
0.1097 0.015 7.442 0.000 0.081 0.139 area
0.3502 0.037 9.361 0.000 0.277 0.424 bathrooms
0.2012 0.033 6.134 0.000 0.137 0.266 stories
0.1884 0.026 7.219 0.000 0.137 0.240
airconditioning 0.0965 0.016 5.890 0.000 0.064 0.129
parking 0.1009 0.026 3.916 0.000 0.050 0.152
prefarea 0.1102 0.018 6.288 0.000 0.076 0.145

==============================================================================
Omnibus: 54.330 Durbin-Watson: 2.060

Prob(Omnibus): 0.000 Jarque-Bera (JB): 125.403

Skew: 0.762 Prob(JB): 5.87e-28
Kurtosis: 5.453 Cond. No. 6.98
==============================================================================

Residual Analysis of the train data

So, now to check if the error terms are also normally distributed (which is infact, one of the
major assumptions of linear regression), let us plot the histogram of the error terms and see
what it looks like.

y_train_price = lm.predict(X_train_rfe)
# Importing the required libraries for plots.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_price), bins = 20)
fig.suptitle('Error Terms', fontsize = 20) # Plot heading
plt.xlabel('Errors', fontsize = 18)

OUTPUT:
Text(0.5,0,'Errors')
plt.scatter(y_train,res)
plt.show()
OUTPUT:
Model Evaluation

num_vars = ['area','stories', 'bathrooms', 'airconditioning', 'prefarea','parking','price']

df_test[num_vars] =
scaler.fit_transform(df_test[num_vars]) Dividing into X_test
and y_test

y_test = df_test.pop('price')
X_test = df_test

X_test = sm.add_constant(X_test)
# Creating X_test_new dataframe by dropping variables from X_test
X_test_rfe = X_test[X_train_rfe.columns]
# Making predictions
y_pred = lm.predict(X_test_rfe)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

# Plotting y_test and y_pred to understand the spread.

fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20) # Plot
heading plt.xlabel('y_test', fontsize=18) # X-label
plt.ylabel('y_pred', fontsize=16) # Y-label

OUTPUT:
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:334:
DataConversionWarning: Data with input dtype int64 were all converted to float64
by MinMaxScaler.
return self.partial_fit(X, y)
0.579124777439774
Text(0,0.5,'y_pred')

We can see that the equation of our best fitted line is:
price=0.35×area+0.20×bathrooms+0.19×stories+0.10×airconditioning+0.10×parking+0.11×p
refarea

Linear Regression Assignment
0% (2)
Linear Regression Assignment
8 pages
House Prices Prediction in King County
No ratings yet
House Prices Prediction in King County
10 pages
3 Using Farm Tools and Equipment (UFTE)
100% (5)
3 Using Farm Tools and Equipment (UFTE)
64 pages
12.2 Oracle Service Service Request APIs - 2
No ratings yet
12.2 Oracle Service Service Request APIs - 2
59 pages
f3683849-7ca6-4854-8f96-af11b6e837ec
No ratings yet
f3683849-7ca6-4854-8f96-af11b6e837ec
20 pages
Ml Manual
No ratings yet
Ml Manual
30 pages
Regression Week 2: Multiple Linear Regression Assignment 1: If You Are Using Graphlab Create
No ratings yet
Regression Week 2: Multiple Linear Regression Assignment 1: If You Are Using Graphlab Create
1 page
Copy of Project 4 _ House Price Prediction.ipynb - Colab
No ratings yet
Copy of Project 4 _ House Price Prediction.ipynb - Colab
5 pages
ml manual
No ratings yet
ml manual
9 pages
T2_summary_VHA
No ratings yet
T2_summary_VHA
14 pages
Regression Algorithm
No ratings yet
Regression Algorithm
9 pages
Regression Anallysis Hands0n 1
100% (1)
Regression Anallysis Hands0n 1
3 pages
Exercise4 Solution
No ratings yet
Exercise4 Solution
20 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
Mlext
No ratings yet
Mlext
1 page
Linear Regression Analysis - Polynomial Regression
No ratings yet
Linear Regression Analysis - Polynomial Regression
25 pages
Lab 3 - Linear Regression
No ratings yet
Lab 3 - Linear Regression
15 pages
PythonFile[1]
No ratings yet
PythonFile[1]
5 pages
C1 W1 Lab03 Model Representation Soln-Copy1
No ratings yet
C1 W1 Lab03 Model Representation Soln-Copy1
7 pages
C1 W1 Lab02 Model Representation Soln
No ratings yet
C1 W1 Lab02 Model Representation Soln
5 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
mlalllabprgs
No ratings yet
mlalllabprgs
17 pages
MachineLearning
No ratings yet
MachineLearning
10 pages
Kaggle Machine Learning
No ratings yet
Kaggle Machine Learning
6 pages
Emllab
No ratings yet
Emllab
6 pages
unit 3 5
No ratings yet
unit 3 5
4 pages
Import As Import As From Import: "Mean Squared Errors: "
No ratings yet
Import As Import As From Import: "Mean Squared Errors: "
1 page
Machine learning lab manual
No ratings yet
Machine learning lab manual
9 pages
C1 W1 Lab02 Model Representation Soln
No ratings yet
C1 W1 Lab02 Model Representation Soln
7 pages
C1 W1 Lab02 Model Representation Soln
No ratings yet
C1 W1 Lab02 Model Representation Soln
5 pages
C1 W1 Lab02 Model Representation Soln
No ratings yet
C1 W1 Lab02 Model Representation Soln
7 pages
MLLabManual
No ratings yet
MLLabManual
24 pages
IoT Task4 21BEC0384
No ratings yet
IoT Task4 21BEC0384
9 pages
Data Science Record_05
No ratings yet
Data Science Record_05
20 pages
ML Practical 04
No ratings yet
ML Practical 04
19 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
ML full for print new 1
No ratings yet
ML full for print new 1
38 pages
ML 1-11
No ratings yet
ML 1-11
27 pages
AIML
No ratings yet
AIML
5 pages
ML MANUAL
No ratings yet
ML MANUAL
24 pages
Chirag HOusing Price Pred
No ratings yet
Chirag HOusing Price Pred
12 pages
DA_lab2
No ratings yet
DA_lab2
5 pages
Faisal Nadeem (SAP# 30601)
No ratings yet
Faisal Nadeem (SAP# 30601)
7 pages
ml observation
No ratings yet
ml observation
29 pages
ML File
No ratings yet
ML File
37 pages
Pattern - Recognition - 3 - Code With Output
No ratings yet
Pattern - Recognition - 3 - Code With Output
7 pages
AI Lec 3
No ratings yet
AI Lec 3
36 pages
New Opendocument Text
No ratings yet
New Opendocument Text
7 pages
Report
No ratings yet
Report
40 pages
One Hot Encoding
No ratings yet
One Hot Encoding
12 pages
ML spy programs
No ratings yet
ML spy programs
16 pages
Machine Learning
No ratings yet
Machine Learning
22 pages
Machine Learning Lab Manual (1) (1)
No ratings yet
Machine Learning Lab Manual (1) (1)
26 pages
Argha's ML LAB_240927_121838
No ratings yet
Argha's ML LAB_240927_121838
13 pages
lab ML
No ratings yet
lab ML
26 pages
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
No ratings yet
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
20 pages
Mlda - Lab
No ratings yet
Mlda - Lab
35 pages
Python - Vectorized - Tute - Jupyter Notebook
No ratings yet
Python - Vectorized - Tute - Jupyter Notebook
16 pages
Regression Week 1: Simple Linear Regression Assignment: All Course Content
No ratings yet
Regression Week 1: Simple Linear Regression Assignment: All Course Content
1 page
machinelearning
No ratings yet
machinelearning
26 pages
Praveen Ai
No ratings yet
Praveen Ai
6 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Lindy Clone 43116v3
No ratings yet
Lindy Clone 43116v3
15 pages
Solar Lawn Mower
No ratings yet
Solar Lawn Mower
63 pages
Kalendvl
100% (1)
Kalendvl
209 pages
Long Quiz in Cookery Module 5 Measurement and Calculation
100% (1)
Long Quiz in Cookery Module 5 Measurement and Calculation
3 pages
RTOS Library For PIC32: Content
No ratings yet
RTOS Library For PIC32: Content
34 pages
A Framework For Analysis of Agricultural Marketing Systems in Developing Countries
No ratings yet
A Framework For Analysis of Agricultural Marketing Systems in Developing Countries
8 pages
Reference List DECPL 2013
No ratings yet
Reference List DECPL 2013
14 pages
My Discord Wont Stop Drawing Feet, So I Critiqued Them Ft. Ramon Hurtado - YouTube
No ratings yet
My Discord Wont Stop Drawing Feet, So I Critiqued Them Ft. Ramon Hurtado - YouTube
1 page
Sai Vidya Institute of Technology
No ratings yet
Sai Vidya Institute of Technology
12 pages
Tle LP Week 1
No ratings yet
Tle LP Week 1
3 pages
D. Count The Certificates at The Same Time Other Securities Are Counted
No ratings yet
D. Count The Certificates at The Same Time Other Securities Are Counted
4 pages
3A and 3B
No ratings yet
3A and 3B
13 pages
Abigail_at_Red_Shield-student_copy
No ratings yet
Abigail_at_Red_Shield-student_copy
8 pages
The Transmission Formula For Radio
No ratings yet
The Transmission Formula For Radio
3 pages
Vac Unit Vertical 24V A0FF00151-00 0: Revision History
No ratings yet
Vac Unit Vertical 24V A0FF00151-00 0: Revision History
4 pages
Get Molecular Biology of the Cell 6th Edition Bruce Alberts free all chapters
100% (30)
Get Molecular Biology of the Cell 6th Edition Bruce Alberts free all chapters
60 pages
DE-CHINH-THUC-GUI-DH-DAP-AN
No ratings yet
DE-CHINH-THUC-GUI-DH-DAP-AN
5 pages
Inbound 6373544405030215126
No ratings yet
Inbound 6373544405030215126
1 page
Conectores Series Kq2 SMC
No ratings yet
Conectores Series Kq2 SMC
21 pages
P18020321 #0 Opq Yl - Opq Yl
No ratings yet
P18020321 #0 Opq Yl - Opq Yl
2 pages
Direct claim A Not-so-Smart TV
No ratings yet
Direct claim A Not-so-Smart TV
3 pages
Big Rock Candy Mountain
No ratings yet
Big Rock Candy Mountain
3 pages
NR7505915345058349 Invoice
No ratings yet
NR7505915345058349 Invoice
2 pages
Reinhart Market Report
No ratings yet
Reinhart Market Report
2 pages
A Review of Azimuth Thruster: Virendra Desai-Patil, Abhishek Ayare, Bhushan Mahajan, Sushmita Bade
No ratings yet
A Review of Azimuth Thruster: Virendra Desai-Patil, Abhishek Ayare, Bhushan Mahajan, Sushmita Bade
3 pages
Technological University of The Philippines - Taguig
No ratings yet
Technological University of The Philippines - Taguig
7 pages
31-Station Acwp - Sop
100% (1)
31-Station Acwp - Sop
3 pages
A. Even But Not Even: Codeforces Round #616 (Div. 2)
No ratings yet
A. Even But Not Even: Codeforces Round #616 (Div. 2)
5 pages