Document From Jahnavi
Document From Jahnavi
DONE BY:
N.R.DIVYASREE
N.JHANSI
Pre dicti ng the Sale price of a house using Linear re gre ssio
n
Problem Statement:
Consider a real estate company that has a dataset containing the prices of properties in a
region. It wishes to use the data to optimise the sale prices of the properties based
on important factors such as area, bedrooms, parking, etc.
Essentially, the company wants —
• To identify the variables affecting house prices, e.g. area, number of rooms,
bathrooms, etc.
• To create a linear model that quantitatively relates house prices with variables such
as number of rooms, area, number of bathrooms, etc.
• To know the accuracy of the model, i.e. how well these variables can predict house
prices.
Data
Use housing dataset.
# Data Visualisation
import matplotlib.pyplot as plt import seaborn as sns
housing=pd.DataFrame(pd.read_csv("C:/Users/NrDiv/OneDrive//Desktop/Housing.csv"))
housind.head()
OUTPUT:
Data Inspection
Housing.shape
housing.info()
housing.describe()
OUTPUT:
(545, 13)
Data Cleaning
OUTPUT:
fig, axs = plt.subplots(2,3, figsize = (10,5))
plt1 = sns.boxplot(housing['price'], ax = axs[0,0])
plt2 = sns.boxplot(housing['area'], ax = axs[0,1])
plt3 = sns.boxplot(housing['bedrooms'], ax = axs[0,2])
plt1 = sns.boxplot(housing['bathrooms'], ax = axs[1,0])
plt2 = sns.boxplot(housing['stories'], ax = axs[1,1])
plt3 = sns.boxplot(housing['parking'], ax = axs[1,2])
plt.tight_layout()
OUTPUT:
OUTPUT:
OUTPUT:
/*We can also visualise some of these categorical features parallely by using
the hue argument. Below is the plot for furnishingstatus with airconditioning as the hue.*/
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'furnishingstatus', y = 'price', hue = 'airconditioning', data = housing)
plt.show()
OUTPUT:
Data Preparation
You can see that your dataset has many columns with values as 'Yes' or 'No'.
But in order to fit a regression line, we would need numerical values and not string. Hence,
we need to convert them to 1s and 0s, where 1 is a 'Yes' and 0 is a 'No'.
Dummy Variables
The variable furnishingstatus has three levels. We need to convert these levels into
integer as well.
For this, we will use something called dummy variables.
status = pd.get_dummies(housing['furnishingstatus'])
status.head()
OUTPUT:
Now, you don't need three columns. You can drop the furnished column, as the type
of furnishing can be identified with just the last two columns where —
00 will correspond to furnished
01 will correspond to unfurnished
10 will correspond to semi-furnished
OUTPUT:
OUTPUT:
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:334:
DataConversionWarning: Data with input dtype int64 were all converted to float64
by MinMaxScaler.
# Let's check the correlation coefficients to see which variables are highly correlated
plt.figure(figsize = (16, 10))
sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
plt.show()
OUTPUT:
Model Building
This time, we will be using the LinearRegression function from SciKit Learn for its
compatibility with RFE (which is a utility from sklearn)
RFE(Recursive feature elimination)
print(col)
X_train.columns[~rfe.support_]
OUTPUT:
LinearRegression(copy_X=True, fit_intercept=True,
n_jobs=None, normalize=False)
OUTPUT:
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.611
==============================================================================
Omnibus: 54.330 Durbin-Watson: 2.060
y_train_price = lm.predict(X_train_rfe)
# Importing the required libraries for plots.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_price), bins = 20)
fig.suptitle('Error Terms', fontsize = 20) # Plot heading
plt.xlabel('Errors', fontsize = 18)
OUTPUT:
Text(0.5,0,'Errors')
plt.scatter(y_train,res)
plt.show()
OUTPUT:
Model Evaluation
y_test = df_test.pop('price')
X_test = df_test
X_test = sm.add_constant(X_test)
# Creating X_test_new dataframe by dropping variables from X_test
X_test_rfe = X_test[X_train_rfe.columns]
# Making predictions
y_pred = lm.predict(X_test_rfe)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
OUTPUT:
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:334:
DataConversionWarning: Data with input dtype int64 were all converted to float64
by MinMaxScaler.
return self.partial_fit(X, y)
0.579124777439774
Text(0,0.5,'y_pred')
We can see that the equation of our best fitted line is:
price=0.35×area+0.20×bathrooms+0.19×stories+0.10×airconditioning+0.10×parking+0.11×p
refarea