House Price Prediction Using Machine Learning Techniques
House Price Prediction Using Machine Learning Techniques
Ask a home buyer to describe their There are much related work going on with different
dream house, and they probably won't begin with the prediction models and different type of approaches
height of the basement ceiling or the proximity to an used to implement the prediction models like
east-west railroad. But this predictor’s dataset Xgboost, Gradient boosting, random forests etc. The
proves that much more influences price negotiations reference we used to implement our project used
than the number of bedrooms or a white-picket linear regression to implement his prediction model
fence. by doing data cleaning, data visualization, how
'The most difficult thing in life is to know common factors are affecting the prices of the
yourself'. This quote belongs to Thales of Miletus. houses and then followed by linear regression.
Thales was a Greek philosopher, mathematician and The Approach follows:
astronomer. I wouldn't say that knowing your data is • imported dependencies , for linear
the most difficult thing in data science, but it is regression we used sklearn (built in python
time-consuming. Therefore, it's easy to overlook this library) and import linear regression from it.
initial step and jump too soon into the water. We • then initialized Linear Regression to a
spend most time In understanding the data. variable reg.
Decision trees leave you with a difficult • set labels (output) as price columns and also
decision. A deep tree with lots of leaves will overfit converted dates to 1’s and 0’s so that it
because each prediction is coming from historical doesn’t influence our data much .
data from only the few houses at its leaf. But a • he again imported another dependency to
shallow tree with few leaves will perform poorly split our data into train and test. Made 90%
because it fails to capture as many distinctions in the and 10% as train and test data respectively.
raw data. Even today's most sophisticated modeling And randomized the splitting the data by
techniques face this tension between underfitting using random_state.
and overfitting. But, many models have clever ideas • So now , the train data , test data and labels
that can lead to better performance. So, we first for both let us fit our train and test data into
implemented random forest to see the results. We linear regression model.
used gradient boosting after implementing random • After fitting our data to the model we can
forest which is machine learning technique for check the score of our data ie , prediction. in
regression and classification problems, which this case the prediction is 73%.
produces a prediction model in the form of an
The accuracy of the model is much lower to be
useful, to achieve the satisfactory accuracy we 3.1.4 Noisy Data
decided to choose random forest and gradient Particular fields that contain new information
boosting implementation by doing better data can't be comprehended and translated accurately by
cleaning, data exploration, data visualization on a machines, like unstructured content. For example, in
different data set which covered almost every aspect a dataset, signs had numerous unstructured fields.
that is dependent on house price prediction For example, some had strange symbols that can not
be read by machine.
3. Data Set
3.1.5 Inconsistent Data
searching related dataset that has most of the
variables to predict the house price was hard in this Containing inconsistencies (an absence of
task. Expected data from the dataset was: similarity or comparability between at least two
realities). Recurrence of this sort of information was
It should to have essential and adequate variables to high in every one of the fields where one certainty
frame a composite choice parameter depend on was spoken to in various ways code names, images,
which results can be acquired. abbreviation and so on.
2. Univariable study: We just focus on the 4. Basic cleaning: We'll clean the dataset and
dependent variable ('SalePrice') and try to handle the missing data, outliers and
know a little bit more about it. categorical variables.
-we analyzed the ‘SalesPrice’ variable and
saw the relationship to its categorical variables like 'PoolQC', 'MiscFeature' and
features and numerical variables. 'FireplaceQu' are strong candidates for
Based on that we concluded that: outliers, so we happily to deleted them.
Regarding 'MasVnrArea' and
1) 'GrLivArea' and 'TotalBsmtSF' seem to 'MasVnrType', we consider that these
be linearly related with 'SalePrice'. Both variables are not essential. Furthermore,
relationships are positive, which means that they have a strong correlation with
as one variable increases, the other also 'YearBuilt' and 'OverallQual' which are
increases. In the case of 'TotalBsmtSF', we already considered.
can see that the slope of the linear we have one missing observation in
relationship is particularly high. 'Electrical'. Since it is just one observation,
2) 'OverallQual' and 'YearBuilt' also seem we'll delete this observation and keep the
to be related with 'SalePrice'. The variable
relationship seems to be stronger in the case Univariate analysis: The primary
of 'OverallQual', where the box plot shows concern here is to establish a threshold that
how sales prices increase with the overall defines an observation as an outlier. To do
quality. so, we standardized the data. In this context,
we convered data values to have mean of 0
3. Multivariate study : We'll try to understand and a standard deviation of 1.
how the dependent variable and independent
variables relate. 5. Test assumptions: We'll check if our data
To explore the universe, we started with meets the assumptions required by most
some practical recipes to make sense of our multivariate techniques.
'plasma soup'((source: According to Hair et al. (2013), four
http://umich.edu/~gs265/bigbang.htm)): assumptions should be tested:
• Correlation matrix (heatmap style). Normality, Homoscedasticity, Linearity,
• 'SalePrice' correlation matrix. Absence of correlated errors
We are in progress to test this assumptions [5] . Create a model to predict house prices using
6. References Python
https://towardsdatascience.com/create-a-model-to-
[1] House Prices: Advanced Regression Techniques predict-house-prices-using-python-d34fe8fad88f
(Dataset)
https://www.kaggle.com/c/house-prices-advanced- [6] A Gentle Introduction to the Gradient Boosting
regression-techniques/data Algorithm for Machine Learning
https://machinelearningmastery.com/gentle-
[2] . A Beginner's Guide to Recurrent Networks and introduction-gradient-boosting-algorithm-machine-
LSTMs - Deeplearning4j: Open-source, Distributed learning/
Deep Learning for the JVM
https://deeplearning4j.org/lstm.html [7] . How to Implement Random Forest From
Scratch in Python
[3] Using Recurrent Neural Networks in DL4J - https://machinelearningmastery.com/implement-
Deeplearning4j: Open-source, Distributed Deep random-forest-scratch-python/
Learning for the JVM
https://deeplearning4j.org/usingrnns [8] Bagging and Random Forest Ensemble
Algorithms for Machine Learning
[4] Pythonic Data Cleaning With NumPy and https://machinelearningmastery.com/bagging-and-
Pandas random-forest-ensemble-algorithms-for-machine-
https://realpython.com/python-data-cleaning-numpy- learning/
pandas/