property price prediction system for the local property market (Malaysia)
Hi... my project is property price prediction system for the local property market (Malaysia). I have
already complete the introduction and background study (little) part of the documentation. I'll need
your help on the " study of existing machine learning models"... you have to compare existing
models like multiple linear regression, gradient boost and artificial neural network and just decide on
the best to use for my project (which is multiple linear regression) So far this is what I have done in
my background study, I'll need help to complete the literature review and touch up the background
study. You may add any section to it if necessary for the project. (Online property price prediction
system)
CHAPTER 2: BACKGROUND STUDY
2.1 Introduction
Many AVMs for property valuation exist in Malaysia, but they are kept private within the
company that develops the model and are not public. The company uses these models to help
property evaluators fix a property price for their clients. Although open-to-the-public AVMs are
prevalent in the United States, Australia, Canada, and Singapore, they are still uncommon in
Malaysia.
In Malaysia, there exists only a single company that provides its AVM to the public. They are
called EdgeProp.my and provide services like property price prediction and rental price
prediction. Therefore, studies are performed on EdgeProp.my, as it is currently the only AVM in
Malaysia opened to the public.
Besides that, a study on existing machine learning models is carried out in this chapter to
determine the best machine learning model to utilize for our local property market.
2.2 Study on Existing AVM in Malaysia
2.2.1 User Interface Design
We can see that the layout of EdgeProp.my is simple as it doesn’t contain any row or column.
The colour blue and white is consistent throughout the website. The font used is consistent and
readable. Besides that, the navigation bar stayed at the exact location for all pages. This helps the
user be more familiar with the website’s navigation as it’ll navigate the same way as all the other
sites they already know.
Following is the layout of EdgeProp.my property price predictor.
Figure 2.1 Layout of EdgeProp.my
The layout of EdgeProp.my uses a design with the navigation bar at the top and no rows or
columns. The navigation bar also highlights the name of the company. Furthermore, the web
elements used in EdgeProp.my are not attractive. More than a quarter of the page is filled with
ads and doesn’t correspond to the functionality of the AVM.
Figure 2.2 Screenshot of EdgeProp.my
2.2.2 Outcome of Analyzing User Interface Design
A consistent and straightforward user interface is necessary for creating a good interface for an
AVM. It must be clear with no distractions as the user is only required to enter their property
specifications on the website to obtain a price or rental prediction.
2.2.3 Features
The AVM of EdgeProp.my gives the option of predicting the price or the rental of a chosen
property. The user needs to state the property type and either the project’s name, township or
street. If the user selects the property type as land, they will need to choose a sub-property type
from the list of sub-property types given on the website, which are made from a bungalow, semi-
detached house, terrace house, townhouse, and cluster house. In addition to the sub-property
type, the user also needs to state the built-up and the land size of the selected property in square
feet.
If the user chooses the property as non-landed, the block and floor number of the property and
the built-up in square feet must be inserted in the AVM.
After the process above is completed, the system then produces a predicted property price.
Rather than giving an exact prediction number, the system provides a range of values that can be
an acceptable property price. Along with the price prediction, the system also shows the last five
properties transacted with the user’s specifications.
Figure 2.3 Screenshot showing the Prediction Result of the AVM
2.2.4 Outcome of Analyzing Features of AVM
In the EdgeProp.my’s AVM, the section to predict the property’s rental price, did not work well.
The system keeps giving errors stating that price cannot be derived due to the lack of recent
similar property transactions. The same mistake was given at certain times when trying to predict
a price of a property located in not densely populated areas.
The system should have limited its locations to only places with sufficient transaction data to
avoid all these errors. Users living in not so densely populated areas will waste their time making
errors upon mistakes when using this system. Besides that, the rental prediction system proves no
worth to the website as most of the property specifications inserted returned with a prediction
error (even in highly populated areas).
Figure 2.4 Screenshot showing the error in predict rental section
2.3 Study on Suitable Machine Learning Models
Human beings require a place to live, and the quality of their lives is influenced by the type of
home they rent or purchase. For most of human history, humans have lived in their homes, but
due to the Industrial Revolution, a large number of people have moved from rural to urban
areas. The vast majority of the population resides in rented apartments and houses. A large
number of property markets have sprung up as a result, allowing users to consult and book
property based on their specific requirements. Due to the fact that tourists travel to various
locations and book accommodations through these platforms, the prices of these properties are
highly correlated with the economy and the weather. Because of advancements in the internet
and technology, these platforms have been shifted from physical units to websites, which allows
users to search for their desired accommodations with greater flexibility.
Users will be able to see expected prices based on previous seasons and time series
forecasting, which will be available to them. Machine learning models are used in conjunction
with intelligent time series forecasting models in order to perform these functions effectively.
Initially, expert systems were employed; however, due to the availability of complex hardware
and resources, as well as a large number of algorithms, machine learning algorithms have come
to be employed.
2.3.1 Comparison of Machine Learning Models
In this study we have compared different machine learning models used to predict property
prices and perform a comparative analysis of their advantages and disadvantages and which
perform better. A number of studies have been performed for property price prediction, which
uses different machine learning techniques. In [1], a case of fairfax County, Varginia is studies
for housing prediction problem. They analyse the housing data of 2359 townhouses. They have
used multiple machine learning models such as C4.5, Ripper, Naïve Bayes and Adaboost. Upon
all these techniques, the Ripper model outperforms the other techniques and achieve better
prediction accuracy. Moreover in [2], a reliable price prediction model for AirBnB using Natural
Language Processing (NLP), machine learning and deep learning approaches to facilitate both
the property owners and customers.
Initially, they extract the property owner reviews, customer and property details and then
combine the bids to extract useful information which then helps the machine learning model to
predict the best price for both property owner and the customer. They have used Support Vector
Regression (SVR), K- means clustering and neural networks for prediction. An ARIMA model
combined with deep neural network is proposed for forecasting of house prices is proposed in
[3]. Firstly, they have crawled the data from websites using scrapy library and compared the
approach with different techniques. Similarly, gradient boosting based methods Xgboost is also
combined with deep learning algorithms to find the appraisal of the property [4].
Ensemble based approaches are also used for property price prediction tasks. In [5]. a
rotation based random forest with features selected from Principle Components Analysis (PCA)
is proposed which performed experiments with 10-fold cross validation achieves state of the art
results.
In [6], the authors proposed a Multiple Linear Regression (MLR) based approach for
prediction of houses pricing. They used the features (bedrooms,lotsize,stories,
driveway,recroom,fullbase,airconditioning facility) which includes the stories, number of
bedrooms, availability of AC, Driveway facility etc. They have compared the proposed with
other regression techniques such as Ridge regression, Lasso, Random Forest regression.
Assessment of the results shows that the proposed approach outperforms other techniques.
Gradient Boosting:
Gradient boosting is a boosting ensemble technique which provides good prediction speed and
accuracy. Boosting is basically and ensemble technique in which we train multiple models on the
data in which the next model performs better than previous model and the combination of these
models is used for the prediction. In boosting we make multiple combinations of model to have
less error and better performance.
Gradient Boosting is also type of boosting technique which relies on the intuition that the best
next model when combines which the previous ones, minimizes the error. The idea is to set the
outcome so that the next model performs better. To calculate the outcome, the target outcome is
studied weather changing the value have impact on the prediction error or not.
1) Once an error rate is reduced by a significant amount due to an adjustment in a forecast, the
case's next desired outcome is a high value. The model's prediction error will be reduced if its
predictions are near to its goals.
2) There will be zero next target outcome if the forecast changes somewhat but has no effect on
the mistake. While making a different prediction, the mistake would be reduced, but changing it
would not.
The name of this algorithm contains gradient because it decides the next model target outcomes
are based on previous model gradient loss. In the space of feasible predictions for each training
example, each new model makes a step toward minimizing prediction error.
Artificial Neural Network:
Neural Networks mimics the working of human brain and try to learn patterns from the data for
predictions. The ability of humans to make complex decisions makes it more suitable to solve
real time problems faced by us. Based on this, Artificial Neural Network (ANN) based
algorithms are proposed. ANN is based on neural networks. Neural Networks (NN) are initially
proposed in 1950s in which first perceptron idea is given. Perceptron is a unit which multiplies
the input x with some weight w and then threshold the value to have some output y. The input
X= {x1, x2 , x3,….. xN} consists of multiple input values and the weight W= {W1, W2 , W3,…..
WN} consists of weight values. The resultant output is the multiplication of X and W XxW= Y.
The output value Y is activation of sum.
The problem with perceptron is that it can perform output using linear function but fails to
handle polynomial functions means it can easily solve linearly separable problems but unable to
solve non-linear separable problems. To solve this issue NNs are proposed which uses activation
function to squeeze the output with some function f(z=x*w) where z is the output of resultant of
x and w multiplication. There are different activation function proposed in literature. Some of
the famous activation functions includes sigmoid, Rectified Linear Unit (ReLu), TanH and
Sigmoid.
2.Multi-Layer Perceptron
A Multi-Layer Perceptron (MLP) consists of multiple layers of neurons interconnected with each
others such that the output of previous neuron is given to next neuron to capture more refined
features. Due to multiple layers, it provides automatic feature selection by selecting only relevant
features and combine them to have more refined features.
Linear Regression:
Linear regression is an approach to model the relationship between one or more variables with
the target variable. This algorithm is quite famous in statistics and holds baseline for many
machine learning algorithms and techniques. Linear regression is represented as the linear
combination of weights B0 and B1 multiplied with some input x. B0 represented the bias value
which is the adjustment of output with some initial weight value. It is represented mathematically
as:
y= B0 +B1*x
In higher dimension where we have more that one input it is called hyper-plane. There are
different types of linear regression based on its usage and applications. One of the simplest
types is linear regression which takes only single input.
3.1 Multiple Linear Regression:
Multiple linear regression (MLR), often known as multiple regression, is a statistical approach
that use a number of explanatory factors to forecast the result of a response variable. MLR’s
objective is to represent the linear connection between the explanatory (independent) and
response (dependent) variables. Regression with more than one explanatory variable may be seen
as an extension of the least-squares (OLS) approach.
It consists of multiple weights and inputs variables, as in MLR we have multiple input variables.
A new variable ϵ Is also added which is the errors of model also called residual. It is represented
mathematically as:
yi = B0+B1*x1+ B1*x2 +B1*x3+…….+ B1*xn + ϵ
Where y is the dependent variable and X belongs to independent variables. The above equation
maps the relationship between independent and dependent variables.
2.3.2 Advantages and Disadvantages of Machine Learning Models
There are number of machine learning models proposed and many approaches based on these
models are proposed in literature as discussed above. Each model has certain advantages and
disadvantages based on their use and applications. The problem with neural network-based
approaches is that they don’t perform better on regression related tasks because in regression we
must map the dependent and independent variables and fit them into the curve to perform
predictions. On the other hand, neural network techniques provide the flexibility to automatically
extract features which saves time and reduce effort to extract features and features selection.
Linear regression and MLP perform better in regression related tasks due to linear mapping of
variables.
2.3.3 Outcome of Study on Suitable Machine Learning Models
The study of above approaches shows the benefits and disadvantages of both approaches. The
comparison of different approaches shows that the suitable machine learning model for this
problem is MLP because it shows better results as discussed in literature and quite good for
regression related tasks.
The MLR is better than other algorithms because it is more versatile and has wide applicability.
Both models can perform prediction, but MLR allows understanding the relationships between
different variables using statistical measures like R-Square and regression analysis to find the
total variability in the data [7]. They also tell that our model is statically significant or not, which
other models fail to do. For example, let’s say we have 40 features; it can find which feature is
good for predictor and which is not. Because of this, the important features are selected.
Moreover, other models treat them as Blackbox models in which we don’t know the inner
working and how the predictor learns on the data, but MLR is an explanatory model which helps
to understand this process. For each regression coefficient that is estimated, regression analysis
can provide a confidence range. For each characteristic, you can receive a range of coefficients
with a degree of confidence (e.g., 99 percent confidence) as well as a single coefficient [8], [9].
Article Title Article Link Problem Purpose Goal(s) of Research Questions
/ Review (Study)
Limitatio
n of
Existing
Reviews
1. Systematic https://www.scie "The purpose of this RQ1. How much SLR activity
literature ncedirect.com/s study is to review the has there been since 2004?
reviews in cience/article/pii/ status of EBSE since
software S095058490800 2004 using a tertiary RQ2. What research topics
engineering - A 1390 study to review articles are being addressed?
systematic related to EBSE
literature review and we concentrate on RQ3. Who is leading SLR
articles describing research?
systematic
literature reviews RQ4. What are the
(SLRs)." limitations of current
research?
2. Using machine Limited The purpose of this study RQ1. How to perform
learning number is to predict the expected analysis on the Fairfax
algorithms for of price of houses based on housing data?
housing price machine the data of Fairfax
prediction: The learning houses. RQ2. How to find suitable
case of Fairfax
models knowledge and decisions
County, Virginia
housing data. used and from the data?
less
performa RQ3. Finding best model
nce which gives batter
prediction?
3. Airbnb Price Complex RQ1. How to efficiently parse
Prediction Using model data from AirBnB website
Machine Learning based on RQ2. How to extract relevant
and Sentiment NLP and data from the crawled data
Analysis. other using NLP approaches?
techniqu RQ3. Analysis of results
es. Not using comparison of machine
end to learning models?
end RQ4. How to find best price
approac recommendation which
h satisfies both consumer and
propose property owner?
d
4. House Prices Model RQ1. How to effectively
prediction using complex extract important features
ARIMA and due to from the scraped data.
deep learning combine RQ2. Which technique
model d ARIMA should be effective for
and features extraction from
deep natural text?
learning RQ3. Which performance
used. measures to use?
RQ4. Which algorithms
should be compared with the
proposed approach?
5. Property price Limited RQ1. How to proposed new
prediction with data technique to overcome the
Xgboost used. flaws of traditional
combined with techniques?
deep learning RQ2. How to perform sale
analysis of the technique.
RQ3. What are the key
factors which impacts the
performance of system?
6. Investigation of Propose RQ1. How to utilize rotation
rotation forest d mechanism with random
method applied to approac forest?
property price h is slow RQ2. Which performance
prediction measures should be used?
RQ3. Assessment of
property prediction model?
7. Housing price - RQ1. How to gather the data
prediction using and clean it?
Multiple Linear RQ2. How to perform data
Regression pre-processing?
RQ3. Which features are
important for training of
model?
References:
[1] Park, Byeonghwa, and Jae Kwon Bae. "Using machine learning algorithms for housing price
prediction: The case of Fairfax County, Virginia housing data." Expert systems with applications 42.6
(2015): 2928-2934.
[2] Rezazadeh Kalehbasti, Pouya, Liubov Nikolenko, and Hoormazd Rezaei. "Airbnb Price Prediction
Using Machine Learning and Sentiment Analysis." International Cross-Domain Conference for Machine
Learning and Knowledge Extraction. Springer, Cham, 2021.
[3] Wang, Feng, et al. "House price prediction approach based on deep learning and ARIMA
model." 2019 IEEE 7th International Conference on Computer Science and Network Technology
(ICCSNT). IEEE, 2019.
[4] Zhao, Yun, Girija Chetty, and Dat Tran. "Deep learning with XGBoost for real estate appraisal." 2019
IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 2019.
[5] Lasota, Tadeusz, Tomasz Łuczak, and Bogdan Trawiński. "Investigation of rotation forest method
applied to property price prediction." International Conference on Artificial Intelligence and Soft
Computing. Springer, Berlin, Heidelberg, 2012.
[6] Kaushal, Anirudh, and Achyut Shankar. "House Price Prediction Using Multiple Linear
Regression." Available at SSRN 3833734 (2021).
[7] https://online.stat.psu.edu/stat501/lesson/7 [Date accessed: 27/10/2021]
[8] Bayer, Anita, et al. "A comparison of feature-based MLR and PLS regression techniques for the
prediction of three soil constituents in a degraded South African ecosystem." Applied and Environmental
Soil Science 2012 (2012).
[9] https://www.kdnuggets.com/2021/08/3-reasons-linear-regression-instead-neural-networks.html [Date
accessed: 27/10/2021]