0% found this document useful (0 votes)
4 views

Ita5007 Da2

Uploaded by

Mśď ŃàŃdy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Ita5007 Da2

Uploaded by

Mśď ŃàŃdy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Download a un-preprocessed dataset and perform some data cleaning

processes using Python / R that you have learnt in the subject.


Dataset: House Price Dataset of Bengaluru City
1. Import Libraries: In Python, you'll typically use libraries like pandas, numpy,
and matplotlib/seaborn for data analysis and visualization. Start by importing
these libraries.
2. Load the Dataset: Use pandas to read the dataset into a DataFrame.
3. Explore the Data: Use functions like df.head(), df.info(), and df.describe() to
get an overview of the dataset, including missing values.
4. Removing unwanted attributes: Here some attributes are not needed so it can
be removed.
5. Handle Missing Values: Identify and handle missing values using methods like
df.isnull(), df.dropna(), or df.fillna().
6. Feature Engineering: Create new features or modify existing ones to improve
model performance.
 Here creates a new column named bhk
 changes the total_sqft into integer where some of it are in string.

 Adding a another column named price per sqft

7. Handling Outliers: Identify and handle outliers using statistical methods or


visualization.
 Here, Removing the location stats which are less than 10
 Removing the data which are all considered outliers, here we take approximately 300
square ft per bedroom and remove the samples which do not satisfies this logic.

 we are finding mean and one standard deviation for every location and removing
anything that lies below mean and one standard deviation, and removing anything
lies above mean and one standard deviation. In this dataset, there are many samples
in which, the price of 3 bedroom apartment is lesser than 2 bedroom apartment with
similar square feet. There may be many reasons for that, but this outlier can affect
the accuracy of the prediction. So, for this, we will calculate the mean, standard
deviation and count for one and two bedroom apartment and filter out the two
bedroom apartments, who’s price is less than the mean of one bedroom apartment.

Before and after outlier removal in RAJAJI NAGAR.


 The majority of the price lies between 0 to 10000, the dataset is in normal
distribution.
In this dataset, there are some samples which have more than 10 bathrooms. If the samples
have same number of bedrooms, the bathroom count is correct, it is not an outlier. So, what
we do is, if the bathroom exceeds the bedroom by 2 or more numbers, those samples are
considered as outliers and removed from the dataset.
Outliers of bathroom features
8. Visualize the cleaned data: Here is the one of the example of visualization of
the data.

9. After that build the machine learning model which is opt to this kind of data.
for this data random forest, linear regression can be used.

You might also like