0% found this document useful (0 votes)

46 views6 pages

Data Analysis Project MAIN

The project aims to analyze a housing dataset to predict housing prices using various data analysis techniques, including web scraping, PCA, ANOVA, and regression modeling. The Kaggle Ames Housing Dataset will be utilized, along with additional data gathered through web scraping, to explore factors influencing house prices. The project will culminate in a predictive model, with an evaluation of its performance and recommendations for future improvements.

Uploaded by

ogbe.619

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views6 pages

Data Analysis Project MAIN

Uploaded by

ogbe.619

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

OGBE STEPHEN AGENE

Data Analysis Class Project: Analyzing and

Predicting Housing Prices
Objective
The goal of this project is to explore and analyze a housing dataset, apply various data analysis
techniques, and build predictive models to forecast housing prices. We will use web scraping to
gather additional data, apply Principal Component Analysis (PCA) for dimensionality reduction,
perform ANOVA for feature significance testing, and build regression models to predict house
prices.

1. Introduction
In this project, we will analyze a publicly available housing dataset to study factors influencing
house prices. We will use multiple techniques in data science, including web scraping,
visualization, statistical analysis (ANOVA), and machine learning (PCA, Regression) to gain
insights into the housing market. Our end goal is to predict house prices based on multiple
features and assess the model's performance.

2. Dataset
For this project, we will use the Kaggle Ames Housing Dataset, a popular dataset containing
information on houses sold in Ames, Iowa, with various features like square footage, number of
bedrooms, age of the house, and sale price. We will also use web scraping to gather additional
external data (e.g., real estate trends, interest rates) that might influence house prices.

 Kaggle Ames Housing Dataset: This dataset contains 79 features of houses sold in
Ames, Iowa, including both numerical and categorical attributes.

3. Data Collection
Step 1: Data Acquisition

1. Load Dataset:
oDownload the dataset from Kaggle (https://www.kaggle.com/c/house-prices-
advanced-regression-techniques/data).
o Use Python libraries such as Pandas to load the dataset.
2. Web Scraping for Additional Data:
o We will scrape additional data on economic indicators or housing trends from
websites like Zillow or other public APIs.
o We can use BeautifulSoup and requests in Python for scraping.

Example URL to scrape: https://www.zillow.com/homes/for_sale/

4. Data Preprocessing
Step 2: Cleaning and Preprocessing

 Handling Missing Data: Investigate any missing values and handle them by imputation
or deletion.
 Feature Engineering: Convert categorical variables into numerical formats (e.g., one-hot
encoding).
 Scaling: Normalize or standardize the data to ensure all features are on the same scale for
PCA and regression.

5. Exploratory Data Analysis (EDA)

Step 3: Visualize the Data

 Univariate Analysis:
o Histograms and boxplots for numerical features like Sale Price, Square Footage,
Number of Bedrooms, etc.
o Bar plots for categorical variables (e.g., Neighborhood, Garage Type).
 Bivariate Analysis:
o Scatter plots to visualize the relationship between continuous variables (e.g., Sale
Price vs. GrLivArea).
o Correlation heatmap to see how variables are correlated with Sale Price.
 Multivariate Analysis:
o Pair plots to visualize relationships between multiple variables.
o Facet grid plots for categorical variables against Sale Price.

Step 4: Web Scraping Data Integration

 If we scrape external data, we will merge it with the original dataset for further analysis.
6. Statistical Analysis
Step 5: ANOVA (Analysis of Variance)

 Purpose: Identify which categorical features have a significant effect on house prices.
 Hypothesis:
o Null Hypothesis (H0H_0H0): There is no significant difference in sale price
between different categories (e.g., neighborhood).
o Alternative Hypothesis (H1H_1H1): There is a significant difference in sale price
between different categories.
 Implementation:
o Perform ANOVA tests on categorical variables like Neighborhood, BldgType,
and GarageFinish using scipy.stats.f_oneway.

python

from scipy import stats

anova_result = stats.f_oneway(df['SalePrice'][df['Neighborhood'] ==
'OldTown'],
df['SalePrice'][df['Neighborhood'] ==
'Edwards'],
df['SalePrice'][df['Neighborhood'] ==
'Somerst'])
print(anova_result)

7. Dimensionality Reduction
Step 6: Principal Component Analysis (PCA)

 Purpose: Reduce the number of features while retaining the variance in the data.
 Implementation:
o Standardize the features and apply PCA to reduce the number of features.
o Visualize the explained variance ratio to decide how many components to keep.

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_scaled = scaler.fit_transform(df.drop('SalePrice', axis=1))

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

explained_variance = pca.explained_variance_ratio_

print(explained_variance)

 Visualize the first two principal components in a scatter plot.

8. Machine Learning Model

Step 7: Linear Regression

 Purpose: Build a regression model to predict house prices.

 Implementation:
o Split the dataset into training and testing sets.
o Apply a linear regression model to predict SalePrice from the features.
o Evaluate the model using metrics such as Mean Absolute Error (MAE) and R²
score.

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_absolute_error, r2_score

# Split data

X = df.drop('SalePrice', axis=1)

y = df['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train regression model

model = LinearRegression()
model.fit(X_train, y_train)

# Predictions

y_pred = model.predict(X_test)

# Evaluate the model

mae = mean_absolute_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f'Mean Absolute Error: {mae}')

print(f'R² Score: {r2}')

Step 8: Feature Importance

 Use RandomForestRegressor or Lasso regression to evaluate feature importance.

 Plot the feature importance to understand which variables are the most important in
predicting house prices.

9. Conclusion
 Summarize the findings from the data analysis, statistical tests, and predictive modeling.
 Discuss which features were most significant in determining house prices.
 Evaluate the performance of the regression model and suggest improvements (e.g., feature
engineering, using more complex models like XGBoost).
 Provide recommendations based on the analysis, such as which factors homebuyers or sellers
should focus on for price negotiation.

10. Future Work

 Implement more advanced machine learning models (e.g., Random Forest, XGBoost) to improve
prediction accuracy.
 Explore feature interactions and non-linear relationships using techniques like decision trees or
neural networks.
 Extend the project by scraping more economic data or market trends to enhance predictions
further.

References
 Kaggle Ames Housing Dataset: https://www.kaggle.com/c/house-prices-advanced-regression-
techniques/data
 Web Scraping: BeautifulSoup Documentation:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
 Scikit-learn: https://scikit-learn.org/stable/
 PCA Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

(House Price Prediction) Capstone Project For Python
No ratings yet
(House Price Prediction) Capstone Project For Python
10 pages
House Price Prediction
No ratings yet
House Price Prediction
5 pages
House Price Pridiction Prabhjotsingh2
No ratings yet
House Price Pridiction Prabhjotsingh2
14 pages
Project
No ratings yet
Project
10 pages
Report
No ratings yet
Report
40 pages
Phase 2 M.dhatchanamurthy
No ratings yet
Phase 2 M.dhatchanamurthy
5 pages
House Price Prediction With Analysis
No ratings yet
House Price Prediction With Analysis
9 pages
Phase 2 Irfan
No ratings yet
Phase 2 Irfan
5 pages
Ese Lab File
No ratings yet
Ese Lab File
30 pages
Coding
No ratings yet
Coding
7 pages
Real-Estate Property
No ratings yet
Real-Estate Property
11 pages
Regression Dataset
No ratings yet
Regression Dataset
3 pages
ML Project Part A 1
No ratings yet
ML Project Part A 1
6 pages
Project Report
No ratings yet
Project Report
15 pages
Intership Report
No ratings yet
Intership Report
20 pages
Story Point Estimation Copy
No ratings yet
Story Point Estimation Copy
16 pages
Predicting Housin Main Project Ediglobe
No ratings yet
Predicting Housin Main Project Ediglobe
4 pages
Reshma Naan Mudhalvan Project
No ratings yet
Reshma Naan Mudhalvan Project
5 pages
Final Data Science Report 25 Pages
No ratings yet
Final Data Science Report 25 Pages
4 pages
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
No ratings yet
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
20 pages
House Price Predictor PPT Project
No ratings yet
House Price Predictor PPT Project
13 pages
Ads Lab8
No ratings yet
Ads Lab8
5 pages
FML PROJECT Diya
No ratings yet
FML PROJECT Diya
9 pages
Title Predicting House Pricing Using AIML (KASHISH)
No ratings yet
Title Predicting House Pricing Using AIML (KASHISH)
2 pages
House Value
No ratings yet
House Value
22 pages
Anbuselvan Phase2
No ratings yet
Anbuselvan Phase2
5 pages
Anbuselvan Phase 2 PRJ
No ratings yet
Anbuselvan Phase 2 PRJ
5 pages
Ames Housing Price Prediction - Complete ML Project With Python
No ratings yet
Ames Housing Price Prediction - Complete ML Project With Python
14 pages
Price Prediction
No ratings yet
Price Prediction
4 pages
Aastha Mahajan Python File
No ratings yet
Aastha Mahajan Python File
17 pages
Phase 1 Project - Dhatchanamurthy.m (Ai&Ds)
No ratings yet
Phase 1 Project - Dhatchanamurthy.m (Ai&Ds)
4 pages
NN - CCP
No ratings yet
NN - CCP
10 pages
AIML
No ratings yet
AIML
5 pages
A
No ratings yet
A
2 pages
Machine Learning for Real Estate
No ratings yet
Machine Learning for Real Estate
9 pages
PN1 Shakti Akshaya S PDF
100% (2)
PN1 Shakti Akshaya S PDF
60 pages
Phase 5
No ratings yet
Phase 5
5 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
Oral Presentation
No ratings yet
Oral Presentation
9 pages
Formal Research Paper Slideshow by Slidesgo
No ratings yet
Formal Research Paper Slideshow by Slidesgo
9 pages
Synopsis Format1 PDF
No ratings yet
Synopsis Format1 PDF
6 pages
House Pridiction Analysis
No ratings yet
House Pridiction Analysis
3 pages
Phase 2 Heefa
No ratings yet
Phase 2 Heefa
4 pages
Rev Ajrcos 101262 Ina A
No ratings yet
Rev Ajrcos 101262 Ina A
11 pages
Document 4
No ratings yet
Document 4
4 pages
House Price Prediction Full Report-2
No ratings yet
House Price Prediction Full Report-2
5 pages
KIIT Deemed To Be University: A Project Report
No ratings yet
KIIT Deemed To Be University: A Project Report
33 pages
ABCA 2 Model Building
No ratings yet
ABCA 2 Model Building
9 pages
Kirubavathi
No ratings yet
Kirubavathi
10 pages
House Prices Analysis - Final Assessment
No ratings yet
House Prices Analysis - Final Assessment
2 pages
Task 1
No ratings yet
Task 1
11 pages
Housepriceprediction ML 221104055342 Fb5109ae
No ratings yet
Housepriceprediction ML 221104055342 Fb5109ae
17 pages
18BCS115
No ratings yet
18BCS115
25 pages
ML Project CLG
No ratings yet
ML Project CLG
62 pages
Shub Neet DT
No ratings yet
Shub Neet DT
12 pages
Rajasri
No ratings yet
Rajasri
10 pages
Data Mining Final Assignment
No ratings yet
Data Mining Final Assignment
4 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
House
No ratings yet
House
7 pages
Pca
No ratings yet
Pca
19 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Ai&ml Question Bank Answers
No ratings yet
Ai&ml Question Bank Answers
26 pages
Python - Stdin, Stdout, and Stderr
No ratings yet
Python - Stdin, Stdout, and Stderr
20 pages
10.1007@978 0 387 39351 3 PDF
100% (2)
10.1007@978 0 387 39351 3 PDF
316 pages
ML - ML in Nutshell
No ratings yet
ML - ML in Nutshell
7 pages
DSML Notes
No ratings yet
DSML Notes
32 pages
Unit 4 - Machine Learning - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 4 - Machine Learning - WWW - Rgpvnotes.in PDF
27 pages
Data Science Career Boost
No ratings yet
Data Science Career Boost
30 pages
Face Recognition Attendance System Based On Real-Time Video Processing
No ratings yet
Face Recognition Attendance System Based On Real-Time Video Processing
9 pages
Data Analytics Compendium BITeSys 2024
No ratings yet
Data Analytics Compendium BITeSys 2024
46 pages
The Power of Matrices in Machine Learning
No ratings yet
The Power of Matrices in Machine Learning
9 pages
Machine Learning Types & Techniques
No ratings yet
Machine Learning Types & Techniques
17 pages
Source Code For Chatbot
No ratings yet
Source Code For Chatbot
22 pages
Face Recognition Under Pose and Illumination Variations Using Thecombination of Information Set and PLPP Features
No ratings yet
Face Recognition Under Pose and Illumination Variations Using Thecombination of Information Set and PLPP Features
11 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Data Science AI Certification Program
No ratings yet
Data Science AI Certification Program
34 pages
PARAFAC Algorithms For Large Scale Probl
No ratings yet
PARAFAC Algorithms For Large Scale Probl
15 pages
AI and Robotics Complete Practice Set Final
No ratings yet
AI and Robotics Complete Practice Set Final
12 pages
Lec 04
No ratings yet
Lec 04
70 pages
Clustering Importante
No ratings yet
Clustering Importante
12 pages
Machine Learning Roadmap
No ratings yet
Machine Learning Roadmap
35 pages
Data Science & ML Expert Profile
No ratings yet
Data Science & ML Expert Profile
5 pages
Yao Seniorthesis 2016
No ratings yet
Yao Seniorthesis 2016
59 pages
ML Full Notes
No ratings yet
ML Full Notes
66 pages
Lecture 11 Dimensionality Reduction
No ratings yet
Lecture 11 Dimensionality Reduction
32 pages
Python Machine Learning - Machine Learning and Deep Learning With Python Scikit Learn and Tensorflow 2 Third Edition
No ratings yet
Python Machine Learning - Machine Learning and Deep Learning With Python Scikit Learn and Tensorflow 2 Third Edition
4 pages
51 DA5400 - FML51 - 20250501 ProblemSet06
No ratings yet
51 DA5400 - FML51 - 20250501 ProblemSet06
4 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
An Introduction To The Psych Package: Part I: Data Entry and Data Description
No ratings yet
An Introduction To The Psych Package: Part I: Data Entry and Data Description
63 pages

Data Analysis Project MAIN

Uploaded by

Data Analysis Project MAIN

Uploaded by

OGBE STEPHEN AGENE

Data Analysis Class Project: Analyzing and

Example URL to scrape: https://www.zillow.com/homes/for_sale/

5. Exploratory Data Analysis (EDA)

Step 4: Web Scraping Data Integration

from scipy import stats

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

X_scaled = scaler.fit_transform(df.drop('SalePrice', axis=1))

 Visualize the first two principal components in a scatter plot.

8. Machine Learning Model

 Purpose: Build a regression model to predict house prices.

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_absolute_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train regression model

# Evaluate the model

mae = mean_absolute_error(y_test, y_pred)

print(f'Mean Absolute Error: {mae}')

print(f'R² Score: {r2}')

Step 8: Feature Importance

 Use RandomForestRegressor or Lasso regression to evaluate feature importance.

10. Future Work

You might also like