OGBE STEPHEN AGENE
Data Analysis Class Project: Analyzing and
Predicting Housing Prices
Objective
The goal of this project is to explore and analyze a housing dataset, apply various data analysis
techniques, and build predictive models to forecast housing prices. We will use web scraping to
gather additional data, apply Principal Component Analysis (PCA) for dimensionality reduction,
perform ANOVA for feature significance testing, and build regression models to predict house
prices.
1. Introduction
In this project, we will analyze a publicly available housing dataset to study factors influencing
house prices. We will use multiple techniques in data science, including web scraping,
visualization, statistical analysis (ANOVA), and machine learning (PCA, Regression) to gain
insights into the housing market. Our end goal is to predict house prices based on multiple
features and assess the model's performance.
2. Dataset
For this project, we will use the Kaggle Ames Housing Dataset, a popular dataset containing
information on houses sold in Ames, Iowa, with various features like square footage, number of
bedrooms, age of the house, and sale price. We will also use web scraping to gather additional
external data (e.g., real estate trends, interest rates) that might influence house prices.
Kaggle Ames Housing Dataset: This dataset contains 79 features of houses sold in
Ames, Iowa, including both numerical and categorical attributes.
3. Data Collection
Step 1: Data Acquisition
1. Load Dataset:
oDownload the dataset from Kaggle (https://www.kaggle.com/c/house-prices-
advanced-regression-techniques/data).
o Use Python libraries such as Pandas to load the dataset.
2. Web Scraping for Additional Data:
o We will scrape additional data on economic indicators or housing trends from
websites like Zillow or other public APIs.
o We can use BeautifulSoup and requests in Python for scraping.
Example URL to scrape: https://www.zillow.com/homes/for_sale/
4. Data Preprocessing
Step 2: Cleaning and Preprocessing
Handling Missing Data: Investigate any missing values and handle them by imputation
or deletion.
Feature Engineering: Convert categorical variables into numerical formats (e.g., one-hot
encoding).
Scaling: Normalize or standardize the data to ensure all features are on the same scale for
PCA and regression.
5. Exploratory Data Analysis (EDA)
Step 3: Visualize the Data
Univariate Analysis:
o Histograms and boxplots for numerical features like Sale Price, Square Footage,
Number of Bedrooms, etc.
o Bar plots for categorical variables (e.g., Neighborhood, Garage Type).
Bivariate Analysis:
o Scatter plots to visualize the relationship between continuous variables (e.g., Sale
Price vs. GrLivArea).
o Correlation heatmap to see how variables are correlated with Sale Price.
Multivariate Analysis:
o Pair plots to visualize relationships between multiple variables.
o Facet grid plots for categorical variables against Sale Price.
Step 4: Web Scraping Data Integration
If we scrape external data, we will merge it with the original dataset for further analysis.
6. Statistical Analysis
Step 5: ANOVA (Analysis of Variance)
Purpose: Identify which categorical features have a significant effect on house prices.
Hypothesis:
o Null Hypothesis (H0H_0H0): There is no significant difference in sale price
between different categories (e.g., neighborhood).
o Alternative Hypothesis (H1H_1H1): There is a significant difference in sale price
between different categories.
Implementation:
o Perform ANOVA tests on categorical variables like Neighborhood, BldgType,
and GarageFinish using scipy.stats.f_oneway.
python
from scipy import stats
anova_result = stats.f_oneway(df['SalePrice'][df['Neighborhood'] ==
'OldTown'],
df['SalePrice'][df['Neighborhood'] ==
'Edwards'],
df['SalePrice'][df['Neighborhood'] ==
'Somerst'])
print(anova_result)
7. Dimensionality Reduction
Step 6: Principal Component Analysis (PCA)
Purpose: Reduce the number of features while retaining the variance in the data.
Implementation:
o Standardize the features and apply PCA to reduce the number of features.
o Visualize the explained variance ratio to decide how many components to keep.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.drop('SalePrice', axis=1))
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
explained_variance = pca.explained_variance_ratio_
print(explained_variance)
Visualize the first two principal components in a scatter plot.
8. Machine Learning Model
Step 7: Linear Regression
Purpose: Build a regression model to predict house prices.
Implementation:
o Split the dataset into training and testing sets.
o Apply a linear regression model to predict SalePrice from the features.
o Evaluate the model using metrics such as Mean Absolute Error (MAE) and R²
score.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
# Split data
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')
print(f'R² Score: {r2}')
Step 8: Feature Importance
Use RandomForestRegressor or Lasso regression to evaluate feature importance.
Plot the feature importance to understand which variables are the most important in
predicting house prices.
9. Conclusion
Summarize the findings from the data analysis, statistical tests, and predictive modeling.
Discuss which features were most significant in determining house prices.
Evaluate the performance of the regression model and suggest improvements (e.g., feature
engineering, using more complex models like XGBoost).
Provide recommendations based on the analysis, such as which factors homebuyers or sellers
should focus on for price negotiation.
10. Future Work
Implement more advanced machine learning models (e.g., Random Forest, XGBoost) to improve
prediction accuracy.
Explore feature interactions and non-linear relationships using techniques like decision trees or
neural networks.
Extend the project by scraping more economic data or market trends to enhance predictions
further.
References
Kaggle Ames Housing Dataset: https://www.kaggle.com/c/house-prices-advanced-regression-
techniques/data
Web Scraping: BeautifulSoup Documentation:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Scikit-learn: https://scikit-learn.org/stable/
PCA Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html