0% found this document useful (0 votes)
46 views6 pages

Data Analysis Project MAIN

The project aims to analyze a housing dataset to predict housing prices using various data analysis techniques, including web scraping, PCA, ANOVA, and regression modeling. The Kaggle Ames Housing Dataset will be utilized, along with additional data gathered through web scraping, to explore factors influencing house prices. The project will culminate in a predictive model, with an evaluation of its performance and recommendations for future improvements.

Uploaded by

ogbe.619
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views6 pages

Data Analysis Project MAIN

The project aims to analyze a housing dataset to predict housing prices using various data analysis techniques, including web scraping, PCA, ANOVA, and regression modeling. The Kaggle Ames Housing Dataset will be utilized, along with additional data gathered through web scraping, to explore factors influencing house prices. The project will culminate in a predictive model, with an evaluation of its performance and recommendations for future improvements.

Uploaded by

ogbe.619
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

OGBE STEPHEN AGENE

Data Analysis Class Project: Analyzing and


Predicting Housing Prices
Objective
The goal of this project is to explore and analyze a housing dataset, apply various data analysis
techniques, and build predictive models to forecast housing prices. We will use web scraping to
gather additional data, apply Principal Component Analysis (PCA) for dimensionality reduction,
perform ANOVA for feature significance testing, and build regression models to predict house
prices.

1. Introduction
In this project, we will analyze a publicly available housing dataset to study factors influencing
house prices. We will use multiple techniques in data science, including web scraping,
visualization, statistical analysis (ANOVA), and machine learning (PCA, Regression) to gain
insights into the housing market. Our end goal is to predict house prices based on multiple
features and assess the model's performance.

2. Dataset
For this project, we will use the Kaggle Ames Housing Dataset, a popular dataset containing
information on houses sold in Ames, Iowa, with various features like square footage, number of
bedrooms, age of the house, and sale price. We will also use web scraping to gather additional
external data (e.g., real estate trends, interest rates) that might influence house prices.

 Kaggle Ames Housing Dataset: This dataset contains 79 features of houses sold in
Ames, Iowa, including both numerical and categorical attributes.

3. Data Collection
Step 1: Data Acquisition

1. Load Dataset:
oDownload the dataset from Kaggle (https://www.kaggle.com/c/house-prices-
advanced-regression-techniques/data).
o Use Python libraries such as Pandas to load the dataset.
2. Web Scraping for Additional Data:
o We will scrape additional data on economic indicators or housing trends from
websites like Zillow or other public APIs.
o We can use BeautifulSoup and requests in Python for scraping.

Example URL to scrape: https://www.zillow.com/homes/for_sale/

4. Data Preprocessing
Step 2: Cleaning and Preprocessing

 Handling Missing Data: Investigate any missing values and handle them by imputation
or deletion.
 Feature Engineering: Convert categorical variables into numerical formats (e.g., one-hot
encoding).
 Scaling: Normalize or standardize the data to ensure all features are on the same scale for
PCA and regression.

5. Exploratory Data Analysis (EDA)


Step 3: Visualize the Data

 Univariate Analysis:
o Histograms and boxplots for numerical features like Sale Price, Square Footage,
Number of Bedrooms, etc.
o Bar plots for categorical variables (e.g., Neighborhood, Garage Type).
 Bivariate Analysis:
o Scatter plots to visualize the relationship between continuous variables (e.g., Sale
Price vs. GrLivArea).
o Correlation heatmap to see how variables are correlated with Sale Price.
 Multivariate Analysis:
o Pair plots to visualize relationships between multiple variables.
o Facet grid plots for categorical variables against Sale Price.

Step 4: Web Scraping Data Integration

 If we scrape external data, we will merge it with the original dataset for further analysis.
6. Statistical Analysis
Step 5: ANOVA (Analysis of Variance)

 Purpose: Identify which categorical features have a significant effect on house prices.
 Hypothesis:
o Null Hypothesis (H0H_0H0): There is no significant difference in sale price
between different categories (e.g., neighborhood).
o Alternative Hypothesis (H1H_1H1): There is a significant difference in sale price
between different categories.
 Implementation:
o Perform ANOVA tests on categorical variables like Neighborhood, BldgType,
and GarageFinish using scipy.stats.f_oneway.

python

from scipy import stats


anova_result = stats.f_oneway(df['SalePrice'][df['Neighborhood'] ==
'OldTown'],
df['SalePrice'][df['Neighborhood'] ==
'Edwards'],
df['SalePrice'][df['Neighborhood'] ==
'Somerst'])
print(anova_result)

7. Dimensionality Reduction
Step 6: Principal Component Analysis (PCA)

 Purpose: Reduce the number of features while retaining the variance in the data.
 Implementation:
o Standardize the features and apply PCA to reduce the number of features.
o Visualize the explained variance ratio to decide how many components to keep.

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_scaled = scaler.fit_transform(df.drop('SalePrice', axis=1))


pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

explained_variance = pca.explained_variance_ratio_

print(explained_variance)

 Visualize the first two principal components in a scatter plot.

8. Machine Learning Model


Step 7: Linear Regression

 Purpose: Build a regression model to predict house prices.


 Implementation:
o Split the dataset into training and testing sets.
o Apply a linear regression model to predict SalePrice from the features.
o Evaluate the model using metrics such as Mean Absolute Error (MAE) and R²
score.

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_absolute_error, r2_score

# Split data

X = df.drop('SalePrice', axis=1)

y = df['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train regression model

model = LinearRegression()
model.fit(X_train, y_train)

# Predictions

y_pred = model.predict(X_test)

# Evaluate the model

mae = mean_absolute_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f'Mean Absolute Error: {mae}')

print(f'R² Score: {r2}')

Step 8: Feature Importance

 Use RandomForestRegressor or Lasso regression to evaluate feature importance.


 Plot the feature importance to understand which variables are the most important in
predicting house prices.

9. Conclusion
 Summarize the findings from the data analysis, statistical tests, and predictive modeling.
 Discuss which features were most significant in determining house prices.
 Evaluate the performance of the regression model and suggest improvements (e.g., feature
engineering, using more complex models like XGBoost).
 Provide recommendations based on the analysis, such as which factors homebuyers or sellers
should focus on for price negotiation.

10. Future Work


 Implement more advanced machine learning models (e.g., Random Forest, XGBoost) to improve
prediction accuracy.
 Explore feature interactions and non-linear relationships using techniques like decision trees or
neural networks.
 Extend the project by scraping more economic data or market trends to enhance predictions
further.

References
 Kaggle Ames Housing Dataset: https://www.kaggle.com/c/house-prices-advanced-regression-
techniques/data
 Web Scraping: BeautifulSoup Documentation:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
 Scikit-learn: https://scikit-learn.org/stable/
 PCA Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

You might also like