Skip to content

nhup218/DS202-Final-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Final Project Report

Nhu Phan 2025-11-20

Introduction

Housing prices are influenced by a multitude of factors, including the size of the property, its age, quality of construction, and location. Understanding these relationships is important not only for prospective buyers and sellers but also for urban planners, real estate investors, and economists.

This project uses the Ames Housing Dataset from Kaggle, which contains detailed information on over 1,400 homes in Ames, Iowa, to analyze which features most strongly affect house prices. By systematically exploring variables like living area, neighborhood, overall quality, and garage size, the goal is to determine the key drivers of housing prices and to understand patterns that may inform decision-making in real estate markets.

Data

ames <- read_csv("train.csv")
## Rows: 1460 Columns: 81
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ames_clean <- ames %>%
select(SalePrice, GrLivArea, OverallQual, YearBuilt, Neighborhood, GarageCars) %>%
drop_na() %>%
mutate(Neighborhood = factor(Neighborhood))

glimpse(ames_clean)
## Rows: 1,460
## Columns: 6
## $ SalePrice    <dbl> 208500, 181500, 223500, 140000, 250000, 143000, 307000, 2…
## $ GrLivArea    <dbl> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 107…
## $ OverallQual  <dbl> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5, …
## $ YearBuilt    <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 193…
## $ Neighborhood <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mitchel, Som…
## $ GarageCars   <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2, …

The dataset consists of 1,460 homes with 81 variables covering structural, geographic, and qualitative features. For this analysis, the following variables were selected:

SalePrice: Sale price of the house (USD) – target variable.

GrLivArea: Above-ground living area in square feet – continuous predictor.

OverallQual: Overall material and finish quality (1–10) – ordinal predictor.

YearBuilt: Year the house was built – continuous predictor.

Neighborhood: Physical location within Ames – categorical predictor.

GarageCars: Garage size (number of cars it fits) – discrete predictor.

Data cleaning steps included:

Selecting relevant columns.

Removing rows with missing values (drop_na()).

Converting Neighborhood to a factor for categorical analysis.

Summary statistics:

summary_stats <- ames_clean %>%
  summarise(
    SalePrice_min = min(SalePrice),
    SalePrice_q1 = quantile(SalePrice, 0.25),
    SalePrice_median = median(SalePrice),
    SalePrice_mean = mean(SalePrice),
    SalePrice_q3 = quantile(SalePrice, 0.75),
    SalePrice_max = max(SalePrice),

    GrLivArea_min = min(GrLivArea),
    GrLivArea_q1 = quantile(GrLivArea, 0.25),
    GrLivArea_median = median(GrLivArea),
    GrLivArea_mean = mean(GrLivArea),
    GrLivArea_q3 = quantile(GrLivArea, 0.75),
    GrLivArea_max = max(GrLivArea),

    OverallQual_min = min(OverallQual),
    OverallQual_q1 = quantile(OverallQual, 0.25),
    OverallQual_median = median(OverallQual),
    OverallQual_mean = mean(OverallQual),
    OverallQual_q3 = quantile(OverallQual, 0.75),
    OverallQual_max = max(OverallQual),

    YearBuilt_min = min(YearBuilt),
    YearBuilt_q1 = quantile(YearBuilt, 0.25),
    YearBuilt_median = median(YearBuilt),
    YearBuilt_mean = mean(YearBuilt),
    YearBuilt_q3 = quantile(YearBuilt, 0.75),
    YearBuilt_max = max(YearBuilt),

    GarageCars_min = min(GarageCars),
    GarageCars_q1 = quantile(GarageCars, 0.25),
    GarageCars_median = median(GarageCars),
    GarageCars_mean = mean(GarageCars),
    GarageCars_q3 = quantile(GarageCars, 0.75),
    GarageCars_max = max(GarageCars)
  )

summary_stats
## # A tibble: 1 × 30
##   SalePrice_min SalePrice_q1 SalePrice_median SalePrice_mean SalePrice_q3
##           <dbl>        <dbl>            <dbl>          <dbl>        <dbl>
## 1         34900       129975           163000        180921.       214000
## # ℹ 25 more variables: SalePrice_max <dbl>, GrLivArea_min <dbl>,
## #   GrLivArea_q1 <dbl>, GrLivArea_median <dbl>, GrLivArea_mean <dbl>,
## #   GrLivArea_q3 <dbl>, GrLivArea_max <dbl>, OverallQual_min <dbl>,
## #   OverallQual_q1 <dbl>, OverallQual_median <dbl>, OverallQual_mean <dbl>,
## #   OverallQual_q3 <dbl>, OverallQual_max <dbl>, YearBuilt_min <dbl>,
## #   YearBuilt_q1 <dbl>, YearBuilt_median <dbl>, YearBuilt_mean <dbl>,
## #   YearBuilt_q3 <dbl>, YearBuilt_max <dbl>, GarageCars_min <dbl>, …
summary_table <- summary_stats %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value")
Variable Min 1st Qu. Median Mean 3rd Qu. Max
SalePrice ($) 34,900 129,975 163,000 180,921 214,000 755,000
GrLivArea (sq ft) 334 1,130 1,464 1,515 1,777 5,642
OverallQual 1 5 6 6.1 7 10
YearBuilt 1872 1954 1973 1971 2000 2010
GarageCars 0 1 2 1.77 2 4

Exploration & Findings

Question 1: Does the size of a house (GrLivArea) strongly affect its sale price?

Scatterplots with regression lines show a clear positive correlation: larger homes generally sell for higher prices. However, a few outliers exist (very large homes with lower prices), highlighting the importance of considering additional factors like location or quality.

ggplot(ames_clean, aes(GrLivArea, SalePrice)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  scale_y_continuous(labels = dollar) +
  labs(
    title = "Relationship Between Living Area and Sale Price",
    x = "Living Area (sq ft)",
    y = "Sale Price ($)"
  )
## `geom_smooth()` using formula = 'y ~ x'

Observation: GrLivArea has a strong positive linear relationship with SalePrice, though outliers suggest caution in using size alone for predictions.

Question 2: Do certain neighborhoods have higher average sale prices than others?

Aggregating by neighborhood shows clear differences: NoRidge, NridgtHt, and StoneBr tend to have the highest average sale prices, whereas older or less central neighborhoods like NAmes and OldTown tend to have lower average prices.

ames_clean %>%
  group_by(Neighborhood) %>%
  summarize(AveragePrice = mean(SalePrice)) %>%
  ggplot(aes(x = reorder(Neighborhood, AveragePrice), y = AveragePrice)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  scale_y_continuous(labels = dollar) +
  labs(
    title = "Average Sale Price by Neighborhood",
    x = "Neighborhood",
    y = "Average Sale Price ($)"
  )

Observation: Location is a strong predictor of price. Even homes of similar size and quality can vary greatly in price depending on the neighborhood.

Question 3: Does overall quality impact price?

Boxplots of SalePrice vs OverallQual indicate a near-linear increase in median price as quality improves. Houses rated 9–10 command significantly higher prices than lower-rated homes.

ggplot(ames_clean, aes(x = as.factor(OverallQual), y = SalePrice)) +
  geom_boxplot(fill = "lightgreen") +
  scale_y_continuous(labels = dollar) +
  labs(
    title = "Sale Price by Overall Quality",
    x = "Overall Quality (1 = Poor, 10 = Excellent)",
    y = "Sale Price ($)"
  )

Insight: Quality is a key factor and interacts with both size and neighborhood for predictive modeling.

Question 4: Are garages influential in pricing?

Homes with more garage capacity tend to have higher prices, though this effect is smaller than living area or quality. Some homes without garages still sell for high prices if other features (size, location, quality) are strong.

ggplot(ames_clean, aes(x = factor(GarageCars), y = SalePrice)) +
  geom_boxplot(fill = "red") +
  scale_y_continuous(labels = dollar) +
  labs(
    title = "Sale Price by Garage Size",
    x = "Garage Capacity (Cars)",
    y = "Sale Price ($)"
  )

Observation: Garage size is moderately correlated with price, acting as a secondary factor.

Curiosity & Skepticism

Multiple approaches were explored: scatterplots, boxplots, and aggregations to identify relationships.

Outliers and extreme values were investigated to avoid misleading conclusions.

Regression models with combinations of variables (size, quality, location, garage) were tested to confirm intuitions.

Some unexpected trends (e.g., very high-priced small homes) suggested interactions between neighborhood and other features, highlighting the complexity of real estate pricing.

Organization of Findings

Size (GrLivArea) strongly influences price.

Location (Neighborhood) has a major impact.

Quality (OverallQual) shows a near-linear relationship with price.

Garage size is a secondary but noticeable factor.

Outliers indicate that no single factor fully predicts price.

Each section has visualizations and summaries supporting these conclusions, with systematic exploration to verify trends.

Conclusions & Future Work

Conclusions: Sale price is most strongly affected by living area, neighborhood, and overall quality, with garage size playing a minor role. Outliers suggest interactions between variables.

Future Questions:

How do remodeling and age of the home affect pricing trends?

Could a predictive model combining all significant features accurately forecast future sales?

How do seasonal factors (month/year sold) interact with other attributes?

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •