Nhu Phan 2025-11-20
Housing prices are influenced by a multitude of factors, including the size of the property, its age, quality of construction, and location. Understanding these relationships is important not only for prospective buyers and sellers but also for urban planners, real estate investors, and economists.
This project uses the Ames Housing Dataset from Kaggle, which contains detailed information on over 1,400 homes in Ames, Iowa, to analyze which features most strongly affect house prices. By systematically exploring variables like living area, neighborhood, overall quality, and garage size, the goal is to determine the key drivers of housing prices and to understand patterns that may inform decision-making in real estate markets.
ames <- read_csv("train.csv")## Rows: 1460 Columns: 81
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ames_clean <- ames %>%
select(SalePrice, GrLivArea, OverallQual, YearBuilt, Neighborhood, GarageCars) %>%
drop_na() %>%
mutate(Neighborhood = factor(Neighborhood))
glimpse(ames_clean)## Rows: 1,460
## Columns: 6
## $ SalePrice <dbl> 208500, 181500, 223500, 140000, 250000, 143000, 307000, 2…
## $ GrLivArea <dbl> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 107…
## $ OverallQual <dbl> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5, …
## $ YearBuilt <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 193…
## $ Neighborhood <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mitchel, Som…
## $ GarageCars <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2, …
The dataset consists of 1,460 homes with 81 variables covering structural, geographic, and qualitative features. For this analysis, the following variables were selected:
SalePrice: Sale price of the house (USD) – target variable.
GrLivArea: Above-ground living area in square feet – continuous predictor.
OverallQual: Overall material and finish quality (1–10) – ordinal predictor.
YearBuilt: Year the house was built – continuous predictor.
Neighborhood: Physical location within Ames – categorical predictor.
GarageCars: Garage size (number of cars it fits) – discrete predictor.
Data cleaning steps included:
Selecting relevant columns.
Removing rows with missing values (drop_na()).
Converting Neighborhood to a factor for categorical analysis.
Summary statistics:
summary_stats <- ames_clean %>%
summarise(
SalePrice_min = min(SalePrice),
SalePrice_q1 = quantile(SalePrice, 0.25),
SalePrice_median = median(SalePrice),
SalePrice_mean = mean(SalePrice),
SalePrice_q3 = quantile(SalePrice, 0.75),
SalePrice_max = max(SalePrice),
GrLivArea_min = min(GrLivArea),
GrLivArea_q1 = quantile(GrLivArea, 0.25),
GrLivArea_median = median(GrLivArea),
GrLivArea_mean = mean(GrLivArea),
GrLivArea_q3 = quantile(GrLivArea, 0.75),
GrLivArea_max = max(GrLivArea),
OverallQual_min = min(OverallQual),
OverallQual_q1 = quantile(OverallQual, 0.25),
OverallQual_median = median(OverallQual),
OverallQual_mean = mean(OverallQual),
OverallQual_q3 = quantile(OverallQual, 0.75),
OverallQual_max = max(OverallQual),
YearBuilt_min = min(YearBuilt),
YearBuilt_q1 = quantile(YearBuilt, 0.25),
YearBuilt_median = median(YearBuilt),
YearBuilt_mean = mean(YearBuilt),
YearBuilt_q3 = quantile(YearBuilt, 0.75),
YearBuilt_max = max(YearBuilt),
GarageCars_min = min(GarageCars),
GarageCars_q1 = quantile(GarageCars, 0.25),
GarageCars_median = median(GarageCars),
GarageCars_mean = mean(GarageCars),
GarageCars_q3 = quantile(GarageCars, 0.75),
GarageCars_max = max(GarageCars)
)
summary_stats## # A tibble: 1 × 30
## SalePrice_min SalePrice_q1 SalePrice_median SalePrice_mean SalePrice_q3
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 34900 129975 163000 180921. 214000
## # ℹ 25 more variables: SalePrice_max <dbl>, GrLivArea_min <dbl>,
## # GrLivArea_q1 <dbl>, GrLivArea_median <dbl>, GrLivArea_mean <dbl>,
## # GrLivArea_q3 <dbl>, GrLivArea_max <dbl>, OverallQual_min <dbl>,
## # OverallQual_q1 <dbl>, OverallQual_median <dbl>, OverallQual_mean <dbl>,
## # OverallQual_q3 <dbl>, OverallQual_max <dbl>, YearBuilt_min <dbl>,
## # YearBuilt_q1 <dbl>, YearBuilt_median <dbl>, YearBuilt_mean <dbl>,
## # YearBuilt_q3 <dbl>, YearBuilt_max <dbl>, GarageCars_min <dbl>, …
summary_table <- summary_stats %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Value")| Variable | Min | 1st Qu. | Median | Mean | 3rd Qu. | Max |
|---|---|---|---|---|---|---|
| SalePrice ($) | 34,900 | 129,975 | 163,000 | 180,921 | 214,000 | 755,000 |
| GrLivArea (sq ft) | 334 | 1,130 | 1,464 | 1,515 | 1,777 | 5,642 |
| OverallQual | 1 | 5 | 6 | 6.1 | 7 | 10 |
| YearBuilt | 1872 | 1954 | 1973 | 1971 | 2000 | 2010 |
| GarageCars | 0 | 1 | 2 | 1.77 | 2 | 4 |
Scatterplots with regression lines show a clear positive correlation: larger homes generally sell for higher prices. However, a few outliers exist (very large homes with lower prices), highlighting the importance of considering additional factors like location or quality.
ggplot(ames_clean, aes(GrLivArea, SalePrice)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "red") +
scale_y_continuous(labels = dollar) +
labs(
title = "Relationship Between Living Area and Sale Price",
x = "Living Area (sq ft)",
y = "Sale Price ($)"
)## `geom_smooth()` using formula = 'y ~ x'
Observation: GrLivArea has a strong positive linear relationship with SalePrice, though outliers suggest caution in using size alone for predictions.
Aggregating by neighborhood shows clear differences: NoRidge, NridgtHt, and StoneBr tend to have the highest average sale prices, whereas older or less central neighborhoods like NAmes and OldTown tend to have lower average prices.
ames_clean %>%
group_by(Neighborhood) %>%
summarize(AveragePrice = mean(SalePrice)) %>%
ggplot(aes(x = reorder(Neighborhood, AveragePrice), y = AveragePrice)) +
geom_col(fill = "steelblue") +
coord_flip() +
scale_y_continuous(labels = dollar) +
labs(
title = "Average Sale Price by Neighborhood",
x = "Neighborhood",
y = "Average Sale Price ($)"
)Observation: Location is a strong predictor of price. Even homes of similar size and quality can vary greatly in price depending on the neighborhood.
Boxplots of SalePrice vs OverallQual indicate a near-linear increase in median price as quality improves. Houses rated 9–10 command significantly higher prices than lower-rated homes.
ggplot(ames_clean, aes(x = as.factor(OverallQual), y = SalePrice)) +
geom_boxplot(fill = "lightgreen") +
scale_y_continuous(labels = dollar) +
labs(
title = "Sale Price by Overall Quality",
x = "Overall Quality (1 = Poor, 10 = Excellent)",
y = "Sale Price ($)"
)Insight: Quality is a key factor and interacts with both size and neighborhood for predictive modeling.
Homes with more garage capacity tend to have higher prices, though this effect is smaller than living area or quality. Some homes without garages still sell for high prices if other features (size, location, quality) are strong.
ggplot(ames_clean, aes(x = factor(GarageCars), y = SalePrice)) +
geom_boxplot(fill = "red") +
scale_y_continuous(labels = dollar) +
labs(
title = "Sale Price by Garage Size",
x = "Garage Capacity (Cars)",
y = "Sale Price ($)"
)Observation: Garage size is moderately correlated with price, acting as a secondary factor.
Multiple approaches were explored: scatterplots, boxplots, and aggregations to identify relationships.
Outliers and extreme values were investigated to avoid misleading conclusions.
Regression models with combinations of variables (size, quality, location, garage) were tested to confirm intuitions.
Some unexpected trends (e.g., very high-priced small homes) suggested interactions between neighborhood and other features, highlighting the complexity of real estate pricing.
Size (GrLivArea) strongly influences price.
Location (Neighborhood) has a major impact.
Quality (OverallQual) shows a near-linear relationship with price.
Garage size is a secondary but noticeable factor.
Outliers indicate that no single factor fully predicts price.
Each section has visualizations and summaries supporting these conclusions, with systematic exploration to verify trends.
Conclusions: Sale price is most strongly affected by living area, neighborhood, and overall quality, with garage size playing a minor role. Outliers suggest interactions between variables.
Future Questions:
How do remodeling and age of the home affect pricing trends?
Could a predictive model combining all significant features accurately forecast future sales?
How do seasonal factors (month/year sold) interact with other attributes?



