0% found this document useful (0 votes)
221 views11 pages

Model Selection Techniques for Concrete Flow

Ralphie is studying concrete properties for a school project. Using a dataset on concrete ingredients and flow, a full model was created. Backwards selection identified water, cement, ash, course aggregate, and fine aggregate as the most important predictors. Checks found the reduced model was not significantly different than the full model, balancing explanatory power and simplicity. Criteria like AIC, BIC, and adjusted R2 were calculated for predictor subsets to evaluate the best model, though criteria may not always agree on the optimal number of predictors.

Uploaded by

Ashutosh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
221 views11 pages

Model Selection Techniques for Concrete Flow

Ralphie is studying concrete properties for a school project. Using a dataset on concrete ingredients and flow, a full model was created. Backwards selection identified water, cement, ash, course aggregate, and fine aggregate as the most important predictors. Checks found the reduced model was not significantly different than the full model, balancing explanatory power and simplicity. Criteria like AIC, BIC, and adjusted R2 were calculated for predictor subsets to evaluate the best model, though criteria may not always agree on the optimal number of predictors.

Uploaded by

Ashutosh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

C1M6_peer_reviewed

June 12, 2021

1 Module 6: Peer Reviewed Assignment

1.0.1 Outline:

The objectives for this assignment:


1. Apply the processes of model selection with real datasets.
2. Understand why and how some problems are simpler to solve with some forms of model
selection, and others are more difficult.
3. Be able to explain the balance between model power and simplicity.
4. Observe the difference between different model selection criterion.
General tips:
1. Read the questions carefully to understand what is being asked.
2. This work will be reviewed by another human, so make sure that you are clear and concise
in what your explanations and answers.
[1]: # This cell loads in the necesary packages
library(tidyverse)
library(leaps)
library(ggplot2)

�� Attaching packages ��������������������������������������� tidyverse


1.3.0 ��

� ggplot2 3.3.0 � purrr 0.3.4


� tibble 3.0.1 � dplyr 0.8.5
� tidyr 1.0.2 � stringr 1.4.0
� readr 1.3.1 � forcats 0.5.0

�� Conflicts ������������������������������������������
tidyverse_conflicts() ��
� dplyr::filter() masks stats::filter()
� dplyr::lag() masks stats::lag()

1
1.1 Problem 1: We Need Concrete Evidence!

Ralphie is studying to become a civil engineer. That means she has to know everything about
concrete, including what ingredients go in it and how they affect the concrete’s properties. She’s
currently writting up a project about concrete flow, and has asked you to help her figure out which
ingredients are the most important. Let’s use our new model selection techniques to help Ralphie
out!
Data Source: Yeh, I-Cheng, “Modeling slump flow of concrete using second-order regressions and
artificial neural networks,” Cement and Concrete Composites, Vol.29, No. 6, 474-480, 2007.
[2]: [Link] = [Link]("[Link]")

[Link] = [Link][, c(-1, -9, -11)]


names([Link]) = c("cement", "slag", "ash", "water", "sp", "[Link]",␣
,→"[Link]", "flow")

head([Link])

cement slag ash water sp [Link] [Link] flow


<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 273 82 105 210 9 904 680 62.0
2 163 149 191 180 12 843 746 20.0
A [Link]: 6 × 8
3 162 148 191 179 16 840 743 20.0
4 162 148 190 179 19 838 741 21.5
5 154 112 144 220 10 923 658 64.0
6 147 89 115 202 9 860 829 55.0

1.1.1 1. (a) Initial Inspections

Sometimes, the best way to start is to just jump in and mess around with the model. So let’s do
that. Create a linear model with flow as the response and all other columns as predictors.
Just by looking at the summary for your model, is there reason to believe that our model could be
simpler?
[3]: # Your Code Here
lm1 = lm(flow~., data=[Link])
summary(lm1)

Call:
lm(formula = flow ~ ., data = [Link])

Residuals:
Min 1Q Median 3Q Max
-30.880 -10.428 1.815 9.601 22.953

Coefficients:
Estimate Std. Error t value Pr(>|t|)

2
(Intercept) -252.87467 350.06649 -0.722 0.4718
cement 0.05364 0.11236 0.477 0.6342
slag -0.00569 0.15638 -0.036 0.9710
ash 0.06115 0.11402 0.536 0.5930
water 0.73180 0.35282 2.074 0.0408 *
sp 0.29833 0.66263 0.450 0.6536
[Link] 0.07366 0.13510 0.545 0.5869
[Link] 0.09402 0.14191 0.663 0.5092
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.84 on 95 degrees of freedom


Multiple R-squared: 0.5022,Adjusted R-squared: 0.4656
F-statistic: 13.69 on 7 and 95 DF, p-value: 3.915e-12

Among seven coefficient (slope) values, that of “water” is statistically significant (α = 0.05).

1.1.2 1. (b) Backwards Selection

Our model has 7 predictors. That is not too many, so we can use backwards selection to narrow
them down to the most impactful.
Perform backwards selection on your model. You don’t have to automate the backwards selection
process.
[4]: # Your Code Here

library(MASS)
stepAIC(lm1, direction = "backward")

Attaching package: ‘MASS’

The following object is masked from ‘package:dplyr’:

select

Start: AIC=533.56
flow ~ cement + slag + ash + water + sp + [Link] + [Link]

Df Sum of Sq RSS AIC


- slag 1 0.22 15672 531.56
- sp 1 33.44 15705 531.78

3
- cement 1 37.60 15709 531.81
- ash 1 47.45 15719 531.87
- [Link] 1 49.04 15720 531.88
- [Link] 1 72.40 15744 532.03
<none> 15671 533.56
- water 1 709.69 16381 536.12

Step: AIC=531.56
flow ~ cement + ash + water + sp + [Link] + [Link]

Df Sum of Sq RSS AIC


- sp 1 62.1 15734 529.97
<none> 15672 531.56
- cement 1 1244.7 16916 537.43
- [Link] 1 1679.4 17351 540.05
- ash 1 1759.2 17431 540.52
- [Link] 1 2292.3 17964 543.62
- water 1 10877.0 26548 583.86

Step: AIC=529.97
flow ~ cement + ash + water + [Link] + [Link]

Df Sum of Sq RSS AIC


<none> 15734 529.97
- cement 1 1193.1 16927 535.50
- [Link] 1 1678.8 17412 538.41
- ash 1 1746.5 17480 538.81
- [Link] 1 2237.1 17971 541.66
- water 1 11947.4 27681 586.16
Call:
lm(formula = flow ~ cement + ash + water + [Link] + [Link],
data = [Link])

Coefficients:
(Intercept) cement ash water [Link] [Link]
-249.50866 0.05366 0.06101 0.72313 0.07291 0.09554

1.1.3 1. (c) Objection!

Stop right there! Think about what you just did. You just removed the “worst” features from your
model. But we know that a model will become less powerful when we remove features so we should
check that it’s still just as powerful as the original model. Use a test to check whether the model
at the end of backward selection is significantly different than the model with all the features.
Describe why we want to balance explanatory power with simplicity.

4
[5]: # Your Code Here
lm2 = lm(flow~cement + ash + water + [Link] + [Link], data=[Link])
anova(lm2, lm1)

[Link] RSS Df Sum of Sq F Pr(>F)


<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
A anova: 2 × 6
1 97 15733.53 NA NA NA NA
2 95 15671.26 2 62.27123 0.1887457 0.8283068
Null hypothesis is rejected. It means that the model at the end of backward selection is not
significantly different than the model with all the features.

1.1.4 1. (d) Checking our Model

Ralphie is nervous about her project and wants to make sure our model is correct. She’s found a
function called regsubsets() in the leaps package which allows us to see which subsets of arguments
produce the best combinations. Ralphie wrote up the code for you and the documentation for the
function can be found here. For each of the subsets of features, calculate the AIC, BIC and adjusted
R2 . Plot the results of each criterion, with the score on the y-axis and the number of features on
the x-axis.
Do all of the criterion agree on how many features make the best model? Explain why the criterion
will or will not always agree on the best model.
Hint: It may help to look at the attributes stored within the regsubsets summary using names(rs).
[16]: reg = regsubsets(flow ~ cement+slag+ash+water+sp+[Link]+[Link]+flow,␣
,→data=[Link])

rs = summary(reg)
rs$which
names(rs)

# Your Code Here

Warning message in [Link](terms(formula, data = data), mm):


“the response appeared on the right-hand side and was dropped”
Warning message in [Link](terms(formula, data = data), mm):
“problem with term 8 in [Link]: no columns are assigned”
(Intercept) cement slag ash water sp [Link] [Link]
1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
2 TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
3 TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
A matrix: 7 × 8 of type lgl
4 TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
5 TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
6 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
1. ’which’ 2. ’rsq’ 3. ’rss’ 4. ’adjr2’ 5. ’cp’ 6. ’bic’ 7. ’outmat’ 8. ’obj’

5
[18]: [Link] = numeric()

[Link][1] = AIC(lm(flow ~ water, data=[Link]))


[Link][2] = AIC(lm(flow ~ slag+water, data=[Link]))
[Link][3] = AIC(lm(flow ~ slag+water+[Link], data=[Link]))
[Link][4] = AIC(lm(flow ~ cement+slag+water+[Link], data=[Link]))
[Link][5] = AIC(lm(flow ~ slag+ash+water+[Link]+[Link], data=concrete.
,→data))

[Link][6] = AIC(lm(flow ~ cement+ash+water+sp+[Link]+[Link],␣


,→data=[Link]))

[Link][7] = AIC(lm(flow ~ ., data=[Link]))


[Link]
[Link]([Link])
plot([Link], type="l")
points(2, [Link][2], col="red", lwd=6)

1. 835.194120716668 2. 819.179876719235 3. 820.356098530211 4. 822.196308265181


5. 824.141642474815 6. 825.862829524366 7. 827.861393942093
2

6
[10]: [Link] = rs$bic
[Link]
[Link]([Link])
plot(reg, scale="bic")
plot([Link], type="l")
points(2, [Link][2], col="red", lwd=6)

1. -43.2523145621918 2. -56.6318295713953 3. -52.8208787721899 4. -48.3459400489902


5. -43.7658768511257 6. -39.4099608133456 7. -34.7766674073893
2

7
8
[11]: reg.r = rs$adjr2
reg.r
[Link](reg.r)
plot(reg, scale="adjr2")
plot(reg.r, type="l")
points(2, reg.r[2], col="red", lwd=6)

1. 0.393510477011073 2. 0.48573319237937 3. 0.484676581258887 4. 0.480225163651389


5. 0.475145293315376 6. 0.47111165564709 7. 0.465551858942537
2

9
10
The model using “slag” and “water” as parameters is the best model when either AIC, BIC or
adjusted R2 is used.
[ ]:

11

Common questions

Powered by AI

Backward selection involves removing the least significant predictor variables from the model to simplify it while maintaining its predictive power. In the assignment, it was used to narrow down the predictors to the most impactful ones without automating the entire process, aiming to identify the simplest model that still retains its effectiveness .

The R-squared value measures the proportion of variance in the dependent variable (concrete flow) that is predictable from the independent variables in Ralphie's model. A higher R-squared value indicates greater explanatory power, showing how well the model captures the underlying data trends .

The anova test compared the full model with all features to the reduced model obtained from backward selection, concluding that the reduced model was not significantly different from the full model. This validated the backward selection process as the simpler model retained comparable explanatory power .

If the response variable appears on the right-hand side of the formula in model matrix creation, it is dropped from the model matrix. This can cause warnings and incorrect model specification, as seen in the regsubsets function where such a term was incorrectly placed, leading to its exclusion in the analysis .

Improvements in Ralphie's model's statistical power after using backward selection could be achieved by retaining a smaller subset of predictors that significantly contribute to variance explanation. This helps maintain the model's predictive integrity while eliminating non-contributing factors, thus increasing interpretability and reliability .

The key objectives for the peer review assignment include applying the processes of model selection with real datasets, understanding the varying complexity of solving problems with different model selection forms, explaining the balance between model power and simplicity, and observing differences between various model selection criteria .

Model selection criteria, such as AIC, BIC, and adjusted R², evaluate models based on different aspects, such as goodness of fit and complexity. They may not always agree because each criterion optimizes for different aspects: AIC seeks a good balance between accuracy and simplicity, BIC penalizes complexity more heavily, and adjusted R² adjusts for the number of predictors. As a result, they may recommend different models as the 'best' based on the dataset and the trade-offs between these aspects .

In the assignment, 'tidyverse' offers a collection of R packages designed for data science that aid in data manipulation and exploration. 'Leaps' assists with regression subset selection, providing tools for model selection, while 'ggplot2' is utilized for creating static plots for data visualization, allowing the user to graphically represent model outputs and criterion comparisons .

In the concrete flow model, the water component holds statistical significance at α = 0.05, evidenced by its p-value below the threshold of significance (p = 0.0408). This indicates that variations in water content are significantly associated with changes in concrete flow .

Creating a simpler model might be necessary to prevent overfitting and ensure that the model is not too complex for the data at hand. A simpler model enhances interpretability and can provide meaningful insights without unnecessary complications, as evidenced by Ralphie's project on concrete flow in which model simplicity was pursued .

You might also like