0% found this document useful (0 votes)
12 views3 pages

MCEN3030 Project1 Wine-Chemistry HZ4jcSg

This document outlines a project for MCEN 3030 at the University of Colorado, focusing on using linear regression to analyze a dataset of 1599 red wines to predict wine quality based on 11 quantitative variables. The project involves performing a Variance Inflation Factor Analysis to address multicollinearity, executing linear regression to determine coefficients, and evaluating the model's predictive accuracy through residual analysis. Deliverables include tables of VIF values, regression coefficients, residual plots, and code used in the analysis.

Uploaded by

breathernzuki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views3 pages

MCEN3030 Project1 Wine-Chemistry HZ4jcSg

This document outlines a project for MCEN 3030 at the University of Colorado, focusing on using linear regression to analyze a dataset of 1599 red wines to predict wine quality based on 11 quantitative variables. The project involves performing a Variance Inflation Factor Analysis to address multicollinearity, executing linear regression to determine coefficients, and evaluating the model's predictive accuracy through residual analysis. Deliverables include tables of VIF values, regression coefficients, residual plots, and code used in the analysis.

Uploaded by

breathernzuki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Linear Regression to Predict Wine Quality © 2025 University of Colorado

MCEN 3030 Summer 2025

In an engineering design process we often have quantitative informa-


tion about scientific/engineering parameters. Examples: the modulus
of elasticity of the foam used on a steering wheel and the grip size
of that wheel. But what modulus do humans prefer? What size?
Conservation of mass/momentum/energy/etc. can’t tell us! In this
project we will use modeling tools to help us connect qualitative
perception and quantitative measurements.1 1
Our department now offers a “Design
I found an interesting data set, not about steering wheels but of Coffee”, a “Design of Chocolate”,
and a “Design of Beer” course – I think
about wine “quality”. Wine-making is an ancient technology, but we this project is really relevant!
have modern tools to help us understand it. Why has the wine from
a particular region been so well-regarded for hundreds of years? Variables/column labels:
1. Fixed Acidity
Maybe it has something to do with the acidity, or the amount of
2. Volative Acidity
sugar, or the alcohol level – all of these variables are included in the 3. Citric Acid
data set we will use in this project. 4. Residual Sugar
5. Chlorides
Included are 11 quantitative measurements as well as an assess- 6. Free Sulfur Dioxide
ment of “quality” for 1599 red wines. We will focus on this “quality”, 7. Total Sulfur Dioxide
as an output value,2 and it presumably is a function of the other 11 8. Density
9. pH
variables. Wine is nice if it is at least a little acidic and obviously 10. Sulphates
most folks say the alcohol is a positive attribute. Sulfur Dioxide helps 11. Alcohol
12. Quality
with preservation and with activating the yeast, but some people in- 2
The min value of quality in this data
sist it negatively impacts the flavor.3 Can we have too much acidity, set is 3, the max is 8.
too much alcohol, too much sulfur? Too little? For sure. So what is 3
Many people react negatively to
the recipe for the best wine? We shall see! sulphates too, e.g. they get headaches.

The central aspect of this project is performing a linear regression


on the data set. We will assume each of the 11 “inputs” (1-11 in the
list to the right) has a linear effect on the quality (Q, variable 12) such
that we can write

Q = a0 + a1 x1 + a2 x2 + a3 x3 + . . . . (1)

At least that is going to be the starting point of our discussion... it


might be the case that not all of these variables are independent, so
let’s think about that first.

Step 1
Step 0 is to finish the homework prob-
A concern: In this data set, citric acid and pH are two separate vari- lems! You should be able to carry over
ables, but are we really able to adjust them independently? What your linear regression code here, and
add to it.
about free sulfur dioxide and total sulfur dioxide, is there a corre-
lation? We want to make sure we are correctly characterizing what
we have control over, and if we can’t actually control these levers
independently, our model is not going to be meaningful.
mcen 3030 linear regression 2

Start by performing a Variance Inflation Factor Analysis on the


input variables.4 We will say our threshold for concern is 5: If any 4
See the reading on Canvas.
VIFn > 5, we will simply toss out the variable xn with the highest
VIF, and then re-run the analysis. You do not have to write code that
automatically does this in one click... you can calculate, interpret the
results, recalculate, ..., and go from there. Report the VIF scores in the
initial analysis and the follow-up analysis/analyses in a table, and
comment on which variables are removed.

Step 2

Now that we have eliminated “multicollinearity” from the data set,


let’s go ahead and perform the linear regression. Determine the
coefficients A = [ a0 , a1 , ..., a N ] T in Eq. (1) and report them in a ta-
ble. Note that our model includes some baseline offset a0 that is not
associated with any variable.5 If the variable was removed, include 5
Make sure to be careful about your
“–” in the table (remove the variable before running the analysis). variable indexing! We include an a0
here, but by MATLAB’s convention, you
Also report the R2 -value for the fit in the text. will likely have A(1) = a0 . And if we
have used VIF to eliminate a variable,
that is going to throw off the indexing
Step 3 as well!

Is it a good predictor? What if a wine was predicted to have a quality


of 7.4, yet it actually was 4? Someone is going to be mad!6 We can 6
And someone is going to try to charge
a lot of money for mediocre wine
plot the residual ei for each wine, labeling each based on their order
“created by scientists to have the
in the list (first row is called wine “1”, second is “2”, etc.) to get an optimum chemistry”.
idea of how the model has done. Comment on whether the data
looks appropriately noisy around ei = 0, or if it seems we have
missed a dependence.7 Discuss: What is the worst overprediction 7
The evidence: patterns in ei .
(e.g. maybe wine 97 was predicted to be 7.3, and is actually a 3)?
What is the worst underprediction? How many are overpredicted
and underpredicted by a quality of 1.5 or more?

Deliverables

• A table that includes the VIF values for a first, and possibly a
second, third, fourth, ..., VIF analysis. In your report, comment on
which variable(s) you remove, if any, and speculate based on your
chemistry knowledge if that removal is reasonable.

• A table that reports the fits for A = [ a0 , ...] T and a comment in the
report about the R2 -value.

• A plot of the residual for each wine, and comments on the distri-
bution around ei = 0.
mcen 3030 linear regression 3

• A plot of the histogram8 of ei . Additionally, write a small piece of 8


histogram(e,edges) will do the job,
code that determines the worst overprediction, worst underpredic- where edges is a vector of the left-side
of each “bin”: edges= -3 : 0.5 : 3 should
tion, and the number that are overpredicted by 1.5 or more, and be good.
the number that are underpredicted by 1.5 or more. Program it,
not a manual search!
• All code used in this problem should be included as an appendix.
Your linear regression code may use the built-in inverse inv or
under-divide \.

You might also like