0% found this document useful (0 votes)
32 views22 pages

Machine Exercise 3

The document outlines a machine exercise focused on Exploratory Data Analysis (EDA) using a dataset from 'Advertising.csv'. It includes steps for data importation, cleaning, and basic analysis using Python and pandas, along with instructions for visualization and regression analysis. The exercise emphasizes understanding data characteristics before applying machine learning models.

Uploaded by

NJ Tan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views22 pages

Machine Exercise 3

The document outlines a machine exercise focused on Exploratory Data Analysis (EDA) using a dataset from 'Advertising.csv'. It includes steps for data importation, cleaning, and basic analysis using Python and pandas, along with instructions for visualization and regression analysis. The exercise emphasizes understanding data characteristics before applying machine learning models.

Uploaded by

NJ Tan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Name: Niño James A.

Tan

Section: BSCpE 3-G

Machine Exercise #2

1. Introduction to Exploratory Data Analysis (EDA)


Before building a machine learning model, you must be able to properly understand the data-set you are dealing
with. This process is called Exploratory Data Analysis.

You can find python notebooks that explain this process:

Intro to Exploratory data analysis (EDA) in Python


Detailed exploratory data analysis with python
Exploratory Data Analysis with Pandas

Take time to read and understand the given examples above. Complete the python notebook [Link].

a. Import the dataset [Link] using panda .


In [146]:

# importing packages
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

In [147]:
pd.read_csv('[Link]')

Out[147]:

Unnamed: 0 TV radio newspaper sales

0 1 230.1 37.8 69.2 22.1

1 2 44.5 39.3 45.1 10.4

2 3 17.2 45.9 69.3 9.3

3 4 151.5 41.3 58.5 18.5

4 5 180.8 10.8 58.4 12.9

... ... ... ... ... ...

195 196 38.2 3.7 13.8 7.6

196 197 94.2 4.9 8.1 9.7

197 198 177.0 9.3 6.4 12.8

198 199 283.6 42.0 66.2 25.5

199 200 232.1 8.6 8.7 13.4

200 rows × 5 columns

b. Use pandas to save the data into a dataframe .


In [148]:

df = pd.read_csv('[Link]')

c. Use pandas to know the column names and number of entries (or
samples).
In [149]:
print([Link])
print([Link])
print([Link])
print([Link])

Index(['Unnamed: 0', 'TV', 'radio', 'newspaper', 'sales'], dtype='object')


RangeIndex(start=0, stop=200, step=1)
(200, 5)
1000

d. Use pandas to show the first and last 5 entries of the data frame.
In [150]:
[Link]()
Out[150]:

Unnamed: 0 TV radio newspaper sales

0 1 230.1 37.8 69.2 22.1

1 2 44.5 39.3 45.1 10.4

2 3 17.2 45.9 69.3 9.3

3 4 151.5 41.3 58.5 18.5

4 5 180.8 10.8 58.4 12.9

In [151]:
[Link]()
Out[151]:

Unnamed: 0 TV radio newspaper sales

195 196 38.2 3.7 13.8 7.6

196 197 94.2 4.9 8.1 9.7

197 198 177.0 9.3 6.4 12.8

198 199 283.6 42.0 66.2 25.5

199 200 232.1 8.6 8.7 13.4

e. Use pandas to know the statistics of each column.


In [152]:
[Link]()
Out[152]:

Unnamed: 0 TV radio newspaper sales

count 200.000000 200.000000 200.000000 200.000000 200.000000


mean Unnamed:
100.500000
0 147.042500
TV 23.264000 30.554000
radio newspaper 14.022500
sales
std 57.879185 85.854236 14.846809 21.778621 5.217457

min 1.000000 0.700000 0.000000 0.300000 1.600000

25% 50.750000 74.375000 9.975000 12.750000 10.375000

50% 100.500000 149.750000 22.900000 25.750000 12.900000

75% 150.250000 218.825000 36.525000 45.100000 17.400000

max 200.000000 296.400000 49.600000 114.000000 27.000000

f. What are the types of the data?


In [153]:
print([Link])

Unnamed: 0 int64
TV float64
radio float64
newspaper float64
sales float64
dtype: object

g. Sometimes data-sets contain columns that might not be needed and


you need to drop it from the data frame. For example purposes, drop the
Unnamed: 0 column of the [Link] data-set.
In [154]:
[Link](columns=["Unnamed: 0"], inplace=True)
print([Link])

Index(['TV', 'radio', 'newspaper', 'sales'], dtype='object')

h. Sometimes you need to rename column names such as confusing


names or columns with spaces or very long names. For example
purposes, rename the column TV to television .
In [155]:
[Link](columns={"TV": "television"}, inplace=True)
print([Link])

Index(['television', 'radio', 'newspaper', 'sales'], dtype='object')

i. Remove rows with duplicates (if there are any). See reference python
notebooks above for example on how to do this.
In [156]:
print(f'Before: {[Link]}')
df = df.drop_duplicates()
[Link]()
print(f'After: {[Link]}')

Before: (200, 4)
After: (200, 4)

j. Check if there are 'null' or 'missing' values then drop these rows.
In [157]:
hasNA = [Link]().[Link]()
if (hasNA):
print('DataFrame contains NA values')
print([Link]().sum())
print(f'Before: {[Link]}')
df = [Link]()
[Link]()
print(f'After: {[Link]}')
else:
print('DataFrame does not contain NA values')

DataFrame does not contain NA values

k. Detect if there are outliers and remove these rows.


In [158]:

from scipy import stats

print(f'Before: {[Link]}')
z_scores = [Link]([Link](df))
df = df[(z_scores < 3).all(axis=1)]
print(f'After: {[Link]}')

Before: (200, 4)
After: (198, 4)

l. Use pandas to know answer how many sales are above 15?
In [159]:
above_15 = df[df["sales"] > 15]["sales"]
print(above_15.count())

74

m. Remove the 10th to 40th entry. What is the range, mean, and standard
deviation of each predictor in the subset of the data that remains? Hint:
you can use [Link]() and [Link][]
In [160]:
#before
[Link]()
Out[160]:

television radio newspaper sales

count 198.000000 198.000000 198.000000 198.000000

mean 146.688384 23.130808 29.777273 13.980808

std 85.443221 14.862111 20.446303 5.196097

min 0.700000 0.000000 0.300000 1.600000

25% 74.800000 9.925000 12.650000 10.325000

50% 149.750000 22.400000 25.600000 12.900000

75% 218.475000 36.325000 44.050000 17.375000

max 293.600000 49.600000 89.400000 27.000000

In [161]:
df = [Link]([Link][10:41]) # Removes 10th to 40th row (inclusive of 10, exclusive of
41)
#after
[Link]()
Out[161]:

television radio newspaper sales

count 167.000000 167.000000 167.000000 167.000000

mean 142.950299 23.269461 29.920359 13.832335

std 84.686566 15.119551 21.014340 5.227456

min 0.700000 0.000000 0.900000 1.600000

25% 74.250000 9.750000 12.150000 10.350000

50% 141.300000 22.500000 25.600000 12.900000

75% 215.900000 36.900000 44.700000 17.250000

max 293.600000 49.600000 89.400000 27.000000

n. Using the complete dataset, replicate the figure below (do not include
regression line yet). Check the python notebook for the figure.
In [162]:
sales = df['sales']
tv = df['television']
radio = df['radio']
newspaper = df['newspaper']

In [163]:
# Correlation
Y = sales

In [164]:
# linear regression
def linear_regression(X, Y):
b1 = [Link](X - [Link](), Y - [Link]()) / [Link](X - [Link](), X - [Link]())
bo = [Link]() - b1 * [Link]()
Yhat = bo + b1 * X
return (bo, b1, Yhat)

In [165]:
#plotting
structure = {
'columns' : [tv, radio, newspaper],
'colors' : ['blue', 'red', 'green'],
'line_colors': ['red', 'green', 'blue'],
'labels' : ['TV', 'Radio', 'Newspaper'],
}

fig, ax = [Link](figsize=(8,3.5), ncols=3, nrows=1)

for i in range (0,3):


lr = linear_regression(structure['columns'][i], sales)
ax[i].scatter(structure['columns'][i], sales, c='none', edgecolor=structure['colors'
][i], linewidth=0.7, alpha=0.5, label = 'data')
ax[i].plot(structure['columns'][i], lr[2], color=structure['line_colors'][i], linewi
dth=2, alpha=0.9, label = f'Y = {lr[0]:.4} + {lr[1]:.4} * X')
ax[i].plot(structure['columns'][i], lr[2], color=structure['line_colors'][i], linewi
dth=2, alpha=0.9, label = 'Linear Regression')
ax[i].set_xlabel(structure['labels'][i])
ax[i].set_ylabel('Sales')
ax[i].legend()

plt.tight_layout(pad=0, rect=[2.5,2.5,4,4])
[Link]();

1. Understanding the pitfall of overfitting.

Measuring the quality of the fit of the ML model is an integral step in assuring that ML model will produce
meaningful results when used for prediction. Kindly read Chapter 2.2 Assessing Model Accuracy of the
Introduction to Statistical Learning textbook for more details. This item aims to understand the pitfall of
overfitting.

Complete the python notebook [Link]. Provide answer to the following questions included in the
python notebook.

Instructions
1. Use the code below to generate synthetic data. Plot the data points and true (or exact without noise)
function.

seed = 20250404
[Link](seed)
N = 25
X = [Link](-[Link],[Link], N)
# noise
# eps ~ N(0,0.01)
eps = [Link](0, 0.1, N)
y = 0.25*[Link](X) + eps
y_true = 0.25*[Link](X)

In [166]:
import numpy as np
import [Link] as plt
from [Link] import PolynomialFeatures
from sklearn.linear_model import LinearRegression

In [167]:
seed = 20250404
[Link](seed)
N = 25
X = [Link](-[Link], [Link], N)
eps = [Link](0, 0.1, N)
y = 0.25 * [Link](X) + eps
y_true = 0.25 * [Link](X)

In [168]:
X = [Link](-1, 1)

In [169]:
# Polynomial regression
degree = 5
poly = PolynomialFeatures(degree)
X_poly = poly.fit_transform(X)
model = LinearRegression()
[Link](X_poly, y)

# Predict on a finer grid


X_plot = [Link](-[Link], [Link], 100).reshape(-1, 1)
X_plot_poly = [Link](X_plot)
y_plot = [Link](X_plot_poly)

# Plotting
[Link](figsize=(6, 4))
[Link](X, y, color='red') # noisy data
[Link](X_plot, y_plot, color='blue') # model prediction
[Link]("x")
[Link]("y")
[Link](True)
[Link]()
1. Perform regression of the entire data-set. You can use statsmodel , scikit-learn or manually calculate
coefficients using OLS method.

Fit a simple linear regression (n=1) and plot. Include exact (no noise) function and data points.
Fit a polynomial degree (n=1,2,3) and plot. Include exact (no noise) function and data points.
Fit a polynomial degree (n=4,5,6) and plot. Include exact (no noise) function and data points.
Fit a polynomial degree (n=13,14,15) and plot. Include exact (no noise) function and data points.
Fit a polynomial degree (n=22) and plot. Include exact (no noise) function and data points.
In [170]:
def plot_poly_fit(degrees, X, y, y_true):
X_plot = [Link](-[Link], [Link], 300).reshape(-1, 1)
y_exact = 0.25 * [Link](X_plot)

[Link](figsize=(6, 4))
[Link](X, y, color='red', label='Data')
[Link](X_plot, y_exact, 'b--', label='Exact (no noise)')

for degree in degrees:


poly = PolynomialFeatures(degree)
X_poly = poly.fit_transform(X)
X_plot_poly = [Link](X_plot)

model = LinearRegression()
[Link](X_poly, y)
y_pred = [Link](X_plot_poly)

label = f'n={degree}'
# Show equation for degree 1
if degree == 1 and len(degrees) == 1:
intercept = model.intercept_
slope = model.coef_[1]
label = f'y={intercept:.4f}+{slope:.4f}x'

[Link](X_plot, y_pred, label=label)

[Link]("x")
[Link]("y")
[Link](True)
[Link]()
[Link]()

In [171]:
# Simple linear regression (n=1)
plot_poly_fit([1], X, y, y_true)

# Polynomial degrees 1 to 3
plot_poly_fit([1, 2, 3], X, y, y_true)

# Polynomial degrees 4 to 6
plot_poly_fit([4, 5, 6], X, y, y_true)

# Polynomial degrees 13 to 15
plot_poly_fit([13, 14, 15], X, y, y_true)

# Polynomial degree 22
plot_poly_fit([22], X, y, y_true)
1. Create a function that will fit the data given k degree polynomial. Then for each fitted model, calculate the
Residual Sum of Squares (RSS) or RSS = . Plot the calculate RSS with the degree of polynomial at the x-
∑ni=1 (yi
− y^i )2
axis.

In [172]:
def compute_rss_for_degrees(X, y, max_degree=25):
rss_list = []
degrees = list(range(1, max_degree + 1))

for k in degrees:
poly = PolynomialFeatures(k)
X_poly = poly.fit_transform(X)

model = LinearRegression()
[Link](X_poly, y)

y_pred = [Link](X_poly)
rss = [Link]((y - y_pred) ** 2)
rss_list.append(rss)

return degrees, rss_list

In [173]:

def plot_rss(degrees, rss_list):


[Link](figsize=(8, 4))
[Link](degrees, rss_list, marker='o', linestyle='-', color='purple')
[Link]('Polynomial Degree')
[Link]('RSS (Residual Sum of Squares)')
[Link]('RSS vs. Polynomial Degree')
[Link](True)
[Link]()

In [174]:
degrees, rss_list = compute_rss_for_degrees(X, y, max_degree=25)
plot_rss(degrees, rss_list)

1. Generate new data sets with N = 100. Use the same seed number as above. Create a OLS model with
polynomial degree from 1 to 20. For each model fitting, subdivide the data-set into 70% for training and 30%
for testing. For each set, calculate the Mean Squared Error (MSE) or M SE . Plot the MSE vs the
1
= n
n
∑i=1 (yi
− y^i )2
polynomial order fit. See example plot below.
In [175]:
from sklearn.model_selection import train_test_split
from [Link] import mean_squared_error

In [176]:

def evaluate_mse_poly_models(seed, N=100, max_degree=20):


[Link](seed)
X = [Link](-[Link], [Link], N).reshape(-1, 1)
eps = [Link](0, 0.1, N)
y = 0.25 * [Link](X).flatten() + eps

degrees = list(range(1, max_degree + 1))


train_mse_list = []
test_mse_list = []

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_sta


te=seed)

for k in degrees:
poly = PolynomialFeatures(k)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = [Link](X_test)

model = LinearRegression()
[Link](X_train_poly, y_train)

y_train_pred = [Link](X_train_poly)
y_test_pred = [Link](X_test_poly)

train_mse = mean_squared_error(y_train, y_train_pred)


test_mse = mean_squared_error(y_test, y_test_pred)

train_mse_list.append(train_mse)
test_mse_list.append(test_mse)

return degrees, train_mse_list, test_mse_list

In [177]:
def plot_mse_vs_degree_separate(degrees, train_mse, test_mse):
fig, axes = [Link](figsize=(8,3.5), ncols=2, nrows=1)

# Plot 1: Train and Test MSE


axes[0].plot(degrees, train_mse, label='train mse', color='red')
axes[0].plot(degrees, test_mse, label='test mse', color='blue')
axes[0].set_title("Train/Test MSE vs Polynomial Degree")
axes[0].set_xlabel("polynomial fit Order (n)")
axes[0].set_ylabel("Mean Squared Error")
axes[0].legend()
axes[0].grid(True)

# Plot 2: Train MSE only


axes[1].plot(degrees, train_mse, label='train mse', color='red')
axes[1].set_title("Train MSE vs Polynomial Degree")
axes[1].set_xlabel("polynomial fit Order (n)")
axes[1].set_ylabel("Mean Squared Error")
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
[Link]()

In [178]:
degrees, train_mse, test_mse = evaluate_mse_poly_models(seed)
plot_mse_vs_degree_separate(degrees, train_mse, test_mse)

1. Model selection demonstrated using Ordinary Least Squares method.

A synthetic data is provided at [Link]. This data was produced using a linear combination of the
predictors ( x_1 ) and ( x_2 ). The aim of this item is for you to select a good model (only use Linear
Regression model) for the data-set. Kindly read Chapter 3.2 and 3.3 of the textbook for this purpose.

Complete the python notebook [Link].

Instructions
1. Import the data from here. The data-set has two predictors/features x1 and x2 and one target value: y1 .

In [179]:
import pandas as pd

In [180]:
df = pd.read_csv('[Link]')

1. Perform an exploratory data analysis for this data-set.

In [181]:

print([Link])
print([Link])
print([Link]())

Index(['x1', 'x2', 'y'], dtype='object')


(350, 3)
x1 350
x2 350
y 350
dtype: int64

In [182]:
[Link]()

Out[182]:

x1 x2 y
x1
0 2.739702 x2
3.770638 y
-36.038913

1 4.974939 0.386704 -335.187632

2 7.828024 -1.765389 -861.653094

3 3.345500 -2.861464 -81.880837

4 0.417797 -3.360638 59.936199

In [183]:
[Link]()

Out[183]:

x1 x2 y

345 7.151762 -3.724223 -615.910151

346 7.192358 3.395231 -656.397129

347 0.019028 1.637808 -4.833681

348 9.393014 0.500292 -1635.089200

349 6.177953 -1.157211 -464.780475

In [184]:

[Link]()
Out[184]:

x1 x2 y

count 350.000000 350.000000 350.000000

mean 5.104855 -0.148559 -518.999391

std 3.080878 2.919885 588.819525

min 0.019028 -4.992177 -1979.355855

25% 2.341294 -2.451153 -892.857741

50% 5.348355 -0.457099 -368.847553

75% 7.940074 2.476036 4.020817

max 9.954931 4.999886 177.691608

In [185]:
print(f'Before: {[Link]}')
df = df.drop_duplicates()
[Link]()
print(f'After: {[Link]}')

Before: (350, 3)
After: (350, 3)

In [186]:
hasNA = [Link]().[Link]()
if (hasNA):
print('DataFrame contains NA values')
print([Link]().sum())
print(f'Before: {[Link]}')
df = [Link]()
[Link]()
print(f'After: {[Link]}')
else:
print('DataFrame does not contain NA values')

DataFrame does not contain NA values


In [187]:

#Outliers
from scipy import stats
import numpy as np

print(f'Before: {[Link]}')
z_scores = [Link]([Link](df))
df = df[(z_scores < 3).all(axis=1)]
print(f'After: {[Link]}')

Before: (350, 3)
After: (350, 3)

1. Divide the data into training and testing set.

In [188]:
from sklearn.model_selection import train_test_split

In [189]:
X = df[['x1', 'x2']]
y = df['y']

In [190]:

# Split the dataset into training and testing sets (e.g., 80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42
)

In [191]:
# Optional: Combine X and y again if needed
train_data = [Link]([X_train, y_train], axis=1)
test_data = [Link]([X_test, y_test], axis=1)

print(train_data.shape)
print(test_data.shape)

(280, 3)
(70, 3)

1. Fit a simple Ordinary Least Square (OLS) model to the data: y = β0 . Check the Residual Standard Error,
+ β1 x1
+ β2 x2
Mean Squared Error, R2 and Root Mean squared error for the training and testing sets. Give your
observations.

In [192]:
from [Link] import mean_squared_error, r2_score
import [Link] as sm

In [193]:

# Add constant for intercept


X_train_sm = sm.add_constant(X_train)
X_test_sm = sm.add_constant(X_test)

In [194]:
# Fit OLS model
ols_model = [Link](y_train, X_train_sm).fit()

# Predict
y_train_pred = ols_model.predict(X_train_sm)
y_test_pred = ols_model.predict(X_test_sm)

In [195]:
# --- Evaluation Metrics ---
def evaluate(y_true, y_pred, n, p):
mse = mean_squared_error(y_true, y_pred)
rmse = [Link](mse)
r2 = r2_score(y_true, y_pred)
rss = [Link]((y_true - y_pred) ** 2)
rse = [Link](rss / (n - p - 1)) # Residual Std Error
return mse, rmse, r2, rse

In [196]:

train_metrics = evaluate(y_train, y_train_pred, n=len(y_train), p=X_train.shape[1])


test_metrics = evaluate(y_test, y_test_pred, n=len(y_test), p=X_test.shape[1])

# Display results
print("TRAINING SET:")
print(f"RSE: {train_metrics[3]:.4f} ")
print(f"MSE: {train_metrics[0]:.4f} ")
print(f"RMSE: {train_metrics[1]:.4f} ")
print(f"R²: {train_metrics[2]:.4f} \n ")

print("TESTING SET:")
print(f"RSE: {test_metrics[3]:.4f}")
print(f"MSE: {test_metrics[0]:.4f}")
print(f"RMSE: {test_metrics[1]:.4f}")
print(f"R²: {test_metrics[2]:.4f}")

TRAINING SET:
RSE: 212.3877
MSE: 44625.2359
RMSE: 211.2469
R²: 0.8707

TESTING SET:
RSE: 235.5388
MSE: 53100.8893
RMSE: 230.4363
R²: 0.8468

1. Is at least one of the predictor x1 , x2 useful in predicting the response y? See Chapter 3.2.2 on how to
answer this question.

In [197]:

print(ols_model.summary())

OLS Regression Results


==============================================================================
Dep. Variable: y R-squared: 0.871
Model: OLS Adj. R-squared: 0.870
Method: Least Squares F-statistic: 933.1
Date: Mon, 02 Jun 2025 Prob (F-statistic): 8.57e-124
Time: [Link] Log-Likelihood: -1896.2
No. Observations: 280 AIC: 3798.
Df Residuals: 277 BIC: 3809.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 388.7546 24.680 15.752 0.000 340.171 437.338
x1 -178.0123 4.121 -43.198 0.000 -186.124 -169.900
x2 -18.4485 4.377 -4.215 0.000 -27.065 -9.832
==============================================================================
Omnibus: 25.957 Durbin-Watson: 2.034
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30.882
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30.882
Skew: -0.800 Prob(JB): 1.97e-07
Kurtosis: 2.704 Cond. No. 11.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifie
d.

Predictor Coef P-value Significant?

x1 -178.0123 0.000 Yes

x2 -18.4485 0.000 Yes

1. Create various ML model that predicts well the given data-set. Only use OLS model. You can use higher
order polynomials (e.g. x51 , x52 ), trigonometric functions ( sin x1 etc.) or interaction terms (e.g. x1 , x2 ), etc.
You can use scikit learn, or statsmodels but it is encourage that you code manually how to perform Ordinary
least squares regression model.

In [198]:
def manual_ols(X, y):
X = [Link](X)
y = [Link](y).reshape(-1, 1)
XTX_inv = [Link](X.T @ X)
beta = XTX_inv @ X.T @ y
return beta

def predict_ols(X, beta):


return X @ beta

In [199]:

X_raw = df[['x1', 'x2']]


y = df['y']

# Split
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X_raw, y, test_size=0.2, ran
dom_state=42)

In [200]:

# Prepare models
def prepare_features(X_df, kind):
X = X_df.copy()
if kind == 'linear':
pass
elif kind == 'poly2':
X['x1^2'] = X['x1'] ** 2
X['x2^2'] = X['x2'] ** 2
elif kind == 'trig':
X['sin_x1'] = [Link](X['x1'])
X['cos_x2'] = [Link](X['x2'])
elif kind == 'interaction':
X['x1*x2'] = X['x1'] * X['x2']
elif kind == 'full':
X['x1^2'] = X['x1'] ** 2
X['x2^2'] = X['x2'] ** 2
X['x1*x2'] = X['x1'] * X['x2']
X['sin_x1'] = [Link](X['x1'])
X['cos_x2'] = [Link](X['x2'])
else:
raise ValueError("Unknown feature set kind")

[Link](0, 'const', 1) # Add intercept


return X

In [201]:
def evaluate_model(X_train, y_train, X_test, y_test):
beta = manual_ols(X_train, y_train)
y_train_pred = predict_ols(X_train, beta)
y_test_pred = predict_ols(X_test, beta)

# Metrics
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)
rmse_train = [Link](mse_train)
rmse_test = [Link](mse_test)

return {
'beta': [Link](),
'mse_train': mse_train,
'rmse_train': rmse_train,
'r2_train': r2_train,
'mse_test': mse_test,
'rmse_test': rmse_test,
'r2_test': r2_test
}

In [202]:

models = ['linear', 'poly2', 'trig', 'interaction', 'full']

results = {}

for model in models:


X_train = prepare_features(X_train_raw, model)
X_test = prepare_features(X_test_raw, model)
results[model] = evaluate_model(X_train.values, y_train, X_test.values, y_test)

In [203]:
for model_name, metrics in [Link]():
print(f"\n Model: {model_name} ")
print(f"Train R²: {metrics['r2_train']:.4f}, Test R²: {metrics['r2_test']:.4f}")
print(f"Train RMSE: {metrics['rmse_train']:.4f} , Test RMSE: {metrics['rmse_test']:.4
f} ")
print(f"Coefficients: {metrics['beta']} ")

Model: linear
Train R²: 0.8707, Test R²: 0.8468
Train RMSE: 211.2469, Test RMSE: 230.4363
Coefficients: [ 388.75457107 -178.01225235 -18.4485275 ]

Model: poly2
Train R²: 0.9822, Test R²: 0.9788
Train RMSE: 78.3903, Test RMSE: 85.7292
Coefficients: [ -9.48737165 87.42813242 -11.99750729 -26.73929893 -0.37045565]

Model: trig
Train R²: 0.8716, Test R²: 0.8470
Train RMSE: 210.5177, Test RMSE: 230.3179
Coefficients: [ 391.93096916 -178.08439731 -18.74052173 -21.91722319 -19.04940317]

Model: interaction
Train R²: 0.8716, Test R²: 0.8506
Train RMSE: 210.5255, Test RMSE: 227.5825
Coefficients: [ 392.99730199 -178.4760032 -28.54090877 1.96711537]

Model: full
Train R²: 0.9947, Test R²: 0.9934
Train RMSE: 42.6674, Test RMSE: 47.7519
Coefficients: [-101.55841377 124.29327334 -11.78807654 -30.3214162 -0.68835946
0.3018325 115.15025886 -2.34296153]

1. Select the 'best' model you have created in Item 6. Provide justifications why you chose this model. Take
note the data is a synthetic data that I produced.
note the data is a synthetic data that I produced.

The best model is the Full model. The full model performs best because it includes a rich set of features—linear,
polynomial, interaction, and trigonometric terms—which allows it to capture complex, nonlinear relationships in
the data.

1. Perform a K-fold (10 folds and 4 folds) cross-validation. Record the mean and variance of R2 statistic for
both the training and cross-validation (CV) set. Discuss the goodness-of-fit for the various models you have
created.

In [204]:
from sklearn.model_selection import KFold

In [205]:
def k_fold_cv(X_df, y_series, k=10):
kf = KFold(n_splits=k, shuffle=True, random_state=42)

r2_train_scores = []
r2_val_scores = []

for train_idx, val_idx in [Link](X_df):


X_train, X_val = X_df.iloc[train_idx], X_df.iloc[val_idx]
y_train, y_val = y_series.iloc[train_idx], y_series.iloc[val_idx]

X_train_np = X_train.values
y_train_np = y_train.values
X_val_np = X_val.values
y_val_np = y_val.values

beta = manual_ols(X_train_np, y_train_np)


y_train_pred = predict_ols(X_train_np, beta)
y_val_pred = predict_ols(X_val_np, beta)

r2_train = r2_score(y_train_np, y_train_pred)


r2_val = r2_score(y_val_np, y_val_pred)

r2_train_scores.append(r2_train)
r2_val_scores.append(r2_val)

return {
"mean_train_r2": [Link](r2_train_scores),
"var_train_r2": [Link](r2_train_scores),
"mean_val_r2": [Link](r2_val_scores),
"var_val_r2": [Link](r2_val_scores),
}

In [206]:

folds = [4, 10]


cv_results = {}

for model in ['linear', 'poly2', 'trig', 'interaction', 'full']:


X_model = prepare_features(X_raw, model) # Use full data
for k in folds:
key = f"{model} _k{k}"
cv_results[key] = k_fold_cv(X_model, y, k)

In [207]:
# Convert results into a DataFrame for easy viewing
cv_summary = []

for model_name, stats in cv_results.items():


summary = {
'Model': model_name,
'Mean Train R²': round(stats['mean_train_r2'], 4),
'Var Train R²': round(stats['var_train_r2'], 6),
'Mean CV R²': round(stats['mean_val_r2'], 4),
'Var CV R²': round(stats['var_val_r2'], 6)
}
cv_summary.append(summary)

# Create DataFrame and sort


cv_df = [Link](cv_summary)
cv_df = cv_df.sort_values(by='Mean CV R²', ascending=False)

# Display
print(cv_df.to_string(index=False))

Model Mean Train R² Var Train R² Mean CV R² Var CV R²


full_k10 0.9946 0.000000 0.9940 0.000003
full_k4 0.9946 0.000000 0.9940 0.000001
poly2_k4 0.9818 0.000000 0.9810 0.000003
poly2_k10 0.9817 0.000000 0.9805 0.000006
trig_k4 0.8676 0.000011 0.8638 0.000077
linear_k4 0.8665 0.000012 0.8637 0.000080
interaction_k4 0.8683 0.000008 0.8634 0.000042
interaction_k10 0.8680 0.000003 0.8624 0.000217
linear_k10 0.8664 0.000004 0.8618 0.000314
trig_k10 0.8674 0.000004 0.8617 0.000366

You might also like