0% found this document useful (0 votes)

32 views22 pages

Machine Exercise 3

The document outlines a machine exercise focused on Exploratory Data Analysis (EDA) using a dataset from 'Advertising.csv'. It includes steps for data importation, cleaning, and basic analysis using Python and pandas, along with instructions for visualization and regression analysis. The exercise emphasizes understanding data characteristics before applying machine learning models.

Uploaded by

NJ Tan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views22 pages

Machine Exercise 3

Uploaded by

NJ Tan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Name: Niño James A.

Tan

Section: BSCpE 3-G

Machine Exercise #2

1. Introduction to Exploratory Data Analysis (EDA)

Before building a machine learning model, you must be able to properly understand the data-set you are dealing
with. This process is called Exploratory Data Analysis.

You can find python notebooks that explain this process:

Intro to Exploratory data analysis (EDA) in Python

Detailed exploratory data analysis with python
Exploratory Data Analysis with Pandas

Take time to read and understand the given examples above. Complete the python notebook [Link].

a. Import the dataset [Link] using panda .

In [146]:

# importing packages
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

In [147]:
pd.read_csv('[Link]')

Out[147]:

Unnamed: 0 TV radio newspaper sales

0 1 230.1 37.8 69.2 22.1

1 2 44.5 39.3 45.1 10.4

2 3 17.2 45.9 69.3 9.3

3 4 151.5 41.3 58.5 18.5

4 5 180.8 10.8 58.4 12.9

... ... ... ... ... ...

195 196 38.2 3.7 13.8 7.6

196 197 94.2 4.9 8.1 9.7

197 198 177.0 9.3 6.4 12.8

198 199 283.6 42.0 66.2 25.5

199 200 232.1 8.6 8.7 13.4

200 rows × 5 columns

b. Use pandas to save the data into a dataframe .

In [148]:

df = pd.read_csv('[Link]')

c. Use pandas to know the column names and number of entries (or
samples).
In [149]:
print([Link])
print([Link])
print([Link])
print([Link])

Index(['Unnamed: 0', 'TV', 'radio', 'newspaper', 'sales'], dtype='object')

RangeIndex(start=0, stop=200, step=1)
(200, 5)
1000

d. Use pandas to show the first and last 5 entries of the data frame.
In [150]:
[Link]()
Out[150]:

Unnamed: 0 TV radio newspaper sales

0 1 230.1 37.8 69.2 22.1

1 2 44.5 39.3 45.1 10.4

2 3 17.2 45.9 69.3 9.3

3 4 151.5 41.3 58.5 18.5

4 5 180.8 10.8 58.4 12.9

In [151]:
[Link]()
Out[151]:

Unnamed: 0 TV radio newspaper sales

195 196 38.2 3.7 13.8 7.6

196 197 94.2 4.9 8.1 9.7

197 198 177.0 9.3 6.4 12.8

198 199 283.6 42.0 66.2 25.5

199 200 232.1 8.6 8.7 13.4

e. Use pandas to know the statistics of each column.

In [152]:
[Link]()
Out[152]:

Unnamed: 0 TV radio newspaper sales

count 200.000000 200.000000 200.000000 200.000000 200.000000

mean Unnamed:
100.500000
0 147.042500
TV 23.264000 30.554000
radio newspaper 14.022500
sales
std 57.879185 85.854236 14.846809 21.778621 5.217457

min 1.000000 0.700000 0.000000 0.300000 1.600000

25% 50.750000 74.375000 9.975000 12.750000 10.375000

50% 100.500000 149.750000 22.900000 25.750000 12.900000

75% 150.250000 218.825000 36.525000 45.100000 17.400000

max 200.000000 296.400000 49.600000 114.000000 27.000000

f. What are the types of the data?

In [153]:
print([Link])

Unnamed: 0 int64
TV float64
radio float64
newspaper float64
sales float64
dtype: object

g. Sometimes data-sets contain columns that might not be needed and

you need to drop it from the data frame. For example purposes, drop the
Unnamed: 0 column of the [Link] data-set.
In [154]:
[Link](columns=["Unnamed: 0"], inplace=True)
print([Link])

Index(['TV', 'radio', 'newspaper', 'sales'], dtype='object')

h. Sometimes you need to rename column names such as confusing

names or columns with spaces or very long names. For example
purposes, rename the column TV to television .
In [155]:
[Link](columns={"TV": "television"}, inplace=True)
print([Link])

Index(['television', 'radio', 'newspaper', 'sales'], dtype='object')

i. Remove rows with duplicates (if there are any). See reference python
notebooks above for example on how to do this.
In [156]:
print(f'Before: {[Link]}')
df = df.drop_duplicates()
[Link]()
print(f'After: {[Link]}')

Before: (200, 4)
After: (200, 4)

j. Check if there are 'null' or 'missing' values then drop these rows.
In [157]:
hasNA = [Link]().[Link]()
if (hasNA):
print('DataFrame contains NA values')
print([Link]().sum())
print(f'Before: {[Link]}')
df = [Link]()
[Link]()
print(f'After: {[Link]}')
else:
print('DataFrame does not contain NA values')

DataFrame does not contain NA values

k. Detect if there are outliers and remove these rows.

In [158]:

from scipy import stats

print(f'Before: {[Link]}')
z_scores = [Link]([Link](df))
df = df[(z_scores < 3).all(axis=1)]
print(f'After: {[Link]}')

Before: (200, 4)
After: (198, 4)

l. Use pandas to know answer how many sales are above 15?
In [159]:
above_15 = df[df["sales"] > 15]["sales"]
print(above_15.count())

m. Remove the 10th to 40th entry. What is the range, mean, and standard
deviation of each predictor in the subset of the data that remains? Hint:
you can use [Link]() and [Link][]
In [160]:
#before
[Link]()
Out[160]:

television radio newspaper sales

count 198.000000 198.000000 198.000000 198.000000

mean 146.688384 23.130808 29.777273 13.980808

std 85.443221 14.862111 20.446303 5.196097

min 0.700000 0.000000 0.300000 1.600000

25% 74.800000 9.925000 12.650000 10.325000

50% 149.750000 22.400000 25.600000 12.900000

75% 218.475000 36.325000 44.050000 17.375000

max 293.600000 49.600000 89.400000 27.000000

In [161]:
df = [Link]([Link][10:41]) # Removes 10th to 40th row (inclusive of 10, exclusive of
41)
#after
[Link]()
Out[161]:

television radio newspaper sales

count 167.000000 167.000000 167.000000 167.000000

mean 142.950299 23.269461 29.920359 13.832335

std 84.686566 15.119551 21.014340 5.227456

min 0.700000 0.000000 0.900000 1.600000

25% 74.250000 9.750000 12.150000 10.350000

50% 141.300000 22.500000 25.600000 12.900000

75% 215.900000 36.900000 44.700000 17.250000

max 293.600000 49.600000 89.400000 27.000000

n. Using the complete dataset, replicate the figure below (do not include
regression line yet). Check the python notebook for the figure.
In [162]:
sales = df['sales']
tv = df['television']
radio = df['radio']
newspaper = df['newspaper']

In [163]:
# Correlation
Y = sales

In [164]:
# linear regression
def linear_regression(X, Y):
b1 = [Link](X - [Link](), Y - [Link]()) / [Link](X - [Link](), X - [Link]())
bo = [Link]() - b1 * [Link]()
Yhat = bo + b1 * X
return (bo, b1, Yhat)

In [165]:
#plotting
structure = {
'columns' : [tv, radio, newspaper],
'colors' : ['blue', 'red', 'green'],
'line_colors': ['red', 'green', 'blue'],
'labels' : ['TV', 'Radio', 'Newspaper'],
}

fig, ax = [Link](figsize=(8,3.5), ncols=3, nrows=1)

for i in range (0,3):

lr = linear_regression(structure['columns'][i], sales)
ax[i].scatter(structure['columns'][i], sales, c='none', edgecolor=structure['colors'
][i], linewidth=0.7, alpha=0.5, label = 'data')
ax[i].plot(structure['columns'][i], lr[2], color=structure['line_colors'][i], linewi
dth=2, alpha=0.9, label = f'Y = {lr[0]:.4} + {lr[1]:.4} * X')
ax[i].plot(structure['columns'][i], lr[2], color=structure['line_colors'][i], linewi
dth=2, alpha=0.9, label = 'Linear Regression')
ax[i].set_xlabel(structure['labels'][i])
ax[i].set_ylabel('Sales')
ax[i].legend()

plt.tight_layout(pad=0, rect=[2.5,2.5,4,4])
[Link]();

1. Understanding the pitfall of overfitting.

Measuring the quality of the fit of the ML model is an integral step in assuring that ML model will produce
meaningful results when used for prediction. Kindly read Chapter 2.2 Assessing Model Accuracy of the
Introduction to Statistical Learning textbook for more details. This item aims to understand the pitfall of
overfitting.

Complete the python notebook [Link]. Provide answer to the following questions included in the
python notebook.

Instructions
1. Use the code below to generate synthetic data. Plot the data points and true (or exact without noise)
function.

seed = 20250404
[Link](seed)
N = 25
X = [Link](-[Link],[Link], N)
# noise
# eps ~ N(0,0.01)
eps = [Link](0, 0.1, N)
y = 0.25*[Link](X) + eps
y_true = 0.25*[Link](X)

In [166]:
import numpy as np
import [Link] as plt
from [Link] import PolynomialFeatures
from sklearn.linear_model import LinearRegression

In [167]:
seed = 20250404
[Link](seed)
N = 25
X = [Link](-[Link], [Link], N)
eps = [Link](0, 0.1, N)
y = 0.25 * [Link](X) + eps
y_true = 0.25 * [Link](X)

In [168]:
X = [Link](-1, 1)

In [169]:
# Polynomial regression
degree = 5
poly = PolynomialFeatures(degree)
X_poly = poly.fit_transform(X)
model = LinearRegression()
[Link](X_poly, y)

# Predict on a finer grid

X_plot = [Link](-[Link], [Link], 100).reshape(-1, 1)
X_plot_poly = [Link](X_plot)
y_plot = [Link](X_plot_poly)

# Plotting
[Link](figsize=(6, 4))
[Link](X, y, color='red') # noisy data
[Link](X_plot, y_plot, color='blue') # model prediction
[Link]("x")
[Link]("y")
[Link](True)
[Link]()
1. Perform regression of the entire data-set. You can use statsmodel , scikit-learn or manually calculate
coefficients using OLS method.

Fit a simple linear regression (n=1) and plot. Include exact (no noise) function and data points.
Fit a polynomial degree (n=1,2,3) and plot. Include exact (no noise) function and data points.
Fit a polynomial degree (n=4,5,6) and plot. Include exact (no noise) function and data points.
Fit a polynomial degree (n=13,14,15) and plot. Include exact (no noise) function and data points.
Fit a polynomial degree (n=22) and plot. Include exact (no noise) function and data points.
In [170]:
def plot_poly_fit(degrees, X, y, y_true):
X_plot = [Link](-[Link], [Link], 300).reshape(-1, 1)
y_exact = 0.25 * [Link](X_plot)

[Link](figsize=(6, 4))
[Link](X, y, color='red', label='Data')
[Link](X_plot, y_exact, 'b--', label='Exact (no noise)')

for degree in degrees:

poly = PolynomialFeatures(degree)
X_poly = poly.fit_transform(X)
X_plot_poly = [Link](X_plot)

model = LinearRegression()
[Link](X_poly, y)
y_pred = [Link](X_plot_poly)

label = f'n={degree}'
# Show equation for degree 1
if degree == 1 and len(degrees) == 1:
intercept = model.intercept_
slope = model.coef_[1]
label = f'y={intercept:.4f}+{slope:.4f}x'

[Link](X_plot, y_pred, label=label)

[Link]("x")
[Link]("y")
[Link](True)
[Link]()
[Link]()

In [171]:
# Simple linear regression (n=1)
plot_poly_fit([1], X, y, y_true)

# Polynomial degrees 1 to 3
plot_poly_fit([1, 2, 3], X, y, y_true)

# Polynomial degrees 4 to 6
plot_poly_fit([4, 5, 6], X, y, y_true)

# Polynomial degrees 13 to 15
plot_poly_fit([13, 14, 15], X, y, y_true)

# Polynomial degree 22
plot_poly_fit([22], X, y, y_true)
1. Create a function that will fit the data given k degree polynomial. Then for each fitted model, calculate the
Residual Sum of Squares (RSS) or RSS = . Plot the calculate RSS with the degree of polynomial at the x-
∑ni=1 (yi
− y^i )2
axis.

In [172]:
def compute_rss_for_degrees(X, y, max_degree=25):
rss_list = []
degrees = list(range(1, max_degree + 1))

for k in degrees:
poly = PolynomialFeatures(k)
X_poly = poly.fit_transform(X)

model = LinearRegression()
[Link](X_poly, y)

y_pred = [Link](X_poly)
rss = [Link]((y - y_pred) ** 2)
rss_list.append(rss)

return degrees, rss_list

In [173]:

def plot_rss(degrees, rss_list):

[Link](figsize=(8, 4))
[Link](degrees, rss_list, marker='o', linestyle='-', color='purple')
[Link]('Polynomial Degree')
[Link]('RSS (Residual Sum of Squares)')
[Link]('RSS vs. Polynomial Degree')
[Link](True)
[Link]()

In [174]:
degrees, rss_list = compute_rss_for_degrees(X, y, max_degree=25)
plot_rss(degrees, rss_list)

1. Generate new data sets with N = 100. Use the same seed number as above. Create a OLS model with
polynomial degree from 1 to 20. For each model fitting, subdivide the data-set into 70% for training and 30%
for testing. For each set, calculate the Mean Squared Error (MSE) or M SE . Plot the MSE vs the
1
= n
n
∑i=1 (yi
− y^i )2
polynomial order fit. See example plot below.
In [175]:
from sklearn.model_selection import train_test_split
from [Link] import mean_squared_error

In [176]:

def evaluate_mse_poly_models(seed, N=100, max_degree=20):

[Link](seed)
X = [Link](-[Link], [Link], N).reshape(-1, 1)
eps = [Link](0, 0.1, N)
y = 0.25 * [Link](X).flatten() + eps

degrees = list(range(1, max_degree + 1))

train_mse_list = []
test_mse_list = []

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_sta

te=seed)

for k in degrees:
poly = PolynomialFeatures(k)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = [Link](X_test)

model = LinearRegression()
[Link](X_train_poly, y_train)

y_train_pred = [Link](X_train_poly)
y_test_pred = [Link](X_test_poly)

train_mse = mean_squared_error(y_train, y_train_pred)

test_mse = mean_squared_error(y_test, y_test_pred)

train_mse_list.append(train_mse)
test_mse_list.append(test_mse)

return degrees, train_mse_list, test_mse_list

In [177]:
def plot_mse_vs_degree_separate(degrees, train_mse, test_mse):
fig, axes = [Link](figsize=(8,3.5), ncols=2, nrows=1)

# Plot 1: Train and Test MSE

axes[0].plot(degrees, train_mse, label='train mse', color='red')
axes[0].plot(degrees, test_mse, label='test mse', color='blue')
axes[0].set_title("Train/Test MSE vs Polynomial Degree")
axes[0].set_xlabel("polynomial fit Order (n)")
axes[0].set_ylabel("Mean Squared Error")
axes[0].legend()
axes[0].grid(True)

# Plot 2: Train MSE only

axes[1].plot(degrees, train_mse, label='train mse', color='red')
axes[1].set_title("Train MSE vs Polynomial Degree")
axes[1].set_xlabel("polynomial fit Order (n)")
axes[1].set_ylabel("Mean Squared Error")
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
[Link]()

In [178]:
degrees, train_mse, test_mse = evaluate_mse_poly_models(seed)
plot_mse_vs_degree_separate(degrees, train_mse, test_mse)

1. Model selection demonstrated using Ordinary Least Squares method.

A synthetic data is provided at [Link]. This data was produced using a linear combination of the
predictors ( x_1 ) and ( x_2 ). The aim of this item is for you to select a good model (only use Linear
Regression model) for the data-set. Kindly read Chapter 3.2 and 3.3 of the textbook for this purpose.

Complete the python notebook [Link].

Instructions
1. Import the data from here. The data-set has two predictors/features x1 and x2 and one target value: y1 .

In [179]:
import pandas as pd

In [180]:
df = pd.read_csv('[Link]')

1. Perform an exploratory data analysis for this data-set.

In [181]:

print([Link])
print([Link])
print([Link]())

Index(['x1', 'x2', 'y'], dtype='object')

(350, 3)
x1 350
x2 350
y 350
dtype: int64

In [182]:
[Link]()

Out[182]:

x1 x2 y
x1
0 2.739702 x2
3.770638 y
-36.038913

1 4.974939 0.386704 -335.187632

2 7.828024 -1.765389 -861.653094

3 3.345500 -2.861464 -81.880837

4 0.417797 -3.360638 59.936199

In [183]:
[Link]()

Out[183]:

x1 x2 y

345 7.151762 -3.724223 -615.910151

346 7.192358 3.395231 -656.397129

347 0.019028 1.637808 -4.833681

348 9.393014 0.500292 -1635.089200

349 6.177953 -1.157211 -464.780475

In [184]:

[Link]()
Out[184]:

x1 x2 y

count 350.000000 350.000000 350.000000

mean 5.104855 -0.148559 -518.999391

std 3.080878 2.919885 588.819525

min 0.019028 -4.992177 -1979.355855

25% 2.341294 -2.451153 -892.857741

50% 5.348355 -0.457099 -368.847553

75% 7.940074 2.476036 4.020817

max 9.954931 4.999886 177.691608

In [185]:
print(f'Before: {[Link]}')
df = df.drop_duplicates()
[Link]()
print(f'After: {[Link]}')

Before: (350, 3)
After: (350, 3)

In [186]:
hasNA = [Link]().[Link]()
if (hasNA):
print('DataFrame contains NA values')
print([Link]().sum())
print(f'Before: {[Link]}')
df = [Link]()
[Link]()
print(f'After: {[Link]}')
else:
print('DataFrame does not contain NA values')

DataFrame does not contain NA values

In [187]:

#Outliers
from scipy import stats
import numpy as np

print(f'Before: {[Link]}')
z_scores = [Link]([Link](df))
df = df[(z_scores < 3).all(axis=1)]
print(f'After: {[Link]}')

Before: (350, 3)
After: (350, 3)

1. Divide the data into training and testing set.

In [188]:
from sklearn.model_selection import train_test_split

In [189]:
X = df[['x1', 'x2']]
y = df['y']

In [190]:

# Split the dataset into training and testing sets (e.g., 80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42
)

In [191]:
# Optional: Combine X and y again if needed
train_data = [Link]([X_train, y_train], axis=1)
test_data = [Link]([X_test, y_test], axis=1)

print(train_data.shape)
print(test_data.shape)

(280, 3)
(70, 3)

1. Fit a simple Ordinary Least Square (OLS) model to the data: y = β0 . Check the Residual Standard Error,
+ β1 x1
+ β2 x2
Mean Squared Error, R2 and Root Mean squared error for the training and testing sets. Give your
observations.

In [192]:
from [Link] import mean_squared_error, r2_score
import [Link] as sm

In [193]:

# Add constant for intercept

X_train_sm = sm.add_constant(X_train)
X_test_sm = sm.add_constant(X_test)

In [194]:
# Fit OLS model
ols_model = [Link](y_train, X_train_sm).fit()

# Predict
y_train_pred = ols_model.predict(X_train_sm)
y_test_pred = ols_model.predict(X_test_sm)

In [195]:
# --- Evaluation Metrics ---
def evaluate(y_true, y_pred, n, p):
mse = mean_squared_error(y_true, y_pred)
rmse = [Link](mse)
r2 = r2_score(y_true, y_pred)
rss = [Link]((y_true - y_pred) ** 2)
rse = [Link](rss / (n - p - 1)) # Residual Std Error
return mse, rmse, r2, rse

In [196]:

train_metrics = evaluate(y_train, y_train_pred, n=len(y_train), p=X_train.shape[1])

test_metrics = evaluate(y_test, y_test_pred, n=len(y_test), p=X_test.shape[1])

# Display results
print("TRAINING SET:")
print(f"RSE: {train_metrics[3]:.4f} ")
print(f"MSE: {train_metrics[0]:.4f} ")
print(f"RMSE: {train_metrics[1]:.4f} ")
print(f"R²: {train_metrics[2]:.4f} \n ")

print("TESTING SET:")
print(f"RSE: {test_metrics[3]:.4f}")
print(f"MSE: {test_metrics[0]:.4f}")
print(f"RMSE: {test_metrics[1]:.4f}")
print(f"R²: {test_metrics[2]:.4f}")

TRAINING SET:
RSE: 212.3877
MSE: 44625.2359
RMSE: 211.2469
R²: 0.8707

TESTING SET:
RSE: 235.5388
MSE: 53100.8893
RMSE: 230.4363
R²: 0.8468

1. Is at least one of the predictor x1 , x2 useful in predicting the response y? See Chapter 3.2.2 on how to
answer this question.

In [197]:

print(ols_model.summary())

OLS Regression Results

==============================================================================
Dep. Variable: y R-squared: 0.871
Model: OLS Adj. R-squared: 0.870
Method: Least Squares F-statistic: 933.1
Date: Mon, 02 Jun 2025 Prob (F-statistic): 8.57e-124
Time: [Link] Log-Likelihood: -1896.2
No. Observations: 280 AIC: 3798.
Df Residuals: 277 BIC: 3809.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 388.7546 24.680 15.752 0.000 340.171 437.338
x1 -178.0123 4.121 -43.198 0.000 -186.124 -169.900
x2 -18.4485 4.377 -4.215 0.000 -27.065 -9.832
==============================================================================
Omnibus: 25.957 Durbin-Watson: 2.034
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30.882
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30.882
Skew: -0.800 Prob(JB): 1.97e-07
Kurtosis: 2.704 Cond. No. 11.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifie
d.

Predictor Coef P-value Significant?

x1 -178.0123 0.000 Yes

x2 -18.4485 0.000 Yes

1. Create various ML model that predicts well the given data-set. Only use OLS model. You can use higher
order polynomials (e.g. x51 , x52 ), trigonometric functions ( sin x1 etc.) or interaction terms (e.g. x1 , x2 ), etc.
You can use scikit learn, or statsmodels but it is encourage that you code manually how to perform Ordinary
least squares regression model.

In [198]:
def manual_ols(X, y):
X = [Link](X)
y = [Link](y).reshape(-1, 1)
XTX_inv = [Link](X.T @ X)
beta = XTX_inv @ X.T @ y
return beta

def predict_ols(X, beta):

return X @ beta

In [199]:

X_raw = df[['x1', 'x2']]

y = df['y']

# Split
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X_raw, y, test_size=0.2, ran
dom_state=42)

In [200]:

# Prepare models
def prepare_features(X_df, kind):
X = X_df.copy()
if kind == 'linear':
pass
elif kind == 'poly2':
X['x1^2'] = X['x1'] ** 2
X['x2^2'] = X['x2'] ** 2
elif kind == 'trig':
X['sin_x1'] = [Link](X['x1'])
X['cos_x2'] = [Link](X['x2'])
elif kind == 'interaction':
X['x1*x2'] = X['x1'] * X['x2']
elif kind == 'full':
X['x1^2'] = X['x1'] ** 2
X['x2^2'] = X['x2'] ** 2
X['x1*x2'] = X['x1'] * X['x2']
X['sin_x1'] = [Link](X['x1'])
X['cos_x2'] = [Link](X['x2'])
else:
raise ValueError("Unknown feature set kind")

[Link](0, 'const', 1) # Add intercept

return X

In [201]:
def evaluate_model(X_train, y_train, X_test, y_test):
beta = manual_ols(X_train, y_train)
y_train_pred = predict_ols(X_train, beta)
y_test_pred = predict_ols(X_test, beta)

# Metrics
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)
rmse_train = [Link](mse_train)
rmse_test = [Link](mse_test)

return {
'beta': [Link](),
'mse_train': mse_train,
'rmse_train': rmse_train,
'r2_train': r2_train,
'mse_test': mse_test,
'rmse_test': rmse_test,
'r2_test': r2_test
}

In [202]:

models = ['linear', 'poly2', 'trig', 'interaction', 'full']

results = {}

for model in models:

X_train = prepare_features(X_train_raw, model)
X_test = prepare_features(X_test_raw, model)
results[model] = evaluate_model(X_train.values, y_train, X_test.values, y_test)

In [203]:
for model_name, metrics in [Link]():
print(f"\n Model: {model_name} ")
print(f"Train R²: {metrics['r2_train']:.4f}, Test R²: {metrics['r2_test']:.4f}")
print(f"Train RMSE: {metrics['rmse_train']:.4f} , Test RMSE: {metrics['rmse_test']:.4
f} ")
print(f"Coefficients: {metrics['beta']} ")

Model: linear
Train R²: 0.8707, Test R²: 0.8468
Train RMSE: 211.2469, Test RMSE: 230.4363
Coefficients: [ 388.75457107 -178.01225235 -18.4485275 ]

Model: poly2
Train R²: 0.9822, Test R²: 0.9788
Train RMSE: 78.3903, Test RMSE: 85.7292
Coefficients: [ -9.48737165 87.42813242 -11.99750729 -26.73929893 -0.37045565]

Model: trig
Train R²: 0.8716, Test R²: 0.8470
Train RMSE: 210.5177, Test RMSE: 230.3179
Coefficients: [ 391.93096916 -178.08439731 -18.74052173 -21.91722319 -19.04940317]

Model: interaction
Train R²: 0.8716, Test R²: 0.8506
Train RMSE: 210.5255, Test RMSE: 227.5825
Coefficients: [ 392.99730199 -178.4760032 -28.54090877 1.96711537]

Model: full
Train R²: 0.9947, Test R²: 0.9934
Train RMSE: 42.6674, Test RMSE: 47.7519
Coefficients: [-101.55841377 124.29327334 -11.78807654 -30.3214162 -0.68835946
0.3018325 115.15025886 -2.34296153]

1. Select the 'best' model you have created in Item 6. Provide justifications why you chose this model. Take
note the data is a synthetic data that I produced.
note the data is a synthetic data that I produced.

The best model is the Full model. The full model performs best because it includes a rich set of features—linear,
polynomial, interaction, and trigonometric terms—which allows it to capture complex, nonlinear relationships in
the data.

1. Perform a K-fold (10 folds and 4 folds) cross-validation. Record the mean and variance of R2 statistic for
both the training and cross-validation (CV) set. Discuss the goodness-of-fit for the various models you have
created.

In [204]:
from sklearn.model_selection import KFold

In [205]:
def k_fold_cv(X_df, y_series, k=10):
kf = KFold(n_splits=k, shuffle=True, random_state=42)

r2_train_scores = []
r2_val_scores = []

for train_idx, val_idx in [Link](X_df):

X_train, X_val = X_df.iloc[train_idx], X_df.iloc[val_idx]
y_train, y_val = y_series.iloc[train_idx], y_series.iloc[val_idx]

X_train_np = X_train.values
y_train_np = y_train.values
X_val_np = X_val.values
y_val_np = y_val.values

beta = manual_ols(X_train_np, y_train_np)

y_train_pred = predict_ols(X_train_np, beta)
y_val_pred = predict_ols(X_val_np, beta)

r2_train = r2_score(y_train_np, y_train_pred)

r2_val = r2_score(y_val_np, y_val_pred)

r2_train_scores.append(r2_train)
r2_val_scores.append(r2_val)

return {
"mean_train_r2": [Link](r2_train_scores),
"var_train_r2": [Link](r2_train_scores),
"mean_val_r2": [Link](r2_val_scores),
"var_val_r2": [Link](r2_val_scores),
}

In [206]:

folds = [4, 10]

cv_results = {}

for model in ['linear', 'poly2', 'trig', 'interaction', 'full']:

X_model = prepare_features(X_raw, model) # Use full data
for k in folds:
key = f"{model} _k{k}"
cv_results[key] = k_fold_cv(X_model, y, k)

In [207]:
# Convert results into a DataFrame for easy viewing
cv_summary = []

for model_name, stats in cv_results.items():

summary = {
'Model': model_name,
'Mean Train R²': round(stats['mean_train_r2'], 4),
'Var Train R²': round(stats['var_train_r2'], 6),
'Mean CV R²': round(stats['mean_val_r2'], 4),
'Var CV R²': round(stats['var_val_r2'], 6)
}
cv_summary.append(summary)

# Create DataFrame and sort

cv_df = [Link](cv_summary)
cv_df = cv_df.sort_values(by='Mean CV R²', ascending=False)

# Display
print(cv_df.to_string(index=False))

Model Mean Train R² Var Train R² Mean CV R² Var CV R²

full_k10 0.9946 0.000000 0.9940 0.000003
full_k4 0.9946 0.000000 0.9940 0.000001
poly2_k4 0.9818 0.000000 0.9810 0.000003
poly2_k10 0.9817 0.000000 0.9805 0.000006
trig_k4 0.8676 0.000011 0.8638 0.000077
linear_k4 0.8665 0.000012 0.8637 0.000080
interaction_k4 0.8683 0.000008 0.8634 0.000042
interaction_k10 0.8680 0.000003 0.8624 0.000217
linear_k10 0.8664 0.000004 0.8618 0.000314
trig_k10 0.8674 0.000004 0.8617 0.000366

DA Lab
No ratings yet
DA Lab
27 pages
Load and Analyze Advertising Data
No ratings yet
Load and Analyze Advertising Data
8 pages
1.3 - Multiple Linear Regression
No ratings yet
1.3 - Multiple Linear Regression
13 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
Data - Analytics Lab - Manual JNTUH R22 Regulation
No ratings yet
Data - Analytics Lab - Manual JNTUH R22 Regulation
26 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
43 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
ML 1-11
No ratings yet
ML 1-11
27 pages
Handle Missing Data in Real-Time
No ratings yet
Handle Missing Data in Real-Time
5 pages
Machine Learning Lab Assignment 2
No ratings yet
Machine Learning Lab Assignment 2
23 pages
23bet10114 Naman Gupta Assignment-1
No ratings yet
23bet10114 Naman Gupta Assignment-1
17 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
Advertising - Paulina Frigia Rante (34) - PPBP 1 - Colaboratory
No ratings yet
Advertising - Paulina Frigia Rante (34) - PPBP 1 - Colaboratory
7 pages
Self Study Assignment Python II
No ratings yet
Self Study Assignment Python II
4 pages
Data Mining Lab: Regression & Clustering
No ratings yet
Data Mining Lab: Regression & Clustering
36 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
32 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
External
No ratings yet
External
11 pages
FDS Record-1-4
No ratings yet
FDS Record-1-4
18 pages
Data Analytics Lab: Handling Missing Data
No ratings yet
Data Analytics Lab: Handling Missing Data
47 pages
ML
No ratings yet
ML
21 pages
Time Series Analysis Group 9
No ratings yet
Time Series Analysis Group 9
16 pages
TD5Numpy Pandas and Matplotlib
No ratings yet
TD5Numpy Pandas and Matplotlib
5 pages
Experiment No 11
No ratings yet
Experiment No 11
19 pages
L-2 (Data Frame Part 1) .Ipynb - Colab
No ratings yet
L-2 (Data Frame Part 1) .Ipynb - Colab
5 pages
Advanced Machine Learning Course Guide
No ratings yet
Advanced Machine Learning Course Guide
36 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
DA Programs
No ratings yet
DA Programs
44 pages
Stationarity Analysis of Time Series Data
No ratings yet
Stationarity Analysis of Time Series Data
7 pages
Lab Record Dev
No ratings yet
Lab Record Dev
20 pages
DP Prog
No ratings yet
DP Prog
10 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Experimenting With Data Analysis Packages and Statistical Operations
No ratings yet
Experimenting With Data Analysis Packages and Statistical Operations
18 pages
Data Mining with Python Lab Guide
No ratings yet
Data Mining with Python Lab Guide
39 pages
ML (Sudhanshu)
No ratings yet
ML (Sudhanshu)
24 pages
Machine Learning Lab Experiments Guide
No ratings yet
Machine Learning Lab Experiments Guide
47 pages
Exercise#8 Instructions Linear Regression Model
No ratings yet
Exercise#8 Instructions Linear Regression Model
4 pages
DSC Lab Programs
No ratings yet
DSC Lab Programs
24 pages
Data Science Experiment Guide
100% (2)
Data Science Experiment Guide
43 pages
Ap Python
No ratings yet
Ap Python
12 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
Data Analysis and Visualization Course
No ratings yet
Data Analysis and Visualization Course
4 pages
PR Final File
No ratings yet
PR Final File
70 pages
Ex. No.: 01 Working With Numpy Arrays
No ratings yet
Ex. No.: 01 Working With Numpy Arrays
30 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Index
No ratings yet
Index
4 pages
24UAD315 DEV Final Record
No ratings yet
24UAD315 DEV Final Record
49 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
Introduction To Python (Part III)
No ratings yet
Introduction To Python (Part III)
29 pages
Multiple Linear Regression Guide
No ratings yet
Multiple Linear Regression Guide
9 pages
Machine Learning Record VR19
No ratings yet
Machine Learning Record VR19
46 pages
Program
No ratings yet
Program
10 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Predictive Modelling Alternate Project Business Case
No ratings yet
Predictive Modelling Alternate Project Business Case
47 pages
Data Analysis for Beginners
No ratings yet
Data Analysis for Beginners
8 pages
Ex. No: 1 Exploring The Features of Numpy, Scipy, Jupyter, Statsmodels and Pandas Date: 07/08/2024
No ratings yet
Ex. No: 1 Exploring The Features of Numpy, Scipy, Jupyter, Statsmodels and Pandas Date: 07/08/2024
9 pages
Niño James Tan - Self-Check Questions For Module 1 - Introduction To Web Development
No ratings yet
Niño James Tan - Self-Check Questions For Module 1 - Introduction To Web Development
3 pages
CARPO - Streamlined Process in Motion Capture Including Analysis, Resimulation, Prediction and Motion Capture
No ratings yet
CARPO - Streamlined Process in Motion Capture Including Analysis, Resimulation, Prediction and Motion Capture
14 pages
Presentation 1 1
No ratings yet
Presentation 1 1
13 pages
Financial Management Learning Package
No ratings yet
Financial Management Learning Package
2 pages
M.Tech in Nanoscience Curriculum Guide
No ratings yet
M.Tech in Nanoscience Curriculum Guide
5 pages
Computersystemservicing Redirection
No ratings yet
Computersystemservicing Redirection
18 pages
Safdarjung Hospital Management Appeal
No ratings yet
Safdarjung Hospital Management Appeal
12 pages
FeeChallan 3
No ratings yet
FeeChallan 3
1 page
(ENG) MHL-200 User Manual
No ratings yet
(ENG) MHL-200 User Manual
24 pages
TCS India Guideline - Payout of Internet Related Expenses in SBWS Arrangement Due To COVID-19 v1.0
No ratings yet
TCS India Guideline - Payout of Internet Related Expenses in SBWS Arrangement Due To COVID-19 v1.0
4 pages
Developing Character Through Sport
No ratings yet
Developing Character Through Sport
3 pages
Profit and Loss Formulas Explained
No ratings yet
Profit and Loss Formulas Explained
2 pages
Aspen Exchanger Design and Rating Fired Heater V11
No ratings yet
Aspen Exchanger Design and Rating Fired Heater V11
1 page
Motherboard Invoice PDF
100% (1)
Motherboard Invoice PDF
2 pages
3-Frequency Reuse Channel Assignment Strategies-17!12!2024
No ratings yet
3-Frequency Reuse Channel Assignment Strategies-17!12!2024
24 pages
Nutrition Care Process (NCP)
100% (2)
Nutrition Care Process (NCP)
48 pages
Zimmer Palacos Bone Cement Brochure
No ratings yet
Zimmer Palacos Bone Cement Brochure
6 pages
Mathematical Modeling and Control Law Design For 1dof Quadcopter Flight Dynamics
No ratings yet
Mathematical Modeling and Control Law Design For 1dof Quadcopter Flight Dynamics
6 pages
IPR Assignment 01 PDF
No ratings yet
IPR Assignment 01 PDF
8 pages
Comp314 Practical-Troubleshooting Methodologies and Tools
No ratings yet
Comp314 Practical-Troubleshooting Methodologies and Tools
11 pages
Geology of Nepal Himalaya MR Dhital TOC
No ratings yet
Geology of Nepal Himalaya MR Dhital TOC
13 pages
WAGO smartPRINTER Poster Installationguide Quick Start V.160122.010 PDF
No ratings yet
WAGO smartPRINTER Poster Installationguide Quick Start V.160122.010 PDF
1 page
Art. 3.2.2.2 PD 1096 PDF
No ratings yet
Art. 3.2.2.2 PD 1096 PDF
22 pages
Resume PDF Khalis
No ratings yet
Resume PDF Khalis
2 pages
Laws Pcu PDF
No ratings yet
Laws Pcu PDF
1 page
Cloud Computing Identity As A Service (IDaaS)
No ratings yet
Cloud Computing Identity As A Service (IDaaS)
4 pages
Fluffy Bunny: Amigurumi Pattern
No ratings yet
Fluffy Bunny: Amigurumi Pattern
8 pages
Assignment-1-Suspension System
No ratings yet
Assignment-1-Suspension System
19 pages
RXNX
No ratings yet
RXNX
3 pages
Technical Data Sheet: Granucult™ Plate Count Skimmed Milk Agar Acc. Iso 4833 and Iso 17410
No ratings yet
Technical Data Sheet: Granucult™ Plate Count Skimmed Milk Agar Acc. Iso 4833 and Iso 17410
4 pages
Certified Treasury Proffessional
100% (6)
Certified Treasury Proffessional
366 pages
App Blueprint Final
No ratings yet
App Blueprint Final
9 pages
Respirator & Types
No ratings yet
Respirator & Types
11 pages
Citizen Handbook
No ratings yet
Citizen Handbook
16 pages