0% found this document useful (0 votes)
130 views19 pages

A. Provide An Example of Poor Quality Structured Data: 204.4/204.4 KB 42.8/42.8 KB

This document outlines an assignment focused on financial data analysis, emphasizing the importance of data quality, yield curve modeling, correlation analysis, and empirical analysis of ETFs. It includes a structured approach to exploring poor-quality data examples, fitting yield curve models, and conducting Principal Component Analysis (PCA) on financial datasets. Additionally, it provides instructions for installing necessary packages and importing libraries for data analysis.

Uploaded by

amnhancololo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views19 pages

A. Provide An Example of Poor Quality Structured Data: 204.4/204.4 KB 42.8/42.8 KB

This document outlines an assignment focused on financial data analysis, emphasizing the importance of data quality, yield curve modeling, correlation analysis, and empirical analysis of ETFs. It includes a structured approach to exploring poor-quality data examples, fitting yield curve models, and conducting Principal Component Analysis (PCA) on financial datasets. Additionally, it provides instructions for installing necessary packages and importing libraries for data analysis.

Uploaded by

amnhancololo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Introduction

In the realm of financial data analysis, ensuring the quality and reliability of data is paramount. The significance of good quality data becomes evident when we examine its implications in
financial modeling and decision-making processes. This assignment is structured to provide a comprehensive understanding of data quality, yield curve modeling, the exploitation of
correlation, and empirical analysis of Exchange-Traded Funds (ETFs), all within the broader context of financial data analysis.
Firstly (Question 1), we will delve into the concept of data quality by providing examples of poor-quality structured and unstructured data, along with methods to identify their shortcomings.
This foundational knowledge is crucial for recognizing the importance of maintaining high standards in financial data.
Next (Question 2), we will explore yield curve modeling by fitting Nelson-Siegel and Cubic-Spline models to government securities from a European Union (Germany), ranging from short-
term to long-term maturities. The section 2 will highlight the comparative analysis of these models in terms of fit and interpretation, addressing the ethical considerations of data smoothing.
Furthermore, in Question 3 we will investigate the role of correlation and principal components in financial data analysis by generating uncorrelated Gaussian random variables and running
Principal Component Analysis (PCA). The analysis will be extended to real data, collecting daily closing yields for government securities and comparing the results through screeplots.
Lastly, in Question 4, the empirical analysis of ETFs will involve computing daily returns, covariance matrices, PCA, and Singular Value Decomposition (SVD) for a sector ETF, providing a
thorough explanation of each transformation and its implications.

Installing Packages
In [ ]: ! pip install -q kaggle --quiet
! pip install pandas --quiet
! pip install numpy --quiet
! pip install matplotlib --quiet
! pip install seaborn --quiet
! pip install pingouin --quiet
! pip install factor_analyzer --quiet
! pip install gdown --quiet
! pip install datetime --quiet
! pip install requests --quiet
! pip install yfinance --quiet
! pip install pandas-datareader --quiet

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 204.4/204.4 kB 5.5 MB/s eta 0:00:00


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.8/42.8 kB 2.0 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Building wheel for factor_analyzer (pyproject.toml) ... done
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.4/44.4 kB 2.1 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.6/52.6 kB 4.1 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 259.8/259.8 kB 10.2 MB/s eta 0:00:00

Importing packages
In [ ]: import pandas as pd
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
from factor_analyzer.factor_analyzer import calculate_kmo
import pingouin as pg
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from numpy import linalg as LA
from scipy.interpolate import CubicSpline
from scipy.optimize import minimize
from datetime import datetime, timedelta
import time
from operator import index
import yfinance as yfin
from PIL.ImageChops import screen
from fontTools.misc.plistlib import end_key
from matplotlib.projections import projection_registry
from numpy.array_api import full_like
from numpy.core.shape_base import block
from numpy.ma.core import around, outer
from rich.jupyter import display
from scipy import stats
from numpy import linalg as la
from scipy.odr import exponential
from unicodedata import decimal
from IPython.display import set_matplotlib_formats
import math
import re
import gdown
import warnings
import requests
warnings.filterwarnings('ignore')

<ipython-input-6-1dea6b209a65>:19: UserWarning: The numpy.array_api submodule is still experimental. See NEP 47.
from numpy.array_api import full_like

Question 1: Data Quality


Data quality is defined by its ability to meet the expectations of data consumers and accurately represent the objects, events, and concepts it is intended to depict (Sebastian xxx). Data can
be classified into structured, unstructured, and semi-structured data (Praveen and Umesh 67). Beyond this classification, data possess dimensions that contribute to its quality, including
accuracy, objectivity, believability, reputation, accessibility, access security, relevancy, value-added, timeliness, completeness, amount of data, interpretability, ease of understanding,
concise representation, and currency (Huh et al. 560; Strong et al. 104). The absence of at least one of these dimensions can result in the data classification being preceded (prefix) by the
term poor quality. Possible causes of poor data quality include missing values, data errors, discrepancies, biases, inconsistencies, static header data, changes in historical data, lack of
transparency, reporting time issues, and misuse of data (Liu 34). Other causes can include multiple names for the same entity, missing values, incorrect values, and duplicate records for the
same customer (Wang and Diane 7; Nadinić and Kalpić 328). Thus, to answer the questions of this exercise, we will base ourselves on these aspects.
Thus, to answer the questions of this exercise, we will base ourselves on these aspects.
a. Provide an example of poor quality structured data
Answer
In [ ]: file_id = "1Rp1Ubva5Q7ReW_zs7EowH3yge8rq-i5w"
url = f"https://drive.google.com/uc?id={file_id}"
output_filename = "financials.zip"
gdown.download(url, output_filename, quiet=False)

import os
if os.path.exists(output_filename):
print(f"File '{output_filename}' downloaded successfully!")
else:
print("File download failed.")
try:
import pandas as pd
datafin = pd.read_csv(output_filename)
# print(datafin.head())
except Exception as e:
print(f"Unable to read the file: {e}")

datafin.head()

Downloading...
From: https://drive.google.com/uc?id=1Rp1Ubva5Q7ReW_zs7EowH3yge8rq-i5w
To: /content/financials.zip
100%|██████████| 22.0k/22.0k [00:00<00:00, 33.8MB/s]
File 'financials.zip' downloaded successfully!

Out[ ]:
Segment Country Product Discount
Band
Units Manufacturing Sale
Sold Price Price
Gross Discounts
Sales Sales COGS Profit Month Month Year
Date Number Name
0 Government Canada Carretera None $1,618.50 $3.00 $20.00 $32,370.00 $- $32,370.00 $16,185.00 $16,185.00 01/01/2014 1 January 2014
1 Government Germany Carretera None $1,321.00 $3.00 $20.00 $26,420.00 $- $26,420.00 $13,210.00 $13,210.00 01/01/2014 1 January 2014
2 Midmarket France Carretera None $2,178.00 $3.00 $15.00 $32,670.00 $- $32,670.00 $21,780.00 $10,890.00 01/06/2014 6 June 2014
3 Midmarket Germany Carretera None $888.00 $3.00 $15.00 $13,320.00 $- $13,320.00 $8,880.00 $4,440.00 01/06/2014 6 June 2014
4 Midmarket Mexico Carretera None $2,470.00 $3.00 $15.00 $37,050.00 $- $37,050.00 $24,700.00 $12,350.00 01/06/2014 6 June 2014

In [ ]: datafin.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Segment 700 non-null object
1 Country 700 non-null object
2 Product 700 non-null object
3 Discount Band 700 non-null object
4 Units Sold 700 non-null object
5 Manufacturing Price 700 non-null object
6 Sale Price 700 non-null object
7 Gross Sales 700 non-null object
8 Discounts 700 non-null object
9 Sales 700 non-null object
10 COGS 700 non-null object
11 Profit 700 non-null object
12 Date 700 non-null object
13 Month Number 700 non-null int64
14 Month Name 700 non-null object
15 Year 700 non-null int64
dtypes: int64(2), object(14)
memory usage: 87.6+ KB

Data description: This dataset, sourced from Atharva Arya, provides detailed sales and profit information across various market segments and countries, including key metrics like units
sold, sale price, gross sales, discounts, and profit. It enables in-depth analysis of sales performance, profitability, and time-based trends to support business insights and decision-making.
In [ ]: datafin.head()

Out[ ]:
Segment Country Product Discount
Band
Units Manufacturing Sale
Sold Price Price
Gross Discounts
Sales Sales COGS Profit Month Month Year
Date Number Name
0 Government Canada Carretera None $1,618.50 $3.00 $20.00 $32,370.00 $- $32,370.00 $16,185.00 $16,185.00 01/01/2014 1 January 2014
1 Government Germany Carretera None $1,321.00 $3.00 $20.00 $26,420.00 $- $26,420.00 $13,210.00 $13,210.00 01/01/2014 1 January 2014
2 Midmarket France Carretera None $2,178.00 $3.00 $15.00 $32,670.00 $- $32,670.00 $21,780.00 $10,890.00 01/06/2014 6 June 2014
3 Midmarket Germany Carretera None $888.00 $3.00 $15.00 $13,320.00 $- $13,320.00 $8,880.00 $4,440.00 01/06/2014 6 June 2014
4 Midmarket Mexico Carretera None $2,470.00 $3.00 $15.00 $37,050.00 $- $37,050.00 $24,700.00 $12,350.00 01/06/2014 6 June 2014

In [ ]: # Checking if there are any NA values


datafin.isna().any().any()

Out[ ]: False

In [ ]: # Checking if there are duplicated data


datafin.duplicated().sum()

Out[ ]: 0

In [ ]: datafin[' Discount Band '].unique()

Out[ ]: array([' None ', ' Low ', ' Medium ', ' High '], dtype=object)

In [ ]: datafin.columns

Out[ ]: Index(['Segment', 'Country', ' Product ', ' Discount Band ', ' Units Sold ',
' Manufacturing Price ', ' Sale Price ', ' Gross Sales ', ' Discounts ',
' Sales ', ' COGS ', ' Profit ', 'Date', 'Month Number',
' Month Name ', 'Year'],
dtype='object')

b. How would you recognize this poor quality?


Write 3 - 4 sentences that show how the data fails to include properties of good quality data.
Answer
Although the data does not have missing or duplicated values, it includes currency symbols ( ), missingorincorrectvalues, liketheplaceholder( -), commas, and extra spaces before and
after the names of some variables as well as their observations. Additionally, quantitative variables are classified as objects rather than as integers (int) or floats, which makes it difficult to
perform numerical operations or analysis.
c. Provide an example of poor quality unstructured data
In [ ]: file_id = "1NIakNIwqoDgw9CvFWrU-s-uuJHdu6KQu"
url = f"https://drive.google.com/uc?id={file_id}"
output_filename = "Airplane_Crashes_and_Fatalities_Since_1908.csv.zip"
gdown.download(url, output_filename, quiet=False)
import os
if os.path.exists(output_filename):
print(f"File '{output_filename}' downloaded successfully!")
else:
print("File download failed.")

try:
import pandas as pd
data = pd.read_csv(output_filename)
# print(data.head())
except Exception as e:
print(f"Unable to read the file: {e}")
data.head()

Downloading...
From: https://drive.google.com/uc?id=1NIakNIwqoDgw9CvFWrU-s-uuJHdu6KQu
To: /content/Airplane_Crashes_and_Fatalities_Since_1908.csv.zip
100%|██████████| 596k/596k [00:00<00:00, 103MB/s]
File 'Airplane_Crashes_and_Fatalities_Since_1908.csv.zip' downloaded successfully!

Out[ ]:
index Date Time Location Operator Flight# Route Type Registration cn/In Aboard Fatalities Ground Summary
0 0 09/17/1908 17:18 Fort Myer, Virginia Military - U.S. NaN Demonstration Wright Flyer NaN 1 2.0 1.0 0.0 During a demonstration
Army III flight, a U.S. Army fly...
1 1 07/12/1912 06:30 AtlantiCity, New Military - U.S. NaN Test flight Dirigible NaN NaN 5.0 5.0 0.0 First U.S. dirigible Akron
Jersey Navy exploded just offsh...
2 2 08/06/1913 NaN Victoria, British Private - NaN Curtiss NaN NaN 1.0 1.0 0.0 The first fatal airplane
Columbia, Canada seaplane accident in Canada oc...
3 3 09/09/1913 18:30 Over the North Sea Military - NaN Zeppelin L-1 The airship flew into a
German Navy NaN (airship) NaN NaN 20.0 14.0 0.0 thunderstorm and encou...
4 4 10/17/1913 10:30 Near Johannisthal, Military - NaN Zeppelin L-2 Hydrogen gas which was
Germany German Navy NaN (airship) NaN NaN 30.0 30.0 0.0 being vented was sucked...

Data description: This dataset, sourced from Data Society, compiles information on Boeing 707 accidents dating back to 1948. It includes various details such as the incident's date and
time, location, and the operator of the aircraft. Additional data includes the flight number, the route taken, and the aircraft's type and registration number. Also recorded are the construction
(serial) number, the total number of people aboard, fatalities, fatalities on the ground, and a summary of each incident.
In [ ]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 5268 non-null int64
1 Date 5268 non-null object
2 Time 3049 non-null object
3 Location 5248 non-null object
4 Operator 5250 non-null object
5 Flight # 1069 non-null object
6 Route 3561 non-null object
7 Type 5241 non-null object
8 Registration 4933 non-null object
9 cn/In 4040 non-null object
10 Aboard 5246 non-null float64
11 Fatalities 5256 non-null float64
12 Ground 5246 non-null float64
13 Summary 4878 non-null object
dtypes: float64(3), int64(1), object(10)
memory usage: 576.3+ KB

In [ ]: data.columns

Out[ ]: Index(['index', 'Date', 'Time', 'Location', 'Operator', 'Flight #', 'Route',


'Type', 'Registration', 'cn/In', 'Aboard', 'Fatalities', 'Ground',
'Summary'],
dtype='object')

In [ ]: data.isna().sum()

Out[ ]: 0
index 0
Date 0
Time 2219
Location 20
Operator 18
Flight # 4199
Route 1707
Type 27
Registration 335
cn/In 1228
Aboard 22
Fatalities 12
Ground 22
Summary 390

dtype: int64
In [ ]: data.duplicated().sum()

Out[ ]: 0

In [ ]: groupdata = data.groupby(['Date', 'Time', 'Location', 'Flight #'])['Fatalities'].sum()


groupdata
Out[ ]: Fatalities
Date Time Location Flight #
01/01/1976 05:30 Near Al Qaysumah, Saudi Arabia 438 82.0
01/01/1978 20:15 Off Bandra, Maharashtra, India 855 213.0
01/01/1985 20:30 Near La Paz, Bolivia 980 29.0
01/01/2002 18:02 Hollywood, Florida - 5.0
01/01/2007 14:07 Off Parepare, Indonesia 574 102.0
... ... ... ... ...
12/29/1991 15:05 Wanli, Taiwan 358 5.0
12/29/1994 15:30 Near Van, Turkey 278 57.0
12/30/1945 21:13 New York City, New York 14 1.0
12/30/1951 22:07 Near Fairbanks, Alaska 501 4.0
12/31/1968 08:30 Port Hedland, WA, Australia 1750 26.0
996 rows × 1 columns

dtype: float64
In [ ]: filtered_data=data[(data['Time'].isna() & data['Flight #'].isna()) & (data['Fatalities'] >= 10)]
filtered_data.head()

Out[ ]:
index Date Time Location Operator Flight# Route Type Registration cn/In Aboard Fatalities Ground Summary
7 7 07/28/1916 NaN Near Jambol, Military - Schutte-Lanz S-L- 0.0 Crashed nearcause
the Black Sea,
Bulgeria German Army NaN NaN 10 (airship) NaN NaN 20.0 20.0 unknown.
10 10 11/21/1916 NaN Mainz, Germany Military - Super Zeppelin
German Army NaN NaN (airship) NaN NaN 28.0 27.0 0.0 Crashed in a storm.
12 12 03/04/1917 NaN Near Gent, Belgium Military -
German Army NaN NaN Airship NaN NaN 20.0 20.0 0.0 Caught fire and crashed.
13 13 03/30/1917 NaN Off Northern Military - Schutte-Lanz S-L- Struck by lightning and
Germany German Navy NaN NaN 9 (airship) NaN NaN 23.0 23.0 0.0 crashed into the Balti...
19 19 05/10/1918 NaN Off Helgoland Military - Zeppelin L-70 0.0 Shot downcrashing
by British aircraft
Island, Germany German Navy NaN NaN (airship) NaN NaN 22.0 22.0 from a ...

In [ ]: filtered_data.shape

Out[ ]: (986, 14)

d. Unstructured data can be more difficult to assess than structured data. Just you did in part b, write 3 - 4 sentences that show how this unstructured data fails to check requirements of
good quality data.
Answer
Although there are no duplicate data, there is an excess of missing values (NaN), even in situations where fatalities are recorded. In other words, there are missing values in columns like
"Flight #" and "cn/In," which makes it difficult to fully assess the incident details and compare across records. For instance, a filter was applied to all rows with missing values in the "Date"
and "Flight #" variables but with fatalities above 10, and it was found that there are 986 rows, which reinforces the fact of poor quality. Additionally, variables like "Date" are not properly
formatted (datatype), and there is an "Index" column that does not contribute to the analysis, further indicating that the data is not structured properly (unstructured).

Question 2: Yield Curve Modeling


a. Pick government securities from a country. The country selected should be one of the countries from your group so that you can fit a Nelson-Siegel mode
Answer
Fitting a Nelson-Siegel model on EU government securities
In [ ]: base_url = 'https://sdw-wsrest.ecb.europa.eu/service/'
params = {
'startPeriod': '2020-01-01',
'endPeriod': '2024-12-31',
'format': 'jsondata'
}

maturities = ['SR_3M', 'SR_1Y', 'SR_2Y', 'SR_5Y', 'SR_10Y', 'SR_30Y']


data_dict = {}

for maturity in maturities:


url = f"{base_url}data/YC/B.U2.EUR.4F.G_N_A.SV_C_YM.{maturity}"
response = requests.get(url, params=params)
data = response.json()

# Extract values (first element of each observation array)


values = [obs[0] for obs in data['dataSets'][0]['series']['0:0:0:0:0:0:0']['observations'].values()]

# Get dates from structure


dates = [date['id'] for date in data['structure']['dimensions']['observation'][0]['values']]

data_dict[maturity] = pd.Series(values, index=pd.to_datetime(dates))

df = pd.DataFrame(data_dict)
df.columns = ['3M', '1Y', '2Y', '5Y', '10Y', '30Y']

b. Be sure to pick maturities ranging from short-term to long-term (e.g. 6 month maturity to 20 or 30 year maturities).
Answer
In [ ]: df
Out[ ]: 3M 1Y 2Y 5Y 10Y 30Y
2020-01-02 -0.623324 -0.634844 -0.622180 -0.478606 -0.173995 0.358465
2020-01-03 -0.595449 -0.630324 -0.638365 -0.520394 -0.222926 0.306064
2020-01-06 -0.597296 -0.637580 -0.651086 -0.541757 -0.247426 0.283818
2020-01-07 -0.605300 -0.641443 -0.650553 -0.532864 -0.232130 0.304965
2020-01-08 -0.597340 -0.633612 -0.641188 -0.515378 -0.204580 0.341047
... ... ... ... ... ... ...
2024-12-20 2.616587 2.172129 1.960729 2.049615 2.374684 2.445430
2024-12-23 2.632250 2.201549 1.997799 2.084822 2.402344 2.456214
2024-12-24 2.661780 2.211694 2.003148 2.090155 2.405459 2.455847
2024-12-27 2.653268 2.188540 2.007697 2.126310 2.431800 2.494656
2024-12-30 2.575177 2.178646 2.011151 2.130018 2.447304 2.513773
1278 rows × 6 columns
In [ ]: df.to_csv('yield_curves.csv')

In [ ]: def nelson_siegel(t, params):


beta0, beta1, beta2, tau = params
# Avoid division by zero
exp_term = np.exp(-t/tau)
term = (1 - exp_term)/(t/tau)
return beta0 + beta1 * term + beta2 * (term - exp_term)

def objective(params, t, y):


return np.sum((nelson_siegel(t, params) - y)**2)

# Convert maturities to years


maturities = np.array([0.25, 1, 2, 5, 10, 30])

# Fit for each date


results = []
for date in df.index:
yields = df.loc[date].values

# Initial parameters [β₀, β₁, β₂, τ]


initial_guess = [yields[-1], yields[0]-yields[-1], 0, 2]

# Bounds for parameters


bounds = [(0, 15), (-15, 15), (-15, 15), (0.1, 10)]

res = minimize(objective, initial_guess, args=(maturities, yields),


bounds=bounds, method='L-BFGS-B')

# Store parameters and fitted yields


params = res.x
fitted_yields = nelson_siegel(maturities, params)
rmse = np.sqrt(np.mean((yields - fitted_yields)**2))

results.append({
'date': date,
'beta0': params[0], # Level
'beta1': params[1], # Slope
'beta2': params[2], # Curvature
'tau': params[3], # Decay
'rmse': rmse
})

# Create DataFrame of results


results_df = pd.DataFrame(results).set_index('date')

In [ ]: results_df

Out[ ]: beta0 beta1 beta2 tau rmse


date
2020-01-02 0.698999 -1.310728 -1.646603 3.455619 0.001258
2020-01-03 0.640179 -1.215057 -1.799441 3.325815 0.000900
2020-01-06 0.620464 -1.195045 -1.839001 3.329411 0.000919
2020-01-07 0.644782 -1.228894 -1.830838 3.332548 0.000943
2020-01-08 0.683263 -1.259286 -1.862525 3.289445 0.000826
... ... ... ... ... ...
2024-12-20 2.569276 0.315362 -2.540065 1.188862 0.035549
2024-12-23 2.576592 0.318141 -2.435052 1.169326 0.037266
2024-12-24 2.575070 0.360573 -2.474106 1.149855 0.037276
2024-12-27 2.607960 0.329663 -2.519298 1.095611 0.032313
2024-12-30 2.632903 0.187122 -2.373655 1.157329 0.034248
1278 rows × 5 columns
c. Fit a Nelson-Siegel model.
Answer
In [ ]: sns.set()

# Read data
df = pd.read_csv('yield_curves.csv', index_col=0)
df.index = pd.to_datetime(df.index)

# Nelson-Siegel function
def nelson_siegel(t, params):
beta0, beta1, beta2, tau = params
exp_term = np.exp(-t/tau)
return beta0 + beta1 * ((1 - exp_term)/(t/tau)) + beta2 * ((1 - exp_term)/(t/tau) - exp_term)

# Objective function
def objective(params, maturities, rates):
return np.sum((nelson_siegel(maturities, params) - rates)**2)

# Convert maturities to years


maturities = np.array([0.25, 1, 2, 5, 10, 30])
latest_rates = df.iloc[-1].values

# Fit Nelson-Siegel model


initial_guess = [3, -2, 2, 1] # [beta0, beta1, beta2, tau]
result = minimize(objective, initial_guess, args=(maturities, latest_rates))
fitted_params = result.x

# Generate smooth curve


t_points = np.linspace(0.1, 30, 100)
fitted_curve = nelson_siegel(t_points, fitted_params)

# Plot
plt.figure(figsize=(10, 6))
sns.set_style('whitegrid')
sns.scatterplot(x=maturities, y=latest_rates, color='blue', s=100, label='Actual Yields')
plt.plot(t_points, fitted_curve, 'r-', label='Nelson-Siegel Fit')
plt.xlabel('Maturity (years)')
plt.ylabel('Yield (%)')
plt.title(f'Figure 2.1: Nelson-Siegel Yield Curve Fit ({df.index[-1].strftime("%Y-%m-%d")})')
plt.legend()
plt.grid(True)

print("Fitted Parameters:")
print(f"β0 (level): {fitted_params[0]:.4f}")
print(f"β1 (slope): {fitted_params[1]:.4f}")
print(f"β2 (curvature): {fitted_params[2]:.4f}")
print(f"τ (decay): {fitted_params[3]:.4f}")

plt.show()

Fitted Parameters:
β0 (level): 2.0494
β1 (slope): -0.3579
β2 (curvature): 1.3981
τ (decay): 0.2822

d. Fit a Cubic-Spline model.


Answer
In [ ]: cs = CubicSpline(maturities, latest_rates)

# Generate smooth curve


t_points = np.linspace(0.1, 30, 100)
fitted_curve = cs(t_points)

# Plot
plt.figure(figsize=(10, 6))
sns.set_style('whitegrid')
sns.scatterplot(x=maturities, y=latest_rates, color='blue', s=100, label='Actual Yields')
plt.plot(t_points, fitted_curve, 'r-', label='Cubic Spline Fit')
plt.xlabel('Maturity (years)')
plt.ylabel('Yield (%)')
plt.title(f'Figure 2.2: Cubic Spline Yield Curve Fit ({df.index[-1].strftime("%Y-%m-%d")})')
plt.legend()
plt.grid(True)
plt.show()
e. Compare the models in terms of 1) fit and 2) interpretation
Answer
In analyzing the Nelson-Siegel (Figure 2.1) and cubic models (Figure 2.2) for interest rate term structure, both models offer distinct advantages. The Nelson-Siegel model uses three
parameters to shape the yield curve, making it flexible and straightforward to interpret (Wahlstrøm et al. 971). This model excels in its simplicity and ability to clearly represent economic
factors like long-term rates, curve steepness, and mid-maturity humps (Akinyemi et al. 14). The Nelson-Siegel model produces a relatively smooth curve that captures the general trend but
misses some key data points. The model parameters shown ( =2.6494, =-0.2579, =1.3981, =0.2822) generate a curve that notably underestimates yields at both very short and long
β0 β1 β2 τ

maturities. The misfit is particularly visible at the 30-year point where the actual yield is around 2.5% but the fitted curve predicts closer to 2.1%.
Conversely, the cubic model employs a third-degree polynomial to capture complex, non-linear data trends. Although it provides a more precise fit for intricate patterns, interpreting its
coefficients demands careful analysis. Overall, the Nelson-Siegel model is favored for its ease of use and clarity, while the cubic model is better suited for detailed data fitting but requires
deeper examination of its terms (Nymand 43). Cubic Spline model produces a more flexible fit that passes much closer to the observed data points. It captures both the short-term dynamics
and the long-term behavior more accurately. The curve shows more local variation, particularly evident in how it models the "belly" of the yield curve between 5 and 15 years.
f. Be sure to specify at the levels of model parameters (ex. Alpha1).
Answer
The Nelson-Siegel model offers clear economic interpretation of its parameters. (2.6494) represents the long-term interest rate level, (-0.2579) controls the slope of the yield curve,
β0 β1

and (1.3981) determines the curvature. The decay parameter (0.2822) governs how quickly the short-term components decay. These parameters have direct economic meaning and
β2 τ

can be used to understand market expectations about future interest rates. The Cubic Spline model, while providing better fit, lacks such clear economic interpretation. Its parameters are
purely mathematical, representing the coefficients of cubic polynomials between knot points. This makes it harder to extract meaningful economic insights about market expectations or risk
premiums from the model parameters.
g. In Module 2 Lesson 4 (‘Smoothing Data’), we said smoothing data can be unethical. If Nelson-Siegel is smoothing the yield curve, is this considered unethical? Why or why not?
Answer
Using smoothing techniques like the Nelson-Siegel method to estimate yield curves is typically ethical when the goal is to improve the accuracy of models and better understand market
trends. These methods help to remove noise and clarify long-term data patterns. However, it crosses into unethical territory if used to mislead stakeholders or distort data, such as by
inflating asset values or hiding volatility. The key to ethical use lies in being transparent about the methodology and intentions. Problems arise when the smoothing process is used to
misrepresent the true nature of the data (Copeland 103).

3. Exploiting Correlation
Financial Data is meant not only to process data but to understand how meaningful factors can be used to summarize or represent the data. Let’s understand the role that correlation and
principal components play.
a) Generate 5 uncorrelated Gaussian random variables that simulate yield changes (they can be positive or negative with a mean close to 0 and a standard deviation that is small).
Answer
Table of first five values generated
In [ ]: np.random.seed(432)
num_samples = 100
num_variables = 5
mean=0
std_deviation=0.01
data=np.random.normal(loc=mean,scale=std_deviation, size=(num_samples,num_variables))
dataF=pd.DataFrame(data)
dataF.columns=["Variable 1","Variable 2","Variable 3","Variable 4","Variable 5"]
dataF.head()

Out[ ]: Variable 1 Variable 2 Variable 3 Variable 4 Variable 5


0 0.001869 0.000345 -0.003848 0.004646 0.006009
1 -0.017359 -0.003120 0.003078 -0.002261 -0.000731
2 0.000010 0.003327 0.019509 -0.007947 0.006518
3 -0.000422 -0.003114 0.000570 0.003264 -0.003010
4 0.002769 0.005642 0.005877 0.016587 -0.014441

Correlation matrix
In [ ]: matrix_corr = dataF.corr()
print(matrix_corr)
Variable 1 Variable 2 Variable 3 Variable 4 Variable 5
Variable 1 1.000000 -0.008919 0.039846 0.013316 0.202243
Variable 2 -0.008919 1.000000 0.121684 0.155779 -0.133540
Variable 3 0.039846 0.121684 1.000000 -0.130289 -0.111868
Variable 4 0.013316 0.155779 -0.130289 1.000000 -0.092115
Variable 5 0.202243 -0.133540 -0.111868 -0.092115 1.000000

Correloplot
In [ ]: corr = dataF.corr()

f, ax =plt.subplots(figsize=(9, 6))
mask=np.triu(np.ones_like(corr, dtype=bool))
cmap=sns.diverging_palette(230, 20, n=256, as_cmap=True)
sns.heatmap(dataF.corr(),
mask=mask,
cmap=cmap,
vmax=1,
vmin = -.25,
center=0,
square=True,
linewidths=.5,
annot = True,
fmt='.3f',
annot_kws={'size': 16},
cbar_kws={"shrink": .75})

plt.title('Figure 3.1: Correloplot')


plt.tight_layout()
ax.tick_params(axis = 'x', labelsize = 14)
ax.tick_params(axis = 'y', labelsize = 14)
ax.set_ylim(len(corr))

plt.show()

As we see in the correleplot (Figure 3.1) all correlation values a near zero, which aline to what exercise need, 5 uncorrelated exercises.
b) Run a Principal Components using EITHER the correlation OR covariance matrix.
Answer
Eigenvalues
In [ ]: # Calculating eigenvectors and eigenvalues of the covariance matrix of simulated dataset
dataF_cov = dataF.cov()
eigenvalues, eigenvectors = LA.eig(dataF_cov)
sorted_indices=np.argsort(eigenvalues)[::-1]
eigenvectors=eigenvectors[:,sorted_indices]
eigenvalues=eigenvalues[sorted_indices]
eigenvalues

Out[ ]: array([1.27782260e-04, 1.14523588e-04, 9.47725977e-05, 7.27875167e-05,


6.21821048e-05])

Eigenvectors
In [ ]: eigenvectors

Out[ ]: array([[ 0.1038149 , -0.21881596, -0.68515343, -0.26798617, -0.63252686],


[-0.49630891, 0.41260863, -0.40928584, 0.64272115, -0.053162 ],
[-0.74956086, -0.50816345, -0.1023671 , -0.29393446, 0.28818733],
[-0.01974409, 0.64348683, -0.31144041, -0.59654411, 0.3642456 ],
[ 0.42505132, -0.33101102, -0.50554497, 0.26987143, 0.61754091]])

Principal components
In [ ]: principal_components =dataF.dot(eigenvectors)
principal_components.columns = ["PC_1","PC_2","PC_3","PC_4","PC_5"]
principal_components.head()
Out[ ]: PC_1 PC_2 PC_3 PC_4 PC_5
0 0.005369 0.002689 -0.005512 -0.000298 0.003094
1 -0.002826 -0.000266 0.013929 0.002893 0.010758
2 -0.013346 -0.015815 -0.004186 0.002901 0.006569
3 -0.000269 0.001614 0.002010 -0.004816 -0.000073
4 -0.013384 0.014189 -0.002674 -0.012635 -0.003234

c) Write a paragraph explaining how the variances of each component compare with each other. In this paragraph, you will address the following question: how much variance is explained by
Component 1, Component 2, Component 3?
Answer
In [ ]: # Put data into a DataFrame
df_eigval = pd.DataFrame({"Eigenvalues":eigenvalues}, index=range(1,6))

# Work out explained proportion


df_eigval["Explained variance proportion"] = df_eigval["Eigenvalues"] / np.sum(df_eigval["Eigenvalues"])
df_eigval["Cumulative variance proportion"] = df_eigval["Explained variance proportion"].cumsum()
#Format as percentage
df_eigval.style.format({"Explained variance proportion": "{:.2%}",
"Cumulative variance proportion": "{:.2%}"})

Out[ ]: Eigenvalues Explained variance proportion Cumulative variance proportion


1 0.000128 27.07% 27.07%
2 0.000115 24.26% 51.33%
3 0.000095 20.08% 71.41%
4 0.000073 15.42% 86.83%
5 0.000062 13.17% 100.00%
The component 1, component 2 and component 3 explain 27.07%, 24.26% and 20.08%, respectively, of the total variability of our dataset. Cumulatively, the first three components explain
71.41% of the total data variability.
d. Produce a screeplot (see https://en.wikipedia.org/wiki/Scree_plot) of the variance explained for each component.
Answer
In [ ]: plt.figure(figsize=(8, 5))
plt.plot(range(1,len(eigenvalues)+1), eigenvalues, marker="o",linestyle="--")
plt.title("Figure 3.2: Scree Plot")
plt.xlabel("Principal Components")
plt.ylabel("Eigenvalue")
plt.xticks(range(1,len(eigenvalues)+1))
plt.grid(True)

e. Collect the daily closing yields for 5 government securities, say over 6 months.
Answer
In [ ]: api_key = "ec67a401ae513935b544843d8c308536" # Alex API
series_ids = [
'DGS1',
'DGS2',
'DGS5',
'DGS10',
'DGS30'
]

base_url = "https://api.stlouisfed.org/fred/series/observations"

# Date range for the past 7 months


end_date = datetime.now().date()
start_date = end_date - timedelta(days=7*30) # 7 months ago

# Function to fetch yield data


def get_yield_data(series_id, start_date, end_date):
params = {
"series_id": series_id,
"api_key": api_key,
"file_type": "json",
"observation_start": str(start_date),
"observation_end": str(end_date),
}

response = requests.get(base_url, params=params)


if response.status_code == 200:
res_data = response.json()
if 'observations' in res_data:
obs_data = pd.DataFrame(res_data['observations'])
obs_data['date'] = pd.to_datetime(obs_data['date'])
obs_data.set_index('date', inplace=True)
obs_data['value'] = pd.to_numeric(obs_data['value'], errors='coerce')
return obs_data
return None

# Collect yield data for each series


yields_dict = {}
for series_id in series_ids:
yields_dict[series_id] = get_yield_data(series_id, start_date, end_date)

# Combine data into a single DataFrame


yields_df = pd.DataFrame({series_id: data['value'] for series_id, data in yields_dict.items() if data is not None})
yields_df.columns = ['1 Year', '2 Year', '5 Year', '10 Year', '30 Year']
# Compute daily yield changes using .diff()
daily_yield_changes = yields_df.diff()
daily_yield_changes.dropna(inplace=True)
print(daily_yield_changes)

1 Year 2 Year 5 Year 10 Year 30 Year


date
2024-06-25 0.00 -0.06 -0.02 -0.02 -0.02
2024-06-26 0.03 0.06 0.07 0.09 0.09
2024-06-27 -0.03 -0.01 -0.03 -0.03 -0.02
2024-06-28 -0.01 0.01 0.04 0.07 0.08
2024-07-01 0.01 0.06 0.11 0.12 0.13
... ... ... ... ... ...
2025-01-10 0.09 0.13 0.13 0.09 0.04
2025-01-13 -0.01 0.00 0.02 0.02 0.01
2025-01-14 -0.02 -0.03 -0.02 -0.01 0.01
2025-01-15 -0.03 -0.10 -0.14 -0.12 -0.10
2025-01-16 -0.01 -0.04 -0.06 -0.05 -0.04

[134 rows x 5 columns]

f. Be sure to compute the daily yield changes!


In [ ]: plt.figure(figsize=(12, 6))
for column in daily_yield_changes.columns:
plt.plot(daily_yield_changes.index, daily_yield_changes[column], label=column)

plt.title('Figure 3.3: Daily Yield Changes for U.S. Treasury Securities')


plt.xlabel('Date')
plt.ylabel('Daily Yield Change (%)')
plt.legend(title="Maturities")
plt.grid(True)
plt.show()

g. Re-run the Principal Components using EITHER the correlation or covariance matrix.
Answer of question g
Eigenvalues
In [ ]: yields_df_cov = daily_yield_changes.cov()
eigenvalues, eigenvectors = LA.eig(yields_df_cov)
sorted_indices=np.argsort(eigenvalues)[::-1]
eigenvectors=eigenvectors[:,sorted_indices]
eigenvalues=eigenvalues[sorted_indices]
eigenvalues

Out[ ]: array([1.27432534e-02, 1.86871188e-03, 2.76677987e-04, 1.32902058e-04,


3.25563762e-05])

Eigenvectors
In [ ]: eigenvectors

Out[ ]: array([[-0.366009 , -0.535143 , -0.73188899, 0.20908883, -0.01672587],


[-0.49679164, -0.45651126, 0.40733949, -0.61471393, -0.03157313],
[-0.50391522, 0.02895065, 0.40052634, 0.62880218, 0.43522153],
[-0.45748966, 0.36706634, 0.02217696, 0.15266311, -0.79508973],
[-0.39497866, 0.60798323, -0.37080976, -0.39963788, 0.42087758]])

Principal components
In [ ]: principal_components =yields_df.dot(eigenvectors)
principal_components.columns = ["PC_1","PC_2","PC_3","PC_4","PC_5"]
principal_components.dropna(inplace=True)
principal_components.head()
Out[ ]: PC_1 PC_2 PC_3 PC_4 PC_5
date
2024-06-24 -10.032590 -0.532780 -1.633712 -0.245560 0.088697
2024-06-25 -9.975655 -0.525469 -1.659190 -0.216314 0.089371
2024-06-26 -10.128439 -0.479133 -1.660047 -0.225135 0.083761
2024-06-27 -10.075749 -0.482554 -1.647428 -0.240712 0.086957
2024-06-28 -10.160836 -0.406276 -1.648127 -0.245083 0.082232

h. How do the variances of each component compare? In other words, how much variance is explained by Component 1, Component 2, Component 3, etc.?
In [ ]: # Put data into a DataFrame
df_eigval = pd.DataFrame({"Eigenvalues":eigenvalues}, index=range(1,6))

# Work out explained proportion


df_eigval["Explained variance proportion"] = df_eigval["Eigenvalues"] / np.sum(df_eigval["Eigenvalues"])
df_eigval["Cumulative variance proportion"] = df_eigval["Explained variance proportion"].cumsum()
#Format as percentage
df_eigval.style.format({"Explained variance proportion": "{:.2%}",
"Cumulative variance proportion": "{:.2%}"})

Out[ ]: Eigenvalues Explained variance proportion Cumulative variance proportion


1 0.012743 84.65% 84.65%
2 0.001869 12.41% 97.06%
3 0.000277 1.84% 98.90%
4 0.000133 0.88% 99.78%
5 0.000033 0.22% 100.00%
The component 1, component 2, component 3, component 4 and component 5 explain 84.31%, 12.76%, 1.81%, 0.89% and 0.23%, respectively, of the total variability of our dataset.
Cumulatively, the first two components explain 97.07% of the total data variability.
i. Produce a screeplot of the variance explained for each component.
In [ ]: plt.figure(figsize=(8,5))
plt.plot(range(1,len(eigenvalues)+1), eigenvalues,marker="o",linestyle="--")
plt.title("Figure 3.4: Scree Plot")
plt.xlabel("Principal Components")
plt.ylabel("Eigenvalue")
plt.xticks(range(1,len(eigenvalues)+1))
plt.grid(True)

j. How does the screeplot from the uncorrelated data compare with the screeplot from the government data?
Answer
In the screeplot of uncorrelated data (Figure 3.2), we observe a gradual decline in the dashed line, indicating the need for many principal components to achieve a high proportion of
explained variance. Conversely, in the screeplot of government data (Figure 3.4), the dashed line shows a more pronounced decline. This suggests that in correlated data, the first few
principal components explain a larger proportion of variability compared to the later components.

4. Empirical Analysis of ETFs Pick a sector ETF (in the US, for example, XLRE)
a. Find the 30 largest holdings.
Answer
The 30 largest holdings of XLI Industrials are
In [ ]: keys = ['XLI']
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0"}
def main(url):
with requests.Session() as req:
req.headers.update(headers)
for key in keys:
r = req.get(url.format(key))
#print(f"Extracting: {r.url}")
goal = re.findall(r'etf\\\/(.*?)\\', r.text)
print(goal[:30])
main("https://www.zacks.com/funds/etf/{}/holding")

['GE', 'CAT', 'RTX', 'HON', 'UNP', 'BA', 'ETN', 'UBER', 'ADP', 'DE', 'LMT', 'UPS', 'GEV', 'TT', 'PH', 'WM', 'TDG', 'EMR', 'MMM', 'ITW', 'GD', 'NOC', 'FDX', 'CT
AS', 'CSX', 'CARR', 'PCAR', 'JCI', 'NSC', 'CPRT']

b. Get at least 6 months of data (~ 120 data points).


Answer
130 recent data points of XLI are
In [ ]: xlf = yfin.download(['GE', 'CAT', 'RTX', 'UBER', 'ETN', 'HON', 'UNP', 'ADP', 'DE', 'BA', 'LMT', 'GEV', 'UPS', 'TT', 'PH', 'WM', 'EMR', 'CTAS', 'ITW', 'TDG', 'M
xlf.index = pd.to_datetime(xlf.index).strftime("%Y-%m-%d")
xlf=xlf.dropna().iloc[-130:]
xlf

[*********************100%***********************] 30 of 30 completed
Out[ ]: Ticker ADP BA CARR CAT CSX CTAS DE EMR ETN FDX ... PCAR PH RTX
Date
2024- 240.175369 179.110001 66.957771 343.134705 34.364563 178.026718 368.950012 115.904816 325.018585 302.040466 ... 101.046936 541.951538 100.543625 1176
07-15
2024- 243.323074 186.050003 69.462715 357.831818 35.089600 180.666565 374.747284 117.787514 330.415192 310.457428 ... 105.241524 565.768982 102.581795 1212
07-16
2024- 246.718231 184.839996 67.176453 355.072998 35.208782 179.254501 382.311493 117.173164 309.584106 307.397614 ... 105.550797 555.083984 103.877922 1175
07-17
2024- 244.887009 180.229996 66.649628 353.316528 34.781704 189.014389 377.854370 115.478729 307.169037 303.624847 ... 106.294998 544.866577 102.740097 1180
07-18
2024- 244.461380 179.669998 65.844460 344.980530 34.453949 188.912292 375.293243 114.626556 309.971710 303.317871 ... 104.197701 541.862000 101.721008 1172
07-19
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2025- 290.200012 170.570007 67.110001 362.500000 31.889999 190.600006 429.910004 118.470001 340.880005 274.589996 ... 108.160004 637.989990 117.739998 1278
01-13
2025- 291.690002 167.020004 68.400002 371.570007 32.139999 192.279999 432.309998 119.790001 340.140015 277.619995 ... 110.199997 647.010010 119.470001 1300
01-14
2025- 293.369995 166.199997 68.800003 374.890015 32.459999 195.699997 428.880005 120.720001 342.579987 276.589996 ... 109.629997 656.320007 119.089996 1293
01-15
2025- 296.230011 168.929993 69.739998 380.549988 32.970001 198.050003 439.109985 123.290001 345.190002 277.369995 ... 109.480003 658.659973 120.459999 1322
01-16
2025- 296.179993 171.089996 69.660004 386.019989 32.730000 198.309998 455.440002 124.529999 346.279999 275.100006 ... 110.330002 669.460022 121.110001 1340
01-17
130 rows × 30 columns
c. Compute the daily returns.
Answer
The daily returns are
In [ ]: xlf_returns=xlf.dropna().pct_change().dropna()
xlf_returns

Out[ ]: Ticker ADP BA CARR CAT CSX CTAS DE EMR ETN FDX ... PCAR PH RTX TDG TT
Date
2024- 0.013106 0.038747 0.037411 0.042832 0.021098 0.014828 0.015713 0.016243 0.016604 0.027867 ... 0.041511 0.043948 0.020271 0.031048 0.015601
07-16
2024- 0.013953 -0.006504 -0.032914 -0.007710 0.003397 -0.007816 0.020185 -0.005216 -0.063045 -0.009856 ... 0.002939 -0.018886 0.012635 -0.030480 -0.031416
07-17
2024- -0.007422 -0.024940 -0.007842 -0.004947 -0.012130 0.054447 -0.011658 -0.014461 -0.007801 -0.012273 ... 0.007051 -0.018407 -0.010953 0.003584 -0.015114
07-18
2024- -0.001738 -0.003107 -0.012081 -0.023594 -0.009423 -0.000540 -0.006778 -0.007379 0.009124 -0.001011 ... -0.019731 -0.005514 -0.009919 -0.006301 -0.000696
07-19
2024- 0.004130 -0.004286 0.026570 0.004766 -0.005189 0.017784 0.003148 0.018586 0.027029 0.003656 ... 0.011595 0.013715 0.009338 0.014361 0.021082
07-22
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2025- 0.011679 -0.008314 0.009021 0.032764 0.003461 0.006867 0.052231 0.004494 -0.001669 0.014258 ... 0.014825 0.012618 0.022848 0.012954 -0.000371
01-13
2025- 0.005134 -0.020813 0.019222 0.025021 0.007839 0.008814 0.005583 0.011142 -0.002171 0.011035 ... 0.018861 0.014138 0.014693 0.017419 0.015200
01-14
2025- 0.005760 -0.004910 0.005848 0.008935 0.009956 0.017787 -0.007934 0.007764 0.007173 -0.003710 ... -0.005172 0.014389 -0.003181 -0.005535 0.007421
01-15
2025- 0.009749 0.016426 0.013663 0.015098 0.015712 0.012008 0.023853 0.021289 0.007619 0.002820 ... -0.001368 0.003565 0.011504 0.022025 0.008585
01-16
2025- -0.000169 0.012786 -0.001147 0.014374 -0.007279 0.001313 0.037189 0.010058 0.003158 -0.008184 ... 0.007764 0.016397 0.005396 0.013616 0.000849
01-17
129 rows × 30 columns
d. Compute the covariance matrix.
Answer
The covariance matrix is
In [ ]: xlf_cov_matrix=xlf_returns.cov()
xlf_cov_matrix
Out[ ]: Ticker ADP BA CARR CAT CSX CTAS DE EMR ETN FDX ... PCAR PH RTX TDG TT U
Ticker
ADP 0.000095 0.000023 0.000061 0.000074 0.000085 0.000066 0.000048 0.000079 0.000073 0.000050 ... 0.000078 0.000089 0.000061 0.000069 0.000040 0.000
BA 0.000023 0.000459 0.000100 0.000094 0.000006 -0.000025 0.000086 0.000104 0.000119 0.000056 ... -0.000009 0.000104 0.000065 0.000122 0.000053 0.000
CARR 0.000061 0.000100 0.000312 0.000167 0.000109 0.000096 0.000098 0.000176 0.000216 0.000095 ... 0.000132 0.000184 0.000062 0.000154 0.000172 0.000
CAT 0.000074 0.000094 0.000167 0.000325 0.000150 0.000077 0.000179 0.000226 0.000211 0.000124 ... 0.000185 0.000245 0.000078 0.000126 0.000098 0.000
CSX 0.000085 0.000006 0.000109 0.000150 0.000253 0.000088 0.000093 0.000145 0.000103 0.000064 ... 0.000163 0.000167 0.000068 0.000092 0.000047 0.000
CTAS 0.000066 -0.000025 0.000096 0.000077 0.000088 0.000269 0.000020 0.000083 0.000109 0.000035 ... 0.000106 0.000100 0.000048 0.000082 0.000087 0.000
DE 0.000048 0.000086 0.000098 0.000179 0.000093 0.000020 0.000287 0.000106 0.000079 0.000100 ... 0.000123 0.000113 0.000059 0.000020 0.000016 0.000
EMR 0.000079 0.000104 0.000176 0.000226 0.000145 0.000083 0.000106 0.000327 0.000215 0.000121 ... 0.000162 0.000239 0.000062 0.000152 0.000126 0.000
ETN 0.000073 0.000119 0.000216 0.000211 0.000103 0.000109 0.000079 0.000215 0.000381 0.000089 ... 0.000154 0.000243 0.000044 0.000186 0.000226 0.000
FDX 0.000050 0.000056 0.000095 0.000124 0.000064 0.000035 0.000100 0.000121 0.000089 0.000344 ... 0.000104 0.000097 0.000044 0.000035 0.000053 0.000
GD 0.000071 0.000066 0.000091 0.000099 0.000088 0.000083 0.000047 0.000091 0.000109 0.000030 ... 0.000076 0.000114 0.000104 0.000134 0.000062 0.000
GE 0.000068 0.000106 0.000183 0.000149 0.000088 0.000078 0.000053 0.000149 0.000246 0.000051 ... 0.000084 0.000186 0.000098 0.000225 0.000193 0.000
GEV 0.000076 0.000127 0.000183 0.000227 0.000103 0.000065 0.000068 0.000247 0.000412 0.000099 ... 0.000167 0.000299 0.000064 0.000241 0.000271 0.000
HON 0.000053 0.000051 0.000104 0.000068 0.000074 0.000056 0.000025 0.000084 0.000095 0.000052 ... 0.000067 0.000075 0.000020 0.000090 0.000072 0.000
ITW 0.000059 0.000044 0.000095 0.000121 0.000089 0.000046 0.000088 0.000107 0.000076 0.000081 ... 0.000090 0.000113 0.000054 0.000060 0.000037 0.000
JCI 0.000086 0.000056 0.000179 0.000190 0.000125 0.000110 0.000083 0.000191 0.000237 0.000111 ... 0.000163 0.000211 0.000063 0.000158 0.000151 0.000
LMT 0.000045 0.000023 0.000026 0.000010 0.000025 0.000030 0.000017 0.000008 0.000003 -0.000017 ... 0.000007 0.000031 0.000090 0.000066 0.000035 -0.000
MMM 0.000070 0.000057 0.000151 0.000161 0.000125 0.000073 0.000082 0.000153 0.000119 0.000067 ... 0.000120 0.000155 0.000059 0.000086 0.000101 -0.000
NOC 0.000048 0.000017 0.000009 0.000042 0.000054 0.000031 0.000046 0.000034 0.000007 0.000012 ... 0.000028 0.000051 0.000112 0.000057 0.000013 -0.000
NSC 0.000096 0.000042 0.000130 0.000174 0.000223 0.000112 0.000094 0.000159 0.000124 0.000100 ... 0.000146 0.000161 0.000076 0.000086 0.000061 0.000
PCAR 0.000078 -0.000009 0.000132 0.000185 0.000163 0.000106 0.000123 0.000162 0.000154 0.000104 ... 0.000350 0.000170 0.000052 0.000090 0.000082 0.000
PH 0.000089 0.000104 0.000184 0.000245 0.000167 0.000100 0.000113 0.000239 0.000243 0.000097 ... 0.000170 0.000334 0.000093 0.000159 0.000133 0.000
RTX 0.000061 0.000065 0.000062 0.000078 0.000068 0.000048 0.000059 0.000062 0.000044 0.000044 ... 0.000052 0.000093 0.000168 0.000088 0.000038 -0.000
TDG 0.000069 0.000122 0.000154 0.000126 0.000092 0.000082 0.000020 0.000152 0.000186 0.000035 ... 0.000090 0.000159 0.000088 0.000299 0.000151 0.000
TT 0.000040 0.000053 0.000172 0.000098 0.000047 0.000087 0.000016 0.000126 0.000226 0.000053 ... 0.000082 0.000133 0.000038 0.000151 0.000221 0.000
UBER 0.000019 0.000130 0.000160 0.000172 0.000112 0.000050 0.000072 0.000108 0.000220 0.000077 ... 0.000086 0.000162 -0.000014 0.000160 0.000132 0.000
UNP 0.000070 0.000028 0.000130 0.000125 0.000158 0.000069 0.000077 0.000112 0.000082 0.000093 ... 0.000121 0.000126 0.000056 0.000089 0.000050 0.000
UPS 0.000030 0.000026 0.000046 0.000096 0.000081 0.000049 0.000102 0.000087 0.000052 0.000143 ... 0.000168 0.000070 0.000010 0.000029 0.000018 0.000
URI 0.000098 0.000106 0.000231 0.000303 0.000186 0.000142 0.000169 0.000259 0.000269 0.000135 ... 0.000191 0.000305 0.000131 0.000186 0.000142 0.000
WM 0.000042 -0.000019 0.000062 0.000014 0.000045 0.000070 -0.000004 0.000032 0.000066 0.000038 ... 0.000074 0.000029 -0.000015 0.000050 0.000069 0.000
30 rows × 30 columns
e. Compute the PCA.
Answer
The principal component analysis(PCA) is
In [ ]: xlf_returns_mean=xlf_returns.mean()

xlf_returns_std=xlf_returns.std()

xlf_standardized_data=(xlf_returns-xlf_returns_mean)/xlf_returns_std

xlf_standardized_cov=xlf_standardized_data.cov()

eigenvalues, eigenvectors = la.eig(xlf_standardized_cov)


idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

xlf_principal_components= xlf_standardized_data.dot(eigenvectors)
xlf_principal_components.columns=["PC_1","PC_2","PC_3","PC_4","PC_5","PC_6","PC_7","PC_8","PC_9","PC_10","PC_11","PC_12","PC_13","PC_14","PC_15","PC_16","PC_17
xlf_principal_components
Out[ ]: PC_1 PC_2 PC_3 PC_4 PC_5 PC_6 PC_7 PC_8 PC_9 PC_10 ... PC_21 PC_22 PC_23 PC_24 PC_25
Date
2024- 7.924030 0.137422 0.886763 -0.059148 1.805747 0.023687 0.014664 0.273473 -0.186904 -0.061284 ... 0.500468 -0.562224 0.032079 -0.623973 0.031814
07-16
2024- -3.543482 4.703898 3.583108 -1.650724 1.032094 0.083364 1.125461 -1.607877 0.442119 1.205232 ... 0.174183 0.885378 -0.356334 -0.211199 0.133578
07-17
2024- -2.319138 0.407285 -0.737715 -1.692023 -0.957578 -0.698202 -2.062343 -0.898443 -1.145080 -1.306339 ... -0.475764 -0.436102 0.476509 0.148484 0.000695
07-18
2024- -2.762167 -1.247847 -0.884262 -0.052278 -0.624147 -0.217254 0.109672 1.256096 0.010304 -0.435093 ... -0.365277 1.091306 0.653740 0.187338 0.090358
07-19
2024- 3.317883 -0.784245 -1.181744 -0.277742 0.015225 -0.492282 -0.304679 -0.970278 -0.283205 -0.771144 ... 0.144122 -0.096397 -0.115491 0.320641 -0.279635
07-22
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2025- 3.924230 3.535143 1.238639 0.974713 0.713740 -1.254833 0.087407 -0.531806 1.612548 -0.358643 ... -0.471550 -0.189333 -0.098235 -0.333291 0.211624
01-13
2025- 3.769834 0.396798 -0.001358 0.626505 -1.095203 -0.868700 -0.728517 -0.375547 0.356758 -0.806851 ... 0.534097 -0.694173 -0.401121 -0.351800 -0.730760
01-14
2025- 1.327423 -0.907914 0.023650 -0.546966 0.050550 0.662722 -0.957914 0.446743 -0.229821 -0.579190 ... 0.876482 0.589169 -0.393185 -0.431120 -0.698458
01-15
2025- 3.884179 0.916024 0.753182 -0.524593 1.023519 0.528138 0.448983 0.508906 -0.124216 0.053304 ... -0.302446 0.109677 0.272910 0.005086 -0.397692
01-16
2025- 1.386382 0.169151 -0.446350 1.317249 -0.021273 -0.813384 0.248580 -0.869154 0.785681 0.866992 ... 0.084466 -0.051597 -0.383539 0.269699 0.253224
01-17
129 rows × 30 columns

f. Compute the SVD.


Answer
The singular value decomposition (SVD) is
In [ ]: standardized_returns_dvd_sqrt_n=(xlf_standardized_data/math.sqrt(len(xlf_standardized_data)-1))
U_st_return, s_st_return, VT_st_return = np.linalg.svd(standardized_returns_dvd_sqrt_n)

Matrix of left singular vectors


In [ ]: U_st_return

Out[ ]: array([[-0.20930198, 0.0074209 , -0.04893938, ..., -0.01761956,


-0.09612688, -0.01893472],
[ 0.09359602, 0.25401361, -0.19774751, ..., -0.04017135,
0.05278976, 0.07862361],
[ 0.06125673, 0.02199364, 0.04071362, ..., 0.05197161,
0.06485387, -0.00574327],
...,
[-0.035062 , -0.04902796, -0.00130521, ..., 0.83354267,
-0.01081503, 0.01220279],
[-0.10259506, 0.04946591, -0.04156722, ..., 0.00562025,
0.86803635, -0.00854122],
[-0.03661932, 0.00913428, 0.02463354, ..., 0.05158599,
-0.01561285, 0.8748643 ]])

Singular values
In [ ]: s_st_return

Out[ ]: array([3.34632245, 1.63680109, 1.60156236, 1.36331236, 1.12608502,


1.06081308, 1.02599675, 0.90912456, 0.86740454, 0.85508141,
0.82863216, 0.76399463, 0.68735184, 0.67355388, 0.63943669,
0.61713275, 0.5982534 , 0.58915346, 0.54750511, 0.52876127,
0.52132406, 0.47621634, 0.47233618, 0.44350506, 0.42235973,
0.38939778, 0.38502787, 0.3386785 , 0.31066052, 0.25473011])

Eigenvalues With The SVD Method


In [ ]: s_st_return**2

Out[ ]: array([11.19787397, 2.67911782, 2.565002 , 1.85862058, 1.26806748,


1.1253244 , 1.05266932, 0.82650747, 0.75239063, 0.73116421,
0.68663126, 0.58368779, 0.47245256, 0.45367483, 0.40887927,
0.38085283, 0.35790713, 0.3471018 , 0.29976185, 0.27958848,
0.27177878, 0.226782 , 0.22310147, 0.19669673, 0.17838774,
0.15163063, 0.14824646, 0.11470312, 0.09650996, 0.06488743])

Matrix of the right singular vectors


In [ ]: VT_st_return.T
Out[ ]: array([[-2.04561657e-01, 1.98823851e-01, -5.55835898e-02,
1.95703312e-01, -4.11442942e-02, 3.09622025e-02,
-1.13914459e-02, 7.67302825e-02, -6.84252241e-02,
-1.49265844e-01, -1.89567279e-01, 4.04895427e-01,
-2.48803611e-01, -1.27740381e-01, 5.60824078e-02,
1.61507979e-02, 3.11498403e-02, 5.65149820e-01,
-3.30417974e-02, -1.78431265e-01, -1.79512862e-01,
-2.65947621e-01, -5.14301642e-02, 1.66646486e-01,
3.58342229e-02, 1.64777695e-01, 2.12864359e-02,
-4.39880829e-02, -2.00605095e-01, 4.86899024e-02],
[-8.75596753e-02, -1.62552367e-02, 1.33533526e-01,
-3.02663392e-01, -3.66292488e-01, -2.22945878e-01,
-4.22875960e-01, -5.18377077e-02, -3.43113454e-01,
-1.59571956e-01, 3.27669925e-01, 2.00612922e-01,
1.88456955e-01, -1.53049320e-01, 2.48060692e-01,
1.75296532e-02, 1.50254105e-01, 5.16648371e-02,
-7.62257950e-03, 8.66222897e-03, 9.47027828e-02,
1.27065675e-01, -1.26985433e-01, 2.71658549e-02,
-2.13191620e-02, -1.28303416e-01, -8.58528564e-02,
3.54596755e-02, -9.59063920e-02, 5.96915933e-02],
[-2.19220724e-01, -1.31178853e-01, 7.43344686e-02,
2.67186836e-02, -7.77282159e-02, -1.35446673e-01,
-1.00326483e-01, 2.47455411e-01, 1.76274533e-01,
3.06843367e-01, 1.76589638e-01, -3.51063871e-01,
-9.57551770e-02, -2.63589136e-01, 6.32013190e-02,
2.91386908e-01, 8.88583531e-04, 5.95289127e-02,
-1.43869588e-01, -2.21495896e-03, -9.82678472e-02,
-6.17294744e-02, -6.88371457e-02, -1.16340932e-01,
1.28363565e-01, 2.22119515e-01, -1.47011841e-01,
-4.76848980e-01, -3.48505784e-02, 1.16569752e-01],
[-2.36482430e-01, -5.79039101e-02, -1.17079216e-01,
-2.37636866e-01, 3.07116109e-03, -1.44043290e-02,
1.29793623e-01, 1.53783590e-01, 9.65123199e-02,
-9.21236370e-02, 1.76502278e-02, 2.88928546e-02,
1.67104147e-02, 1.91141091e-01, -6.75668859e-02,
-2.03189352e-01, 1.09802083e-01, -2.56474450e-01,
-1.47451140e-02, 2.46539573e-02, -9.82361533e-02,
-1.04639455e-01, 1.00802558e-01, 6.10823840e-01,
1.06869685e-01, -2.69474630e-01, -7.91429705e-02,
-3.28869050e-01, -2.05899971e-01, 6.95707534e-02],
[-2.06471728e-01, 8.98749208e-02, -2.25960499e-01,
1.19540410e-01, 8.61513013e-02, -2.75552939e-01,
1.94902731e-01, -1.98725092e-01, 8.21253027e-02,
-1.65166193e-01, -1.08920352e-01, -5.91850276e-02,
2.34985525e-01, -1.13265797e-01, 1.15016596e-01,
4.94218844e-02, -1.94032047e-01, 7.41560734e-02,
1.90880433e-01, 2.57685369e-03, 2.31067119e-01,
2.03624124e-01, -8.27938976e-02, -1.82310381e-01,
-1.97902441e-01, -1.77810477e-01, 1.95232447e-01,
-3.26350305e-01, -3.64378868e-01, -1.65207948e-01],
[-1.42384977e-01, 2.91392582e-02, -2.03576516e-03,
3.14924169e-01, 9.27019852e-02, 1.19527056e-01,
3.12481738e-01, 1.44507587e-01, -3.98359288e-01,
3.92471001e-01, 3.98005685e-01, 2.15862314e-01,
1.77596493e-01, 1.63191377e-01, 1.51629732e-01,
-1.18180403e-01, -9.74547679e-02, 9.92343287e-02,
-6.06864801e-02, 1.29964814e-01, -2.54336333e-02,
1.04086912e-01, 2.13233570e-01, -5.64591867e-02,
-9.62924340e-03, -8.30309351e-02, -1.19439927e-01,
-5.61997351e-02, -4.73067280e-03, -3.06645544e-02],
[-1.39310180e-01, 5.78051879e-02, -2.37619548e-01,
-2.93482523e-01, -1.98464529e-01, 1.10003009e-01,
-7.87432864e-04, 1.73208382e-01, 4.38547652e-01,
-3.55654308e-02, 3.69531202e-01, 2.27597619e-01,
8.98075476e-02, -4.71468788e-02, -2.21171049e-01,
-1.42186259e-01, -3.48309398e-01, 7.98202468e-02,
5.28853171e-02, -6.32855403e-02, -1.79673474e-01,
-2.21019570e-02, 1.00818655e-01, -2.23210552e-01,
-1.95059142e-01, 1.27706549e-02, 7.63411448e-02,
9.05747243e-02, 1.05013195e-01, 3.57654688e-02],
[-2.28678290e-01, -9.55947369e-02, -4.13407143e-02,
-1.21923614e-01, 4.74387979e-02, 9.94365524e-03,
-2.40600513e-02, 2.15640296e-01, -1.74596705e-01,
-1.94622359e-01, -1.65686339e-01, -1.64526645e-01,
-1.82725760e-02, 2.07692421e-01, 4.59511234e-01,
-4.36287710e-02, -2.62839700e-01, -1.24797945e-01,
3.03612490e-01, -1.23968657e-01, -2.42453296e-01,
-2.16491774e-01, 2.74404231e-02, -2.88219371e-01,
3.04894510e-01, -4.75793005e-02, -1.56389386e-02,
1.06891019e-01, 5.80275717e-02, -3.48138577e-02],
[-2.25824364e-01, -2.43276323e-01, 2.00042897e-01,
-5.02288750e-02, 5.84127444e-02, 9.49781275e-02,
4.93019371e-02, 4.78210633e-02, -4.02815731e-02,
-6.64961218e-02, 9.50484366e-03, 1.45041378e-01,
5.16899645e-02, -4.74051485e-02, -1.66981378e-01,
2.23871948e-01, -2.92316290e-02, 5.18354101e-02,
2.78461881e-01, 1.82147244e-01, -6.69450572e-02,
1.71143528e-02, -2.62634108e-01, 1.86841331e-01,
-8.89276702e-02, 8.58434536e-02, -9.22271546e-02,
-5.66547199e-02, 2.86782828e-01, -6.22799394e-01],
[-1.26969087e-01, -8.45754237e-02, -2.21994626e-01,
-7.33375274e-02, -2.49452427e-01, 2.90954779e-01,
-2.81217879e-01, -6.79256050e-02, -3.78913285e-02,
4.85740089e-01, -4.92283983e-01, 9.44215378e-02,
1.47063602e-01, 2.70014595e-02, 6.76940539e-02,
-1.77835000e-01, 1.34616097e-01, -5.48952633e-02,
1.31945234e-01, 6.37828447e-02, -1.20019250e-01,
1.14673736e-01, -1.30201705e-02, -6.94517217e-02,
-2.23183886e-01, 6.64095010e-02, 3.66102472e-04,
-1.16514820e-01, -3.63321846e-02, 2.62837496e-03],
[-1.99076713e-01, 2.46626307e-01, 1.34481924e-01,
7.84470318e-02, -1.81681359e-01, -3.77343810e-02,
1.33954024e-01, -6.32259648e-02, -2.13025773e-01,
-5.26149716e-02, 5.94463994e-02, -1.92305546e-03,
-2.83401409e-01, -1.07375499e-01, -3.15138549e-01,
-7.26859275e-02, 6.64558256e-02, -3.46076373e-01,
2.42921779e-01, 4.23609675e-01, -1.90962208e-01,
-9.39287059e-02, -1.38687182e-01, -1.99078367e-01,
-2.09614633e-02, 1.12623921e-01, 1.00281967e-01,
4.55634483e-02, -2.20208388e-01, 1.73952476e-01],
[-1.88988595e-01, -9.78581444e-03, 3.39915610e-01,
-6.39030441e-02, 3.83528771e-02, 1.52587557e-02,
1.35043256e-02, -1.48484245e-01, 2.09773492e-01,
1.42582693e-01, -1.78567540e-02, -6.80169563e-02,
-8.46550924e-02, -9.84236016e-02, 3.84921845e-02,
-4.76108269e-01, -6.01006316e-02, 2.90070340e-01,
2.91297449e-01, 1.39686221e-01, 4.41018991e-01,
-8.65904310e-02, -1.42087023e-02, 1.13984367e-02,
1.83432368e-01, -8.50664558e-02, -1.11475338e-01,
-1.22588454e-02, 1.69850399e-01, 1.71978376e-01],
[-1.71947105e-01, -1.90453198e-01, 2.11900954e-01,
-1.18373171e-01, 2.82492470e-01, 2.15197454e-01,
-1.06383605e-01, -2.35931812e-01, 3.26388468e-02,
-2.25839050e-01, -1.43098139e-01, 2.66711999e-01,
9.32397601e-02, -1.43139500e-01, -1.02925290e-01,
9.58393096e-02, -1.01494275e-01, -5.84955567e-02,
-1.97560690e-01, 1.49531119e-01, -1.06840680e-01,
2.80263242e-01, 3.91081117e-01, -1.73889019e-01,
2.59542948e-01, 1.06566458e-01, -1.37641990e-01,
-1.04624585e-01, -1.44886175e-01, 1.07947210e-01],
[-1.53083550e-01, -5.91480454e-02, 2.67892625e-02,
3.41798851e-01, -2.82166795e-01, -1.82789851e-01,
-2.60406976e-01, 2.38619386e-01, 4.81488741e-02,
-1.92080116e-01, -7.60675279e-02, -1.14310072e-01,
2.04852437e-01, 4.69814662e-01, -3.36766748e-01,
5.91316366e-02, 7.41217806e-02, 1.43896736e-01,
9.21074234e-02, 7.56327784e-02, 1.79069605e-01,
6.61167749e-02, 2.75134538e-01, -3.66322005e-02,
8.15919260e-02, 1.33963865e-01, -1.29388103e-02,
3.35060918e-02, -1.10076451e-02, 1.33872909e-02],
[-2.14899992e-01, 9.59846529e-02, -1.89326552e-01,
-1.03093220e-01, -1.22598498e-01, -2.31204750e-02,
1.49973294e-02, 1.58171315e-01, 5.70577855e-04,
7.75680110e-02, -8.06960615e-02, 5.93772075e-02,
-6.29481275e-01, 1.96003027e-02, 1.08869079e-01,
2.18228329e-01, -5.72110147e-02, -4.45417563e-02,
-5.61937150e-02, 4.58434630e-02, 2.88618918e-01,
4.34485704e-01, 1.81457499e-01, 4.16830260e-02,
2.68495629e-02, -1.59884689e-01, 6.24146292e-04,
1.41927066e-01, 1.08167755e-01, -9.88696472e-02],
[-2.52539697e-01, -1.20180303e-01, 4.04953794e-02,
3.75220846e-02, 1.11412102e-01, 1.27979092e-01,
-3.04375322e-02, 8.15217111e-02, -4.08375030e-02,
-3.84248627e-02, -8.95149263e-02, 1.63217436e-02,
2.79134445e-03, 1.58180058e-01, -2.31349066e-01,
4.53742373e-02, 1.17370360e-01, 1.17099833e-01,
-3.51508959e-01, -1.32644395e-02, -2.67992486e-02,
-2.70442033e-02, -4.33076122e-01, -2.92528925e-01,
4.09622709e-03, -5.58483663e-01, 8.52581741e-02,
-1.37985865e-02, 1.29141766e-01, 1.68752958e-01],
[-7.41435166e-02, 4.19799116e-01, 2.47881319e-01,
1.00055992e-01, -4.11912379e-02, 9.72157218e-02,
-9.67990394e-02, -2.19995575e-02, 3.60837736e-01,
-2.67984532e-03, 4.75541414e-02, 4.84810656e-02,
1.23377815e-01, 2.06365299e-01, 3.83815279e-01,
4.95039625e-03, 7.22957868e-02, -5.51080892e-02,
-1.98934424e-01, 2.64510159e-01, -1.50739016e-01,
1.53755809e-01, -1.28158206e-01, 9.15382727e-02,
1.52251119e-01, 7.18597013e-02, 3.78841262e-01,
-3.96300997e-02, 9.57860900e-02, -1.19422561e-01],
[-1.44014401e-01, 4.53247089e-02, -8.54670595e-02,
-2.25795894e-02, 5.25781307e-01, -1.47671044e-01,
-4.52492559e-01, -2.77384163e-02, 6.48226633e-02,
1.38899559e-01, 2.28500988e-01, 4.45788341e-02,
-2.15804413e-01, 1.60998289e-01, -3.00874994e-02,
-1.88083103e-01, 2.24669194e-01, -9.39300541e-02,
-8.11214302e-03, -5.24787001e-02, 5.34114916e-02,
-1.79716108e-01, 5.81899584e-02, -1.54881731e-01,
-1.30280403e-01, 7.02563208e-02, 1.43764132e-02,
4.71480711e-02, -2.12021747e-01, -2.91917932e-01],
[-1.04377302e-01, 4.88605670e-01, 7.35385020e-02,
-6.95850729e-02, 5.17880173e-02, 1.80559677e-01,
-1.73245015e-02, -1.38533077e-01, 4.90661297e-02,
9.92291113e-04, -1.44277837e-02, -1.23916460e-02,
8.52186534e-02, 2.73273008e-01, -2.65472092e-02,
2.37178871e-01, -1.47784639e-01, -1.11324486e-01,
4.99609660e-02, -2.03637261e-01, 1.25340358e-01,
9.78733850e-03, -2.10568423e-01, 2.49502847e-03,
-1.22056071e-01, 8.04173808e-02, -6.08779037e-01,
-1.57635096e-02, -5.44018111e-02, 1.00706845e-01],
[-1.97547406e-01, 7.66694401e-02, -2.48797277e-01,
8.50090798e-02, 2.78820605e-01, -3.20073310e-01,
-1.24487941e-01, -1.71622083e-01, -1.08740575e-01,
8.95255739e-02, 1.96592299e-02, 1.57653274e-01,
1.06050841e-01, -2.86470060e-02, -9.66582175e-02,
9.87794614e-02, -4.89274276e-02, -1.06394686e-01,
1.95342805e-01, -1.14505050e-01, -9.24022821e-02,
1.40164094e-01, -8.63018745e-02, 1.89813601e-01,
1.11102861e-01, 1.36927034e-01, 1.74305806e-01,
-3.93511261e-02, 4.92054802e-01, 3.77883342e-01],
[-1.87807573e-01, -6.30089153e-02, -2.19908875e-01,
9.42306866e-02, 3.00184652e-02, 2.84970455e-01,
1.72337690e-01, -1.12378295e-01, 1.34430974e-01,
-2.61526732e-01, 2.23109605e-01, -2.43286778e-01,
1.25931183e-02, -7.42338173e-02, 1.53056298e-01,
-9.93023917e-03, 6.06212538e-01, 1.33570539e-01,
2.14194559e-01, -6.06081430e-02, -1.38494873e-01,
2.07939176e-01, 5.25760945e-02, -5.66008515e-02,
-7.68021967e-02, 4.47309493e-02, -1.29499547e-01,
1.09623798e-01, 7.28184989e-02, 9.58528698e-02],
[-2.48162724e-01, -4.81933495e-02, 1.54687355e-02,
-1.46342267e-01, 8.11939337e-02, -1.99169501e-02,
1.43751950e-01, 1.15548232e-01, -7.66315458e-02,
-1.72406066e-01, -1.31704424e-01, 9.88241609e-03,
9.40454962e-02, -4.50238531e-03, 2.13941758e-01,
8.00079853e-02, 5.80141981e-02, -8.89810639e-02,
-2.75597013e-01, 2.43131644e-01, 2.92855654e-01,
-3.42618628e-01, 1.56607537e-01, -8.46591721e-03,
-4.98031883e-01, 2.42695780e-01, 6.20428796e-02,
3.00863997e-02, 2.10248927e-01, 1.71345750e-01],
[-1.51700649e-01, 4.09977378e-01, 9.26018987e-02,
-1.20875603e-01, -6.94984520e-02, 1.09243036e-01,
1.26381205e-02, -3.21615901e-02, -1.33923856e-01,
1.18809677e-01, -5.17951398e-02, -1.13881591e-01,
1.91590783e-01, -2.69077039e-01, -1.96730677e-01,
1.73915680e-01, 1.97310212e-01, -4.28632156e-02,
3.74327449e-02, -2.57801318e-01, 7.66605349e-02,
-2.56724090e-01, 4.01848841e-01, -6.65235806e-02,
1.53877023e-01, -2.41432706e-01, 1.97714295e-01,
-2.97583774e-02, 1.53415416e-01, -2.24042663e-01],
[-1.99883241e-01, 7.36309967e-03, 2.81106297e-01,
4.88706935e-02, -1.15106811e-01, -6.23985703e-02,
-1.50694099e-02, -2.37617785e-01, -1.90420829e-01,
-1.09737823e-01, 4.80395204e-03, -3.15340722e-01,
-1.79397499e-01, 7.16075222e-02, -5.97017199e-02,
-3.44650969e-01, -1.94036906e-01, 2.91393848e-02,
-2.24606200e-01, -3.39942144e-01, -2.70092565e-01,
2.13763496e-01, 9.88731207e-02, 7.85880195e-02,
-3.20811708e-01, 7.89066180e-02, 5.88012912e-02,
-1.13883750e-01, 1.45039069e-01, -1.07755461e-01],
[-1.89726874e-01, -2.00645896e-01, 3.06362260e-01,
1.03664238e-01, 9.08834969e-02, 1.50654024e-01,
-9.34164118e-02, 2.21001982e-03, 1.06786888e-01,
1.95216332e-01, 7.53816913e-02, -9.96635871e-02,
4.66076155e-02, -2.86990250e-02, 4.78487007e-02,
3.24094821e-01, -1.56560084e-01, -1.61245557e-02,
2.10641626e-01, -1.45865227e-01, -3.18253309e-02,
-2.28120909e-03, 4.75283305e-02, 2.67317414e-01,
-2.25837964e-01, -1.38059827e-01, 2.48087043e-01,
3.85639028e-01, -3.20463040e-01, 2.36271481e-01],
[-1.10655848e-01, -2.35324818e-01, 8.70816049e-02,
-1.07464259e-01, -2.58547361e-01, -3.11510401e-01,
3.14344926e-01, -4.66469923e-01, 1.89002448e-01,
2.15889006e-01, 2.11074606e-02, 2.37100185e-01,
-8.66949152e-02, 3.20811798e-01, 6.43791899e-02,
1.73642049e-01, 1.90601712e-01, -3.62391114e-02,
-5.31077177e-02, -1.55352629e-01, -5.76460876e-02,
-1.91408259e-01, 6.25639883e-02, -1.56064515e-01,
9.42695254e-02, 3.73819277e-02, -9.25441486e-03,
5.45048695e-02, -1.61140961e-02, -5.29235942e-02],
[-2.13704817e-01, 5.47278840e-02, -2.17551050e-01,
1.63965178e-01, -3.97549664e-02, -2.74472181e-01,
-5.10365649e-02, -1.49643745e-01, 1.17554785e-01,
7.64481701e-02, -1.09275170e-01, -1.94592735e-01,
1.55697848e-01, -2.53974164e-01, 1.94547387e-02,
-7.05850535e-02, -1.20606388e-01, 3.68064253e-02,
-2.60894214e-01, 2.83672951e-01, -2.07850254e-01,
-9.13229681e-02, 3.42378824e-02, 1.44770270e-01,
7.76689692e-02, -1.54100635e-01, -3.39602982e-01,
4.53464400e-01, -4.87552182e-03, -1.50864039e-01],
[-1.04445510e-01, -1.03930485e-01, -3.51353987e-01,
-4.65297430e-03, -1.26362081e-01, 3.97891284e-01,
-1.32714686e-01, -4.04996467e-01, -1.76930155e-01,
-4.17530796e-02, 2.32489559e-01, -2.11332509e-01,
-6.18631772e-02, 1.19987288e-01, -4.77552393e-02,
8.02145893e-02, -2.33852444e-01, 6.69549974e-02,
-1.40461565e-01, 1.06560115e-01, 2.27571947e-01,
-2.08869624e-01, -9.97974603e-02, 1.32310066e-01,
2.32574955e-01, 1.54477416e-01, 2.05018180e-01,
-4.10627189e-02, -4.96404985e-02, -5.40940323e-02],
[-2.38927801e-01, -1.18252394e-02, -1.69451130e-03,
-1.85613351e-01, 3.34573957e-02, -2.30558467e-02,
2.44002789e-01, 2.34080103e-01, -6.64081409e-02,
9.39645831e-02, -6.60376242e-02, -1.73742482e-02,
2.05033545e-01, -7.37659864e-02, -1.26079792e-01,
-1.95043407e-01, 9.41344431e-02, -8.96160809e-02,
-1.93669312e-01, -2.77182941e-01, 2.03192740e-01,
2.13718784e-01, -2.73290112e-01, 2.93250785e-03,
2.54605909e-01, 4.05348423e-01, 1.20116917e-01,
2.95004917e-01, -2.07038143e-01, -8.16965520e-02],
[-9.26843088e-02, -1.43711414e-01, 1.02962326e-02,
5.32817020e-01, -1.88515754e-01, 1.36900630e-01,
-1.00193790e-01, 1.64064697e-02, 1.67420857e-01,
-1.32579642e-01, 4.96911897e-02, 2.64892788e-01,
-3.58782210e-02, -2.51912216e-01, 1.14511692e-01,
-1.30576792e-01, -2.61789839e-02, -4.92368779e-01,
-5.76320838e-02, -2.80122198e-01, 2.06965217e-01,
-1.19696792e-01, -5.69479674e-02, -2.40639658e-02,
2.77121618e-02, -4.37756691e-02, -8.46245379e-02,
-4.64828950e-02, 8.81483650e-02, -5.78901592e-02]])

The percentage of variance contribution each principal component catches.


In [ ]: df_eigval = pd.DataFrame({"Eigenvalues":eigenvalues}, index=range(1,31))
df_eigval["Explained proportion"] = df_eigval["Eigenvalues"] / np.sum(df_eigval["Eigenvalues"])
df_eigval["Explained proportion"]=df_eigval["Explained proportion"].map('{:.2%}'.format)

df_eigval
Out[ ]: Eigenvalues Explained proportion
1 11.197874 37.33%
2 2.679118 8.93%
3 2.565002 8.55%
4 1.858621 6.20%
5 1.268067 4.23%
6 1.125324 3.75%
7 1.052669 3.51%
8 0.826507 2.76%
9 0.752391 2.51%
10 0.731164 2.44%
11 0.686631 2.29%
12 0.583688 1.95%
13 0.472453 1.57%
14 0.453675 1.51%
15 0.408879 1.36%
16 0.380853 1.27%
17 0.357907 1.19%
18 0.347102 1.16%
19 0.299762 1.00%
20 0.279588 0.93%
21 0.271779 0.91%
22 0.226782 0.76%
23 0.223101 0.74%
24 0.196697 0.66%
25 0.178388 0.59%
26 0.151631 0.51%
27 0.148246 0.49%
28 0.114703 0.38%
29 0.096510 0.32%
30 0.064887 0.22%

Now that you have calculated, presented and plotted tasks from c to f, you must explain each transformation thoroughly. Write a paragraph of 500 words at minimum that explains why
returns are important, compare and contrast PCA and SVD, explain what the eigenvectors, eigenvalues, singular values etc show us for the specific data, etc.
Answer
Returns are a measure of the performance of our investments and allow us to compare performances from different investments. PCA is primarily used for dimensionality reduction, while
SVD has broader applications, including matrix approximation and dimensionality reduction (Hair 92). Both PCA and SVD produce orthogonal vectors and the same eigenvalues. PCA involves
computing eigenvectors and eigenvalues of the covariance matrix, while SVD directly deals with singular values. The singular values obtained from SVD can be used to calculate a variance
of each principal component (Schwarz et al 2256). Each eigenvector transforms the standardized dataset into a principal component, whichcan explain a portion of the variance of the
dataset. The corresponding eigenvalue for the principal component is the variance of the whole data explained by the principal component. The sum of all the eigenvalues is the total
variance of the data.

Conclusion
Ensuring data quality in financial analysis is of paramount importance. Poor-quality data, whether structured or unstructured, hampers accurate modeling and reliable decision-making. For
instance, in the context of this assignment, structured data included currency symbols, placeholders, and improper classifications that hinder numerical operations. Similarly, unstructured
data exhibited missing values and formatting issues that complicated assessment.
The comparative analysis of the Nelson-Siegel and Cubic-Spline models highlighted their respective strengths and weaknesses. The Nelson-Siegel model, while easier to interpret, struggled
with certain maturities, whereas the Cubic-Spline model offered a more precise fit but lacked clear economic interpretation. Ethical considerations in data smoothing were addressed,
emphasizing transparency and accuracy over potential misrepresentation.
Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) both proved essential in reducing dimensionality and understanding underlying data structures, offering
insights into returns and variability in investment performance. The screeplots revealed the differences in explained variance between uncorrelated and government data, emphasizing the
significance of correlation in principal components.

References
Akinyemi, Kemi, et al. ''Yield curve extrapolation methods: Methodologies for valuing cash flows that extend beyond the maximum yield curve''. Society of actuaries, March (2019),
https://www.soa.org/resources/research-reports/2019/yield-curve-report/
Copeland, Ronald M. "Income smoothing." Journal of accounting research, (1968): 101-116, https://doi.org/10.2307/2490073
Hair, J. F., et al. Multivariate Data Analysis. 7th Edn Prentice Hall International. Prentice Hall, (2009).
Huh, Y. U., et al. ''Data quality''. Information and software technology, 32.8 (1990): 559-565., https://www.sciencedirect.com/science/article/pii/095058499090146I
Liu, Grace. "Data quality problems troubling business and financial researchers: A literature review and synthetic analysis." Journal of Business & Finance Librarianship, 25.3-4 (2020): 315-
371, https://doi.org/10.1080/08963568.2020.1847555
Nadinić, Berislav, and Damir Kalpić. "Data quality in finances and its impact on credit risk management and CRM integration." International Conference on Software and Data Technologies,
Special Session on Applications in Banking and Financing, (3; 2008). 2008, https://doi.org/10.5220/0001879103270331
Nymand-Andersen, Per. ''Yield curve modelling and a conceptual framework for estimating yield curves: evidence from the European Central Bank's yield curves''. ECB Statistics Paper, 27,
2018, https://doi.org/10.2866/892636
Praveen, Shagufta, and Umesh Chandra. "Influence of structured, semi-structured, unstructured data on various data models." International Journal of Scientific & Engineering Research,
8.12 (2017): 67-69, https://www.researchgate.net/publication/344363081
Schwarz, Christian, et al. "Principal component analysis and singular value decomposition used for a numerical sensitivity analysis of a complex drawn part." The International Journal of
Advanced Manufacturing Technology, 94 (2018): 2255-2265. https://link.springer.com/article/10.1007/s00170-017-0980-z
Sebastian-Coleman, Laura. Measuring data quality for ongoing improvement: a data quality assessment framework. Newnes, 2012,
https://www.sciencedirect.com/book/9780123970336/measuring-data-quality-for-ongoing-improvement
Strong, Diane M., et al. "Data quality in context." Communications of the ACM, 40.5 (1997): 103-110, https://doi.org/10.1145/253769.253804
Wahlstrøm, Ranik Raaen, et al. ''A comparative analysis of parsimonious yield curve models with focus on the Nelson-Siegel, Svensson and Bliss versions''. Computational Economics (2022):
1-38, https://doi.org/10.1007/s10614-021-10113-w
Wang, Richard Y., and Diane M. Strong. "Beyond accuracy: What data quality means to data consumers." Journal of management information systems, 12.4 (1996): 5-33,
https://doi.org/10.1080/07421222.1996.11518099.
Useful links
Data Society
Atharva Arya
.

In [ ]: %%capture
!wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
from colab_pdf import colab_pdf
colab_pdf('Group_Work_Project 1, group 8072.ipynb')
#set_matplotlib_formats('pdf', 'svg')

You might also like