A. Provide An Example of Poor Quality Structured Data: 204.4/204.4 KB 42.8/42.8 KB
A. Provide An Example of Poor Quality Structured Data: 204.4/204.4 KB 42.8/42.8 KB
In the realm of financial data analysis, ensuring the quality and reliability of data is paramount. The significance of good quality data becomes evident when we examine its implications in
financial modeling and decision-making processes. This assignment is structured to provide a comprehensive understanding of data quality, yield curve modeling, the exploitation of
correlation, and empirical analysis of Exchange-Traded Funds (ETFs), all within the broader context of financial data analysis.
Firstly (Question 1), we will delve into the concept of data quality by providing examples of poor-quality structured and unstructured data, along with methods to identify their shortcomings.
This foundational knowledge is crucial for recognizing the importance of maintaining high standards in financial data.
Next (Question 2), we will explore yield curve modeling by fitting Nelson-Siegel and Cubic-Spline models to government securities from a European Union (Germany), ranging from short-
term to long-term maturities. The section 2 will highlight the comparative analysis of these models in terms of fit and interpretation, addressing the ethical considerations of data smoothing.
Furthermore, in Question 3 we will investigate the role of correlation and principal components in financial data analysis by generating uncorrelated Gaussian random variables and running
Principal Component Analysis (PCA). The analysis will be extended to real data, collecting daily closing yields for government securities and comparing the results through screeplots.
Lastly, in Question 4, the empirical analysis of ETFs will involve computing daily returns, covariance matrices, PCA, and Singular Value Decomposition (SVD) for a sector ETF, providing a
thorough explanation of each transformation and its implications.
Installing Packages
In [ ]: ! pip install -q kaggle --quiet
! pip install pandas --quiet
! pip install numpy --quiet
! pip install matplotlib --quiet
! pip install seaborn --quiet
! pip install pingouin --quiet
! pip install factor_analyzer --quiet
! pip install gdown --quiet
! pip install datetime --quiet
! pip install requests --quiet
! pip install yfinance --quiet
! pip install pandas-datareader --quiet
Importing packages
In [ ]: import pandas as pd
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
from factor_analyzer.factor_analyzer import calculate_kmo
import pingouin as pg
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from numpy import linalg as LA
from scipy.interpolate import CubicSpline
from scipy.optimize import minimize
from datetime import datetime, timedelta
import time
from operator import index
import yfinance as yfin
from PIL.ImageChops import screen
from fontTools.misc.plistlib import end_key
from matplotlib.projections import projection_registry
from numpy.array_api import full_like
from numpy.core.shape_base import block
from numpy.ma.core import around, outer
from rich.jupyter import display
from scipy import stats
from numpy import linalg as la
from scipy.odr import exponential
from unicodedata import decimal
from IPython.display import set_matplotlib_formats
import math
import re
import gdown
import warnings
import requests
warnings.filterwarnings('ignore')
<ipython-input-6-1dea6b209a65>:19: UserWarning: The numpy.array_api submodule is still experimental. See NEP 47.
from numpy.array_api import full_like
import os
if os.path.exists(output_filename):
print(f"File '{output_filename}' downloaded successfully!")
else:
print("File download failed.")
try:
import pandas as pd
datafin = pd.read_csv(output_filename)
# print(datafin.head())
except Exception as e:
print(f"Unable to read the file: {e}")
datafin.head()
Downloading...
From: https://drive.google.com/uc?id=1Rp1Ubva5Q7ReW_zs7EowH3yge8rq-i5w
To: /content/financials.zip
100%|██████████| 22.0k/22.0k [00:00<00:00, 33.8MB/s]
File 'financials.zip' downloaded successfully!
Out[ ]:
Segment Country Product Discount
Band
Units Manufacturing Sale
Sold Price Price
Gross Discounts
Sales Sales COGS Profit Month Month Year
Date Number Name
0 Government Canada Carretera None $1,618.50 $3.00 $20.00 $32,370.00 $- $32,370.00 $16,185.00 $16,185.00 01/01/2014 1 January 2014
1 Government Germany Carretera None $1,321.00 $3.00 $20.00 $26,420.00 $- $26,420.00 $13,210.00 $13,210.00 01/01/2014 1 January 2014
2 Midmarket France Carretera None $2,178.00 $3.00 $15.00 $32,670.00 $- $32,670.00 $21,780.00 $10,890.00 01/06/2014 6 June 2014
3 Midmarket Germany Carretera None $888.00 $3.00 $15.00 $13,320.00 $- $13,320.00 $8,880.00 $4,440.00 01/06/2014 6 June 2014
4 Midmarket Mexico Carretera None $2,470.00 $3.00 $15.00 $37,050.00 $- $37,050.00 $24,700.00 $12,350.00 01/06/2014 6 June 2014
In [ ]: datafin.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Segment 700 non-null object
1 Country 700 non-null object
2 Product 700 non-null object
3 Discount Band 700 non-null object
4 Units Sold 700 non-null object
5 Manufacturing Price 700 non-null object
6 Sale Price 700 non-null object
7 Gross Sales 700 non-null object
8 Discounts 700 non-null object
9 Sales 700 non-null object
10 COGS 700 non-null object
11 Profit 700 non-null object
12 Date 700 non-null object
13 Month Number 700 non-null int64
14 Month Name 700 non-null object
15 Year 700 non-null int64
dtypes: int64(2), object(14)
memory usage: 87.6+ KB
Data description: This dataset, sourced from Atharva Arya, provides detailed sales and profit information across various market segments and countries, including key metrics like units
sold, sale price, gross sales, discounts, and profit. It enables in-depth analysis of sales performance, profitability, and time-based trends to support business insights and decision-making.
In [ ]: datafin.head()
Out[ ]:
Segment Country Product Discount
Band
Units Manufacturing Sale
Sold Price Price
Gross Discounts
Sales Sales COGS Profit Month Month Year
Date Number Name
0 Government Canada Carretera None $1,618.50 $3.00 $20.00 $32,370.00 $- $32,370.00 $16,185.00 $16,185.00 01/01/2014 1 January 2014
1 Government Germany Carretera None $1,321.00 $3.00 $20.00 $26,420.00 $- $26,420.00 $13,210.00 $13,210.00 01/01/2014 1 January 2014
2 Midmarket France Carretera None $2,178.00 $3.00 $15.00 $32,670.00 $- $32,670.00 $21,780.00 $10,890.00 01/06/2014 6 June 2014
3 Midmarket Germany Carretera None $888.00 $3.00 $15.00 $13,320.00 $- $13,320.00 $8,880.00 $4,440.00 01/06/2014 6 June 2014
4 Midmarket Mexico Carretera None $2,470.00 $3.00 $15.00 $37,050.00 $- $37,050.00 $24,700.00 $12,350.00 01/06/2014 6 June 2014
Out[ ]: False
Out[ ]: 0
Out[ ]: array([' None ', ' Low ', ' Medium ', ' High '], dtype=object)
In [ ]: datafin.columns
Out[ ]: Index(['Segment', 'Country', ' Product ', ' Discount Band ', ' Units Sold ',
' Manufacturing Price ', ' Sale Price ', ' Gross Sales ', ' Discounts ',
' Sales ', ' COGS ', ' Profit ', 'Date', 'Month Number',
' Month Name ', 'Year'],
dtype='object')
try:
import pandas as pd
data = pd.read_csv(output_filename)
# print(data.head())
except Exception as e:
print(f"Unable to read the file: {e}")
data.head()
Downloading...
From: https://drive.google.com/uc?id=1NIakNIwqoDgw9CvFWrU-s-uuJHdu6KQu
To: /content/Airplane_Crashes_and_Fatalities_Since_1908.csv.zip
100%|██████████| 596k/596k [00:00<00:00, 103MB/s]
File 'Airplane_Crashes_and_Fatalities_Since_1908.csv.zip' downloaded successfully!
Out[ ]:
index Date Time Location Operator Flight# Route Type Registration cn/In Aboard Fatalities Ground Summary
0 0 09/17/1908 17:18 Fort Myer, Virginia Military - U.S. NaN Demonstration Wright Flyer NaN 1 2.0 1.0 0.0 During a demonstration
Army III flight, a U.S. Army fly...
1 1 07/12/1912 06:30 AtlantiCity, New Military - U.S. NaN Test flight Dirigible NaN NaN 5.0 5.0 0.0 First U.S. dirigible Akron
Jersey Navy exploded just offsh...
2 2 08/06/1913 NaN Victoria, British Private - NaN Curtiss NaN NaN 1.0 1.0 0.0 The first fatal airplane
Columbia, Canada seaplane accident in Canada oc...
3 3 09/09/1913 18:30 Over the North Sea Military - NaN Zeppelin L-1 The airship flew into a
German Navy NaN (airship) NaN NaN 20.0 14.0 0.0 thunderstorm and encou...
4 4 10/17/1913 10:30 Near Johannisthal, Military - NaN Zeppelin L-2 Hydrogen gas which was
Germany German Navy NaN (airship) NaN NaN 30.0 30.0 0.0 being vented was sucked...
Data description: This dataset, sourced from Data Society, compiles information on Boeing 707 accidents dating back to 1948. It includes various details such as the incident's date and
time, location, and the operator of the aircraft. Additional data includes the flight number, the route taken, and the aircraft's type and registration number. Also recorded are the construction
(serial) number, the total number of people aboard, fatalities, fatalities on the ground, and a summary of each incident.
In [ ]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 5268 non-null int64
1 Date 5268 non-null object
2 Time 3049 non-null object
3 Location 5248 non-null object
4 Operator 5250 non-null object
5 Flight # 1069 non-null object
6 Route 3561 non-null object
7 Type 5241 non-null object
8 Registration 4933 non-null object
9 cn/In 4040 non-null object
10 Aboard 5246 non-null float64
11 Fatalities 5256 non-null float64
12 Ground 5246 non-null float64
13 Summary 4878 non-null object
dtypes: float64(3), int64(1), object(10)
memory usage: 576.3+ KB
In [ ]: data.columns
In [ ]: data.isna().sum()
Out[ ]: 0
index 0
Date 0
Time 2219
Location 20
Operator 18
Flight # 4199
Route 1707
Type 27
Registration 335
cn/In 1228
Aboard 22
Fatalities 12
Ground 22
Summary 390
dtype: int64
In [ ]: data.duplicated().sum()
Out[ ]: 0
dtype: float64
In [ ]: filtered_data=data[(data['Time'].isna() & data['Flight #'].isna()) & (data['Fatalities'] >= 10)]
filtered_data.head()
Out[ ]:
index Date Time Location Operator Flight# Route Type Registration cn/In Aboard Fatalities Ground Summary
7 7 07/28/1916 NaN Near Jambol, Military - Schutte-Lanz S-L- 0.0 Crashed nearcause
the Black Sea,
Bulgeria German Army NaN NaN 10 (airship) NaN NaN 20.0 20.0 unknown.
10 10 11/21/1916 NaN Mainz, Germany Military - Super Zeppelin
German Army NaN NaN (airship) NaN NaN 28.0 27.0 0.0 Crashed in a storm.
12 12 03/04/1917 NaN Near Gent, Belgium Military -
German Army NaN NaN Airship NaN NaN 20.0 20.0 0.0 Caught fire and crashed.
13 13 03/30/1917 NaN Off Northern Military - Schutte-Lanz S-L- Struck by lightning and
Germany German Navy NaN NaN 9 (airship) NaN NaN 23.0 23.0 0.0 crashed into the Balti...
19 19 05/10/1918 NaN Off Helgoland Military - Zeppelin L-70 0.0 Shot downcrashing
by British aircraft
Island, Germany German Navy NaN NaN (airship) NaN NaN 22.0 22.0 from a ...
In [ ]: filtered_data.shape
d. Unstructured data can be more difficult to assess than structured data. Just you did in part b, write 3 - 4 sentences that show how this unstructured data fails to check requirements of
good quality data.
Answer
Although there are no duplicate data, there is an excess of missing values (NaN), even in situations where fatalities are recorded. In other words, there are missing values in columns like
"Flight #" and "cn/In," which makes it difficult to fully assess the incident details and compare across records. For instance, a filter was applied to all rows with missing values in the "Date"
and "Flight #" variables but with fatalities above 10, and it was found that there are 986 rows, which reinforces the fact of poor quality. Additionally, variables like "Date" are not properly
formatted (datatype), and there is an "Index" column that does not contribute to the analysis, further indicating that the data is not structured properly (unstructured).
df = pd.DataFrame(data_dict)
df.columns = ['3M', '1Y', '2Y', '5Y', '10Y', '30Y']
b. Be sure to pick maturities ranging from short-term to long-term (e.g. 6 month maturity to 20 or 30 year maturities).
Answer
In [ ]: df
Out[ ]: 3M 1Y 2Y 5Y 10Y 30Y
2020-01-02 -0.623324 -0.634844 -0.622180 -0.478606 -0.173995 0.358465
2020-01-03 -0.595449 -0.630324 -0.638365 -0.520394 -0.222926 0.306064
2020-01-06 -0.597296 -0.637580 -0.651086 -0.541757 -0.247426 0.283818
2020-01-07 -0.605300 -0.641443 -0.650553 -0.532864 -0.232130 0.304965
2020-01-08 -0.597340 -0.633612 -0.641188 -0.515378 -0.204580 0.341047
... ... ... ... ... ... ...
2024-12-20 2.616587 2.172129 1.960729 2.049615 2.374684 2.445430
2024-12-23 2.632250 2.201549 1.997799 2.084822 2.402344 2.456214
2024-12-24 2.661780 2.211694 2.003148 2.090155 2.405459 2.455847
2024-12-27 2.653268 2.188540 2.007697 2.126310 2.431800 2.494656
2024-12-30 2.575177 2.178646 2.011151 2.130018 2.447304 2.513773
1278 rows × 6 columns
In [ ]: df.to_csv('yield_curves.csv')
results.append({
'date': date,
'beta0': params[0], # Level
'beta1': params[1], # Slope
'beta2': params[2], # Curvature
'tau': params[3], # Decay
'rmse': rmse
})
In [ ]: results_df
# Read data
df = pd.read_csv('yield_curves.csv', index_col=0)
df.index = pd.to_datetime(df.index)
# Nelson-Siegel function
def nelson_siegel(t, params):
beta0, beta1, beta2, tau = params
exp_term = np.exp(-t/tau)
return beta0 + beta1 * ((1 - exp_term)/(t/tau)) + beta2 * ((1 - exp_term)/(t/tau) - exp_term)
# Objective function
def objective(params, maturities, rates):
return np.sum((nelson_siegel(maturities, params) - rates)**2)
# Plot
plt.figure(figsize=(10, 6))
sns.set_style('whitegrid')
sns.scatterplot(x=maturities, y=latest_rates, color='blue', s=100, label='Actual Yields')
plt.plot(t_points, fitted_curve, 'r-', label='Nelson-Siegel Fit')
plt.xlabel('Maturity (years)')
plt.ylabel('Yield (%)')
plt.title(f'Figure 2.1: Nelson-Siegel Yield Curve Fit ({df.index[-1].strftime("%Y-%m-%d")})')
plt.legend()
plt.grid(True)
print("Fitted Parameters:")
print(f"β0 (level): {fitted_params[0]:.4f}")
print(f"β1 (slope): {fitted_params[1]:.4f}")
print(f"β2 (curvature): {fitted_params[2]:.4f}")
print(f"τ (decay): {fitted_params[3]:.4f}")
plt.show()
Fitted Parameters:
β0 (level): 2.0494
β1 (slope): -0.3579
β2 (curvature): 1.3981
τ (decay): 0.2822
# Plot
plt.figure(figsize=(10, 6))
sns.set_style('whitegrid')
sns.scatterplot(x=maturities, y=latest_rates, color='blue', s=100, label='Actual Yields')
plt.plot(t_points, fitted_curve, 'r-', label='Cubic Spline Fit')
plt.xlabel('Maturity (years)')
plt.ylabel('Yield (%)')
plt.title(f'Figure 2.2: Cubic Spline Yield Curve Fit ({df.index[-1].strftime("%Y-%m-%d")})')
plt.legend()
plt.grid(True)
plt.show()
e. Compare the models in terms of 1) fit and 2) interpretation
Answer
In analyzing the Nelson-Siegel (Figure 2.1) and cubic models (Figure 2.2) for interest rate term structure, both models offer distinct advantages. The Nelson-Siegel model uses three
parameters to shape the yield curve, making it flexible and straightforward to interpret (Wahlstrøm et al. 971). This model excels in its simplicity and ability to clearly represent economic
factors like long-term rates, curve steepness, and mid-maturity humps (Akinyemi et al. 14). The Nelson-Siegel model produces a relatively smooth curve that captures the general trend but
misses some key data points. The model parameters shown ( =2.6494, =-0.2579, =1.3981, =0.2822) generate a curve that notably underestimates yields at both very short and long
β0 β1 β2 τ
maturities. The misfit is particularly visible at the 30-year point where the actual yield is around 2.5% but the fitted curve predicts closer to 2.1%.
Conversely, the cubic model employs a third-degree polynomial to capture complex, non-linear data trends. Although it provides a more precise fit for intricate patterns, interpreting its
coefficients demands careful analysis. Overall, the Nelson-Siegel model is favored for its ease of use and clarity, while the cubic model is better suited for detailed data fitting but requires
deeper examination of its terms (Nymand 43). Cubic Spline model produces a more flexible fit that passes much closer to the observed data points. It captures both the short-term dynamics
and the long-term behavior more accurately. The curve shows more local variation, particularly evident in how it models the "belly" of the yield curve between 5 and 15 years.
f. Be sure to specify at the levels of model parameters (ex. Alpha1).
Answer
The Nelson-Siegel model offers clear economic interpretation of its parameters. (2.6494) represents the long-term interest rate level, (-0.2579) controls the slope of the yield curve,
β0 β1
and (1.3981) determines the curvature. The decay parameter (0.2822) governs how quickly the short-term components decay. These parameters have direct economic meaning and
β2 τ
can be used to understand market expectations about future interest rates. The Cubic Spline model, while providing better fit, lacks such clear economic interpretation. Its parameters are
purely mathematical, representing the coefficients of cubic polynomials between knot points. This makes it harder to extract meaningful economic insights about market expectations or risk
premiums from the model parameters.
g. In Module 2 Lesson 4 (‘Smoothing Data’), we said smoothing data can be unethical. If Nelson-Siegel is smoothing the yield curve, is this considered unethical? Why or why not?
Answer
Using smoothing techniques like the Nelson-Siegel method to estimate yield curves is typically ethical when the goal is to improve the accuracy of models and better understand market
trends. These methods help to remove noise and clarify long-term data patterns. However, it crosses into unethical territory if used to mislead stakeholders or distort data, such as by
inflating asset values or hiding volatility. The key to ethical use lies in being transparent about the methodology and intentions. Problems arise when the smoothing process is used to
misrepresent the true nature of the data (Copeland 103).
3. Exploiting Correlation
Financial Data is meant not only to process data but to understand how meaningful factors can be used to summarize or represent the data. Let’s understand the role that correlation and
principal components play.
a) Generate 5 uncorrelated Gaussian random variables that simulate yield changes (they can be positive or negative with a mean close to 0 and a standard deviation that is small).
Answer
Table of first five values generated
In [ ]: np.random.seed(432)
num_samples = 100
num_variables = 5
mean=0
std_deviation=0.01
data=np.random.normal(loc=mean,scale=std_deviation, size=(num_samples,num_variables))
dataF=pd.DataFrame(data)
dataF.columns=["Variable 1","Variable 2","Variable 3","Variable 4","Variable 5"]
dataF.head()
Correlation matrix
In [ ]: matrix_corr = dataF.corr()
print(matrix_corr)
Variable 1 Variable 2 Variable 3 Variable 4 Variable 5
Variable 1 1.000000 -0.008919 0.039846 0.013316 0.202243
Variable 2 -0.008919 1.000000 0.121684 0.155779 -0.133540
Variable 3 0.039846 0.121684 1.000000 -0.130289 -0.111868
Variable 4 0.013316 0.155779 -0.130289 1.000000 -0.092115
Variable 5 0.202243 -0.133540 -0.111868 -0.092115 1.000000
Correloplot
In [ ]: corr = dataF.corr()
f, ax =plt.subplots(figsize=(9, 6))
mask=np.triu(np.ones_like(corr, dtype=bool))
cmap=sns.diverging_palette(230, 20, n=256, as_cmap=True)
sns.heatmap(dataF.corr(),
mask=mask,
cmap=cmap,
vmax=1,
vmin = -.25,
center=0,
square=True,
linewidths=.5,
annot = True,
fmt='.3f',
annot_kws={'size': 16},
cbar_kws={"shrink": .75})
plt.show()
As we see in the correleplot (Figure 3.1) all correlation values a near zero, which aline to what exercise need, 5 uncorrelated exercises.
b) Run a Principal Components using EITHER the correlation OR covariance matrix.
Answer
Eigenvalues
In [ ]: # Calculating eigenvectors and eigenvalues of the covariance matrix of simulated dataset
dataF_cov = dataF.cov()
eigenvalues, eigenvectors = LA.eig(dataF_cov)
sorted_indices=np.argsort(eigenvalues)[::-1]
eigenvectors=eigenvectors[:,sorted_indices]
eigenvalues=eigenvalues[sorted_indices]
eigenvalues
Eigenvectors
In [ ]: eigenvectors
Principal components
In [ ]: principal_components =dataF.dot(eigenvectors)
principal_components.columns = ["PC_1","PC_2","PC_3","PC_4","PC_5"]
principal_components.head()
Out[ ]: PC_1 PC_2 PC_3 PC_4 PC_5
0 0.005369 0.002689 -0.005512 -0.000298 0.003094
1 -0.002826 -0.000266 0.013929 0.002893 0.010758
2 -0.013346 -0.015815 -0.004186 0.002901 0.006569
3 -0.000269 0.001614 0.002010 -0.004816 -0.000073
4 -0.013384 0.014189 -0.002674 -0.012635 -0.003234
c) Write a paragraph explaining how the variances of each component compare with each other. In this paragraph, you will address the following question: how much variance is explained by
Component 1, Component 2, Component 3?
Answer
In [ ]: # Put data into a DataFrame
df_eigval = pd.DataFrame({"Eigenvalues":eigenvalues}, index=range(1,6))
e. Collect the daily closing yields for 5 government securities, say over 6 months.
Answer
In [ ]: api_key = "ec67a401ae513935b544843d8c308536" # Alex API
series_ids = [
'DGS1',
'DGS2',
'DGS5',
'DGS10',
'DGS30'
]
base_url = "https://api.stlouisfed.org/fred/series/observations"
g. Re-run the Principal Components using EITHER the correlation or covariance matrix.
Answer of question g
Eigenvalues
In [ ]: yields_df_cov = daily_yield_changes.cov()
eigenvalues, eigenvectors = LA.eig(yields_df_cov)
sorted_indices=np.argsort(eigenvalues)[::-1]
eigenvectors=eigenvectors[:,sorted_indices]
eigenvalues=eigenvalues[sorted_indices]
eigenvalues
Eigenvectors
In [ ]: eigenvectors
Principal components
In [ ]: principal_components =yields_df.dot(eigenvectors)
principal_components.columns = ["PC_1","PC_2","PC_3","PC_4","PC_5"]
principal_components.dropna(inplace=True)
principal_components.head()
Out[ ]: PC_1 PC_2 PC_3 PC_4 PC_5
date
2024-06-24 -10.032590 -0.532780 -1.633712 -0.245560 0.088697
2024-06-25 -9.975655 -0.525469 -1.659190 -0.216314 0.089371
2024-06-26 -10.128439 -0.479133 -1.660047 -0.225135 0.083761
2024-06-27 -10.075749 -0.482554 -1.647428 -0.240712 0.086957
2024-06-28 -10.160836 -0.406276 -1.648127 -0.245083 0.082232
h. How do the variances of each component compare? In other words, how much variance is explained by Component 1, Component 2, Component 3, etc.?
In [ ]: # Put data into a DataFrame
df_eigval = pd.DataFrame({"Eigenvalues":eigenvalues}, index=range(1,6))
j. How does the screeplot from the uncorrelated data compare with the screeplot from the government data?
Answer
In the screeplot of uncorrelated data (Figure 3.2), we observe a gradual decline in the dashed line, indicating the need for many principal components to achieve a high proportion of
explained variance. Conversely, in the screeplot of government data (Figure 3.4), the dashed line shows a more pronounced decline. This suggests that in correlated data, the first few
principal components explain a larger proportion of variability compared to the later components.
4. Empirical Analysis of ETFs Pick a sector ETF (in the US, for example, XLRE)
a. Find the 30 largest holdings.
Answer
The 30 largest holdings of XLI Industrials are
In [ ]: keys = ['XLI']
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0"}
def main(url):
with requests.Session() as req:
req.headers.update(headers)
for key in keys:
r = req.get(url.format(key))
#print(f"Extracting: {r.url}")
goal = re.findall(r'etf\\\/(.*?)\\', r.text)
print(goal[:30])
main("https://www.zacks.com/funds/etf/{}/holding")
['GE', 'CAT', 'RTX', 'HON', 'UNP', 'BA', 'ETN', 'UBER', 'ADP', 'DE', 'LMT', 'UPS', 'GEV', 'TT', 'PH', 'WM', 'TDG', 'EMR', 'MMM', 'ITW', 'GD', 'NOC', 'FDX', 'CT
AS', 'CSX', 'CARR', 'PCAR', 'JCI', 'NSC', 'CPRT']
[*********************100%***********************] 30 of 30 completed
Out[ ]: Ticker ADP BA CARR CAT CSX CTAS DE EMR ETN FDX ... PCAR PH RTX
Date
2024- 240.175369 179.110001 66.957771 343.134705 34.364563 178.026718 368.950012 115.904816 325.018585 302.040466 ... 101.046936 541.951538 100.543625 1176
07-15
2024- 243.323074 186.050003 69.462715 357.831818 35.089600 180.666565 374.747284 117.787514 330.415192 310.457428 ... 105.241524 565.768982 102.581795 1212
07-16
2024- 246.718231 184.839996 67.176453 355.072998 35.208782 179.254501 382.311493 117.173164 309.584106 307.397614 ... 105.550797 555.083984 103.877922 1175
07-17
2024- 244.887009 180.229996 66.649628 353.316528 34.781704 189.014389 377.854370 115.478729 307.169037 303.624847 ... 106.294998 544.866577 102.740097 1180
07-18
2024- 244.461380 179.669998 65.844460 344.980530 34.453949 188.912292 375.293243 114.626556 309.971710 303.317871 ... 104.197701 541.862000 101.721008 1172
07-19
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2025- 290.200012 170.570007 67.110001 362.500000 31.889999 190.600006 429.910004 118.470001 340.880005 274.589996 ... 108.160004 637.989990 117.739998 1278
01-13
2025- 291.690002 167.020004 68.400002 371.570007 32.139999 192.279999 432.309998 119.790001 340.140015 277.619995 ... 110.199997 647.010010 119.470001 1300
01-14
2025- 293.369995 166.199997 68.800003 374.890015 32.459999 195.699997 428.880005 120.720001 342.579987 276.589996 ... 109.629997 656.320007 119.089996 1293
01-15
2025- 296.230011 168.929993 69.739998 380.549988 32.970001 198.050003 439.109985 123.290001 345.190002 277.369995 ... 109.480003 658.659973 120.459999 1322
01-16
2025- 296.179993 171.089996 69.660004 386.019989 32.730000 198.309998 455.440002 124.529999 346.279999 275.100006 ... 110.330002 669.460022 121.110001 1340
01-17
130 rows × 30 columns
c. Compute the daily returns.
Answer
The daily returns are
In [ ]: xlf_returns=xlf.dropna().pct_change().dropna()
xlf_returns
Out[ ]: Ticker ADP BA CARR CAT CSX CTAS DE EMR ETN FDX ... PCAR PH RTX TDG TT
Date
2024- 0.013106 0.038747 0.037411 0.042832 0.021098 0.014828 0.015713 0.016243 0.016604 0.027867 ... 0.041511 0.043948 0.020271 0.031048 0.015601
07-16
2024- 0.013953 -0.006504 -0.032914 -0.007710 0.003397 -0.007816 0.020185 -0.005216 -0.063045 -0.009856 ... 0.002939 -0.018886 0.012635 -0.030480 -0.031416
07-17
2024- -0.007422 -0.024940 -0.007842 -0.004947 -0.012130 0.054447 -0.011658 -0.014461 -0.007801 -0.012273 ... 0.007051 -0.018407 -0.010953 0.003584 -0.015114
07-18
2024- -0.001738 -0.003107 -0.012081 -0.023594 -0.009423 -0.000540 -0.006778 -0.007379 0.009124 -0.001011 ... -0.019731 -0.005514 -0.009919 -0.006301 -0.000696
07-19
2024- 0.004130 -0.004286 0.026570 0.004766 -0.005189 0.017784 0.003148 0.018586 0.027029 0.003656 ... 0.011595 0.013715 0.009338 0.014361 0.021082
07-22
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2025- 0.011679 -0.008314 0.009021 0.032764 0.003461 0.006867 0.052231 0.004494 -0.001669 0.014258 ... 0.014825 0.012618 0.022848 0.012954 -0.000371
01-13
2025- 0.005134 -0.020813 0.019222 0.025021 0.007839 0.008814 0.005583 0.011142 -0.002171 0.011035 ... 0.018861 0.014138 0.014693 0.017419 0.015200
01-14
2025- 0.005760 -0.004910 0.005848 0.008935 0.009956 0.017787 -0.007934 0.007764 0.007173 -0.003710 ... -0.005172 0.014389 -0.003181 -0.005535 0.007421
01-15
2025- 0.009749 0.016426 0.013663 0.015098 0.015712 0.012008 0.023853 0.021289 0.007619 0.002820 ... -0.001368 0.003565 0.011504 0.022025 0.008585
01-16
2025- -0.000169 0.012786 -0.001147 0.014374 -0.007279 0.001313 0.037189 0.010058 0.003158 -0.008184 ... 0.007764 0.016397 0.005396 0.013616 0.000849
01-17
129 rows × 30 columns
d. Compute the covariance matrix.
Answer
The covariance matrix is
In [ ]: xlf_cov_matrix=xlf_returns.cov()
xlf_cov_matrix
Out[ ]: Ticker ADP BA CARR CAT CSX CTAS DE EMR ETN FDX ... PCAR PH RTX TDG TT U
Ticker
ADP 0.000095 0.000023 0.000061 0.000074 0.000085 0.000066 0.000048 0.000079 0.000073 0.000050 ... 0.000078 0.000089 0.000061 0.000069 0.000040 0.000
BA 0.000023 0.000459 0.000100 0.000094 0.000006 -0.000025 0.000086 0.000104 0.000119 0.000056 ... -0.000009 0.000104 0.000065 0.000122 0.000053 0.000
CARR 0.000061 0.000100 0.000312 0.000167 0.000109 0.000096 0.000098 0.000176 0.000216 0.000095 ... 0.000132 0.000184 0.000062 0.000154 0.000172 0.000
CAT 0.000074 0.000094 0.000167 0.000325 0.000150 0.000077 0.000179 0.000226 0.000211 0.000124 ... 0.000185 0.000245 0.000078 0.000126 0.000098 0.000
CSX 0.000085 0.000006 0.000109 0.000150 0.000253 0.000088 0.000093 0.000145 0.000103 0.000064 ... 0.000163 0.000167 0.000068 0.000092 0.000047 0.000
CTAS 0.000066 -0.000025 0.000096 0.000077 0.000088 0.000269 0.000020 0.000083 0.000109 0.000035 ... 0.000106 0.000100 0.000048 0.000082 0.000087 0.000
DE 0.000048 0.000086 0.000098 0.000179 0.000093 0.000020 0.000287 0.000106 0.000079 0.000100 ... 0.000123 0.000113 0.000059 0.000020 0.000016 0.000
EMR 0.000079 0.000104 0.000176 0.000226 0.000145 0.000083 0.000106 0.000327 0.000215 0.000121 ... 0.000162 0.000239 0.000062 0.000152 0.000126 0.000
ETN 0.000073 0.000119 0.000216 0.000211 0.000103 0.000109 0.000079 0.000215 0.000381 0.000089 ... 0.000154 0.000243 0.000044 0.000186 0.000226 0.000
FDX 0.000050 0.000056 0.000095 0.000124 0.000064 0.000035 0.000100 0.000121 0.000089 0.000344 ... 0.000104 0.000097 0.000044 0.000035 0.000053 0.000
GD 0.000071 0.000066 0.000091 0.000099 0.000088 0.000083 0.000047 0.000091 0.000109 0.000030 ... 0.000076 0.000114 0.000104 0.000134 0.000062 0.000
GE 0.000068 0.000106 0.000183 0.000149 0.000088 0.000078 0.000053 0.000149 0.000246 0.000051 ... 0.000084 0.000186 0.000098 0.000225 0.000193 0.000
GEV 0.000076 0.000127 0.000183 0.000227 0.000103 0.000065 0.000068 0.000247 0.000412 0.000099 ... 0.000167 0.000299 0.000064 0.000241 0.000271 0.000
HON 0.000053 0.000051 0.000104 0.000068 0.000074 0.000056 0.000025 0.000084 0.000095 0.000052 ... 0.000067 0.000075 0.000020 0.000090 0.000072 0.000
ITW 0.000059 0.000044 0.000095 0.000121 0.000089 0.000046 0.000088 0.000107 0.000076 0.000081 ... 0.000090 0.000113 0.000054 0.000060 0.000037 0.000
JCI 0.000086 0.000056 0.000179 0.000190 0.000125 0.000110 0.000083 0.000191 0.000237 0.000111 ... 0.000163 0.000211 0.000063 0.000158 0.000151 0.000
LMT 0.000045 0.000023 0.000026 0.000010 0.000025 0.000030 0.000017 0.000008 0.000003 -0.000017 ... 0.000007 0.000031 0.000090 0.000066 0.000035 -0.000
MMM 0.000070 0.000057 0.000151 0.000161 0.000125 0.000073 0.000082 0.000153 0.000119 0.000067 ... 0.000120 0.000155 0.000059 0.000086 0.000101 -0.000
NOC 0.000048 0.000017 0.000009 0.000042 0.000054 0.000031 0.000046 0.000034 0.000007 0.000012 ... 0.000028 0.000051 0.000112 0.000057 0.000013 -0.000
NSC 0.000096 0.000042 0.000130 0.000174 0.000223 0.000112 0.000094 0.000159 0.000124 0.000100 ... 0.000146 0.000161 0.000076 0.000086 0.000061 0.000
PCAR 0.000078 -0.000009 0.000132 0.000185 0.000163 0.000106 0.000123 0.000162 0.000154 0.000104 ... 0.000350 0.000170 0.000052 0.000090 0.000082 0.000
PH 0.000089 0.000104 0.000184 0.000245 0.000167 0.000100 0.000113 0.000239 0.000243 0.000097 ... 0.000170 0.000334 0.000093 0.000159 0.000133 0.000
RTX 0.000061 0.000065 0.000062 0.000078 0.000068 0.000048 0.000059 0.000062 0.000044 0.000044 ... 0.000052 0.000093 0.000168 0.000088 0.000038 -0.000
TDG 0.000069 0.000122 0.000154 0.000126 0.000092 0.000082 0.000020 0.000152 0.000186 0.000035 ... 0.000090 0.000159 0.000088 0.000299 0.000151 0.000
TT 0.000040 0.000053 0.000172 0.000098 0.000047 0.000087 0.000016 0.000126 0.000226 0.000053 ... 0.000082 0.000133 0.000038 0.000151 0.000221 0.000
UBER 0.000019 0.000130 0.000160 0.000172 0.000112 0.000050 0.000072 0.000108 0.000220 0.000077 ... 0.000086 0.000162 -0.000014 0.000160 0.000132 0.000
UNP 0.000070 0.000028 0.000130 0.000125 0.000158 0.000069 0.000077 0.000112 0.000082 0.000093 ... 0.000121 0.000126 0.000056 0.000089 0.000050 0.000
UPS 0.000030 0.000026 0.000046 0.000096 0.000081 0.000049 0.000102 0.000087 0.000052 0.000143 ... 0.000168 0.000070 0.000010 0.000029 0.000018 0.000
URI 0.000098 0.000106 0.000231 0.000303 0.000186 0.000142 0.000169 0.000259 0.000269 0.000135 ... 0.000191 0.000305 0.000131 0.000186 0.000142 0.000
WM 0.000042 -0.000019 0.000062 0.000014 0.000045 0.000070 -0.000004 0.000032 0.000066 0.000038 ... 0.000074 0.000029 -0.000015 0.000050 0.000069 0.000
30 rows × 30 columns
e. Compute the PCA.
Answer
The principal component analysis(PCA) is
In [ ]: xlf_returns_mean=xlf_returns.mean()
xlf_returns_std=xlf_returns.std()
xlf_standardized_data=(xlf_returns-xlf_returns_mean)/xlf_returns_std
xlf_standardized_cov=xlf_standardized_data.cov()
xlf_principal_components= xlf_standardized_data.dot(eigenvectors)
xlf_principal_components.columns=["PC_1","PC_2","PC_3","PC_4","PC_5","PC_6","PC_7","PC_8","PC_9","PC_10","PC_11","PC_12","PC_13","PC_14","PC_15","PC_16","PC_17
xlf_principal_components
Out[ ]: PC_1 PC_2 PC_3 PC_4 PC_5 PC_6 PC_7 PC_8 PC_9 PC_10 ... PC_21 PC_22 PC_23 PC_24 PC_25
Date
2024- 7.924030 0.137422 0.886763 -0.059148 1.805747 0.023687 0.014664 0.273473 -0.186904 -0.061284 ... 0.500468 -0.562224 0.032079 -0.623973 0.031814
07-16
2024- -3.543482 4.703898 3.583108 -1.650724 1.032094 0.083364 1.125461 -1.607877 0.442119 1.205232 ... 0.174183 0.885378 -0.356334 -0.211199 0.133578
07-17
2024- -2.319138 0.407285 -0.737715 -1.692023 -0.957578 -0.698202 -2.062343 -0.898443 -1.145080 -1.306339 ... -0.475764 -0.436102 0.476509 0.148484 0.000695
07-18
2024- -2.762167 -1.247847 -0.884262 -0.052278 -0.624147 -0.217254 0.109672 1.256096 0.010304 -0.435093 ... -0.365277 1.091306 0.653740 0.187338 0.090358
07-19
2024- 3.317883 -0.784245 -1.181744 -0.277742 0.015225 -0.492282 -0.304679 -0.970278 -0.283205 -0.771144 ... 0.144122 -0.096397 -0.115491 0.320641 -0.279635
07-22
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2025- 3.924230 3.535143 1.238639 0.974713 0.713740 -1.254833 0.087407 -0.531806 1.612548 -0.358643 ... -0.471550 -0.189333 -0.098235 -0.333291 0.211624
01-13
2025- 3.769834 0.396798 -0.001358 0.626505 -1.095203 -0.868700 -0.728517 -0.375547 0.356758 -0.806851 ... 0.534097 -0.694173 -0.401121 -0.351800 -0.730760
01-14
2025- 1.327423 -0.907914 0.023650 -0.546966 0.050550 0.662722 -0.957914 0.446743 -0.229821 -0.579190 ... 0.876482 0.589169 -0.393185 -0.431120 -0.698458
01-15
2025- 3.884179 0.916024 0.753182 -0.524593 1.023519 0.528138 0.448983 0.508906 -0.124216 0.053304 ... -0.302446 0.109677 0.272910 0.005086 -0.397692
01-16
2025- 1.386382 0.169151 -0.446350 1.317249 -0.021273 -0.813384 0.248580 -0.869154 0.785681 0.866992 ... 0.084466 -0.051597 -0.383539 0.269699 0.253224
01-17
129 rows × 30 columns
Singular values
In [ ]: s_st_return
df_eigval
Out[ ]: Eigenvalues Explained proportion
1 11.197874 37.33%
2 2.679118 8.93%
3 2.565002 8.55%
4 1.858621 6.20%
5 1.268067 4.23%
6 1.125324 3.75%
7 1.052669 3.51%
8 0.826507 2.76%
9 0.752391 2.51%
10 0.731164 2.44%
11 0.686631 2.29%
12 0.583688 1.95%
13 0.472453 1.57%
14 0.453675 1.51%
15 0.408879 1.36%
16 0.380853 1.27%
17 0.357907 1.19%
18 0.347102 1.16%
19 0.299762 1.00%
20 0.279588 0.93%
21 0.271779 0.91%
22 0.226782 0.76%
23 0.223101 0.74%
24 0.196697 0.66%
25 0.178388 0.59%
26 0.151631 0.51%
27 0.148246 0.49%
28 0.114703 0.38%
29 0.096510 0.32%
30 0.064887 0.22%
Now that you have calculated, presented and plotted tasks from c to f, you must explain each transformation thoroughly. Write a paragraph of 500 words at minimum that explains why
returns are important, compare and contrast PCA and SVD, explain what the eigenvectors, eigenvalues, singular values etc show us for the specific data, etc.
Answer
Returns are a measure of the performance of our investments and allow us to compare performances from different investments. PCA is primarily used for dimensionality reduction, while
SVD has broader applications, including matrix approximation and dimensionality reduction (Hair 92). Both PCA and SVD produce orthogonal vectors and the same eigenvalues. PCA involves
computing eigenvectors and eigenvalues of the covariance matrix, while SVD directly deals with singular values. The singular values obtained from SVD can be used to calculate a variance
of each principal component (Schwarz et al 2256). Each eigenvector transforms the standardized dataset into a principal component, whichcan explain a portion of the variance of the
dataset. The corresponding eigenvalue for the principal component is the variance of the whole data explained by the principal component. The sum of all the eigenvalues is the total
variance of the data.
Conclusion
Ensuring data quality in financial analysis is of paramount importance. Poor-quality data, whether structured or unstructured, hampers accurate modeling and reliable decision-making. For
instance, in the context of this assignment, structured data included currency symbols, placeholders, and improper classifications that hinder numerical operations. Similarly, unstructured
data exhibited missing values and formatting issues that complicated assessment.
The comparative analysis of the Nelson-Siegel and Cubic-Spline models highlighted their respective strengths and weaknesses. The Nelson-Siegel model, while easier to interpret, struggled
with certain maturities, whereas the Cubic-Spline model offered a more precise fit but lacked clear economic interpretation. Ethical considerations in data smoothing were addressed,
emphasizing transparency and accuracy over potential misrepresentation.
Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) both proved essential in reducing dimensionality and understanding underlying data structures, offering
insights into returns and variability in investment performance. The screeplots revealed the differences in explained variance between uncorrelated and government data, emphasizing the
significance of correlation in principal components.
References
Akinyemi, Kemi, et al. ''Yield curve extrapolation methods: Methodologies for valuing cash flows that extend beyond the maximum yield curve''. Society of actuaries, March (2019),
https://www.soa.org/resources/research-reports/2019/yield-curve-report/
Copeland, Ronald M. "Income smoothing." Journal of accounting research, (1968): 101-116, https://doi.org/10.2307/2490073
Hair, J. F., et al. Multivariate Data Analysis. 7th Edn Prentice Hall International. Prentice Hall, (2009).
Huh, Y. U., et al. ''Data quality''. Information and software technology, 32.8 (1990): 559-565., https://www.sciencedirect.com/science/article/pii/095058499090146I
Liu, Grace. "Data quality problems troubling business and financial researchers: A literature review and synthetic analysis." Journal of Business & Finance Librarianship, 25.3-4 (2020): 315-
371, https://doi.org/10.1080/08963568.2020.1847555
Nadinić, Berislav, and Damir Kalpić. "Data quality in finances and its impact on credit risk management and CRM integration." International Conference on Software and Data Technologies,
Special Session on Applications in Banking and Financing, (3; 2008). 2008, https://doi.org/10.5220/0001879103270331
Nymand-Andersen, Per. ''Yield curve modelling and a conceptual framework for estimating yield curves: evidence from the European Central Bank's yield curves''. ECB Statistics Paper, 27,
2018, https://doi.org/10.2866/892636
Praveen, Shagufta, and Umesh Chandra. "Influence of structured, semi-structured, unstructured data on various data models." International Journal of Scientific & Engineering Research,
8.12 (2017): 67-69, https://www.researchgate.net/publication/344363081
Schwarz, Christian, et al. "Principal component analysis and singular value decomposition used for a numerical sensitivity analysis of a complex drawn part." The International Journal of
Advanced Manufacturing Technology, 94 (2018): 2255-2265. https://link.springer.com/article/10.1007/s00170-017-0980-z
Sebastian-Coleman, Laura. Measuring data quality for ongoing improvement: a data quality assessment framework. Newnes, 2012,
https://www.sciencedirect.com/book/9780123970336/measuring-data-quality-for-ongoing-improvement
Strong, Diane M., et al. "Data quality in context." Communications of the ACM, 40.5 (1997): 103-110, https://doi.org/10.1145/253769.253804
Wahlstrøm, Ranik Raaen, et al. ''A comparative analysis of parsimonious yield curve models with focus on the Nelson-Siegel, Svensson and Bliss versions''. Computational Economics (2022):
1-38, https://doi.org/10.1007/s10614-021-10113-w
Wang, Richard Y., and Diane M. Strong. "Beyond accuracy: What data quality means to data consumers." Journal of management information systems, 12.4 (1996): 5-33,
https://doi.org/10.1080/07421222.1996.11518099.
Useful links
Data Society
Atharva Arya
.
In [ ]: %%capture
!wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
from colab_pdf import colab_pdf
colab_pdf('Group_Work_Project 1, group 8072.ipynb')
#set_matplotlib_formats('pdf', 'svg')