Week 4 - Probability Descriptive Statistics Cont (Post-Class)
Week 4 - Probability Descriptive Statistics Cont (Post-Class)
2
Bivariate Distributions
So far we have considered the probability distributions of univariate random variables: Suncor stock return or S&P
TSX composite index return
Now, we extend the definitions and concepts to two random variables, say X and Y:
X=Suncor stock return and Y=S&P TSX composite index return
First, we discuss how to extend the concept of a probability distribution of a single random variable X to a joint
probability distribution of two random variables X and Y
We want to know if there is any relation between X and Y.
In particular, we want to know the probability that X takes on a particular value x and Y takes on a particular value y
That is, we want to determine p(x,y) = Pr(X = x , Y = y)
This joint probability distribution function determines the likelihood that rv’s X and Y takes on values in the joint
sample space for X and Y
3
Bivariate Distribution - Example
Consider two discrete random variables for monthly return on Suncor stock (in percent), labelled X and monthly
return on Encana denoted as Y.
For simplicity we assume that the sample spaces for X and Y are respectively, so that the
random variables X and Y are discrete
4
Joint Distribution - Example
The joint distribution for X and Y is given by the following table:
Now, we can determine the probability that X takes on a particular
value x and Y takes on a particular value y,
i.e., p(x, y) = Pr(X = x, Y = y) from the values in the table on the right
Example: p(0, 0) = Pr(X = 0, Y = 0) = 1/8; p(1, 1) = 1/8 1/8
This is a joint probability distribution function because it makes a statement about the probability of two events
occurring together
The bivariate distribution is illustrated graphically in the figure below as a 3-dimensional bar chart:
5
Properties of a Joint pdf P(x,y)
The joint sample space for X and Y:
The joint probability distribution function for X and Y are nonnegative for
all x and y in the joint sample space for X and Y:
p(0,0) = 1/8
p(1,0) = p(2,0) =
2/8 p(3,0) =
1/8 00
1/8 2/8 1/8
The joint probability distribution function for X and Y are zero for all x and y not in the joint sample space for X and
Y:
The joint probability distribution functions for X and Y sum to 1 for all x and y in the joint sample space for X and Y:
6
Marginal Distribution
The joint probability distribution tells the probability of X and Y occurring together.
What if we only want to know about the probability of X occurring or the probability of Y occurring?
Suppose that we want to find Pr(X=0) and Pr(Y=1) from a given joint distribution.
Consider the joint distribution in the table to the right:
What is Pr(X = 0) independent of the value of Y ?
Now X can occur if Y = 0 or if Y = 1 and since these two events are
mutually exclusive we have that:
Pr(X=0) = Pr(X=0,Y=0) + Pr(X=0,Y=1) = 1/8+0 = 1/8
Notice that this probability is equal to the horizontal (row) sum of the probabilities in the table at X=0.
7
Marginal Probability
Notice that this probability is equal to the vertical (row) sum of the probabilities in the table at Y=0.
p(Y=1) = 4/8
4/8
8
Marginal Probability
The probability Pr(X=x) is the marginal probability distribution function of X and is in general given by
Similarly, the probability Pr(Y=y) is the marginal probability distribution function of Y and is in general given by
It is a called a marginal probability distribution function because it depends only on totals found in the margins of the
table.
The marginal probabilities of X=x are given in the last column of the above Table.
The marginal probabilities of Y=y are given in the last row of Table.
Notice that these probabilities sum to 1.
9
Conditional Probability
Suppose that we know that Y=0.
How does this particular knowledge affect the probability that X=0, 1, 2, or 3, or how can we make good use of this
information to improve the probability that X=0,1, 2, or 3?
i.e., what are: Pr(X=0|Y=0), Pr(X=1|Y=0), Pr(X=2|Y=0), or Pr(X=3|Y=0) equal to?
10
Conditional Probability
Pr(X=0|Y=0) = 1/4 > Pr(X=0) = 1/8
Hence, knowledge that Y=0 does increase the likelihood that X=0
Clearly, X depends on Y, i.e., knowing that Y=0 gives us a higher
probability that X=0 (1/4) compared to not knowing that Y=0, in
which case the probability that X=0 is 1/8
In contrast, the marginal probability, Pr(X=0) ignores information about Y.
Now suppose that we know that X=0
How does this knowledge affect the probability that Y=0?
To find out we compute
11
Conditional Probability
Similarly, we can calculate:
In general, the conditional probability that X = x given that Y = y (provided that Pr(Y = y) ≠ 0) is
12
Independence
Let X and Y be two discrete random variables with:
pdfs: p(x), p(y)
sample spaces:
Then X and Y are (statistically) independent random variables if and only if the joint PDF of X and Y is the product of
individual PDFs: for all x in SX and y in SY.
If X and Y are independent random variables, then the conditional PDF of X given Y (or Y given X) is equal to its
respective marginal PDF:
Intuition
X and Y are independent if knowledge of X does not influence probabilities associated with Y and knowledge of
Y does not influence probabilities associated with X.
13
Bivariate Distributions for Continuous RV
The joint pdf of continuous rv’s X and Y is a non-negative function f (x, y) such that
The three-dimensional plot of the joint probability distribution gives a probability
surface whose total volume is unity.
Let [x1, x2] and [y1, y2] be intervals on the real line. Then
Example of a bivariate standard normal distribution
14
Bivariate Standard Normal Distribution
To find Pr(−1 < X < 1, −1 < Y < 1), we need to solve
Numerical approximation methods (available in Excel) are required to evaluate the above integral.
15
Covariance and Correlation
In panel (a) we see no relationship between X and Y
In panel (b) we see a perfectly positive linear
relationship between X and Y
In panel (c) we see a perfectly negative linear
relationship
In panel (d) we see a positive, but less than perfect,
linear relationship.
16
Covariance
Definition:
17
Covariance - Example
Example: For the data in the table below:
18
Properties of Covariances
Let X and Y be random variables and let a and b be constants.
Some important properties of Cov(X, Y) are
Cov(X, X) = Var(X)
Cov(X, Y) = Cov(Y, X)
If X and Y are independent then Cov(X, Y) = 0 (i.e. no association implies no linear association)
However, if cov(X, Y) = 0, then X and Y are not necessarily independent (no linear association does not
19
Correlation
Correlation: measures both the direction and strength of the linear relationship between any two random variables
The correlation between two random variables X and Y is given by
i.e. the correlation coefficient is a scaled/normalized covariance
Example: For the data in the table, we have
= 0.577
0.577
20
Correlation
Properties of Correlations:
21
Linear Combinations of Two RV (Review)
Let X and Y be random variables
Define a new random variable Z that is a linear combination of X and Y : Z = aX + bY , where a and b are constants
Then
And
And
Result: A linear combination of two normally distributed random variables is itself a normally distributed random
variable.
22
Portfolio Returns (Review)
RA = return on asset A with E[RA] = μA and Var(RA) = σ2A
Portfolio
x = share of wealth invested in asset A
A
xB = share of wealth invested in asset B
xA + x B = 1
The portfolio return is
23
Portfolio Returns and Risk
How much wealth should be invested in assets A and B?
Portfolio expected return (this is the gain from investing):
24
Multi-Period Continuously Compounded
Return
Let rt = ln(1+Rt) be monthly continuously compounded returns.
Assume that for all t so that
Then the annual cc return is equal the sum of twelve-monthly cc returns:
Since each monthly return is normally distributed, the annual return is
also normally distributed.
Then the expected annual return:
Hence, the expected 12-month (annual) return is equal to 12 times the expected monthly return.
The variance of the annual return:
so that the annual variance is also
equal to 12 times the monthly variance.
25
Multi-Period Continuously Compounded
Return
The SD of the annual return:
Hence, the annual standard deviation is times the monthly standard deviation (this result is famously known
as the square root of time rule)
Data Analysis – Excel Add-in
We will be using the Data Analysis ToolPak Add-In for Excel in this course extensively!
To activate it:
File -> Options -> Add-Ins on the Left Sidebar
Highlight Analysis ToolPak & hit GO (not OK)
Check the Analysis ToolPak Option and hit OK
To see the Data Analysis tab -> go to DATA tab and at the far right end banner you should see Data Analysis under the
Analysis section
Refer to the Excel Primer pdf on the Learn site throughout the course if you ever need to go back and remember how we
use various Data Analysis features to calculate statistics and figures
27
Population & Samples
A population is defined as all members of a specified group
descriptive measure of a population characteristic (mean, variance) is a parameter
categorize data – do not rank them differences between scale values are equal – can be
on a ratio scale
example: returns, earnings per share.
28
Concept of Random Sampling
A random sample is a sequence of (usually an infinite number of)independently and
identically distributed (i.i.d.) random variables with an unknown pdf, p(x)
An observed sample (we call data) are (usually a finite number of)
observations generated by the random sample
29
Histogram
A frequency distribution is a tabular display of data summarized in a relatively small number of intervals
A histogram is the graphical equivalent of a frequency distribution
A histogram is used to describe the shape of the distribution of the observed sample (or data):
How to construct a histogram?
Order data from smallest to largest values; min = smallest value, max = largest value, range = max – min
Bin width (Scott’s normal reference rule) = 3.5*standard deviation/(number of observations 1/3)
30
Monthly CC Returns - Histogram
Suncor Monthly CC Returns Histogram
80
The histogram has a bell-shape like the normal
distribution and is centered around values slightly more
70
than zero
60 The bulk of the Suncor returns are between -5% and 15%.
50 The histogram for Suncor is slightly skewed left (long left
Frequency
-0.5 Take
Take55minutes
minutestotoopen
open“Descriptive
“DescriptiveStatistics
Statistics––
Eliminating gaps between bars in a histogram (Excel primer pp. 4-8) In-Class
In-ClassProblems”
Problems”and
andAttempt
AttemptS&P
S&PTSXTSX
Right-mouse the column bar. Click Format. Hit Format Selection on the left side. Returns Histogram tab
Returns Histogram tab
On the right side under Format Data Series, change Gap Width from 150% to 0%.
Hit ENTER.
31
Monthly Price Data Time Plot
Suncor Adjusted Closing Price (CAD) What do you observe about asset prices in the plots
2001-2023
shown?
80
60
The prices exhibit random-walk like behavior with no
40 tendency for the observations on the prices to revert to a
20 constant (or time independent) mean and, thus, appear to
0
be non-stationary
2000/12
2001/08
2002/04
2002/12
2003/08
2004/04
2004/12
2005/08
2006/04
2006/12
2007/08
2008/04
2008/12
2009/08
2010/04
2010/12
2011/08
2012/04
2012/12
2013/08
2014/04
2014/12
2015/08
2016/04
2016/12
2017/08
2018/04
2018/12
2019/08
2020/04
2020/12
2021/08
2022/04
2022/12
2023/08
Both the Suncor stock price and the S&P TSX Composite
index show the run-up to the global financial crisis of 2008
S&P TSX Composite Index 2001-2023 and then the sharp drop and the subsequent recovery after
25000
20000
the financial crisis. In 2022, there is a variation between
15000 the two - the index has passed the highs of January 2020
10000 while Suncor’s price dropped sharply at the start of the
5000 health crisis and has since recovered.
0
There is a common trend observed between the two price
2000/12
2001/09
2002/06
2003/03
2003/12
2004/09
2005/06
2006/03
2006/12
2007/09
2008/06
2009/03
2009/12
2010/09
2011/06
2012/03
2012/12
2013/09
2014/06
2015/03
2015/12
2016/09
2017/06
2018/03
2018/12
2019/09
2020/06
2021/03
2021/12
2022/09
2023/06
series.
32
Monthly CC Returns Time Plot
40%
What do you observe about asset prices in the plots shown?
30% SUNCOR Monthly CC Return 2001-2023
20% In contrast to asset prices, asset returns are mean-reverting and the
10%
0%
common monthly mean values seem close to zero
-10% The constant mean value assumption of stationarity looks to hold.
-20% However, the volatility (i.e., the fluctuation of returns about the
-30%01 11 09 07 05 03 01 11 09 07 05 03 01 11 09 07 05 03 01 11 09 07 05 03 01 11 09 07
01 01 02 03 04 05 06 06 07 08 09 10 11 11 12 13 14 15 16 16 17 18 19 20 21 21 22 23
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
-40%
mean) of both series appears to change over time
-50%
Both series show higher volatility during the 2008 financial crisis
-60% and the 2020 health crisis
15%
S&P TSX Composite Monthly CC Returns 2001-2023 This is an indication of time-varying conditional volatility (which
10%
is a form of non-stationarity in volatility).
5%
0%
There does not appear to be any evidence of systematic time
dependence in the returns
-5%
-10%
Later on we will see that the estimated autocorrelation coefficients
-15%
(which is a new concept to be discussed) are very close to zero
-20%
The returns for Suncor and the S&P TSX index tend to move
-25%
200101 200212 200411200610 200809201008 201207 201406 201605 201804 202003 202202 together suggesting a positive correlation.
33
Monthly CC Returns Time Plot (Another Perspective)
40%
Monthly CC Returns 2001-2023
30%
20%
-10%
In general, the lower volatility of the S&P TSX index
represents the reduced risk of a large diversified
-20%
2001/012003/022005/032007/042009/052011/062013/072015/082017/092019/102021/11 portfolio.
-30%
-40%
-50%
Suncor cc return SU_Ret
-60% S&P TSX cc return
S&P_TSX_Ret
34
Empirical Quantiles
Mean and variance describe the shape characteristics of a distribution of data such as continuously compounded
returns
Often we are also interested in describing a relative location of a particular measurement within a given data set
When a company XYZ reports that its yearly sales are in the 90th percentile of all companies in the industry – what
does it mean?
It means that 90% of all companies in this industry have yearly sales less than XYZ, and only 10% have yearly
35
Percentiles
Empirical percentiles that partition a data set into 4 segments, with each segment containing exactly 25% of the measurement are
known as quartiles .
The lower (or first) quartile is the 25th percentile,
The middle (or second) quartile is the median or 50 th percentile,
The upper (or third) quartile is the 75 th percentile,
The second empirical quartile is the sample median and is the data point such that half of the data is less than or equal to its
value.
The distance between the upper (3rd) and lower (1st) quartiles is known as the interquartile range (IQR):
IQR shows the size of the middle of the distribution of the data
Quartiles are useful in finding unusual observations in a data set.
Use [Link] (representing inclusive) to calculate percentiles of dataset
36
Sample Statistics
To calculate sample quantities for the mean, variance (or standard deviation), skewness and kurtosis of our financial
data, two critical assumptions about the data must be met:
1. data must be covariance (or weakly) stationary, so that the population quantities for the mean, variance (or
standard deviation), skewness and kurtosis of the data are constants and not functions of time. This allows the
sample quantities to be calculated as sample averages
2. Over the sample of observation (t=1,..,T), there must be only one regime/process generating the data, so that
sample quantities can be calculated as one sample average for each moment.
Under these two assumptions, we calculate the sample mean, variance (or standard deviation), skewness and kurtosis
as follows:
37
Outliers
Extremely large or small values are called “outliers”
Outliers can be thought of in two ways:
First, an outlier can be the result of a data entry error - the outlier is not a valid observation and should be
provides important information and should not be removed from data sample
For financial market data, outliers are typically extremely large or small values that could be the result of a data entry
error (e.g. price entered as 1 instead of 10) or a valid outcome associated with some unexpected news.
Outliers are problematic for data analysis because they can greatly influence the value of sample statistics: the sample
mean, variance, standard deviation, skewness and kurtosis
Percentile measures are more robust to outliers; outliers do not greatly influence these measures (e.g. median instead
of mean; IQR instead of SD)
IQR (interquartile range) – outlier robust measure of spread
38
Outliers
To illustrate the impact of outliers on sample statistics, the simulated data (i.e. i.i.d N(0,1) data is polluted by a single
large negative outlier)
The above table compares the sample statistics of the unpolluted and polluted data.
The sample statistics are influenced by the outliers:
mean
skewness
kurtosis
standard deviation
Sample Statistics - Example
Excel – Under Data Analysis, go to Descriptive Statistics -> Let’s Try it Out
Take
Take55minutes
minutestotoopen
open“Descriptive
“DescriptiveStatistics
Statistics––
In-Class
In-ClassProblems”
Problems”and
andAttempt
AttemptS&P
S&PTSXTSX
Returns DStats tab
Returns DStats tab
Calculate:
Calculate:
–– Descriptive statistics
Descriptive statistics
–– 1stst, 5thth, 10thth, 25thth, 50thth, 75thth, 90thth, 95ththand 99thth
1 , 5 , 10 , 25 , 50 , 75 , 90 , 95 and 99
percentiles
percentiles
–– Interquartile range
Interquartile range
–– Moderate outliers
Moderate outliers
–– Extreme outliers
Extreme outliers
40
Additional Measures of Dispersion
Relative Dispersion: Coefficient of Variation = standard deviation/mean
Free of scale – allows comparison of dispersion across datasets - how much dispersion exists relative to the
41
Empirical CDF
Recall that the CDF of a rv X is
Then the empirical CDF of a random sample is
How to compute and plot the empirical
for a sample of data ?
Sort data from smallest to largest values in the form of order statistics:
42
Calculating the Empirical CDF
Question: Does the observed data come from a normal distribution? Let’s
Let’ssee
seewhat
whatthis
thislooks
lookslike
likeininExcel
Excel
To answer this question, we follow the steps given below:
Step 1. Standardize data to have a zero mean and a variance equal to one
Step 3. Compute standard normal (also known as Gaussian White Noise – GWN) CDF at each sorted value:
43
Value at Risk (Review)
Let denote a sample of T simple monthly returns on an investment.
Let be the initial value of an investment
For , the historical VaRα is for simple returns where
Note: For cc returns , we use where
Consider investing $10,000 in Suncor for a month, and we calculate the VaR at 1%, we can say VaR0.01 =
10,000*(exp(q0.01) - 1) = $1,854. So we say that a $10,000 monthly investment in Suncor will lose $1,854 or more
with 1% probability -> recall from last week!
If the corresponding VaR at 1% for the S&P TSX is $858, since this is considerably smaller than Suncor’s 1% VaR,
we can say that investing in Suncor is riskier than investing in the S&P 500 index.
44
Quantile-Quantile (Q-Q) Plot Let’s
Let’ssee
seewhat
whatthis
thislooks
lookslike
likeininExcel
Excel
A normal probability or Quantile-Quantile (QQ plot) is useful for comparing the data with the quantile of a specified
or reference distribution (usually a normal distribution) that we think is appropriate for the return data -> i.e. if we
believe the distribution is normal and want to check it
The QQ-plot is an XY plot with the reference distribution (normal distribution quantiles on the x-axis and the
empirical quantiles (Suncor empirical quantiles) on the y-axis.
How to construct a QQ Plot
1. Column C is rank, i ranging from 1: n (n is number of observations in the data series)
2. Column D is the sorted Suncor returns
3. Column E is the cumulative relative frequency: i/n
4. Column F lists the standard normal quantiles: NORMINV(E2,0,1)
5. Column F values are copied and pasted as values in column G
6. Column H is the standardized Suncor returns
7. Highlight columns F, G and H and draw a scatter XY plot.
45
Q-Q Plot Interpretation
Q-Q Plot Suncor Monthly CC returns
We can interpret the QQ plot in the following way:
4 If all of the points are close to a straight line, then the
reference distribution we conjecture is appropriate
2 If the points do not fall close to a straight line, then the
reference distribution we conjecture is not appropriate and
-4 -3 -2 -1
0
0 1 2 3 4 5
we should consider a different distribution instead
The closer the red dots are to the blue dots, the more
-2 plausible it is that the data is sampled from a normally
distributed population.
-4 The QQ plot for Suncor’s returns indicate that there are
outliers indicating deviation from a normal distribution.
-6
46
Bivariate Descriptive Statistics
SUNCOR CC Ret vs. S&P TSX CC Ret
S
u 15%
Sample covariance
n
c
o 10%
r Sample correlation
5%
0%
-60% -50% -40% -30% -20% -10% 0% 10% 20% 30% 40%
Sample covariance and correlation between Suncor and
S&P TSX Return -5% S&P TSX cc returns
-10%
(Use Data/Data Analysis/Analysis Tools/Covariance)
-15%
-20%
-25%
47
Wrap Up & Next Class
48
Market/Economics Graphics Report &
Presentation
Goal: To link economics & finance to a topic of your choice. Find an article or research a topic that
interests you and your group. Link that topic or article to the concept of economics or finance.
The selection of your topics is pretty open ended – you can really discuss anything as long as it has a relation to
economics and enough content to create a report, graphic and presentation.
Format:
• Max 2-page report (excluding references) + 1 pager graphic (graphic not included in the 2-page count)
• The graphic is meant to be an infographic that the user can read and pick up the key concepts of your report
from
Presentation:
• 10-minute presentation to the class with a 5 min Q&A session
• Q&A team will ask questions to the presenting team and if time permits, we will open up Q&A to the entire
class
• First presentations will take place on October 22 – your report & slide deck is due to the dropbox by
12PM on the day of your presentation
49
Now What for Week 5?
Week 5 Focus: Constant Expected Return (CER) Model
• What does the CER mean?
• How can we define error terms?
• Estimating regression parameters of the CER model
• Statistical properties of estimators
Problem Sets:
• Problem Set 3 – Descriptive Statistics now available on Learn (attempt to complete it to test your
understanding)
• Review the Probability Review (Part V), Descriptive Stats (Part I) and Descriptive Stats (Part II) excel files
on Learn for the sample calculations
Assignments Due:
• Assignment 2 – Random Variables & Descriptive Statistics due 7PM on October 8
• Please review the assigned stock information (under the Admin folder on Learn) to see what stock you
have been assigned. Note that you will stick with this stock to complete all assignments in the course
50