0% found this document useful (0 votes)
28 views6 pages

Stat 1

Uploaded by

Reevu Thapa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views6 pages

Stat 1

Uploaded by

Reevu Thapa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Review of Statistics

1 Random Variables and Key Statistics


Random Variable: A random variable is a variable that takes on different numerical values from
a sample space determined by chance (probability distribution, f(x)). For example, the outcome
of rolling a fair dice is a random variable having possible values of 1, . . . , 6 each with a chance of
1
6 . A random variable is discrete if it can assume at most a countable number of values.
Key statistics for a random variable, X:
P6 1
• Expected value µ = E(X) =
P
all x xf (x), for example, µ = x=1 6 x from rolling a fair
dice.

• Variance: measures the level of dispersion of a random variable- average square distance to
the mean. X
σ 2 = V (X) = E[(X − µ)2 ] = (x − µ)2 f (x)
all x
or
σ 2 = V (X) = E(X 2 ) − [E(X)]2

Test hypotheses: Two-Tailed, Large-Sample Test for the Population Mean


• H0 : µ = µ0
H1 : µ 6= µ0

• The significant level of the test: α (usually, we set α = 0.01, 0.05, or 0.1)
x̄−µ
• Test statistic: z = √0
s/ n

• Critical points: ±Zα/2

• The decision rule: Reject the null hypothesis if either Z > Zα/2 or Z < −Zα/2 .
Example 1.1 An insurance company executive believes that, over the last few years, the average
liability insurance per board seat in companies defined as “small companies” has been $2,000.
A recent survey of small business by Growth Resources, Inc., reports that the average liability
tab per board seat in their sample is $2,700. Assume that the sample used by Growth Resources
contained 100 randomly chosen small firms (as defined by their total annual gross billing) and
that the sample standard deviation was $947. Do these sampling results provide evidence to reject
the executive’s claim that the average liability per board seat is $2,000, using a α = 0.01 level of
significance?
Answer: We set H0 : µ = 2000 and H1 : µ 6= 2000. Since α = 0.01, we have the Z statistics for
the two critical points, ±Zα/2 = ±2.575, while the test statistic
x̄ − µ0 2700 − 2000 700
z= √ = = = 7.39 > Zα/2
s/ n 947/10 94.7
Thus, we reject the null hypothesis.

1
2 Measures of Association Between Two Variables
In data analysis, we sometimes want to learn the relationship between two variables, for example,
does the higher temperature in July lead to higher electricity consumption? The statistics, co-
variance and correlation serve for that purpose. They are the building blocks of many advanced
multi-variate analysis.
P
(xi −x̄)(yi −ȳ)
• sample covariance: sxy = n−1
P
(xi −µx )(yi −µy )
• population covariance: σxy = cov(x, y) = N

Effect of variable scaling:


sxy
• Pearson sample correlation coefficient: rxy = s x sy where sx and sy are sample standard
rP rP
(xi −x̄)2 (yi −ȳ)2
deviations of random variables x and y respectively, and sx = n−1 , sy = n−1 .
Note that −1 ≤ rxy ≤ 1.
σxy
• Pearson population correlation coefficient: ρxy = σx σ y where σx and σy are population
rP
(xi −µx )2
standard deviations of random variables x and y respectively, and σx = N , σy =
rP
(yi −µy )2
N

• Graphic Interpretations:
Example 2.1 The following data set contains 2 variables and 10 observations. For example, the
data might come from a survey to 10 female respondents. Variable x represents the number of
children the respondent has and variable y records the age of the respondent. We are interested in
knowing whether the older generation tends to raise more children than the younger generation.
Note that all respondents are either in their late stage of reproductive period or has passed that
period. For survey data, we usually arrange the data in rows and columns, each row corresponding
to the answers to all survey questions from a respondent and each column listing answers to one
question from all respondents.
obs. xi yi xi − x̄ yi − ȳ (xi − x̄)(yi − ȳ)
1 2 50 -1 -1 1
2 5 57 2 6 12
3 1 41 -2 -10 20
4 3 54 0 3 0
5 4 54 1 3 3
6 1 38 -2 -13 26
7 5 63 2 12 24
8 3 48 0 -3 0
9 4 59 1 8 8
10 2 46 -1 -5 5
Sum 30 510 0 0 99
Average 3 51 0 0 9.9

2
Answer:
P
(xi −x̄)(yi −ȳ) 99
• sxy = n−1 = 10−1 = 11
rP
(xi −x̄)2
q
20
• sx = n−1 = 9 = 1.4907
rP
(yi −ȳ)2
q
566
• sy = n−1 = 9 = 7.9303
sxy 11
• rxy = sx sy = (1.4907)(7.9303) = 0.93
When two variables X and Y are positively correlated, higher value of X usually comes
with higher value of Y and smaller value of X is more likely to be associated with small value of
Y.

3 Linear Combinations of Random Variables


We now consider multi-variate cases where there are two or more variables. Let’s consider a
scenario where we are creating a portfolio consisting of n individual stocks with initial capital one
million dollars. It is our decision to determine the percentage of the initial capital to be invested
in each stock so that certain goals can be achieved, for example, at least 10% expected daily
return and no more than 15% risk (measured by standard deviation). To facilitate the decision
process, we need to evaluate the portfolio expected return and risk under various alternatives.
Assuming that in one alternative, we invest ai portion of total capital in stock i, where 0 ≤ ai ≤ 1
and ni=1 ai = 1, we can find the expected return and risk for this alternative if we know the
P

expected daily return and risk for each individual stock. The expected return and risk of each
individual stock can be obtained from historical data. For example, the expected daily return of
stock i is the mean daily return of stock i in the past three years (or any duration in which we
have data) and the expected risk of stock i is the standard deviation of the daily return in the
same period. In addition to return and risk, we also need to know the covariance between any
pair of stocks in the portfolio. This can again be obtained from the historical data. Once the
information about the individual stock is available, the expected return and risk of the portfolio
given the portfolio composition ai , i = 1, . . . , n can be easily calculated following the following
theorems. In this example, the daily return of each stock Xi , i = 1, . . . , n is a random variable
and the daily return of the portfolio is also a random variable which is a linear combination of
the n individual random variables (Xp = a1 X1 + a2 X2 + · · · + an Xn ).

Theorem 1 Let X1 , X2 , . . . , Xn be random variables with means µ1 , µ2 , . . . , µn and variances


σ12 , σ22 , . . . , σn2 respectively. Then

E[a1 X1 + a2 X2 + · · · + an Xn ] = a1 E[X1 ] + a2 E[X2 ] + · · · + an E[Xn ]


= a1 µ1 + a2 µ2 + · · · + an µn (1)

V ar[a1 X1 + a2 X2 + · · · + an Xn ] = a21 σ12 + a22 σ22 + · · · + a2n σn2 + 2Σni=1,i<j Σnj=1 ai aj Cov(Xi , Xj ) (2)

3
where X
µ = E(X) = xf (x)
all x
and X
σ 2 = V (X) = E[(X − µ)2 ] = (x − µ)2 f (x)
all x
Theorem 1 shows that the expected value of the linear combination of some random variables
is the linear combination of the means of those variables.

Theorem 2 Let X1 , X2 , . . . , Xn be independent random variables with means µ1 , µ2 , . . ., µn


and variances σ12 , σ22 , . . . , σn2 respectively. Then

V ar[a1 X1 + a2 X2 + · · · + an Xn ] = a21 σ12 + a22 σ22 + · · · + a2n σn2

When two variables Xi and Xj are independent, Cov(Xi , Xj ) = 0. Thus, the last term in
the variance formula is gone.

Theorem 3 Let X1 , X2 , . . . , Xn be independent identically distributed random variables


with mean µ and variance σ 2 .

V ar[a1 X1 + a2 X2 + · · · + an Xn ] = [a21 + a22 + · · · + a2n ]σ 2

Example 3.1 Let X1 , X2 , . . . , Xn be independent identically distributed random variables with


mean µ and variance σ 2 .
1 1 1
E[ X1 + X2 + · · · + Xn ] = n1 E[X1 ] + n1 E[X2 ] + · · · + n1 E[Xn ]
n n n
= n1 µ + n1 µ + · · · + n1 µ = n( n1 µ) = µ

1 1 1 1 1 1 1 σ2
V ar[ X1 + X2 + · · · + Xn ] = ( )2 σ 2 + ( )2 σ 2 + . . . + ( )2 σ 2 = n( )2 σ 2 =
n n n n n n n n
2
Note that n1 X1 + n1 X2 + · · · + n1 Xn = X1 +X2n+···+Xn = X̄. So E[X̄] = µ, V ar(X̄) = σn
and σX̄ = √σn , that is, the mean and standard deviation of the sampling distribution are µ and
σX̄ = √σn , respectively. It is clear that as sample size n increases, the standard deviation of the
sampling distribution becomes smaller.
We now rewrite Theorem 1 in matrix forms. Let m = [µ1 , µ2 , . . . , µn ]T and C be the
variance covariance matrix, i.e.

···

σ11 σ12 σ1n

 σ21 σ22 ··· σ2n 

C= .. .. .. .. 
. . . .
 
 
σn1 ··· σn,n−1 σnn

4
Outcome 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

Table 1: Possible outcomes of Y

If a = [a1 , a2 , . . . , an ]T contains the coefficients of the linear combination in Theorem 1,


Equations (1) and (2) can be rewritten as

E(Y ) = aT m

and
V ar(Y ) = aT Ca
where Y = a1 X1 + a2 X2 + · · · + an Xn .

Example 3.2 Let X be a discrete uniformly distributed random variable with possible values
1, 2, . . . , 6. Find the mean and standard deviation of the mean of nine randomly chosen observa-
tions.

1 1 1 21
µ = E(X) = Σall x xf (x) = (1)( ) + (2)( ) + · · · + (6)( ) = = 3.5
6 6 6 6
V ar(X) = E(X − µ)2 = Σall x (x − µ)2 f (x) = E(X 2 ) − [E(X)]2
Since E(X 2 ) = (12 )( 61 ) + (22 )( 16 ) + . . . + (62 )( 16 ) = 91
6 , V ar(X) = E(X 2 ) − [E(X)]2 =
91
6 − ( 21 2 546 441 105
6 ) = 36 − 36 = 36 = 2.9167

σX = 1.7

E(X̄) = E(X) = µ = 3.5


σX 1.7
σX̄ = √ = √ = .57
n 9
If we rolled a set of nine fair dices and averaged the number of dots on the top faces, we
would expect that this average would be between 3.5- 1.96(.57) and 3.5+1.96(.57) or between 2.38
and 4.62 about 95% of the time if we believe the Central Limit Theorem applies to a sample this
small.

Example 3.3 Let X1 , X2 be discrete uniformly distributed random variables with possible values
1, 2, . . . , 6. Find the mean and standard deviation of the random variable Y = X1 + X2 .

5
From Table 1, we have
1 2 3 4 5 6
µ = E(Y ) = Σall x xf (x) = (2)( ) + (3)( ) + (4)( ) + (5)( ) + (6)( ) + (7)( )
36 36 36 36 36 36
5 4 3 2 1 252
+(8)( ) + (9)( ) + (10)( ) + (11)( ) + (12)( ) = =7
36 36 36 36 36 36
This is the same as 2 × E(X) = 2 × 3.5.

V ar(Y ) = E(Y − µ)2 = Σall y (y − µ)2 f (y) = E(Y 2 ) − [E(Y )]2


1 2 3
= [22 ( ) + 32 ( ) + 42 ( )
36 36 36
4 5 6
+52 ( ) + 62 ( ) + 72 ( )
36 36 36
2 5 2 4 2 3
+8 ( ) + 9 ( ) + 10 ( )
36 36 36
2 1 1974
+112 ( ) + 122 ( )] − 72 = − 49 = 5.8333
36 36 36


σY = 5.8333 = 2.415

We can also obtain it from

V ar(Y ) = V ar(X) + V ar(X) = 2V ar(X) = 2 ∗ 2.9167

Example 3.4 If a car dealer estimated that she has a 30% of chance selling 3 cars a day, a 40%
of chance selling 2 cars a day, a 20% chance of selling 1 car a day and 10% of chance with no
sales in a day.

1. What is the expected number of cars sold by the dealer and what is the standard deviation?

2. If the dealer now owns 3 stores and the distribution of number of cars sold in a day is
identical in all stores, what is the expected total number of cars sold in a day by the 3
stores and what is the standard deviation of the total? (assume the distribution for each
store is the same as the one described for the one store case)

3. In the 3-store case, what is the expected value of the average number of cars sold a day
from the three stores and what is the standard deviation of the average?

You might also like