Definition Of Statistics
Statistics is a branch of applied mathematics concerned with collecting, organizing, and interpreting data. The data are
represented by means of graphs.
Statistics is also the mathematical study of the likelihood and probability of events occurring based on known quantitative data
or a collection of data.
Statistics, thus attempts to infer the properties of a large collection of data from inspection of a sample of the collection thereby
allowing educated guesses to be made with a minimum of expense.
There are basically three kinds of averages commonly used in statistics. They are: mean, median, and mode .
Example of Statistics
A survey was conducted to find the favorite fruit of 100 people. The circle
graph below shows the results of the survey.
Random Variables
A random variable, usually written X, is a variable whose possible values are numerical outcomes of a random
phenomenon. There are two types of random variables, discrete and continuous.
Discrete Random Variables
A discrete random variable is one which may take on only a countable number of distinct values such as
0,1,2,3,4,........ Discrete random variables are usually (but not necessarily) counts. If a random variable can
take only a finite number of distinct values, then it must be discrete. Examples of discrete random variables
include the number of children in a family, the Friday night attendance at a cinema, the number of patients
in a doctor's surgery, the number of defective light bulbs in a box of ten.
The probability distribution of a discrete random variable is a list of probabilities associated with each of its
possible values. It is also sometimes called the probability function or the probability mass function.
(Definitions taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)
Suppose a random variable X may take k different values, with the probability that X = xi defined to be P(X =
xi) = pi. The probabilities pi must satisfy the following:
1: 0 < pi < 1 for each i
2: p1 + p2 + ... + pk = 1.
Example
Suppose a variable X can take the values 1, 2, 3, or 4.
The probabilities associated with each outcome are described by the following table:
Outcome 1 2 3 4
Probability 0.1 0.3 0.4 0.2
The probability that X is equal to 2 or 3 is the sum of the two probabilities: P(X = 2 or X = 3) = P(X = 2) + P(X =
3) = 0.3 + 0.4 = 0.7. Similarly, the probability that X is greater than 1 is equal to 1 - P(X = 1) = 1 - 0.1 = 0.9, by
the complement rule.
This distribution may also be described by the probability histogram shown to the right:
Mean, Median, Mode
Mean, median, and mode are different measures of center in a numerical data set. They each try to summarize a dataset with a
single number to represent a "typical" data point from the dataset.
Mean:
The "average" number; found by adding all data points and dividing by the number of data points.
Example:
The mean of 444, 111, and 777 is (4+1+7)/3 = 12/3 = 4(4+1+7)/3=12/3=4left parenthesis, 4, plus, 1, plus, 7, right parenthesis, slash,
3, equals, 12, slash, 3, equals, 4.
Median:
The middle number; found by ordering all data points and picking out the one in the middle
(or if there are two middle numbers, taking the mean of those two numbers).
Example:
The median of 444, 111, and 777 is 444 because when the numbers are put in order (1(1left parenthesis,
1, 444, 7)7)7, right parenthesis, the number 444 is in the middle.
Mode: The most frequent number—that is, the number that occurs the highest number of times.
Example:
The mode of \{4{4left brace, 4, 222, 444, 333, 222, 2\}2}2, right brace is 222 because it occurs three times, which
is more than any other number
Basic Terms of Probability
Definitions
• Experiment: A process by which an observation or outcome is obtained.
• Sample Space: The set S of all possible outcomes of an experiment.
• Event: Any subset E of the sample space S.
Probability of an Event
• Probability of an event is a measure of the likelihood that the event will occur.
• Remember probability is a number not a set.
• Mathematically speaking the probability of an event E denoted by P(E) is:
P(E) = n(E)/n(S).
• Recall that that n(E) is the cardinal number of set E and n(S) is the cardinal number of set S.
Relative frequency
• Tossing a single coin has a sample space of Heads and Tails. That is S={H,T}.
• Theoretically speaking the probability of tossing a head is ½.
• Let say you flip a coin 10 times and record the result of each toss.
• According to my results I recorded seven trials with heads. This would yield a relative frequency 7/10 or 0.7.
• We call this type of computation an empirical probability.
• When I toss the coin 100 times, my relative frequency was 52/100 or 0.52. This number is closer to the theoretical
probability, but it is not exact.
• The relationship between relative frequencies and theoretical probabilities is called the Law of Large Numbers.
• This law states that if you repeat the experiment a large number of times, then the relative frequency of the outcome
will tend to be close to the theoretical probability of that outcome.
Example 2
• Three coins are tossed answer the following:
1. What is the sample space?
ANSWER
S={HHH, HHT, HTH, THH, TTH, THT, HTT, TTT}.
• Define the following events E is the event of tossing exactly two heads. F is the event that there are at least 2 are heads.
G is the event that all three are heads.
2. What is P(E)? ANSWER n(E)/n(S)=3/8
3. What is O(E)? ANSWER n(E):n(E’)=3:5
4. What is P(F)? ANSWER n(F)/n(S)=4/8=1/2
5. What is O(F)? ANSWER n(F):n(F’)=4:4=1:1
6. What is P(G)? ANSWER n(G)/n(S)=1/8
7. What is O(G)? ANSWER n(G):n(G’)=1:7
Mean and Variance of Random Variables
Mean
The mean of a discrete random variable X is a weighted average of the possible values that the random variable can take.
Unlike the sample mean of a group of observations, which gives each observation equal weight, the mean of a random variable
weights each outcome xi according to its probability, pi. The common symbol for the mean (also known as the expected
value of X) is , formally defined by
The mean of a random variable provides the long-run average of the variable, or the expected average outcome over many
observations.
Example
Suppose an individual plays a gambling game where it is possible to lose $1.00, break even, win $3.00, or win $10.00 each time
she plays. The probability distribution for each outcome is provided by the following table:
Outcome -$1.00 $0.00 $3.00 $5.00
Probability 0.30 0.40 0.20 0.10
The mean outcome for this game is calculated as follows:
= (-1*.3) + (0*.4) + (3*.2) + (10*0.1) = -0.3 + 0.6 + 0.5 = 0.8.
In the long run, then, the player can expect to win about 80 cents playing this game -- the odds are in her favor.
For a continuous random variable, the mean is defined by the density curve of the distribution. For a symmetric density curve,
such as the normal density, the mean lies at the center of the curve.
The law of large numbers states that the observed random mean from an increasingly large number of observations of a
random variable will always approach the distribution mean . That is, as the number of observations increases, the mean of
these observations will become closer and closer to the true mean of the random variable. This does not imply, however, that
short term averages will reflect the mean.
In the above gambling example, suppose a woman plays the game five times, with the outcomes $0.00, -$1.00, $0.00, $0.00, -
$1.00. She might assume, since the true mean of the random variable is $0.80, that she will win the next few games in order to
"make up" for the fact that she has been losing. Unfortunately for her, this logic has no basis in probability theory. The law of
large numbers does not apply for a short string of events, and her chances of winning the next game are no better than if she
had won the previous game.
Properties of Means
If a random variable X is adjusted by multiplying by the value b and adding the value a, then the mean is affected as follows:
Example
In the above gambling game, suppose the casino realizes that it is losing money in the long term and decides to adjust the
payout levels by subtracting $1.00 from each prize. The new probability distribution for each outcome is provided by the
following table:
Outcome -$2.00 -$1.00 $2.00 $4.00
Probability 0.30 0.40 0.20 0.10
The new mean is (-2*0.3) + (-1*0.4) + (2*0.2) + (4*0.1) = -0.6 + -0.4 + 0.4 + 0.4 = -0.2. This is equivalent to subtracting $1.00 from
the original value of the mean, 0.8 -1.00 = -0.2. With the new payouts, the casino can expect to win 20 cents in the long run.
Suppose that the casino decides that the game does not have an impressive enough top prize with the lower payouts, and
decides to double all of the prizes, as follows:
Outcome -$4.00 -$2.00 $4.00 $8.00
Probability 0.30 0.40 0.20 0.10
Now the mean is (-4*0.3) + (-2*0.4) + (4*0.2) + (8*0.1) = -1.2 + -0.8 + 0.8 + 0.8 = -0.4. This is equivalent to multiplying the
previous value of the mean by 2, increasing the expected winnings of the casino to 40 cents.
Overall, the difference between the original value of the mean (0.8) and the new value of the mean (-0.4) may be summarized by
(0.8 - 1.0)*2 = -0.4.
The mean of the sum of two random variables X and Y is the sum of their
means:
For example, suppose a casino offers one gambling game whose mean winnings are -$0.20 per play, and another game whose
mean winnings are -$0.10 per play. Then the mean winnings for an individual simultaneously playing both games per play are -
$0.20 + -$0.10 = -$0.30.
Variance
The variance of a discrete random variable X measures the spread, or variability, of the distribution, and is
defined by
The standard deviation is the square root of the variance.
Example
In the original gambling game above, the probability distribution was defined to be:
Outcome -$1.00 $0.00 $3.00 $5.00
Probability 0.30 0.40 0.20 0.10
The variance for this distribution, with mean = 0.8, may be calculated as follows:
2 2 2 2
(-1 - 0.8) *0.3 + (0 - 0.8) *0.4 + (3 - 0.8) *0.2 + (5 - 0.3) *0.1
2 2 2 2
= (-1.8) *0.3 + (-0.8) *0.4 + (2.2) *0.2 + (4.2) *0.1
= 3.24*0.3 + 0.64*0.4 + 4.84*0.2 + 17.64*0.1
= 0.972 + 0.256 + 0.968 + 1.764 = 3.960, with standard deviation = 1.990.
Since there is not a very large range of possible values, the variance is small.
Properties of Variances
If a random variable X is adjusted by multiplying by the value b and adding the value a, then the variance is
affected as follows:
Since the spread of the distribution is not affected by adding or subtracting a constant, the value a is not
considered. And, since the variance is a sum of squared terms, any multiplier value b must also be squared
when adjusting the variance.
Example
As in the case of the mean, consider the gambling game in which the casino chooses to lower each payout by
$1.00, then double each prize. The resulting distribution is the following:
Outcome -$4.00 -$2.00 $4.00 $8.00
Probability 0.30 0.40 0.20 0.10
The variance for this distribution, with mean = -0.4, may be calculated as follows:
2 2 2 2
(-4 -(-0.4)) *0.3 + (-2 - (-0.4)) *0.4 + (4 - (-0.4)) *0.2 + (8 - (-0.4)) *0.1
2 2 2 2
= (-3.6) *0.3 + (-1.6) *0.4 + (4.4) *0.2 + (8.4) *0.1
= 12.96*0.3 + 2.56*0.4 + 19.36*0.2 + 70.56*0.1
= 3.888 + 1.024 + 3.872 + 7.056 = 15.84, with standard deviation = 3.980.
This is equivalent to multiplying the original value of the variance by 4, the square of the multiplying
constant.
For independent random variables X and Y, the variance of their sum or difference is the sum of their
variances:
Variances are added for both the sum and difference of two independent random variables because the
variation in each variable contributes to the variation in each case. If the variables are not independent, then
variability in one variable is related to variability in the other. For this reason, the variance of their sum or
difference may not be calculated using the above formula.
For example, suppose the amount of money (in dollars) a group of individuals spends on lunch is
represented by variable X, and the amount of money the same group of individuals spends on dinner is
represented by variable Y. The variance of the sum X + Y may not be calculated as the sum of the variances,
since X and Y may not be considered as independent variables.
Formula
The following represent the probability density function of a normal distribution with the variate X.
P (x) = 1(σ2π√)1(σ2π) e(−(x−μ)2(2σ2))e(−(x−μ)2(2σ2))
Here, x belongs to the interval of negative infinity to positive infinity (-\infty , \infty ) as the domain, \mu is
the mean and \sigma^2 is the variance.
This is the general normal distribution. When we take \mu = 0 and \sigma^2 = 1 then our general normal
distribution gets converted to standard normal distribution with probability density function as below:
P (x) dx = 1(σ2π√)1(σ2π) e−z22e−z22 dz
Where Z = (x−μ)σ(x−μ)σ which implies dz = dxσdxσ
We use more commonly the standard form of normal distribution.
This z is known as z-score more commonly.
The z - score tells us the distance between a value and the mean. If the value of this Z is equal to zero, then the
‘x’ value is equal to the mean. If this z-score is equal to one, then the ‘x’ value is one standard deviation above
the mean and if the z-score is equal to -1, then the ‘x’ value is one standard deviation below the mean. And so
on with z-score increasing or decreasing.
Some properties of the standard normal curve are:
1) The complete area under this curve is always 1.
2) This curve extends both sides indefinitely approaching the horizontal axis.
3)This curve is symmetric always about zero.
Examples
Let us see an example to understand this topic more clearly.
Example:
Given that X is a random variable that is normally distributed with μ = 30 and σ = 4. Determine the following:
1)P (30 < x < 35)
2)P (x > 21)
3)P (x < 40)
Solution:
Here we are simply finding the area under the standard type of normal curve under given conditions.
1) Now, Z = (30−30)4(30−30)4 = 0. Also, Z = (35−30)4(35−30)4 = 1.25
Thus P (30 < x < 35) = P (0 < z < 1.25) = 0.3944
2) Z = (21−30)4(21−30)4 = -2.25
Thus P (x > 21) = P (z > -2.25) = 0.9878
3) Z = (40−30)4(40−30)4 = 2.5
Thus P (x < 40) = P (z < 2.5) = 0.9938
Formula
The formula that we will use is as follows: z = (x - μ)/ σ
The description of each part of the formula is:
x is the value of our variable
μ is the value of our population mean.
σ is the value of the population standard deviation.
z is the z-score.
Examples
Now we will consider several examples that illustrate the use of the z-score formula. Suppose that we know
about a population of a particular breed of cats having weights that are normally distributed. Furthermore,
suppose we know that the mean of distribution is 10 pounds and the standard deviation is 2 pounds.
Consider the following questions:
1. What is the z-score for 13 pounds?
2. What is the z-score for 6 pounds?
3. How many pounds corresponds to a z-score of 1.25?
For the first question we simply plug x = 13 into our z-score formula. The result is:
(13 – 10)/2 = 1.5
This means that 13 is one and a half standard deviations above the mean.
The second question is similar. Simply plug x = 6 into our formula. The result for this is:
(6 – 10)/2 = -2
The interpretation of this is that 6 is two standard deviations below the mean.
For the last question, we now know our z -score. For this problem we plug z = 1.25 into the formula and use
algebra to solve for x:
1.25 = (x – 10)/2
Multiply both sides by 2:
2.5 = (x – 10)
Add 10 to both sides:
12.5 = x
And so we see that 12.5 pounds corresponds to a z-score of 1.25.
Finding Areas Using a Table
Once we have the general idea of the Normal Distribution, the next step is to learn how to find areas under
the curve. We'll learn two different ways - using a table and using technology.
Since every normally distributed random variable has a slightly different distribution shape, the only way to
find areas using a table is to standardize the variable - transform our variable so it has a mean of 0 and a
standard deviation of 1. How do we do that? Use the z-score!
Z = x - μ
σ
As we noted in Section 7.1, if the random variable X has a mean μ and standard deviation σ, then
transforming Xusing the z-score creates a random variable with mean 0 and standard deviation 1! With that
in mind, we just need to learn how to find areas under the standard normal curve, which can then be applied
to any normally distributed random variable.
Finding Area under the Standard Normal Curve to the Left
Before we look a few examples, we need to first see how the table works. Before we start the section, you need a
copy of the table. You can download a printable copy of this table, or use the table in the back of your
textbook. It should look something like this:
It's pretty overwhelming at first, but if you look at the picture at the top (take a minute and check it out), you
can see that it is indicating the area to the left. That's the key - the values in the middle represent areas to the
left of the corresponding z-value. To determine which z-value it's referring to, we look to the left to get the first
two digits and above to the columns to get the hundredths value. (Z-values with more accuracy need to be
rounded to the hundredths in order to use this table.)
Say we're looking for the area left of -2.84. To do that, we'd start on the -2.8 row and go across until we get to
the 0.04 column. (See picture.)
From the picture, we can see that the area left of -2.84 is 0.0023.