Advanced Placement : Statistics 1
Advanced Placements
Statistics
[Link] Topic Pg. No.
1 Categorical Variables 2
2 Quantitative Variables 3
3 Quartiles 6
4 Normal Distribution 8
5 Scatter Plots 12
6 Probability 16
7 Random Variables 20
8 Binomial 22
9 Sampling Distributions 25
10 Inferential Statistics 27
11 Hypothesis Testing 33
12 Chi-Squared Testing 35
Advanced Placement : Statistics 2
AP Statistics
Statistics is the study concerned with collecting, analyzing and interpreting
data.
Categorical Variables
The values of which can be put in a countable number of different groups
and can be organized into frequency tables are categorical variables. Grade,
Eye Colour, Religion, etc. are few examples of this.
Example-
Relative
Grade in Maths Frequency
frequency
A 50 0.5
B 20 0.2
C 30 0.3
This can also be represented by frequency or relative frequency graphs-
60
50
40
30
20
10
0
A B C
Advanced Placement : Statistics 3
Quantitative variables
These variables take on numerical values for a measured quantity. Height,
weight, etc. are some examples of this.
*Averages exist in case of quantitative variables, not categorical
variables. Example- There can be the average of ‘weight’ of students
but not the average ‘religion’.
Generally, histograms are used to represent plots of such variables.
Example –
Cumulative Frequency –
It is defined as the sum of frequecies distributed over different classes.
Example –
Weight of students in a class (kg) Frequency Cumulative Frequency
50-60 6 6
60-70 12 6 + 12 = 18
70-80 11 18 + 11 = 29
80-90 8 29 + 8 = 37
Advanced Placement : Statistics 4
90-100 2 37 + 2 = 39
The cumulative frequency of last class is the total number of data points
taken.
Cumulative Frequency Graphs –
Cumulative Frequency Graphs show the running total of frequencies of the
data classes. Example –
Age Frequency Cumulative Frequency
0-10 8 8
10-20 14 22
20-30 8 30
30-40 6 36
40-50 4 40
Advanced Placement : Statistics 5
Type of Distributions of Quantitative Variables
The centre and the spread are the two most important aspects to be
considered to observe the pattern of distribution.
Mean, Median and Mode calculations –
Median is the middle value when the data set is arranged in increasing
order. If the number of data points is even, the median is the average of the
two middle values.
Mean is simply the sum of data set values divided by their number.
Mode is simply the most frequently occuring value. A data set can be
bimodal as well if two values occur with equal and high fequencies. If there
are more than two modes, then mode will not be a good measure of central
tendency.
Varience and Standard deviation –
Varience (𝝈𝟐 ) is simply the measure of the spread of data from the mean
value.
Advanced Placement : Statistics 6
𝒏
𝟐
(𝒙𝒊 − 𝒙 )𝟐
𝝈 =
𝒏
𝒊 𝟏
Where 𝒙 is the mean.
Standard Deviation (𝝈) indicates how much a data deviates from the
mean value. Mathematicallly, it is the square root of varience.
Range and Interquartile Range (IQR)
Range is simply the difference between largest and smallest values
Quartiles –
Quartiles are three values namely – first quartile (Q1 or lower quartile),
second quartile (Q2 or median) and third quartile (Q3 or upper quartile)
When arranged in increasing order –
Q1 Is the number which is halfway between lowest and the middle
value. It is the 25th percentile.
Q2 Is the median of data set It is the 50th percentile.
Q3 Is the number halfway between the median and the highest value. It
is the 75th percentile.
Example –
Advanced Placement : Statistics 7
2 3 4 6 10 11 14 16 21 22 27
Q1 Q2 Q3
Inter Quartile Range (IQR) = Q3 – Q1 = 21 – 4 = 17
Box plots –
The first, second and third quartile along with the IQR and the range, can
also be plotted through a boxplot, which looks like –
Min Q1 Q2 Q3 Max
value value
Outliers –
Data points those are abnormally far from the IQR are called outliers.
They are calculated as follows –
If a value is below 𝐐𝟏 − 𝟏. 𝟓(𝐈𝐐𝐑) or above 𝐐𝟑 + 𝟏. 𝟓(𝐈𝐐𝐑), it will be
considered as an outlier.
Advanced Placement : Statistics 8
Z-score
Z-score of a data point is simple the number of standard deviations the data
is away from the mean.
𝒙𝒊 − 𝒙
𝐙 − 𝐬𝐜𝐨𝐫𝐞 =
𝛔
Z-score can be positive, negative or zero.
The Normal Distribution
The Normal distribution or the Gaussian distribution is certainly the most
important distribution curve of statistics. It describes and fits to many
phenomena occurring in nature.
This is a bell shaped curve with the maxima at mean of the population and
is symmetric along the line of mean.
Given a normal distribution of a variable X, with mean 𝝁 and standard
deviation 𝝈.
Advanced Placement : Statistics 9
The notation 𝑿~𝑵(𝝁, 𝝈𝟐 ) is read as “the random variable X is distributed
normally with a mean 𝝁 and standard deviation 𝝈.”
The line of symmetry of curve represents the mean on the x-axis and the
width of the curve represents standard deviation.
Narrower the curve, less is the standard deviation, more precise is the data.
Same mean, different standard deviation
Advanced Placement : Statistics 10
Different mean, same standard deviation
Since one standard deviation away from mean is the Z-value equal to 1,
hence, this curve can also be represented as the graph with respect to the
Z-values –
Point of Point of
inflection inflection
Z-values
The curve extends from −∞ to + ∞ and the integral of this curve from
some value 𝑥 to some other value 𝑥 gives the probability of a random
selection being in the range [𝑥 , 𝑥 ].
Advanced Placement : Statistics 11
Area = P (𝒙𝟏 ≤ 𝒙 ≤ 𝒙𝟐 )
Hence, the complete area should be equal to 1, since, the probability of x
being in the complete number line will be 1.
Most of the data lies within a range of Z-values = ±2
Double Variable (Categorical)
Double variable functions can also be represented in tables. Look at the
following example –
A group of 1000 students were asked their favourite AP subject and their
score in that AP –
Score
3 4 5
AP Calculus 150 120 80
Favourite AP AP E&M 200 90 100
AP Statistics 140 70 50
Advanced Placement : Statistics 12
Total number of students = sum of all = 1000
A student is selected at random. Find the probability that –
1. His favourite AP is Statistics but he could not get a 5 on it.
140 + 70
𝑃= = 0.21
1000
2. He got a 5 in his favourite AP.
80 + 100 + 50
𝑃= = 0.23
1000
3. He didn’t get a 5 and is not from Calculus.
200 + 90 + 140 + 70
𝑃= = 0.50
1000
Bar graph representation –
250
200
150 3
4
100
5
50
0
AP Calculus AP E&M AP Statistics
Double Variable (Quantitative)
Generally, scatter plots are used to represent a relationship between such
variables.
Example –
Advanced Placement : Statistics 13
Below is a scatter plot of the relation between weight of a fruit and its cost.
We collected many data sets and plotted them such that each point on the
graph represents a data set.
The correlation can be positive, negative or neutral. It can also be strong,
moderate or weak.
Advanced Placement : Statistics 14
Best Fit Line –
This line is drawn such that the expression below has minimum value –
Advanced Placement : Statistics 15
That’s why it is also called least squares regression line.
Residuals –
When regression line is plotted, the vertical distance from regression line
to the actual data point is called the residual.
Best fit
line
Residual =
Residual can be positive or negative.
Sum of all residuals is zero.
Advanced Placement : Statistics 16
Probability
This branch of mathematics is concerned with calculating the mathematical
possibility of the occurrence of an event.
𝐧𝐨. 𝐨𝐟 𝐟𝐚𝐯𝐨𝐮𝐫𝐚𝐛𝐥𝐞 𝐨𝐮𝐭𝐜𝐨𝐦𝐞𝐬
𝐏𝐫𝐨𝐛𝐚𝐛𝐢𝐥𝐢𝐭𝐲 (𝐏) =
𝐭𝐨𝐭𝐚𝐥 𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐨𝐮𝐭𝐜𝐨𝐦𝐞𝐬
𝟎≤𝐏≤𝟏
P=0 means impossible event.
P=1 means sure event.
Law of Large Numbers
This law says that the relative frequency of an event gets closer to its actual
probability, when the experiment is performed large number of times.
For example –
If you toss a coin 10 times, you might end up with heads 8 times and tails,
just 2 times.
But if you toss the coin 1 million times, the relative frequency of heads will
start to get close to 0.5.
This law is the reason why insurance companies and casinos make profit in
the long run.
Basic rules of probability
Advanced Placement : Statistics 17
A B
R1 R2 R3
𝐀 = 𝐑𝟏 + 𝐑𝟐
𝐁 = 𝐑𝟐 + 𝐑𝟑
𝐀 ∪ 𝐁 (𝐀 𝐮𝐧𝐢𝐨𝐧 𝐁) = 𝐑𝟏 + 𝐑𝟐 + 𝐑𝟑
𝐀 ∩ 𝐁 (𝐀 𝐢𝐧𝐭𝐞𝐫𝐬𝐞𝐜𝐭𝐢𝐨𝐧 𝐁) = 𝐑𝟐
Hence,
A + B = R1 + R2 + R2 + R3
A + B = (R1 + R2 + R3) + (R2)
A + B = (A ∪ B) + (A ∩ B)
(𝐀 ∪ 𝐁) = 𝐀 + 𝐁 − (𝐀 ∩ 𝐁)
This is also called the addition law of probability.
Conditional Probability
This law tells us about the probability of an event, under the conditions
when it is given that another event has happened.
𝐏(𝐀 ∩ 𝐁)
𝐏(𝐀|𝐁) = Probability of A, given that B has happened =
𝐏(𝐁)
Advanced Placement : Statistics 18
A B
𝐀∩𝐁
When it is given that B has happened, it makes the possibility of anything
happening outside B, equal to zero, and hence, shrinks the total number of
possible events.
Independent Events
Two events are called independent when the occurrence or non-occurrence
of one does not affect that of the other.
Multiplication rule –
If A and B are independent events, then the probability of them both
occurring is the multiplication of probabilities of individual occurrence.
𝐏(𝐀 ∩ 𝐁) = 𝐏(𝐀) ∙ 𝐏(𝐁)
In general –
𝐏(𝐀 ∩ 𝐁) = 𝐏(𝐀) ∙ 𝐏(𝐁|𝐀)
Mutually exclusive events –
Events which cannot occur simultaneously are called mutually exclusive
events. Example – Heads and tails occurring simultaneously, which is never
possible.
Advanced Placement : Statistics 19
Probability Tree Problems
A bag had 3 red balls and 5 green balls. A person draws a ball, checks its
colour and drops it back into the bag. He repeats this one more time.
Construct a probability tree for this scenario –
𝟑 𝑹𝒆𝒅
𝟖
𝑹𝒆𝒅
𝟑
𝟖 𝟓
𝑮𝒓𝒆𝒆𝒏
𝟖
𝟑 𝑹𝒆𝒅
𝟖
𝟓
𝟖
𝑮𝒓𝒆𝒆𝒏
𝟓
𝟖 𝑮𝒓𝒆𝒆𝒏
Find the probability that both the balls are red.
Since they are independent events, the multiplication law can be
applied –
3 3 9
P= × =
8 8 64
Find the probability that both balls are of different colour.
Advanced Placement : Statistics 20
We need to consider all the possible cases
3 5 5 3 30
P= × + × =
8 8 8 8 64
Random Variables
A random variable X is something that represents all the possible outcomes
associated with a particular event.
Consider the following example –
A book store recorded the data for the book purchased. 24% customers did
not purchase any book, 11% purchased one book, 28% purchased 2 books,
5% purchased 3 books, and 32% purchased 4 books. Let X be the number
of books purchased by a randomly selected customer.
State the possible values of X.
X = 0,1,2,3,4
Construct a probability table for X.
x 0 1 2 3 4
P(X=x) 0.24 0.11 0.28 0.05 0.32
Find the mode and median.
P1 + P2 + P3 = 0.63
The 50 percentile is crossed during X = 2.
Hence, median = 2
Advanced Placement : Statistics 21
Expected Value
The expected value of a random variable is the average of its expected
outcomes. It is defined as –
𝒏
𝐄= 𝒑𝒊 ∙ 𝒙𝒊
𝒊 𝟏
For the previous question, the expected value of the number of books will
be =
E = (0 ∗ 0.24) + (1 ∗ 0.11) + (2 ∗ 0.28) + (3 ∗ 0.05) + (4 ∗ 0.32) = 2.1
Hence, average number of books purchased by a customer is equal to 2.1
*Note –
Purchasing 2.1 books is not possible but still the expected value is 2.1,
hence it should be noted that the expected value may or may not be a
possible outcome for a trial. It is just the average of the whole data.]
Variance and Standard Deviation for random variable –
𝜎= 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑋) = (𝑥 − 𝜇) ∙ 𝑝
Advanced Placement : Statistics 22
Binomial Distribution
A binomial experiment is one with-
A fixed number of independent trials
Only two possible outcomes – success and failure
Probability of success is same for every trial.
𝑋~𝐵(𝑛, 𝑝) means X is binomially distributed with n trials and probability of
success is p.
Example –
Consider a 4−sided spinner with one side painted white and the rest of the
sides painted black is spinned 3 times. Now, let’s consider a white sided
outcome to be a success. Let X be the random variable that denotes the
number of white outcomes in 3 trials. Hence, the probability of the success
= and the probability of failure =
The probability tree would look like –
Advanced Placement : Statistics 23
Possible values of X = 0,1,2,3
3 27
P(𝑋 = 0) = =
4 64
3 1 3 1 3 1 3 27
P(𝑋 = 1) = ∙ + ∙ ∙ + ∙ =
4 4 4 4 4 4 4 64
3 1 1 3 1 1 3 9
P(𝑋 = 2) = ∙ + ∙ ∙ + ∙ =
4 4 4 4 4 4 4 64
1 1
P(𝑋 = 3) = =
4 64
The formula can be generalized as follows –
If X is a binomial variable with n independent trials and the probability of
success is p, Then the probability of obtaining exactly x successes from n
trials will be –
𝑛
𝑃(𝑋 = 𝑥) = (𝑝) (1 − 𝑝)
𝑥
𝑥 can have the values 0, 1, 2, 3, ….. n
In a graphing display calculator, the above function is available as –
Binomial C.D and Binomial P.D
Mean and standard deviation of a Binomial Distribution
Mean is given by, 𝝁 = 𝒏𝒑
Variance, 𝝈𝟐 = 𝒏𝒑(𝟏 − 𝒑)
Standard deviation, 𝝈 = 𝒏𝒑(𝟏 − 𝒑)
Advanced Placement : Statistics 24
*Note
These formulas are strictly limited to binomial random variables. Do not
use them for discrete variables that are not random.
Geometric Distribution
If X is geometrically distributed and the probability of success is p, the
probability that first success occurs on nth trial is –
𝑃(𝑋 = 𝑛) = 𝑝 ∙ (1 − 𝑝)
Also,
For example, in a dice roll –
Advanced Placement : Statistics 25
Sampling Distribution of sample means –
Given, the mean of a population is 𝝁 and standard deviation 𝝈 , If we take
out all possible samples of size n from that population and find each of their
mean, and check their distribution, the mean of the distribution will be
same as that of original population. Also, the distribution will be normal
(Central limit theorem).
Hence, for the distribution of sample means (𝑥̅ ),
𝜇 ̅=𝜇
𝜎
𝜎̅=
√𝑛
The size of the sample should not be more than 10% of the population.
Advanced Placement : Statistics 26
Sampling Distribution of sample proportion –
If p is the proportion of success in a binomial variable and the number of
trials is n,
We know that for a binomial variable, the mean and standard deviation are
given as –
𝜇 = 𝑛𝑝
𝜎 = 𝑛𝑝(1 − 𝑝)
the mean and standard deviation of sample proportion is given by the
previous values divided by n –
𝜇 =𝑝
𝑝(1 − 𝑝)
𝜎 =
𝑛
Hence, the sampling distribution can be considered as the normal
distribution, provided that the least number of success and failures, both
are greater than 10.
Thus, if np ≥ 10 and n(1− p) ≥ 10 –
𝑝(1 − 𝑝)
𝑝̂ ~ 𝑁 𝑝,
𝑛
Advanced Placement : Statistics 27
Inferential Statistics
Till now, we studied statistic as a description of the sample and a
parameter as a description of population. Now in inferential statistics, we
will use the statistic and perform various set of calculations on it to get the
information about the parameter.
Confidence Interval
Suppose we have to study the length of a particular species of trees in a
forest. Now, there would be thousands of trees in the forest so it is
practically impossible to survey each and every tree.
Let’s say that we surveyed 100 trees and their mean came out to be equal
to 25 meters. Assume that we know the population standard deviation and
it is equal to 4 meters.
Now is it possible for us to determine the average length of trees in the
entire forest? Or can we tell with a certain level of surety that the average
length will be in some interval? Let’s find out –
Advanced Placement : Statistics 28
Since the lengths of the entire population will be normally distributed and
we take all the possible samples of size 100 from the population and find
their means, then by the central limit theorem, the distribution of sample
means will also be normal if some particular conditions are satisfied –
Remember the table from the sampling distribution of sample means –
Since the sample size, (100) is greater than 30, we can assume the normal
distribution model. Assume that for original population, the mean was 𝝁.
The distribution of original population will be like –
Advanced Placement : Statistics 29
Now for the distribution of sample means, the mean standard deviation will
be –
𝜇 ̅=𝜇
𝜎 4
𝜎̅= = = 0.4
√𝑛 √100
Hence, the plot will be narrow.
−𝟐𝝈 −𝝈 𝝁𝒙 = 𝝁 𝝈 𝟐𝝈
Now the mean of our sample (𝑥̅ ) will be somewhere on this line. But, we
can be 95% sure that it will be between z-value -1.96 and 1.96 –
Advanced Placement : Statistics 30
It can be found using the InvNom function on the calculator, or using the z-
value table.
Thus, if the sample mean lies in this interval, we can be sure that the
population mean will be at most 1.96 standard deviations away from it.
Thus, the 95% confidence interval will be defined as –
𝜎
Conficence interval = x ± 𝑧 ∗
√𝑛
Here, n is the size of sample and z* (z-score) will be 1.96 for 95% surety.
If we want a confidence interval will greater surety for mean to be in that
interval, we will have to choose the z* value accordingly.
Advanced Placement : Statistics 31
Confidence Interval for a Proportion
If p is the population proportion, and 𝑝̂ is the sample proportion, we can
say with a 95% surety that p will lie in the range –
𝑝(1 − 𝑝)
𝑝̂ ± 1.96
𝑛
Conditions must be checked before applying the model. The least number
of success and failures, both are greater than 10.
That is, np ≥ 10 and n(1− p) ≥ 10.
Confidence Interval for a Mean when population standard deviation is
not known –
In this case, instead of population standard deviation 𝝈, we know the
sample standard deviation 𝒔. In this case, we use something called a t-score
(t*) instead of a z-score (z*).
The confidence interval will be –
𝑠
x ± 𝑡∗
√𝑛
To get that value of t*, we need a number called the degrees of freedom,
which is equal to (n – 1). Now, using the t-distribution table or the function
Inv T on the calculator, we can get the t*-value.
Advanced Placement : Statistics 32
Estimating difference between two groups –
If two independent groups are given with their means and standard
deviations, the mean and standard deviation of their combination or
difference is given by –
𝜇 ± =𝜇 ±𝜇
𝜎 ± =𝜎 +𝜎
𝜎 ± = 𝜎 +𝜎
Hence, for proportions,
Confidence interval for proportions –
Similarly, for means, the confidence interval is given by –
To get the t-score, the degree of freedom is given by –
Advanced Placement : Statistics 33
The following conditions should always satisfy –
The samples must be independent
For proportions, number of success and failures, both should be at
least greater than or equal to 10
For means, the sample size should at least be 30
Both samples should be less than 10% of the population
Hypothesis Testing
This is a process to investigate the correctness of a hypothesis using
statistics. We start by assuming a null hypothesis. Then, we counter it
using an alternate hypothesis.
We work on the alternate hypothesis and evaluate a result. If that result is
far from what we expect from the null hypothesis, we reject the null
hypothesis. Otherwise, if the result is not significantly far, we fail to reject
the null.
P-value
A P-value is a probability which tells how far or how close the result is to
the assumption of null hypothesis. If P-value is significantly low (P < 0.05 or
Advanced Placement : Statistics 34
P < 0.01), that means the result is far away from the null assumption and
we have sufficient evidence to reject the null hypothesis.
Example –
In a school, a group of 40 students are selected and they are found to spend
Rs. 102 on an average, per day with the population standard deviation of
Rs. 6. What is the probability of getting the mean of Rs. 102 if the
population mean is Rs. 100?
Solution –
Assuming the normal distribution, we have –
Required
portion
100 102
= P(𝑥̅ > 102)
=P 𝑧> = 0.017
√
Since P < 0.05, this is a statistically significant result.
Advanced Placement : Statistics 35
Types of alternate hypothesis –
The null hypothesis assumes that the difference of means should be zero.
Hence, there could be three possible alternate hypothesis –
𝜇 > 𝜇 (one sided alternative)
𝜇 < 𝜇 (one sided alternative)
𝜇 ≠ 𝜇 (two sided alternative)
Inference for Categorical Data –
Chi – Squared Test
Assume we have the following categorical data –
Grade Frequency Expected
A 200 146
B 170 146
C 90 146
D 150 146
E 120 146
The chi–square calculates the squared difference between observed and
expected values with respect to the expected values.
A statistical method for assessing the discrepancy between observed and
expected data is the Chi-Square test. You can also perform this test to see if
it has any correlation with our data's category variables. Determining if a
Advanced Placement : Statistics 36
discrepancy between two categorical variables is the result of random
variation or a relationship between them is helpful.
This test aims to determine whether a discrepancy between observed and
expected data is the result of random variation or a relationship between
the variables being examined. Because of this, the chi-square test is a great
option to help us comprehend and analyze the relationship between our
two category variables.
To test a hypothesis about the distribution of a categorical variable, a chi-
square test is needed. Nominal or ordinal variables represent categories,
such as nations or animals, in categorical variables. Since they can only
have a small number of specific values, they cannot have a normal
distribution.
Q. In a study of 160 cities, the goal was to determine if the high depression
rated were associated with the temperature of the city
Depression rates
Less Normal High
Temperature Less 10 10 20
Normal 30 10 10
High 20 30 20
Does the above data provide enough significance that the depression rate is
associated with the temperature of the city at 0.025 level of significance?