DA Unit-2 Probability and Statistical Methods
DA Unit-2 Probability and Statistical Methods
Sample space is the universal set that consists of all possible outcomes of
an experiment. Sample space is usually represented using the letter ‘S’
and individual outcomes are called the elementary events.
The sample space can be finite or infinite.
S = {(T , T , T) , (T , T , H) , (T , H , T) , (T , H , H ) , (H , T , T ) , (H , T , H) ,
(H , H, T) ,(H , H , H)}
Suppose, if we want to find only the outcomes which have at least two heads;
then the set of all such possibilities can be given as:
E = { (H , T , H) , (H , H ,T) , (H , H ,H) , (T , H , H)}
If A is students with more than 3.5 CGPA (cumulative grade point average) out of 4 and
B is students with a CGPA of more than 3.0, then P(A) < P(B)
4. The probability that either events A or B occur or both occur is given by
P (A U B) = P(A) + P(B)- P (A ∩ B )
5 .If A and B are mutually exclusive events, so that P (A ∩ B ) = 0, then
P (A U B) = P(A) + P(B)
6. If A1 , A2 , …, An are n events that form a partition of sample space S,
then their probabilities must add up to 1:
Joint Probability :
Let A and B be two events in a sample space. Then the joint probability of the two events,
written as P(A ∩ B), is given by
13 42
P( Divorced ∩ Default )= -------- = 0.013 P( Single ∩ Default )= -------- = 0.042
1000 1000
50 300
P( Divorced )= ----------- = 0.05 P( Single )= ----------- = 0.3
1000 1000
1. Let there be a bag containing 5 white and 4 red balls .Two balls are
drawn from the bag one after the other without replacement. Consider
the following events.
A= Drawing a white ball in the first draw
B= Drawing a red ball in the Second draw.
Sol: P(B/A)= Probability of drawing a red ball in second draw given
that a white ball has already been drawn in the first draw.
P(B/A)= Probability of drawing a red ball from a bag containing 4
white and 4 red balls.
P(B/A)= 4/8 =1/2
For this Random Experiment P(A/B) is not meaningful because A
cannot occur after the occurrence of event B.
2. A Die is thrown twice and the sum of the numbers appearing is observed
to be 6. what is the conditional probability that the number 4 has appeared
at least once?
B= Number 4 has appears at least once
A=The Sum of the numbers appearing is 6, Required probability P(B/A)
Sol: A=((1,5),(2,4),(3,3),(4,2),(5,1)) P(A ∩ B)= 2 P(A)=5
Required probability = P(B/A)
= P(A ∩ B)/P(A) = 2/5
A= sum of the numbers appearing on two dice is 6
=(1,5),(5,1),(2,4),(4,2),(3,3) B= number 4 has appeared at least once
P(A)=5 =(1,4),(4,1),(2,4),(4,2),(3,4),(4,3),(4,4),(4,5),(5,4)
,(4,6),(6,4)
A∩B=(2,4),(4,2)
P(A∩B)=2
Question 3:
Ten numbered cards are there from 1 to 15, and two cards a
chosen at random such that the sum of the numbers on both the
cards is even. Find the probability that the chosen cards are
odd-numbered.
Let, A ≡ event of selecting two odd-numbered cards
B ≡ event of selecting cards whose sum is even.
Sol: Then,
P(B) = number of ways of choosing two numbers whose sum is even
= 8C 2 + 7C 2 .
P(A ∩ B) = number of ways of choosing odd-numbered cards such that
their sum is even.
= 8 C 2.
Now, P(A|B) = P(A ∩ B)/P(B)
= 8C2 / (8C2 + 7C2) = 4/7.
Bayes’ theorem is one of the most important concepts in analytics
since several problems are solved using Bayesian statistics. Consider
two events A and B. We can write the following two conditional
probabilities:
Random variable HH HT TH TT
X 2 1 1 0
Random variables can be classified as discrete or continuous depending on the values that
the random variable can take.
Discrete Random Variables :
A Random variables which takes finite or at most countable ( may be finite or infinite)
number of values is known as discrete random variable. Or Discrete Random Variable
takes a countable number of possible outcomes.
Ex: i) Marks obtained by a student in a test
ii) Number of Defective nuts in a lot
iii) The number of cars that pass through a given intersection in an
hour.
iii) Number of errors on a page of a book
iv) Number of accidents taking place on busy road.
Thus, X = {1, 2, 3, 4, 5, 6}
Another popular example of a discrete random variable is the number of heads when
tossing of two coins. In this case, the random variable X can take only one of the three
choices i.e., 0, 1, and 2.
Continuous Random variable :
A random variable which takes all the possible values in an interval is called
Continuous variable.
Examples i) Waiting time for a bus
P(X)=P(x=0)+p(x=1)+p(x=2)
= 1/4+1/2+1/4
=1
Cumulative distribution function, P(xi ), is the probability that the random
variable X takes values less than or equal xi . That is, P(xi ) = P(X ≤ xi ).
From the above problem
P(X < 2), probability that the number of heads are less than are equal
to two.
F(2) = P(x=0)+P(x=1)
= 1/4 +1/2
= 0.75
Example 2:
The Cumulative Distribution Function (CDF) is another important concept in
probability theory and statistics, especially when dealing with random variables, whether
discrete or continuous. The CDF provides the probability that a random variable X takes
on a value less than or equal to a specific point x.
The cumulative distribution function is denoted by F(x) and its formula is given by:
F(x)=P(X≤x)
Probability Mass Function and Cumulative Distribution Function of a
Continuous Random Variable :
where
What is the probability for the student to fail the test (i.e., to have less
than 6 correct answers)?
Answer:
Binomial Mean and Variance:
Mean= np
Variance=np(1-p)
Binomial Mean E(X) = 10 * 0.25 = 2.5.
Variance V (X) = 10 * (0.25) * (1 − 0.25) = 1.875.
Poisson Distribution
Poisson Distribution is a Probability distribution that is used to show how many times
an event occurs over a specific period.
It is the discrete probability distribution of the number of events occurring in a given
time period, given the average number of times the event occurs over that time
period. It is the distribution related to probabilities of events that are extremely rare
but have a large number of independent opportunities for occurrence.
Poisson Distribution Definition
Poisson distribution is used to model the number of events that occur in a fixed
interval of time or space, given the average rate of occurrence, assuming that the
events happen independently and at a constant rate
Poisson distribution formula
Mean and Variance of Poisson distribution:
The Poisson distribution has only one parameter, called λ.
Suppose 400 pages of the book are randomly selected. What are the
probabilities for having no typos and for having five or fewer typos?
Sol:
NORMAL DISTRIBUTION (GAUSSIAN DISTRIBUTION) :
The normal distribution is the most widely known and used of all
distributions. Because the normal distribution approximates many natural
phenomena so well, it has developed into a standard of reference for many
probability problems.
Let X be a continuous random variable, then it is said to follow normal
distribution if it is given by
Thus, any normal random variable X can be expressed using the standard
normal random variable Z.
Solved Examples
1. Calculate the probability of normal distribution with the population mean
2, standard deviation 3 or random variable 5.
Solution:
x=5
Mean = μ = 2
Standard Deviation = σ = 3
We will solve the questions with the help of the above normal
probability distribution formula:
SAMPLING
Definition: A portion of the population which is examined with a
view to determining the population characteristics is called a
sample.
In other words, sample is a subset of population. Size of the sample
is denoted by n. The process of selection of a sample is called
Sampling.
There are different methods of sampling
Probability Sampling Methods
Non-Probability Sampling Methods
Probability Sampling Methods :
a) Random Sampling (Probability Sampling): It is the process of drawing a sample from a
population in such a way that each member of the population has an equal chance of being included in
the sample.
Example: A hand of cards from a well shuffled pack of cards is a random sample.
Note: If N is the size of the population and n is the size of the sample, then The no. of samples with
replacement = Nn
The no. of samples without replacement = 𝑁Cn
b) Stratified Sampling : In this , the population is first divided into several smaller groups called strata
according to some relevant characteristics .
From each strata samples are selected at random, all the samples are combined together to form the
stratified sampling.
c) Cluster Sampling :
In cluster sampling, the population is divided into mutually exclusive clusters.
For example, assume that a researcher is interested in analyzing life of smart phone batteries from a
specific manufacturer. The manufacturer may have different models (each model in this case will be a
cluster).
d) Systematic Sampling (Quasi Random Sampling): In this method , all the units of the population
are arranged in some order . If the population size is N, and the sample size is n, then we first define
sample interval denoted by = N/n
Non Probability Sampling Methods:
Sample units are selected based on convenience and/or on voluntary basis.
Ex: Assume that a data scientist is interested in studying attrition and factors
influencing attrition. For this study, he/she may collect data from his friends and
colleagues which may not be true representation of the population. Such
sampling procedures come under the category of non-probability sampling.
Convenience Sampling :
Convenience sampling is a non-probability sampling technique in which the sample
units are not selected according to a probability distribution. For example, a
researcher may collect data from his school or the work place and from his/her
friends since the cost of data collection in such cases is minimal. Convenience
sampling is not recommended since it is likely to result in bias estimates.
Voluntary Sampling : Under voluntary sampling the data is collected from people
who volunteer for such data collection. For example, customer feedbacks in many
contexts fall under this sampling procedure. There could be bias in case of voluntary
sampling. Many organizations such as Amazon, Trip Advisor provide customer
feedback. Many times the feedback is provided by customers who had bad
experience with product/ service; many customers who were happy with
product/service may not give feedback.
Purposive (Judgment ) Sampling : In this method, the members constituting the
sample are chosen not according to some definite scientific procedure , but
according to convenience and personal choice of the individual who selects the
sample . It is the choice of the individual items of a sample entirely depends on the
individual judgment of the investigator.
Sequential Sampling: It consists of a sequence of sample drawn one after another
from the population. Depending on the results of previous samples if the result of
the first sample is not acceptable then second sample is drawn and the process
continues to take proper decision . But if the first sample is acceptable ,then no
new sample is drawn .
Classification of Samples:
Large Samples : If the size of the sample n ≥ 30 , then it is said to
be large sample.
Small Samples : If the size of the sample n < 30 ,then it is said to
be small sample or exact sample.
Parameters and Statistics:
Parameter is a statistical measure based on all the units of a
population.
Statistic is a statistical measure based on only the units selected in a
sample.
Note: In this unit, Parameter refers to the population and Statistic
refers to sample.
SAMPLING DISTRIBUTION
Sampling distribution refers to the probability distribution of a
statistic such as sample mean and sample standard deviation
computed from several random samples of same size.
Understanding the sampling distribution is important for
hypothesis testing. Test statistic in hypothesis testing is derived
based on the knowledge of sampling distribution.
In this example, the population is the weight of six pumpkins (in
pounds) displayed in a carnival "guess the weight" game booth.You
are asked to guess the average weight of the six pumpkins by taking
a random sample without replacement from the population.
Since we know the weights from the population, we can find the population
mean.
To demonstrate the sampling distribution, let’s start with obtaining all of the
possible samples of size n=2 from the populations, sampling without
replacement. The table below shows all the possible samples, the weights for the
chosen pumpkins, the sample mean and the probability of obtaining each sample.
The mean of the sample means is :
=9.5(1/15)+11.5(1/15)+12(2/15)+12.5(1/15)+13(1/15)+13.5(1
/15)+14(1/15)+14.5(2/15)+15.5(1/15)+16(1/15)+16.5(1/15)+1
7(1/15)+18(1/15)
= 14
Now, let's do the same thing as above but with sample size n=5
Central Limit Theorem: If ̅ be the mean of a random sample of size n
drawn from population having mean 𝜇 and standard deviation 𝜎 , then
the sampling distribution of the sample mean ̅ is approximately a normal
distribution with mean 𝜇 and SD = S.E of ̅ = 𝜎 /√n provided the
sample size n is large.
Estimate : An estimate is a statement made to find an unknown population
parameter.
Estimator : The procedure or rule to determine an unknown population
parameter is called estimator.
Example: Sample proportion is an estimate of population proportion , because
with the help of sample proportion value we can estimate the population
proportion value.
Types of Estimation:
Point Estimation: If the estimate of the population parameter is given by a
single value , then the estimate is called a point estimation of the parameter.
Interval Estimation: If the estimate of the population parameter is given by
two different values where the parameter is excepted to lie, then the estimate is
called an interval estimation of the parameter.
INTRODUCTION TO HYPOTHESIS TESTING:
Hypothesis is a claim or belief, hypothesis testing is a statistical process of
either rejecting or retaining a claim or belief or association related to a
business context, product, service, processes, etc.
Hypothesis testing consists of two complementary statements called null
hypothesis and alternative hypothesis, and only one of them is true.
Null hypothesis is the claim that is assumed to be true initially. That is at the
beginning we assume that the null hypothesis is true and try to retain it
unless there is strong evidence against null hypothesis.
Alternative hypothesis, usually denoted as HA (or H1 ), is the complement
of null hypothesis. Alternative hypothesis is what the researcher believes to
be true and would like to reject the null hypothesis.
Hypothesis testing is an integral part of many predictive analytics
techniques such as multiple linear regression and logistic regression.
In business, many claims are made by organizations. Few examples of such
claims are listed below:
1. Children who drink the health drink Complan (a health drink owned by
the company Heinz in India) are likely to grow taller.
2. If you drink Horlicks, you can grow taller, stronger, and sharper (3 in 1).
3. Using fair and lovely (fair and handsome) cream can make one fair and
lovely (fair and handsome).
4. Wearing perfume (such as Axe) will help to attract opposite gender
(known as Axe effect).
5. Women use camera phone more than men (Freier, 2016).
There are many such claims and beliefs; many business rules and strategies
are generated based on these hypotheses. The question is how can we check
whether these are actually true. Hypothesis testing is used for checking the
validity of the claim using evidence found in a sample data.
Take the decision to reject or retain the null hypothesis based on the p-value
and significance value α. The null hypothesis is rejected when p-value is less
than α and the null hypothesis is retained when p-value is greater than or equal
to α.
Calculate the p-value (probability value), which is the conditional probability
of observing the test statistic value when the null hypothesis is true. In simple
terms, p-value is the evidence in support of the null hypothesis.
Decide the criteria for rejection and retention of null hypothesis. This is called
significance value traditionally denoted by symbol α . The value of α will
depend on the context and usually 0.1, 0.05, and 0.01 are used.
if the calculated statistic value is less than the critical value (p-value will be less
than α-value) then we reject the null hypothesis, whereas, if the statistic value
is greater than the critical value(p-value will be greater than then we retain
the null hypothesis.
TYPE I ERROR, TYPE II ERROR
In hypothesis test we end up with the following two decisions:
1. Reject null hypothesis.
2. Fail to reject (or retain) null hypothesis.
Type I Error: Conditional probability of rejecting a null hypothesis
when it is true is called Type I Error or False Positive (falsely believing
that the claim made in alternative hypothesis is true).
A type I error (false-positive) occurs if an investigator rejects a null
hypothesis that is actually true in the population false in the population.
The significance value α is the value of Type I error.
Type I Error = α = P(Rejecting null hypothesis | H0 is true)
Probability value (p-value) is the evidence for the null hypothesis
whereas significance value α is the error based on repetitive sampling.
Type II Error: Conditional probability of failing to reject a null
hypothesis (or retaining a null hypothesis) when the alternative hypothesis
is true is called Type II Error or False Negative (falsely believing that there
is no relationship).
A type II error (false-negative) occurs if the investigator fails to reject a
null hypothesis that is actually false in the population.
Usually Type II error is denoted by the symbol ß.
Type II Error = ß = P(Retain null hypothesis | H0 is false)
The value (1 − ß ) is known as the power of hypothesis test.
Power of the test = 1 − ß = 1 − P(Retain null hypothesis | H0 is false)
Alternatively the power of test = 1 − ß = P(Reject null hypothesis|H0 is
false.
False-positive and false-negative results can also occur because of bias.
T-test :
The t-test is used when the population follows a normal distribution and the population standard
deviation s is unknown and is estimated from the sample. t-test is a robust test for violation of
normality of the data as long as the data is close to symmetry and there are no outliers.
Let S be the standard deviation estimated from the sample of size n. Then the statistic
will follow a t-distribution with (n − 1) degrees of freedom if the sample is drawn from a
population that follows a normal distribution. Here 1 degree of freedom is lost since the standard
deviation is estimated from the sample. Thus, we use the t-statistic (hence the test is called t-test) to
test the hypothesis when the population standard deviation is unknown. t-statistic =
The t-test is a statistical test procedure that tests whether there is a
significant difference between the means of two groups.
EX: The two groups could be, for example, patients who received drug
A once and drug B once, and you want to know if there is a difference in
blood pressure between these two groups.
Types of t-test :
There are three different types of t-tests.
One-sample t-test
We use the one-sample t-test when we want to compare the mean of a sample with a known
reference mean.
Example : A manufacturer of chocolate bars claims that its chocolate bars weigh 50 grams on
average. To verify this, a sample of 30 bars is taken and weighed. The mean value of this sample is
48 grams.
Independent-sample t-test
We use the t-test for independent samples when we want to compare the means of two
independent groups or samples. We want to know if there is a significant difference between these
means.
Example : We would like to compare the effectiveness of two painkillers, drug A and drug B.
Paired-sample t-test
The t-test for dependent samples is used to compare the means of two dependent groups.
Example : We want to know how effective a diet is. To do this, we weigh 30 people before the diet
and exactly the same people after the diet.
Chi-Square Goodness of Fit Tests
Goodness of fit tests are hypothesis tests that are used for comparing the
observed distribution of data with expected distribution of the data to
decide whether there is any statistically significant difference between the
observed distribution and a theoretical distribution based on comparison
of observed frequencies in the data and the expected frequencies if the data
follows a specified theoretical distribution.
The null and alternative hypotheses in chi-square goodness of fit tests are
H0 : There is no statistically significant difference between the observed
frequencies and the expected frequencies from a hypothesized
distribution.
HA: There is a statistically significant difference between the observed
frequencies and the expected frequencies from a hypothesized
distribution.
Let Z be a standard normal distribution with 1 degree.
If we have k random variables, namely, X1 , X2 , …, Xk , then a chi-
square distribution with k-degrees of freedom is given by
where Oij is the observed frequency in category (i, j) and Eij is the expected
frequency in the category (i, j). Thus, chi-square test is always a right-tailed
test.
INTRODUCTION TO ANALYSIS OF VARIANCE (ANOVA)
The objective of ANOVA is to check simultaneously whether population
mean from more than two populations are different.
ANOVA stands for Analysis of Variance. It is a statistical method used to
analyze the differences between the means of two or more groups or
treatments.
It is often used to determine whether there are any statistically significant
differences between the means of different groups.
ANOVA is used to compare treatments, analyze factors impact on a
variable, or compare means across multiple groups.
Types of ANOVA include one-way (for comparing means of groups) and
two-way (for examining effects of two independent variables on a
dependent variable).
One-way analysis of variance (ANOVA) : It is a statistical method
for testing for differences in the means of three or more groups.
In statistics, ANOVA also uses a Null hypothesis and an Alternate
hypothesis.
The Null hypothesis in ANOVA is valid when all the sample means are
equal, or they don’t have any significant difference.
On the other hand, the alternate hypothesis is valid when at least one of
the sample means is different from the rest of the sample means. In
mathematical form, they can be represented as:
where μi is the mean of the i-th level of the factor.
Ex for One –way ANOVA:
Suppose you are studying the effectiveness of three different drugs (Drug
A, Drug B, and Drug C) in reducing blood pressure.You randomly assign
90 patients to one of the three drug groups and measure their blood
pressure after one month of treatment. The blood pressure measurements
(in mmHg) for each patient are observed and prepared as a dataset.
In this dataset, each drug group represents a separate treatment or
condition, and the blood pressure measurements for each patient in that
group are recorded.
To analyze this dataset using ANOVA, you would compare the means of
the blood pressure measurements among the three drug groups to
determine if there is a statistically significant difference.
Two-Way ANOVA : Two way ANOVA technique are used
when the data are classified based on the two factors.
Ex: the agricultural output may be classified on the basis of different
varieties of Seeds and also on the basis of different varieties of
fertilizers are used.
A statistical test is used to determine the effect of two nominal
predictor variables on a Continuous outcome variable.
Two way ANOVA test analyzes the effect of the independent variables
on the expected outcome along with their relationship to the
outcome itself.
Ex for TWO –way ANOVA
Two-way (or two factor) analysis of variance tests whether there is a
difference between more than two independent samples split between
two variables or factors.
A factor is, for example, the gender of a person with the characteristics
male and female, the form of therapy used for a disease with therapy A,
B and C or the field of study with, for example, medicine, business
administration, psychology and math.
In addition to gender, the highest level of education also has an influence
on salary.
besides therapy, gender also has an influence on blood pressure.
In addition to the field of study, the university attended also has an
influence on the duration of studies.
Now in all three cases you would not have one factor, but two factors
each. And since you now have two factors, you use the two-way
analysis of variance.
Formulas of ANOVA:
Sum of Squares of Total Variation (SST):