0% found this document useful (0 votes)
24 views130 pages

Final Correction Basic Statistics Combined Chapter

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views130 pages

Final Correction Basic Statistics Combined Chapter

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

Basic

Statistics
Chapter -1 Stats
Vision
At our educational institute, our mission is to provide an empowering learning
environment that encourages students to excel and reach their full potential. Our
vision is to equip students with the knowledge, skills and values necessary for them to
become responsible and engaged citizens in the global community.

Mission
Our educational institute has a vision of providing students with an
environment for learning and growth. We offer a wide range of courses,
from basic to advanced, to help students gain knowledge and skills that can
be applied in their daily lives.

Objective
Our educational institute is dedicated to providing a world-class learning
experience. We strive to foster a culture of exploration, creativity, and
collaboration in our students. Our faculty and staff are committed to
providing an environment that encourages academic excellence and
personal growth.
Dear Students,
On behalf of Learning Lab, I am delighted to welcome you to our esteemed institution! As
the Director, I am committed to providing you with the best possible education in the field
of data science, Digital Marketing, Cyber Security & many more.
At our institute, we believe in staying at the forefront of the latest advancements and
industry trends. Our experienced faculty, who are experts in the field, will guide you
through a comprehensive curriculum that covers a wide range of topics, including data
analysis, machine learning, statistical modeling, data visualization, Digital marketing and
more.
Through hands-on projects, case studies, and real-world applications, you will gain practical
experience and develop critical thinking skills that will enable you to excel in any role.
In addition to technical skills, we also emphasize soft skills such as communication,
teamwork, and problem-solving. Our collaborative learning environment encourages
discussions, teamwork, and peer-to-peer learning, providing you with a well rounded
education.
I am excited to witness your growth and success as you embark on this data science journey
with us. Get ready to unlock your potential and thrive in the data-driven world!

Best regards,
- Prashant Biradar
Founder & CEO
Agenda Basic Statistics
1 Data types- Continous, Discrete, Nominal, Ordinal, Interval, Ratio

2 First, Second, third and fourth moment business decisions Graphical


representation- Barplot, Histogram, Boxplot

3 Random variable, Probability, Probability Distribution

4 Hypothesis Testing and Anova

5 Simpler Linear Regression

www.learninglabb.com Page 1
Statistics
Statistics is the science of data, It involves
Collecting
Classifying
Summarizing
Analyzing and
Interpreting numerical information.
Statistics is used in several different disciplines (both scientific and non-scientific) to make decisions and draw
conclusions based on data.

Statistics is indeed the science of data that involves various processes such as collecting, classifying, summarizing,
analyzing, and interpreting numerical information. These processes are used to gain insights and make informed
decisions based on the data available. Here's a brief explanation of each step:

Collecting: This involves gathering data from various sources, such as surveys, experiments, observations, or existing
datasets. The data collected should be relevant and representative of the population or phenomenon of interest.

Classifying: Once the data is collected, it needs to be organized and classified into different categories or groups.
This step helps in better understanding the data and identifying patterns or relationships within the dataset.

www.learninglabb.com Page 1
Statistics
Summarizing: In this step, the collected data is summarized using various statistical measures such as measures of
central tendency (e.g., mean, median, mode) and measures of dispersion (e.g., range, standard deviation).
Summarizing the data provides a concise representation of its main characteristics.

Analyzing: Statistical analysis involves applying various statistical techniques and methods to explore relationships,
patterns, and trends in the data. This step may involve techniques like hypothesis testing, regression analysis,
correlation analysis, or data mining, depending on the research question or objective.

Interpreting: The final step in the statistical process is interpreting the results obtained from the analysis. This
involves drawing conclusions, making inferences, and providing meaningful insights based on the statistical findings.
Effective interpretation requires an understanding of the context and subject matter expertise.

By following these steps, statisticians can extract valuable information from data and provide evidence-based
insights to inform decision-making in various fields such as business, healthcare, social sciences, and more.

www.learninglabb.com Page 1
Books and Softwares

www.learninglabb.com Page 1
Sample Statistics Population parameters
A sample is a set of n observations actually A hypothetical set of N observations from which
obtained and a statistic is a numerical value that
describe the sample.

X -Sample Mean = Population mean


S2 = Sample Variance = Population
s = Sample Standard Deviation = Population

www.learninglabb.com Page 1
Statistics
There are two types of statistics that are often referred to when making a statistical decision or working on a statistical
problem.

Descriptive Statistics: Descriptive statistics involve using numerical and graphical methods to examine and summarize
a given data set. The main objective is to describe and understand the characteristics and patterns present within the
data. Descriptive statistics provide measures that summarize the central tendency (such as mean, median, and mode)
and variability (such as range, standard deviation, and variance) of the data. Graphical representations like histograms,
bar charts, pie charts, and scatter plots can also be used to visually depict the data. Descriptive statistics help in
providing a concise summary of the data, enabling individuals to comprehend and interpret the information for
decision-making purposes.

Inferential Statistics: Inferential statistics involve drawing conclusions or making inferences about a larger population
based on a sample of data. It allows you to generalize the findings from the sample to the entire population. Inferential
statistics use probability theory and sampling techniques to estimate population parameters, test hypotheses, make
predictions, or determine relationships between variables. Techniques such as confidence intervals and hypothesis
tests (such as z-tests or t-tests) are commonly used in inferential statistics. The goal is to make reliable and valid
inferences about the population based on the observed sample data.

www.learninglabb.com Page 1
Statistics
In summary, descriptive statistics summarize and describe the characteristics of a given data set, while inferential
statistics use sample data to draw conclusions and make inferences about a larger population. Both branches of
statistics are crucial in analyzing and interpreting data to gain insights and support decision-making processes.

www.learninglabb.com Page 1
Statistics
Inferential Statistics
The main goal of inferential statistics is to make a conclusion about a population based off of a sample of data from that
population. One of the most commonly used inferential techniques is hypothesis testing

Ex: New drug tests

www.learninglabb.com Page 1
Data come in many flavors
Type of data Definition Example

Nominal Categories Your Previous Degree

Ordinal Can be ranked ordered but not measured Rai ness school rankings

Interval scale Intervals are meaningful not ratios Temperature in Fahrenheit or Celsius

Ratio scale Ratio are meaningful Sales of a new product

Source of data Definition Example

Analyst does not control data generating


Observational Stock Returns on BSE
process

Experimental Analyst good control over data generation Drug efficacy in clinical trials

www.learninglabb.com Page 1
www.learninglabb.com Page 1
Data Types Preliminaries
Nominal Nominal Ordinal Interval Ratio

Merely labels, name related information Mode Mode Mode Mode


Example: "Dell and 'HP",
Frequency Median Median Median

Ordinal Percentage Frequency Mean Mean


Conveys only upto preference information, Direction alone,
Percentage Percentage Percentage
Example: "I prefer Dell to HP".
Few Statistical Analysis Frequency Frequency
Interval
Variance Variance
It tells relative magnitude information, in addition to Preference.
Example: "I rate Dell a 8 and HP a 4 on a scale of 10. SD SD

Almost All
Ratio Statistical Ratio of

Conveys information on an absolute scale. Analysis Numbers

Example: "I paid Rs 1 lakh for HP and Rs 1.25 lakh for Dell",
All Statistical
Analysis

www.learninglabb.com Page 1
Data types-Continous & Discrete

www.learninglabb.com Page 1
Group the following as either discrete or continuous data

Volume of a Speed of a
cereal box car

Population Length of a
of a town crocodile
Discrete?
Continous?
Number of
Shirts matches in a
box

Number of
Temperature
goals in a
in oven
season
www.learninglabb.com Page 1
Measures of Central Tendency

Central tendency Population Sample

Men/Average

Median Middle value of data

Mode Most occurring value in the data

www.learninglabb.com Page 1
Sample Mean Formula
The sample mean formula is written as

Sample Mean = (Sum of terms) / (Number of Terms) Weight = 20+21+23+22+21+20+23

Average = (Sum of values)/No.of values

= (20+21+23+22+21+20+23)/7

= 150/7

Where,
=21.42

www.learninglabb.com Page 1
Sample Median

Median
The median is the middle value, to calculate median, sort the given list an ascending or descending
4 5 2 6 9 3 10 11 3 8

www.learninglabb.com Page 1
Mode
The mode is the number that is repeated more often than any other
A unimodal mode is a set of data with only one mode.
The mode of data set A = {14, 15, 16, 17, 15, 18, 15, 19}, for example, is 15 because just one value repeats itself. As a
result, it's a unimodal data set.
Bimodal Mode
A bimodal mode is a set of data that has two modes. This indicates that the data values with the highest frequencies
are two.
Set A = {2,2,2,3,4,4,5,5,5} has a mode of 2 and 5, because both 2 and 5 are repeated three times in the provided set
Trimodal Mode
A trimodal mode is a set of data that has three modes. This indicates that the top three data values have the most
frequency.
Set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} has a mode of 2, 5, and 8 since all three numbers are repeated thrice in the
provided set. As a result, it's a trimodal data collection.
Multimodal Mode
A multimodal mode is a set of data that contains four or more modalities.
Because all four values in the given set recur twice, the mode of data set A = 100, 80, 80, 95, 95, 100, 90, 90,100,95 is
80, 90, 95 and 100. As a result, it's a multimodal dataset.

www.learninglabb.com Page 1
www.learninglabb.com Page 1
Sampling Funnel

Population

Sampling Frame

SRS

Sample

www.learninglabb.com Page 1
Sampling Frame
A sampling frame is a list or representation of all the individuals or items in a population from which a sample is drawn.
It serves as a reference or basis for selecting the sample. The sampling frame should ideally include all members of the
population, and each member should have a known and non-zero chance of being included in the sample

A sample is a subset of individuals from a larger population. Sampling means selecting the group that you will
actually collect data from in your research. For example, if you are researching the opinions of students in your
university, you could survey a sample of 100 students

www.learninglabb.com Page 1
www.learninglabb.com Page 1
Standard deviation
Standard deviation is the degree of dispersion. In descriptive statistics, it is the scatter of the data points
relative to its mean.
It represents the way in which the values are spread across the data sample.
It is the measure of the variation of the data points from the mean.
The standard deviation of a sample is therefore the square root of its variance
Assume a number of gold coins 5 pirates own; 4, 2, 5, 8, 6.

(x1+x2+x3+x4…..+xn)
n

www.learninglabb.com Page 1
www.learninglabb.com Page 1
Variance
Variance can be defined as the:

“The measure of a data collection has been spread out.”

In case the data values are identical to one another, then the variance is zero. Herein, all the non-zero variances are
positive.

A little variance indicated that the data points are close to the mean, as well as to one another.

However, in case the data points are highly spread from the mean and from each other, then it represents a high
variance.

Variance can be expressed as the average of the squared distance from every point to the mean.

Variance Formula

www.learninglabb.com Page 1
Variance Formula
The Population Variance Formula can be represented by:

The Sample Variance Formula can be represented by:

www.learninglabb.com Page 1
Measures of Variability
The mean, mode, and median do a nice job in telling where the center of the data set is, but often we are interested in
more...

For example, a pharmaceutical engineer develops a new drug that regulates sugar in the blood.

Suppose she finds out that the average sugar content after taking the medication is the optimal level.

This does not mean that the drug is effective. There is a possibility that half of the patients have dangerously low sugar
content while the other half has dangerously high content. Instead of the drug being an effective regulator, it is a deadly
poison.

What the pharmacist needs is a measure of how far the data is spread apart. This is what the variance and standard
deviation do

www.learninglabb.com Page 1
Measures of Dispersion
Dispersion Population Sample

Variance

Standard Deviation

Range Max - Min

www.learninglabb.com Page 1
We define Variance to be

Standard deviation to be

www.learninglabb.com Page 1
Computing step by step
Variance and Standard Deviation: Step by Step

1. Calculate the mean, x.


2. Write a table that subtracts the mean from each observed value.
3. Square each of the differences.
4. Add this column.
5. Divide by n -1 where n is the number of items in the sample This is the variance.
6. To get the standard deviation we take the square root of the variance.

www.learninglabb.com Page 1
Measures of Variability
Deviation is the distance from the mean.
Deviation score = observation - true mean.
Variance = mean or average of squared deviation scores.
Standard Deviation = Square root of variance.

In class exercise
Find Std and Var
44,50,38,96,42,47,40,39,46,50

www.learninglabb.com Page 1
Range
The "Range" for a data set is the difference between the largest value and smallest value contained in the data set. First
reorder the data set from smallest to largest then subtract the first element from the last element

Data Set = 2, 5, 9, 3, 5, 4, 7

Reordered = 2, 3, 4, 5, 5, 7, 9

Range=(9-2)=7

www.learninglabb.com Page 1
Basic
Statistics
Chapter -2 Visualization and Stats
Visualization Techniques
GMAT Scores of an MBA Class

610 730 590 610 - - - 630 630


640 630 540 660 - - - 610 540
690 610 520 640 - - - 720 630
610 650 660 630 - - - 600 730
710 600 760 690 - - - 500 720
610 650 660 710 - - - 480 600
630 610 680 730 - - - 700 690
530 550 730 690 - - - 670 540
630 720 610 710 - - - 600 600
690 600 730 540 - - - 560 770

www.learninglabb.com Page 1
Pictorial summary of data:
A bar chart
25

20

15

10

0
100 200 300 400 500

www.learninglabb.com Page 1
Graphical Techniques - Box plot
Box Plot : This graph shows the distribution of data
Range(IQR): The
by dividing the data into four groups with the same
middle half of a data
number of data points in each group. The box
set falls within the
contains the middle 50% of the data points and each
inter-quartile range
of the two whiskers contain 25% of the data points.
Inter-quartile
It displays two common measures of the variability
or spread in a data set

Range : It is represented on a box plot by the


distance between the smallest value and the largest
value, including any outliers. If you ignore outliers,
the range is illustrated by the distance between the
opposite ends of the whiskers

www.learninglabb.com Page 1
Graphical Techniques - Box plot
Minimum: The minimum value in the given dataset
First Quartile (Q1): The first quartile is the median of the lower half of
the data set.
Median: The median is the middle value of the dataset, which divides
the given dataset into two equal parts. The median is considered as
the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper half
of the data.
Maximum: The maximum value in the given dataset.
Apart from these five terms, the other terms used in the box plot are:
Interquartile Range (IQR): The difference between the third quartile
and first quartile is known as the interquartile range. (i.e.) IQR = Q3-Q1
Outlier: The data that falls on the far left or right side of the ordered
data is tested to be the outliers. Generally, the outliers fall more than
the specified distance from the first and third quartile.
(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR).

www.learninglabb.com Page 1
Graphical Techniques - Box plot

www.learninglabb.com Page 1
Graphical techniques - Histogram
A histogram represents the frequency distribution, i.e., how many observations take the value within a certain interval.

25

20

15

10

0
600 650 700 750
www.learninglabb.com Page 1
Skewness & Kurtosis
Third and Fourth moments

Skewness Kurtosis
A measure of asymmetry in the distribution A measure of the "Peakedness" of the distribution
Mathematically it is given by E[(x-µ/o)]3 Mathematically it is given by E[(x-µ/o)]° -3
Negative skewness implies mass of the distribution For Symmetric distributions, negative kurtosis
is concentrated on the implies wider peak and thinner tails

www.learninglabb.com Page 1
www.learninglabb.com Page 1
Basic
Statistics
Chapter -3 Normal distribution
Random Variable
A random variable is a rule that assigns a numerical value to each outcome in a sample space.

Random variables may be either discrete or continuous

A discrete random variable can take only a finite number of distinct values such as 0, 1, 2, 3, 4, … and so on.

Ex : number of defective light bulbs in a box, the number of children in a family

A numerically valued variable is said to be continuous if, in any unit of measurement, whenever it can take on
the values between a and b

a=1 and b=5


ex : 1.2, 3.4, 4.5

Ex: time it takes to complete a race or the length of time between arrivals at a hospital clinic.

www.learninglabb.com Page 1
Z-Scores
z-score makes use of the mean and the standard deviation of the data set in order to specify the relative location
of a measurement.

It represents the distance between a given data point and the mean, expressed in standard deviations. The score
is also known as "standardizing" the data point

Large z-scores tell us that the measurement is larger than almost all other measurements in the data set.
Similarly, a small z-score tells us that the measurement is small than all other measurements.
If a score is 0, then the observation lies on the mean.

www.learninglabb.com Page 1
www.learninglabb.com Page 1
Probability Distributions
Probability is simply how likely something is to happen. Whenever we're unsure about the outcome of an event,
we can talk about the probabilities of certain outcomes—how likely they are. The analysis of events governed by
probability is called statistics.

What is a Probability Distribution?

A probability distribution is a statistical function that describes the likelihood of obtaining all possible values that
a random variable can take. In other words, the values of the variable vary based on the underlying probability
distribution. Typically, analysts display probability distributions in graphs and tables. There are equations to
calculate probability distributions.

Suppose you draw a random sample and measure the heights of the subjects. As you measure heights, you
create a distribution of heights. This type of distribution is useful when you need to know which outcomes are
most likely, the spread of potential values, and the likelihood of different results.

www.learninglabb.com Page 1
Probability Distributions
A discrete distribution has a range of values that are countable. For example, the numbers on birthday
cards have a possible range from 0 to 122 (122 is the age of Jeanne Calment the oldest person who ever
lived).
Discrete vs. Continuous Probability Distributions
A discrete probability distribution is made up of discrete variables

Probability properties:
The sum of all probabilities in a distribution sum to 1.
Each value has a probability between 0 and 1.

www.learninglabb.com Page 1
Discrete probability Distribution:
Binomial Distribution
A binomial experiment is a statistical experiment that has the following properties:
The experiment consists of n repeated trials.
Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a
failure ; Yes or No
The probability of success, denoted by P, is the same on every trial.
The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials.

www.learninglabb.com Page 1
Discrete probability Distribution:
Binomial Distribution
Example:
Consider the following statistical expermient.
You flip a coin 2 times and count the number of times the coin lands on heads.
This is binomial experiment because

The experiment should consists of n repeated trials


The experiment consists of repeated trials. We flip a coin 2 times.
Each trial can result in just two possible outcomes
Each trial can result in just outcomes heads or tails.
The probability of success, denoted by P. is the same on every trial
The probability of success is constant -0.5 on every trials.
The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials.
The trials are independent; that is, getting heads on one trial does not affect whether we get heads on other trials.

www.learninglabb.com Page 1
In class exercise: Probability
Calculation for discrete P.D
The daily sales of large flat panel TVs at a store (X)

What is the probability of a sale?


What is the probability of selling at least three TVs

www.learninglabb.com Page 1
Continuous Probability
Distributions
What is a Continuous Probability Distribution?
Probability distributions are either continuous probability distributions or discrete probability distributions. A
continuous distribution has a range of values that are infinite, and therefore uncountable. For example, time is
infinite: you could count from 0 seconds to a billion seconds…a trillion seconds…and so on, forever.
A continuous probability distribution is made up of continuous variables.

What are the Features of Probability Density Function?


The features of the probability density function are given below:
The probability density function will always be a positive value.
The total area under the probability density function curve will
always be equal to 1.
How?
This is because the value of a random variable has to lie
somewhere in the sample space 1.
In other words, the probability that the value of a random variable
is equal to 'something' is 1.
www.learninglabb.com Page 1
Normal Distribution
It is one of the most famous statistical distributions in use .The normal distribution is a continuous probability
distribution Several phenomenon are modelled with the normal distribution. For example : Heights of people are
normally distributed as well as possible blood pressure levels for people.

www.learninglabb.com Page 1
Normal Distribution
The normal distribution is the proper term for a probability bell curve. In a normal distribution the mean is zero
and the standard deviation is 1. It has zero skew and a kurtosis of 3.
The normal distribution, also known as the Gaussian distribution or bell curve, is a probability distribution that is
widely used in statistics and probability theory. It is characterized by its symmetric bell-shaped curve, with the
mean, median, and mode all being equal and located at the center of the distribution.

Characteristics of the normal distribution:


Shape: The normal distribution has a symmetric shape, with the majority of the data clustered around the mean.
The shape of the curve is determined by the mean and standard deviation of the distribution.
Mean and standard deviation: The mean (μ) of the normal distribution represents the center or average value of
the data. The standard deviation (σ) measures the spread or variability of the data. The shape of the bell curve is
affected by the mean and standard deviation.
Empirical Rule: The normal distribution follows the empirical rule, also known as the 68-95-99.7 rule.
According to this rule, approximately 68% of the data falls within one standard deviation of the mean, about 95%
falls within two standard deviations, and around 99.7% falls within three standard deviations.

www.learninglabb.com Page 1
Normal Distribution
Data can be "distributed" (spread out) in different ways. But there are many cases where the data tends to be
around a central value with no bias left or right, and it
gets close to a "Normal Distribution" like this.

www.learninglabb.com Page 1
Normal Distribution

www.learninglabb.com Page 1
Probability Calculation for
Continuous Distributions
The probability associated with any single value of the random variable is always zero
Probability of values being in a range = Area under the pdf curve in that range
Area under the entire curve always equals 1

www.learninglabb.com Page 1
Probability calculation for
Normal Distribution
Consider normally distributed random variable X—N(mu,sigma^2)
To compute probability P(X<=x)

from scipy import stats


stats.norm.cdf( x, loc=mean,scale=std)

1. X: It is the random variable'.


2. loc: It is the location parameter of the distribution. It is mean for normal distribution.
3. scale: It is the scale parameter of the distribution. It is standard deviation for normal distribution.

www.learninglabb.com Page 1
Calculating Probabilities :
Normal distribution
Ex: find the probability that a normally distributed random variable has a mean of 60 and a standard deviation of 10
and we want to find the probability that X is less than 70.

from scipy import stats


stats.norm.cdf( x, loc=mean,scalc=std)

www.learninglabb.com Page 1
www.learninglabb.com Page 1
In class exercise
Suppose GMAT scores can be reasonably modeled using a normal distribution — p 711 and o =29.
What is P(Xs 680)?

Suppose GMAT scores are distributed normally with IA = 711 and a = 29.
What is P(X.c. 680)?

Step 1: Calculate Z-score corresponding to 680


— Z = (680-711)/29 = -1.06

Step 2: Calculate the probabilities using Z-tables


— P(Z s -1)= 0.14

www.learninglabb.com Page 1
Exercise
What is P(697≤ X≤740)?

Step 1: Use P(x1≤ X ≤ x2) = Use P(X≤ x2) — P(X≤ x1)

Step 2: Calculate P(P(X≤ x2) and P(X≤ x1) as before


— P(x ≤740) =1)(Z ≤ 1) = 0.84
— P(x ≤697) = P(Z ≤ -0.5) = 0.31

Step 3: Calculate P(697≤ X≤740) = 0.84 — 0.31 = 0.53

www.learninglabb.com Page 1
Stock Price
To understand normal distribution and its application, Let us use daily returns of stocks traded in BSE (Bombay
Stock Exchange). Imagine a scenario where an investor wants to understand the risks and returns associated with
various stocks before investing in them.

For this analysis, we will evaluate two stocks: BENI and GLAXO. The daily trading data (open and close price) for each
stock is taken for the period starting from 2010 to 2016 from BSE site (www.bseindia.com)

www.learninglabb.com Page 1
Data

www.learninglabb.com Page 1
What questions can be
answered?
1.What is the expected daily rate of return of these stocks?
2. Which stocks have higher risk or volatility as far as daily returns are concerned?
3. Which stock has higher probability of making a daily return of 2% or more?
4. Which stock has higher probability of making a loss (risk) of 2% or more?

To answer the above questions, we must find out the behavior of daily returns (we will refer to this as gain hence
forward) on these stocks. The gain can be calculated as a percentage change in close price, from the previous day's
close price.

The method pct _change() in Pandas will give the percentage change in a column value shifted by a period, which is
passed as a parameter to periods.

www.learninglabb.com Page 1
www.learninglabb.com Page 1
Mean and Variance
Glaxo : Mean: 0.0004
Standard Deviation: 0.0134

BEML: Mean: 0.0003


Standard Deviation: 0.0264

Gain seems to be normally distributed for both the


stocks with a mean around 0.00. BEML seems to have a
higher variance than Glaxo

www.learninglabb.com Page 1
The expected daily rate of return (gain) is around 0%
for both stocks.

Here variance or standard deviation of gain indicates


risk.

So, BEML stock has a higher risk as standard deviation


of BEML is 2.64% whereas the standard deviation for
Glaxo is 1.33%.

www.learninglabb.com Page 1
To calculate the probability of gain higher than 2% or more, we need to find out what is the sum of all probabilities
that gain can take values more than 0.02 (i.e., 2%).

Probability of making 2% loss or higher in Glaxo: 0.063 or


6.3%

Probability of making 2% loss or higher in BEML: 0.2215


or 22.3%

Probability of making 2% gain or higher in Glaxo: 0.0710


or 7.1%

Probability of making 2% gain or higher in BEML: 0.2276


or 22.76%

www.learninglabb.com Page 1
Normal Distribution codes

www.learninglabb.com Page 1
Student t- Distribution codes

www.learninglabb.com Page 1
Basic
Statistics
Chapter -4 CLT & CI
Inferential Statistics
GMAT Scores of an MBA Class

www.learninglabb.com Page 1
Sampling-How can it go wrong
Collecting data only from volunteers (voluntary response sample)
- e.g. online reviews (yelp.com, maps.google.com, tripadvisor.com)
Picking easily available respondents (convenience sample)
- e.g. choosing to survey In In-Orbit mall
A high rate of non-response (more than 7096)
- e.g. CEO / CIO surveys on some Industry trends

Sampling variation
Sample mean varies from one sample to another Sample
mean on be (and most likely is) different from the population
mean Sample mean is a random variable

www.learninglabb.com Page 1
Central Limit Theorem
The distribution of the sample mean
- will be normal when the distribution of data In the population is normal
- will be approximately normal even if the distribution of data In the population Is not normal, If the sample size is
"fairly large"

Mean (X)= µ (the same as the population mean of the raw data)

σ
Standard deviation (X)= √n , where a is the population standard deviation and n is the sample size

— This is referred to as Standard Error of the Mean

www.learninglabb.com Page 1
Central Limit Theorem
The distribution of the data itself
(e.g. distribution of the individual
work experience), which may or may
not be normal

The distribution of sample means


(e.g. the average work experience of
64 students), which is approximately
normal due to CLT as long as the
sample Is "large enough"

Thumb rule to check whether CLT


can be applied
n ≥ 10 (Skewness)²

www.learninglabb.com Page 1
Confidence Intervals
Learning objectives

How to provide an interval estimate (confidence interval) for population parameters such as mean?
How to adjust the interval estimate if the population standard deviation is not known?
How to calculate confidence interval for population proportion?
What should be the sample size to collect for a desired width of the interval estimate?
What is the Probability of tomorrow's temperature being 42 degrees exactly ?

Probability is '0'

Can it be between [-50°C & 100°C] ?

www.learninglabb.com Page 1
Confidence Intervals
Sometimes samples don't give quite the right result

Point estimators are valuable, but they may give errors

Because we're not dealing with the entire population, all we're doing is giving a best estimate. If the sample we use is
unbiased, then the estimate is likely to be close to the true value of the population

Specify a Range instead of single pint of estimate

www.learninglabb.com Page 1
Credit card
Example: Credit Card Launch

A university with 100,000 alumni is thinking of


offering a new affinity credit card to its alumni.
Profitability of the card depends on the average
balance maintained by the cardholders.
A market research campaign is launched, in
which about 140 alumni accept the card in a pilot
launch.
Average balance maintained by these is 51990
and the standard deviation is 52833. Assume that
the population standard deviation is 52500 from
previous launches.
What can we say about the average balance that
will be held after a full-fledged market launch?

www.learninglabb.com Page 1
Interval estimates of parameters
Based on sample data
- The point estimate for mean balance = $1990
- Can we trust this estimate ? • What do you think will happen if we took another random sample of 140 alumni ?
Because of this uncertainty, we prefer to provide the estimate as an interval (range) and associate a level of
confidence with it

Interval estimate = Point Estimate ± Margin of Error

Thumb rule to check whether CLT


can be applied
n ≥ 10 * (Skewness)²

www.learninglabb.com Page 1
Confidence Interval for the
Population Mean
Start by choosing a confidence level (1-a)% (e.g. 95%, 99%, 90%)
Then, the population mean will be within

Interval Point Margin of


= ±
Estimate Estimate error

Margin of error depends on the underlying uncertainty, confidence level and the sample size

www.learninglabb.com Page 1
Credit card: Mean Balance
Based on the survey and past data from scipy import stats Confidence
Z
— = 140; σ = $ 2500; x̅ = $ 1.990 stats.norm.ppf(.975) Interval

80% 1.282

Construct a 95% confidence interval for the mean card balance and interpret it 85% 1.440

Construct a 90% confidence interval for the mean card balance and interpret it 90% 1.645

95% 1.960

99% 2.576

99.5% 2.807

99.9% 3.291

www.learninglabb.com Page 1
(Mis) Interpreting the
Confidence Interval
Consider the 95% Confidence Interval for the mean income: [51576, $2404]
Is it accurate to say that the mean balance of the population falls within this range?
Can we conclude that the mean balance is within this range 95% of the time?
Does it imply that 95% of the alumni have a balance within this range?

www.learninglabb.com Page 1
What if We Don't Know σ?
Suppose that the alumni of this university are very different and hence population standard deviation from
previous launches cannot be used.

www.learninglabb.com Page 1
Population SD (σ) Unknown
We replace a with our best guess (point estimate) s, which is the standard deviation of the sample:

Calculate

If the underlying population is normally distributed, T is a random variable distributed according to a


r-distribution with n-1 degrees of freedom (Tn-1 )

Research has showm that the r-distribution is fairly robust to deviations of the population from the normal model

www.learninglabb.com Page 1
Student's t-distribution
As n →∞,

t n N(0,1)
i.e. as the degrees of
freedom increase, the
t-distribution
approaches the
standard normal dist

www.learninglabb.com Page 1
Confidence Interval for Mean
with Unknown σ

www.learninglabb.com Page 1
Back to Credit Card Balance
1-α = 0.95; n = 140
from scipy import stats
stats.t.ppf(0.975,df=139)
Calculate t = 1.98
0.95,139

Our estimate of

Then the 95% confidence interval for balance is [51516, 52464]

www.learninglabb.com Page 1
Basic
Statistics
Chapter -5 Hypothesis
What is hypothesis testing?
Hypothesis testing is a statistical method that is used in
making statistical decisions using experimental data.
Hypothesis Testing is basically an assumption that we make Process A Process B
about the population parameter.
89.7
Ex : You say avg height of citizens of city is more than 5.8 ft. 81.4 86.1
Votes for the politician will be more than 60% 84.5 83.2
All those example we assume need some statistic way to prove
84.8 91.9
those. we need some mathematical conclusion what ever we
are assuming is true. 87.3 86.3

79.7 79.3
Example
it is claimed that a process has been improved in yield by 85.1 82.6
bringing a change in an important factor X. Yield data are 81.7 89.1
collected from old and new processes.
✓ Random samples are drawn from yield data from old process
83.7

84.5
83.7

88.5
A and improved process B.

www.learninglabb.com Page 1
Example: Hypothesis Testing
Real Question:
Can we say that the yield of improved Process B is greater than old Process A?

Descriptive Statistics :

Variable Process N Mean Std. Dev

Yield A 10 84.24 2.90

B 10 85.54 3.65

Statistical Question:
Is there a statistically significant difference between mean of Process B (85.54) and mean of Process A (84.24)? Or, is
this difference in mean just due to chance?

www.learninglabb.com Page 1
Hypothesis Testing
Example: Medicine B for treating
Develop the hypothesis for population
headache that is newly developed by a
and make statistical decision by
pharmaceutical company has 30
determining the acceptance of
minutes longer effect than existing
hypothesis using sample data.
Medicine A.

Null Hypothesis (HO): Argument


made so far, or hypothesis saying
that there is no change or HO : Medicine A and B have same
difference effect
Alternative Hypothesis (H1): New H1 : Medicine B has 30 minutes
argument, that is a hypothesis that longer effect than Medicine A
you want to prove with solid ground
obtained from sample

www.learninglabb.com Page 1
Procedure of Hypothesis Testing
Define the null hypothesis and the alternative hypothesis.
Determine the appropriate test statistic for evaluating the validity of the null hypothesis, such as a Z-test or t-test.
Choose the significance level (Alpha), typically set to 0.05.
Calculate the p-value, which represents the probability of obtaining the observed test statistic value (or more extreme)
under the assumption that the null hypothesis is true. Utilize the functions provided in the scipy.stats module for
p-value calculation.
Make a decision to either reject or fail to reject the null hypothesis based on comparing the p-value to the significance
level (Alpha).
P-value: The p-value is a probability value that measures the evidence against the null hypothesis.
A small p-value suggests strong evidence against the null hypothesis, while a large p-value suggests weak evidence.
Researchers compare the p-value to the significance level (α) to make a decision about rejecting or retaining the null
hypothesis. If the p-value is smaller than α, the null hypothesis is rejected. If the p-value is greater than or equal to α,
the null hypothesis is retained.

www.learninglabb.com Page 1
Procedure of Hypothesis Testing

www.learninglabb.com Page 1
How to frame hypothesis
Hypothesis is a starting position that is open to a test and rejection in light of strong adverse evidence

The initial belief is called the null hypothesis (H)


— Generally the status quo
— Action taken: Do nothing

Its negation is called the alternative hypothesis (HA, He, H1)


— Often a claim to be tested, or a change to be detected
— Action taker: Do something

The two hypotheses are


— Mutually exclusive
— Collectively exhaustive
1-tail, 2-Tail
1-Sample, 2-Sample

www.learninglabb.com Page 1
Two Tail Hypothesis Testing

www.learninglabb.com Page 1
www.learninglabb.com Page 1
Hypothesis Testing Process
Start with Hypotheses about a Population Parameter
Parameter could mean proportion or something else

Collect Sample Information


Collect information from a randomly chosen sample and calculate the appropriate sample statistic

Reject/Do Not Reject Hypothesis


Is the sample information strongly inconsistent with the null hypothesis? if yes then the reject hypothesis.

www.learninglabb.com Page 1
Example: Hypermarket Loyalty
Program
The hypermarket intends to introduce a loyalty program provided that it leads to an average spending per
shopper exceeding 110 per week.
In a pilot program, a sample of 80 shoppers who enrolled in the loyalty program spent an average of 120 per
week, with a standard deviation of 40.
Based on the average spending of 130 per week from a sample of 90 shoppers enrolled in the loyalty program,
along with a standard deviation of 40, should the hypermarket proceed with launching the loyalty program or
not?

www.learninglabb.com Page 1
www.learninglabb.com Page 1
Right-tailed hypothesis test
(sample mean of 130)
Null Hypothesis. The additional Acceptable level of type4 error is 0.05
spending is less than or equal to $120 (α = 0.05)
H0: μ S ≤ μ0 (120)

If I reject my null hypothesis that p S 120, I will be wrong with probability 0.01

Launch the loyalty card

www.learninglabb.com Page 1
Normal Distribution Codes

www.learninglabb.com Page 1
Student't- Distribution Codes

www.learninglabb.com Page 1
Simple Exercise
A prisoner is on trial for a crime, and you're on the jury. The jury's task is to assume the prisoner is innocent, but if there's
enougl evidence against him, they need to convict him.

1. In the trial, what's the null hypothesis?


2. What's the alternate hypothesis?
3. In what ways can the jury make a verdict that's correct?
4. In what ways can the jury make a verdict that's incorrect?

www.learninglabb.com Page 1
Simple Exercise
In the trial, what's the null hypothesis?
The null hypothesis is that the prisoner is innocent, as that is what we have to assume until there's proof otherwise.

What's the alternate hypothesis?


The alternate hypothesis is that the prisoner is guilty. In other words, if there's sufficient proof that the prisoner is not
innocent, then we'll accept that he's guilty and convict him.

www.learninglabb.com Page 1
Simple Exercise
In what ways can the jury make a verdict that's correct?
We can make a correct verdict if:
a) The prisoner is innocent, and we find him innocent.
b) The prisoner is guilty, and we find him guilty.

In what ways can the jury make a verdict that's incorrect?


We can make an incorrect verdict if:
a) The prisoner is innocent, and we find him guilty.
b) The prisoner is guilty, and we find him innocent.

The errors we can make when conducting a hypothesis test are the same sort of errors we could make when putting a
prisoner on trial
Hypothesis tests are basically tests where you take a claim and put it on trial by assessing the evidence against it. If there's
sufficient evidence against it, you reject it, but if there's insufficient evidence against it, you accept it.
You may correctly accept or reject the null hypothesis, but even considering the evidence, it's also possible to make an
error. You may reject a valid null hypothesis, or you might accept it when it's actually false

www.learninglabb.com Page 1
Statisticians have special names for these types of errors.

A Type I error is :
when you wrongly reject a true null hypothesis (Punished a innocent guy), and

A Type II error is :
when you wrongly accept a false null hypothesis (Let guilty go free).

Errors
You can make two types of errors!

Actual
Situation

Probability of committing a Type-I error is the same as p-value


α-value can be interpreted as the acceptable probability of making a Type-I error (also called significance level)

www.learninglabb.com Page 1
Example: Process Control at a
Call Center
Performance of a call center is monitored by the average call duration
Data from 18 months shows that on the days when the process runs normally
— = 4 min, a = 3 min
Cannot monitor each and every call due to limited resources; so randomly sample 50 calls per day

www.learninglabb.com Page 1
Day variation in sample mean
We already know that sample mean even, day will be different
inherent variability

But when should you be alarmed and conclude that the system
is not behaving normally — external variability

A pragmatic approach is to say:


— I believe that the process is unchanged, i.e., µ = 4
— Bring strong enough evidence to make me to change my
mind, i.e., x that is very different from µ
— I am looking for deviations on either side of µ. as evidence

www.learninglabb.com Page 1
Let us solve the problem
Mean =4.0
Standard deviation =3
Sample Size =50
Sample mean =4.6

import scipy
import numpy as np

'T' statistic = (4-4.6)/(3/np.sqrt(50))

2*stats.t.cdf(-1.41,df=49)

www.learninglabb.com Page 1
Two-tailed hypothesis test
(sample mean of 4.6)
Null Hypothesis: The mean call duration is 4
minutes H0:µ = µ0 (4)

Acceptable level of type-I error is 0.05

(α = 0.05)

Probability of seeing a sample mean of ≥ 4.6 or ≤ 3.4 is 0.16


OR
If I reject my null hypothesis that µ = 4, I will be wrong with probability 0.16

I cannot conclude that the process has changed → Do not investigate

www.learninglabb.com Page 1
Two-tailed hypothesis test
(sample mean of 5.3)
Null Hypothesis: The mean call duration is 4
minutes H0:µ = µ0 (4)

Acceptable level of type-I error is 0.05

(α = 0.05)

Probability of seeing a sample mean of ≥ 5.3 or ≤ 2.7 is 0.002


OR
If I reject my null hypothesis that µ = 4, I will be wrong with probability 0.002

I cannot conclude that the process has changed → investigate

www.learninglabb.com Page 1
Let us do it in Python
scipy.stats.ttest_l samp(array,mu)

One-sample and one tail t-tests

Ex. An outbreak of Salmonella-related illness was attributed to ice cream produced at a certain factory. Scientists
measured the level of Salmonella in 9 randomly sampled batches of ice cream. The levels (in MPN/g) were

0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392, 0.418

Is there evidence that the mean level of Salmonella in the ice cream is greater than 0.3 MPN/g?

Let us try in Python 28

www.learninglabb.com Page 1
Let be the mean level of Salmonella in all batches of ice cream. Here the hypothesis of interest can be expressed as:

HO: <= 0.3


Ha: > 0.3

Data = pd.Series([0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392, 0.418]) scipy.stats.ttest_1samp(data10.3)

Two-sample t-tests

Ex. 6 subjects were given a drug (treatment group) and an additional 6 subjects a placebo (control group). Their reaction
time to a stimulus was measured (in tins). We want to perform a two-sample t-test for comparing the means of the
treatment and control groups.

Control 91, 87, 99, 77, 88, 91


Treat :101, 110, 103, 93, 99, 104

www.learninglabb.com Page 1
Let Mul be the mean of the population taking medicine and Mu2 the mean of the untreated population. Here the
hypothesis of interest can be expressed as:

HO: Mul-Mu2=0
Ha: Mul-Mu2 !=0

Control=pd.Scries([91, 87, 99, 77, 88, 911)


Treat =pd.Series([101, 110, 103, 93, 99,1041)
stats.ttest_ind( control,Treat)

Ttest_indResult(statistic=-3.445612673536487, pvalue=0.006272124350809803)

www.learninglabb.com Page 1
Proportion t test
Usecase: Is there a significant difference between the population proportions of state 1 and state 2 who report that they
have been placed immediately after education?

Populations: All Students who have completed graduation and Post graduation in both both states

Parameter of Interest: p1 - p2, where p1= state1 and p2 = state2

Data: 247 students from state 1. 36.8% of students report that they have got the job.
308 students from state 2. 38.9% of students report that they have got the job.

Hypothesis Definition:
Null Hypothesis: p1 - p2 = 0
Alternative Hypothesis: p1 -p2 ≠ 0

www.learninglabb.com Page 1
The difference in population proportion needs t-test. Also, the population follows a binomial distribution here. We can
just pass on the two population quantities with the appropriate binomial distribution parameters to the t-test function

We can use the ttest_ind() function from Statsmodels.


The function returns three values: (a) test statistic, (b) p-value of the Nest, and (c) degrees of freedom used in the t-test

Data Given:
n1 = 247
pl = .37

n2 = 308
p2 = .39

www.learninglabb.com Page 1
Basic
Statistics
Chapter -6 Annova and Chi square
Annova
Analysis of Variance( ANOVA ) is a statistical technique
which is used to check if the means of two or more
groups are statistically different from each other.

It checks the impact of one or more factors by comparing


the means of different samples.

Terminologies used
Grand Mean: it is the average of sample means

Hypothesis: Ho: Ho: µ1 = µ.2 = µ3


Ha: At least one mean is different

Variation between groups: For each data point look at the difference between
its group mean and the overall mean

Variation Within Groups: Difference between data point and its sample mean
www.learninglabb.com Page 1
Example:
A research institute wants to determine
the best diet plan for weight loss and
for this they have randomly selected 36
volunteers to analyze three diet plans
( Atkins , GM , South beach). Help the
research institute in identifying the best
diet plan by analyzing the weight lost by
the volunteers.

www.learninglabb.com Page 1
Calculations

www.learninglabb.com Page 1
Calculations
Deviation of mean with global Mean: ( µ1— µ)2

SS between group : product of Deviation of global mean with local mean and no. of data points

SS within group(Residual): total squared deviation of data points with its local mean

Sum of Squares Total: SS within group + SS between Group

F-ratio = ( SS Btwn Groups / DF )

(SS With Groups / DF)

www.learninglabb.com Page 1
Conclusion
If calculated F ratio is greater than F value based on the degrees of freedom ,we reject HO .

www.learninglabb.com Page 1
Chi-Square test
Chi-square tests of independence test whether two qualitative variables are independent, that is, whether
there exists a relationship between two categorical variables

Hypotheses The Chi-square test of independence is a hypothesis test so it has a null (HO) and an alternative
hypothesis (HI):

HO : the variables are independent, there is no relationship between the two categorical variables. Knowing
the value of one variable does not help to predict the value of the other variable

H1 : the variables are dependent, there is a relationship between the two categorical variables. Knowing the
value of one variable helps to predict the value of the other variable

www.learninglabb.com Page 1
Chi-Square test
The chi-square test is used when the data is categorical and it detects the difference between observed data
and what could we expect if the Ho is true.

0: Observed Frequency in the data for each type


E: Expected Frequency to see of each type if the null hypothesis was true.

www.learninglabb.com Page 1
How the test works
If the difference between the observed frequencies and the expected frequencies is small, we cannot reject the null
hypothesis of independence and thus we cannot reject the fact that the two variables are not related

if the difference between the observed frequencies and the expected frequencies is large, we can reject the null
hypothesis of independence and thus we can conclude that the two variables are related How the test works

we want to determine whether there is a statistically significant association between smoking and being a
professional athlete.

Smoking can only be "yes" or "no" and being a professional athlete can only be "yes" or "no". The two variables of
interest are qualitative variables, so we need to use a Chi-square test of independence, and the data have been
collected on 28 persons

www.learninglabb.com Page 1
0: Observed Frequency in the data for each type

E: Expected Frequency to see of each type if the null


hypothesis was true.

www.learninglabb.com Page 1
Test statistic
In the subgroup of athlete and non-smoker:

in the subgroup of non-athlete and non-smoker:

in the subgroup of non-athlete and non-smoker:

in the subgroup of non-athlete and non-smoker:

and then we sum them all to obtain the test statistic

www.learninglabb.com Page 1
Critical value
df= (number of rows-1) (number of columns -1)
Chi-square table (x=0.05 and df=1) : The critical value is 3.84146
test statistic=15.56>critical value=3.84146

Like for any statistical test, when the test statistic is larger than the critical value, we can reject the null hypothesis at
the specified significance level

www.learninglabb.com Page 1

You might also like