BRM File
BRM File
Definition
The central tendency is stated as the statistical measure that represents the single
value of the entire distribution or a dataset. It aims to provide an accurate
description of the entire data in the distribution.
Mean
The mean represents the average value of the dataset. It can be calculated as the
sum of all the values in the dataset divided by the number of values. In general, it is
considered as the arithmetic mean.
Median
Median is the middle value of the dataset in which the dataset is arranged in the
ascending order or in descending order. When the dataset contains an even number
of values, then the median value of the dataset can be found by taking the mean of
the middle two values.
Mode
The mode represents the frequently occurring value in the dataset. Sometimes the
dataset may contain multiple modes and in some cases, it does not contain any
mode at all.
Data analysis and interpretation is the next stage after collecting data from empirical
methods. The dividing line between the analysis of data and interpretation is difficult
to draw as the two processes are symbolic and merge imperceptibly. Interpretation
is inextricably interwoven with analysis.
1
MEAN, MEDIAN AND MODE
2
Measures of Dispersion-Variance, Standard
Deviation
1. Range: It is simply the difference between the maximum value and the minimum value
given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
2. Variance: Deduct the mean from each data in the set, square each of them and add
each square and finally divide them by the total no of values in the data set to get the
variance. Variance (σ2) = ∑(X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the standard
deviation i.e. S.D. = √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers
into quarters. The quartile deviation is half of the distance between the third and the first
quartile.
5. Mean and Mean Deviation: The average of numbers is known as the mean and the
arithmetic mean of the absolute deviations of the observations from a measure of central
tendency is known as the mean deviation (also called mean absolute deviation).
3
MEASURES OF DISPERSION
4
CORRELATION & REGRESSION
Correlation refers to a process for establishing the relationships between two variables.
You learned a way to get a general idea about whether or not two variables are related, is to
plot them on a “scatter plot”. While there are many measures of association for variables
which are measured at the ordinal or higher level of measurement, correlation is the most
commonly used approach.
In statistics, Correlation studies and measures the direction and extent of relationship
among variables, so the correlation measures co-variation, not causation. Therefore, we
should never interpret correlation as implying cause and effect relation. For example, there
exists a correlation between two variables X and Y, which means the value of one variable
is found to change in one direction, the value of the other variable is found to change either
in the same direction (i.e. positive change) or in the opposite direction (i.e. negative
change). Furthermore, if the correlation exists, it is linear, i.e. we can represent the relative
movement of the two variables by drawing a straight line on graph paper.
In statistics, Correlation studies and measures the direction and extent of relationship
among variables, so the correlation measures co-variation, not causation. Therefore, we
should never interpret correlation as implying cause and effect relation. For example, there
exists a correlation between two variables X and Y, which means the value of one variable
is found to change in one direction, the value of the other variable is found to change either
in the same direction (i.e. positive change) or in the opposite direction (i.e. negative
change). Furthermore, if the correlation exists, it is linear, i.e. we can represent the relative
movement of the two variables by drawing a straight line on graph paper.
Correlation Coefficient
The correlation coefficient, r, is a summary measure that describes the extent of the
statistical relationship between two interval or ratio level variables. The correlation
coefficient is scaled so that it is always between -1 and +1. When r is close to 0 this
means that there is little relationship between the variables and the farther away
from 0 r is, in either the positive or negative direction, the greater the relationship
between the two variables.
The two variables are often given the symbols X and Y. In order to illustrate how the
two variables are related, the values of X and Y are pictured by drawing the scatter
diagram, graphing combinations of the two variables. The scatter diagram is given
first, and then the method of determining Pearson’s r is presented.
5
Scatter Diagram
A scatter diagram is a diagram that shows the values of two variables X and Y,
along with the way in which these two variables relate to each other. The values of
variable X are given along the horizontal axis, with the values of the variable Y given
on the vertical axis.
Later, when the regression model is used, one of the variables is defined as an
independent variable, and the other is defined as a dependent variable. In
regression, the independent variable X is considered to have some effect or
influence on the dependent variable Y. Correlation methods are symmetric with
respect to the two variables, with no indication of causation or direction of influence
being part of the statistical consideration. A scatter diagram is given in the following
example. The same example is later used to determine the correlation coefficient.
Types of Correlation
The scatter plot explains the correlation between the two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables –
Positive Correlation – when the values of the two variables move in the same direction
so that an increase/decrease in the value of one variable is followed by an
increase/decrease in the value of the other variable.
Negative Correlation – when the values of the two variables move in the opposite
direction so that an increase/decrease in the value of one variable is followed by
decrease/increase in the value of the other variable.
No Correlation – when there is no linear dependence or no relation between the two
variables.
REGRESSION
Regression is a statistical method used in finance, investing, and other disciplines that
attempts to determine the strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of other variables (known as
independent variables).
Also called simple regression or ordinary least squares (OLS), linear regression is the most
common form of this technique. Linear regression establishes the linear
relationship between two variables based on a line of best fit. Linear regression is thus
graphically depicted using a straight line with the slope defining how the change in one
variable impacts a change in the other. The y-intercept of a linear regression relationship
represents the value of one variable when the value of the other is zero. Non-linear
regression models also exist, but are far more complex.
Regression analysis is a powerful tool for uncovering the associations between variables
observed in data, but cannot easily indicate causation. It is used in several contexts in
business, finance, and economics. For instance, it is used to help investment managers
6
value assets and understand the relationships between factors such as commodity
prices and the stocks of businesses dealing in those commodities.
Chart Title
2000
1800
1600
f(x) = 0.824004402349223 x
1400 R² = 0.968594144287856
1200
1000
800
600
400
200
0
0 500 1000 1500 2000 2500
7
8
DISTRIBUTION OF DATA- SKEWNESS,KURTOSIS, KS
TEST OF NORMALITY
Skewness is a measure of the asymmetry of a distribution. A distribution is
asymmetrical when its left and right side are not mirror images.
A distribution can have right (or positive), left (or negative), or zero skewness. A
right-skewed distribution is longer on the right side of its peak, and a left-skewed
distribution is longer on the left side of its peak:
Tails are the tapering ends on either side of a distribution. They represent the
probability or frequency of values that are extremely high or low compared to the
mean. In other words, tails represent how often outliers occur.
9
OUTLIERS
In data analytics, outliers are values within a dataset that vary greatly from the
example, the average height of a giraffe is about 16 feet tall. However, there
have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet,
the general giraffe population. When going through the process of data analysis,
outliers can cause anomalies in the results obtained. This means that they
require some special attention and, in some cases, will need to be removed in
order to analyze data effectively.There are two main reasons why giving outliers
Types of outliers
For example, Sultan Kösen is currently the tallest man alive, with a height
least two variables. For example, if you’re looking at both the height and
weight of a group of adults, you might observe that one person in your
dataset is 5ft 9 inches tall—a measurement that would fall within the
normal range for this particular variable. You may also observe that this
person weighs 110lbs. Again, this observation alone falls within the normal
10
range for the variable of interest: weight. However, when you consider
outlier.
KS TEST OF NORMALITY
In statistics, the Kolmogorov–Smirnov test (K–S test or KS test)
is a nonparametric test of the equality of continuous (or
discontinuous, see Section 2.2), one-dimensional probability
distributions that can be used to compare a sample with a
reference probability distribution (one-sample K–S test), or to
compare two samples (two-sample K–S test).
11
percentage should probably be quite small. That is, a small
deviation has a high probability value or p-value.
Reversely, a huge deviation percentage is very unlikely and
suggests that my reaction times don't follow a normal
distribution in the entire population. So a large deviation
has a low p-value.
NORMAL DISTRIBUTION
12
0.03 NORMAL DISTRIBUTION
0.025
0.02
0.015
0.01
0.005
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
13
KOLMOGOROV-SMIRNOV TEST
14
T-TEST
15
A t test is a statistical test that is used to compare the means of two groups. It is
often used in hypothesis testing to determine whether a process or treatment
actually has an effect on the population of interest, or whether two groups are
different from one another. A t test can only be used when comparing the means of
two groups (a.k.a. pairwise comparison). If you want to compare more than two
groups, or if you want to do multiple pairwise comparisons, use an ANOVA test or a
post-hoc test.
The t test is a parametric test of difference, meaning that it makes the same
assumptions about your data as other parametric tests. The t test assumes your
data:
1. are independent
2. are (approximately) normally distributed
3. have a similar amount of variance within each group being compared (a.k.a.
homogeneity of variance)
If your data do not fit these assumptions, you can try a nonparametric alternative to
the t test, such as the Wilcoxon Signed-Rank test for data with unequal variances.
When choosing a t test, you will need to consider two things: whether
the groups being compared come from a single population or two
different populations, and whether you want to test the difference in a
specific direction. One-sample, two-sample, or paired t test?
If the groups come from a single population (e.g., measuring before and after
an experimental treatment), perform a paired t test. This is a within-subjects
design.
If the groups come from two different populations (e.g., two different species,
or people from two separate cities), perform a two-
sample t test (a.k.a. independent t test). This is a between-subjects
design.
If there is one group being compared against a standard value (e.g.,
comparing the acidity of a liquid to a neutral pH of 7), perform a one-
sample t test.
If you only care whether the two populations are different from one another,
perform a two-tailed t test.
If you want to know whether one population mean is greater than or less than
the other, perform a one-tailed t test.
PAIRED T-TEST
16
UNPAIRED T-TEST
17
F-TEST:TWO SAMPLE FOR VARIANCES
F test is a statistical test that is used in hypothesis testing to check
whether the variances of two populations or two samples are equal or not.
In an f test, the data follows an f distribution. This test uses the f statistic to
18
compare two variances by dividing them. An f test can either be one-tailed
or two-tailed depending upon the parameters of the problem.
The f value obtained after conducting an f test is used to perform the one-
way ANOVA (analysis of variance) test. In this article, we will learn more
about an f test, the f statistic, its critical value, formula and how to conduct
an f test for hypothesis testing.
F Test Definition
F test can be defined as a test that uses the f test statistic to check
whether the variances of two samples (or populations) are equal to the
same value. To conduct an f test, the population should follow an f
distribution and the samples must be independent events. On conducting
the hypothesis test, if the results of the f test are statistically significant
then the null hypothesis can be rejected otherwise it cannot be rejected.
F Test Formula
The f test is used to check the equality of variances using hypothesis testing.
The f test formula for different hypothesis tests is given as follows:
Decision Criteria: If the f statistic < f critical value then reject the null
hypothesis
Decision Criteria: If the f test statistic > f test critical value then reject the null hypothesis
19
Two Tailed test:
Decision Criteria: If the f test statistic > f test critical value then the null hypothesis is rejected
F Test T-Test
An F test is a test
The T-test is used when the sample
statistic used to check
size is small (n < 30) and the
the equality of
population standard deviation is not
variances of two
known.
populations
F-TEST
20
Z-TEST:TWO SAMPLES FOR MEANS
21
A z-test is a statistical test used to determine whether two population
means are different when the variances are known and the sample size
is large.
KEY TAKEAWAYS
"P(Z <= z) one tail" should be interpreted as P(Z >= ABS(z)) or the
probability of a larger z Critical one-tail value larger than the absolute
value of the observed z value, when there is no difference between the
population means.
Z-TEST
23
ANOVA:SINGLE AND TWO FACTOR
24
When it comes to research, in the field of business, economics, psychology, sociology, biology, etc.
the Analysis of Variance, shortly known as ANOVA is an extremely important tool for analysis of data.
It is a technique employed by the researcher to make a comparison between more than two
populations and help in performing simultaneous tests. There is a two-fold purpose of ANOVA.
In one way ANOVA the researcher takes only one factor.
As against, in the case of two-way ANOVA, the researcher investigates two factors concurrently. For
a layman these two concepts of statistics are synonymous. However, there is a difference between
one-way and two-way ANOVA.
When it comes to research, in the field of business, economics, psychology, sociology, biology, etc.
the Analysis of Variance, shortly known as ANOVA is an extremely important tool for analysis of data.
It is a technique employed by the researcher to make a comparison between more than two
populations and help in performing simultaneous tests. There is a two-fold purpose of ANOVA.
In one way ANOVA the researcher takes only one factor.
As against, in the case of two-way ANOVA, the researcher investigates two factors concurrently. For
a layman these two concepts of statistics are synonymous. However, there is a difference between
one-way and two-way ANOVA.
Comparison Chart
BASIS FOR
ONE WAY ANOVA TWO WAY ANOVA
COMPARISON
Meaning One way ANOVA is a hypothesis test, used to test Two way ANOVA is a statistical technique
the equality of three of more population means wherein, the interaction between factors,
simultaneously using variance. influencing variable can be studied.
Compares Three or more levels of one factor. Effect of multiple level of two factors.
Number of Need not to be same in each group. Need to be equal in each group.
Observation
Design of Need to satisfy only two principles. All three principles needs to be satisfied
experiments
One way Analysis of Variance (ANOVA) is a hypothesis test in which only one categorical variable or
single factor is considered. It is a technique which enables us to make a comparison of means of
three or more samples with the help of F-distribution. It is used to find out the difference among its
different categories having several possible values.
The null hypothesis (H0) is the equality in all population means, while alternative hypothesis (H1) will
be the difference in at least one mean.
Normal distribution of the population from which the samples are drawn.
Measurement of the dependent variable is at interval or ratio level.
Two or more than two categorical independent groups in an independent variable.
Independence of samples
Homogeneity of the variance of the population.
25
Definition of Two-Way ANOVA
Two-way ANOVA as its name signifies, is a hypothesis test wherein the classification of data is
based on two factors. For instance, the two bases of classification for the sales made by the firm is
first on the basis of sales by the different salesman and second by sales in the various regions. It is a
statistical technique used by the researcher to compare several levels (condition) of the two
independent variables involving multiple observations at each level.
Two-way ANOVA examines the effect of the two factors on the continuous dependent variable. It also
studies the inter-relationship between independent variables influencing the values of the dependent
variable, if any.
Normal distribution of the population from which the samples are drawn.
Measurement of dependent variable at continuous level.
Two or more than two categorical independent groups in two factors.
Categorical independent groups should have the same size.
Independence of observations
Homogeneity of the variance of the population.
26
ANOVA : TWO FACTOR WITH
REPLICATION
27
28
CHI-SQUARE TEST
29
A Pearson’s chi-square test is a statistical test for categorical data. It is used to determine whether
your data are significantly different from what you expected. There are two types of Pearson’s chi-
square tests:
The chi-square goodness of fit test is used to test whether the frequency distribution of a
categorical variable is different from your expectations.
The chi-square test of independence is used to test whether two categorical variables are
related to each other.
Chi-square is often written as Χ2 and is pronounced “kai-square” (rhymes with “eye-square”). It is also
called chi-squared.
Pearson’s chi-square (Χ2) tests, often referred to simply as chi-square tests, are among the most
common nonparametric tests. Nonparametric tests are used for data that don’t follow
the assumptions of parametric tests, especially the assumption of a normal distribution.
If you want to test a hypothesis about the distribution of a categorical variable you’ll need to use a
chi-square test or another nonparametric test. Categorical variables can be nominal or ordinal and
represent groupings such as species or nationalities. Because they can only have a few specific
values, they can’t have a normal distribution.
Frequency distributions are often displayed using frequency distribution tables. A frequency
distribution table shows the number of observations in each group. When there are two categorical
variables, you can use a specific type of frequency distribution table called a contingency table to
show the number of observations in each combination of groups.
Where:
The larger the difference between the observations and the expectations (O − E in the equation), the
bigger the chi-square will be. To decide whether the difference is big enough to be statistically
significant, you compare the chi-square value to a critical value.
CHI-SQUARE TEST
30
31
WILCOXON SIGNED-RANK TEST
32
Wilcoxon signed rank test is a nonparametric hypothesis test that can
do the following:Evaluate the median difference between two paired
samples.Compare a 1-sample median to a reference value.
Statisticians often use the Wilcoxon signed rank test when their data do
not follow the normal distribution. However, it has other advantages
over t-tests, including the ability to analyze ordinal data and reduce the
impact of outliers.While the data don’t need to be normally distributed,
they must follow a symmetrical distribution. When using the paired form,
the distribution of the differences between the paired values must be
symmetrical.If the distribution is asymmetric, consider using the sign
test. This nonparametric test is like the Wilcoxon signed rank test but
can handle asymmetric distributions. However, the sign test is less
powerful.
Now, let’s delve into the hypotheses of the Wilcoxon signed rank test.
There are two sets of hypotheses. Choosing the correct set depends on
whether you perform the paired or one-sample test.
Paired Test
33
The following are the hypotheses for the paired Wilcoxon signed rank
test
One-Sample Test
The following are the hypotheses for the one-sample Wilcoxon signed
rank test:
WILCOXON T-TEST
34
35