Introduction to Biostatistics Concepts
Topics covered
Introduction to Biostatistics Concepts
Topics covered
The mere mention of the word statistics can ring alarm bells in many minds and more so in the minds of
medics. Medical professionals believe that they are duty bound not to touch anything mathematical. They
had opted for Biology instead of Mathematics at FSc level so it would be a sin to think of it. They
erroneously think of Statistics as a branch of Mathematics. No doubt that mathematics are extensively used
in statistics but so is the case with other sciences like, Physics, Chemistry, Agriculture, Engineering and
many more. The subject is further defamed, though, for political reasons; for example when Mark Twain
quotes the Victorian age English Prime Minister Disraeli as saying “there are lies, damned lies and
statistics”. We the readers immediately implicate the subject statistics as the worst types of lies. Facts are to
the contrary. What Disraeli means by statistics here is the facts and figures presented by the government
whereas the subject statistics is an entirely different phenomenon though it also plays with facts and figures
but in a scientific way.
The kind of thinking involved in statistics will not be entirely new to you. Indeed, you will find that many
of our day-to-day assumptions and decisions already depend on it. Suppose you are told that two adults are
sitting in the next room. One is 5 feet tall and the other is six feet tall. What would be your best guess as to
each one’s sex, based on that information alone? You may be fairly confident in assuming that the six feet
tall person may be a man and the five-footer may be a woman. You could be wrong, off course, but
experience tells you that five-foot men and six-foot women are somewhat rare. You have noticed that, by
and large, males tend to be taller than female human beings. Off course you have not seen all men or all
women and you recognize that many women are taller than many men; nevertheless you feel reasonably
confident about generalizing from the particular men and women you have known to men and women as a
whole.
The above is a simple, every day example of statistical thinking. There are many other examples. Anytime
you use phrases like: ‘on average, I sleep 52 hours a week’ or ‘we can expect a lot of rain at this time of
year’ or ‘the earlier you start revising the better you are likely to do in the annual exam’; you are making a
statistical statement, even though you may have performed no calculations.
There are many more things, which are not known to us. We may need information on some of these
things. Without conducting a proper investigation we may be oblivious to many important things. To
conduct such an investigation we need to have knowledge of the subject STATISTICS. Medical science
cannot progress without making use of the subject statistics; hence all medical graduates must posses a first
hand knowledge of statistics. Just think for a moment from where do we know that Hemoglobin level is 12-
15g/dl in adult males. Have we measured the hemoglobin level of all men and women? Certainly not but
still we confidently diagnose a man as anemic with a hemoglobin level of 10g/dl. How do we feel confident
about our diagnosis of anemia with 10g/dl hemoglobin? Take care you need to learn it as we discuss it in
the following pages.
Statistics:
A subject that deals with the collection, compilation, presentation, analysis and interpretation of data
Statistics are:
a). Descriptive Statistics: Methods used to summarize or describe our observations
b). Inferential Statistics: Using those observations as a basis for making estimates or
predictions i.e. inferences about a situation that has not yet been observed. Appropriate
tests of significance are applied in inferential statistics.
* Note that the word statistics used in everyday language means facts and figures and not
the subject statistics. The word statistic means the computed figures from actual
observations (sample).
Data:
Record of observations – facts and figures – any piece of information.
Information:
When data is processed and made meaningful
Sample:
A part taken out of the population for actual study
Census:
When all members of a population are examined
Sampling:
The procedures when some members of the population are drawn for examination
You will appreciate the fact that populations are usually very big and at times of infinite size. Time and
other resources are most of the times scarce; therefore, researchers almost always opt for sampling rather
than conducting a census. To be able to draw meaningful information from our samples we would like that
our samples are representatives of the population they are drawn from. We don’t know about the population
so how do we know that our sample is representative of the population. We don’t have any foolproof
method but to be reasonably sure that our samples are representatives of the population they are drawn
from, we must ensure that:
i. they are drawn randomly
ii. they are of adequate size
We can ensure samples are drawn randomly but the size of a sample is usually dictated by the availability
of resources i.e. time, men, money and material rather than the statistical requirements.
Sampling unit:
It could be individuals or households, which are actually studied.
Sampling Frame:
The list of sampling units is called sampling frame.
Sampling Techniques:
In an ideal world we need to have a list of all the members of the population and then draw a sample by
method of say lottery. But just imagine that if want to know about the heights of adult males in district
Abbottabad and we want to draw a sample of 1000 adult males. There may be more than 300000 adult
males in district Abbottabad. Do we have any method where we can obtain a list of all adult males of
district Abbottabad? We surely don’t have a complete list of adult males of district Abbottabad. Similarly,
most of the populations we encounter don’t have complete lists. Therefore, we need to look for other
sampling techniques.
Random Samples:
Defined as where each and every member of the population has an equal chance of selection.
a. Probability Sampling
b. Non-probability Sampling
a. Probability samples are those when members of the population have known and not necessarily
equal chance of getting selected as sample members. In this technique of sampling inferential
statements can be made based on samples. In the case of probability sampling the sampling frame
is available in some shape. Some of such techniques are as under:
Cluster Sampling: Cluster sampling is akin to multi-stage sampling in so far as the town is
consecutively subdivided. Cluster sampling is adopted when there is no sampling frame from
which the final sample can be selected. In this type the researcher combs the area meticulously to
find the items needed to from that area’s sample e.g. to know vaccination status of children under
two years a researcher first select a district, then Union Council, then villages and then Mohallah –
all randomly. Then he looks for seven or more such houses located together that have children less
than two years of age. Such seven or more houses located together are known as clusters.
Note if the sample is not randomly selected it cannot be representative of the population
under study. This is also called SYSTEMATIC ERROR or BIAS. Therefore to decrease bias
the sampling technique is important consideration. To decrease the play of chance sample
size is a consideration.
Variable:
An attribute or characteristic that is variable from one individual to another.
Types of Data:
Data consists of variables. We need to know different types of variables because different statistical
techniques are employed to analyze different variables.
Category Quantity
Discrete Continuous
Nominal Ordinal
(counting) (Measuring)
Interval Ratio
Category (Nominal):
Observations have names only for example male/female, black/white/yellow/brown. There are no orders or
ratios. If nominal data has only two groups e.g. male/female it is called dichotomous data.
Category (Ordinal):
When data is placed into meaningful order. Students may be ranked as 1 st, 2nd, 3rd etc. however the interval
between orders is not certain.
Quantity (Discrete):
When items can be counted e.g. number of children a woman gives birth to. They can be 1, 2, 3 or 4 and
even more. It cannot be 2.6
Quantity/Continuous (Interval):
Such data is in order in addition that they can be placed in meaningful order. Temperature on Celsius Scale
is interval scale data. But temperature of 10C0 doesn’t mean that it is twice as hot as a temperature of 5 C 0
because Celsius Scale doesn’t have an absolute zero.
Quantity/Continuous (Ratio):
In such data the intervals have meaningful ratios e.g. a student weighing 80kg is twice as heavy as a student
of 40kg.
Another classification of variables is independent and dependent variable. They are used when we compare
variables. Independent variables are presumed causes and dependent variables are presumed effects.
Incidence of common cold in different seasons is an example to explain this. Season is independent
variable and common cold is dependent variable.
Compilation of Data:
Once data is collected it can be organized for further processes.
Frequency Distribution:
The collected data can be plotted in tabular form or graphic form after organizing it showing frequencies of
different observations. It can also be organized in group form which is called grouped frequency
distributions. If we collect data on the pulse rates/min of 15 students as under:
72, 73, 80, 62, 66, 108, 82, 73, 69, 78, 86, 67, 76, 73, 75
Note: For tallying observations the FIVE BAR GATES or tallying methods can be used shown in the third
column in the above table. //// is called five bar gate.
Description of Data:
Once the data is collected we need to know the following;
Mean (Arithmetic Mean): It is defined as the sum of observations divided by the number of observations
x
Mean X
n
x Sum of observations in a sample
n Number of observations in a sample
X
Mean
N
Mean of a population
X Sum of values in a population
N Number of values in a population
x
( Mean ) X
n
1140
X 76 beats / min ute
15
x 1142 . 5
X 76 . 16 beats / min
n 15
61 70 71 80
In the above table mid point calculated as like 65 . 5 , 75 . 5 so on .
2 2
Mean calculated in this way is not exactly the same as calculated by adding individual observations but is
near to that. In the case of very large data it goes even nearer to the actual value.
Advantages of Mean:
Disadvantages:
i. It is affected by extreme values
ii. Sometimes it can give a ridiculous figure e.g. 2.35 children, 1.13 eggs etc
n 1
Position of Median =
2
Using the same data
62, 66, 67, 69, 72, 73, 73, 73 75, 76, 78, 80, 82, 86, 108
n 1 15 1 16
Position of Median = 8 8th value is the median which is 73.
2 2 2
You can see that there are seven values below 73 and an equal number i.e. seven above the median. In the
data shown n=15 which is an odd number. If n=16, an even number then the Median would be
n 1 16
Position of Median = 8 .5
2 2
8.5 means the average of 8th and 9th observations. If e.g. 8th value was 73 and 9th 75, then
73 75 148
Median would be 74 beats / min ute
2 2
In this case median may not be an actually observed value.
Advantages of Median:
It is not affected by extreme values therefore; it is used for that data, which is, skewed i.e. having extreme
observations.
Disadvantages:
1. It does not take into account all the values of a distribution
2. It is of limited value in further statistical computation
Mode: The most frequently observed value in a distribution is known as mode. In the aforementioned data
Mode is 73 beats per minute, which appears three times.
Some distribution may have two modes – they are called bimodal distributions. If there are more than two
modes, such distributions are known as multimodal distributions. It can be used for all types of data.
Note: Mean – Median – Mode have the same units as of observations and must be noted with the resultant
value e.g. Mean is 76 beats per minute.
1. Range
2. Mean Deviation
72, 73, 80, 62, 66, 108, 82, 73, 69, 78, 86, 67, 76, 73, 75
Range is 62-108 beats per minute or 108-62 beats per minute or 46 beats per minute.
Range is a good measure of dispersion when we want to know immediately how the data spread but it takes
into account only the lowest and highest values of a distribution. Therefore, it is not a good measure of
dispersion of data.
2. Variance: is equal to the sum of squared deviation of observations from mean of the distribution
divided by the number of observations
Variance = ( X X )
2
62 76 -14 196
66 76 -10 100
67 76 -9 81
69 76 -7 49
72 76 -4 16
73 76 -3 9
73 76 -3 9
73 76 -3 9
75 76 -1 1
76 76 0 0
78 76 +2 4
80 76 +4 16
82 76 +6 36
86 76 +10 100
108 76 +32 1024
n =15 Σ ( X X ) =0 Σ ( X X ) 2 =1650
( X X ) 2 1650
1650
Variance = 110
15
We square the deviations to get rid of the negative signs but by squaring the values we loose the units.
Therefore, Variance is of limited value in measuring dispersion of the data.
x x 2
SD
n
(X ) (X ) XX
62 76 -14 196
66 76 -10 100
67 76 -9 81
69 76 -7 49
72 76 -4 16
73 76 -3 9
73 76 -3 9
73 76 -3 9
75 76 -1 1
76 76 0 0
78 76 +2 4
80 76 +4 16
82 76 +6 36
86 76 +10 100
108 76 +32 1024
n =15 Σ ( X X ) =0 Σ ( X X ) 2 =1650
By squaring the deviation we get rid of the negative signs, but we loose the original unit, which is taken
care of by applying the square root again, which means that original units are restored.
x 2
x 2
A. SD n
n
x 2
B. SD x2
n
Formula A:
x2
x 2
SD n
n
Using the same data we can calculate the standard deviation as under by the above formula:
x 2 All the observations are first squared then added.
x 2 All the observations are first added then squared.
Observations Observations
(X) Squared
(X)2
62 3844
66 4356
67 4489
69 4761
72 5184
73 5329
73 5329
73 5329
75 5625
76 5725
78 6084
80 6400
82 6724
86 7396
108 11664
x 1140 x 2 88290
x2
x 2
SD n
n
(1140 ) 2
88290
SD 15
15
1299600
88290
SD 15
15
88290 86640
SD
15
1650
SD 110 10.5 beats / min
15
x x is usually used for small data. The other formulas are used for large
2
The first formula SD
n
data. If your data consists of less than 30 observations then the two formulas can be amended as under for
correction.
x x
2
1. SD
n 1
x2
x 2
2. SD n
n 1
x2
The last formula SD x 2 does not have such facility and should be used for data of more
n
than 30 observations only.
The use of standard deviation in statistical data is explained with the Normal Distribution.
Note: Standard Deviation of a sample is denoted by the symbols SD and Standard Deviation of population
is denoted by Greek letter small sigma δ
4. Co-efficient of Variation:
Measures variability in relation to the mean and offers a method by which one can compare the
relative dispersions of one type of data with the relative dispersion of another type of data.
Our data of heart beats per minute will have Co-efficient of variation as under:
SD 10.5
Co-efficient of variation of heart beats data = x100 = x100 13.8%
Mean 76
If we also had recorded the systolic blood pressures of the same individuals with a mean systolic BP of
130mmHg and Standard Deviation of 13mmHg – the co-efficient of variation would have been
SD 13
x100 x100 10%
Mean 130
Now we can compare and conclude that among persons whose pulse rates and systolic blood pressure were
recoded, pulse rates are more variable than systolic blood pressure since co-efficient of variation of pulse
rates is 13.8% and systolic BP is 10%.
Inferential Statistics means when we go beyond the actual observations and state something (based on the
collected data), which have not been actually observed. Here the theory of probability comes in.
Probability:
The number of events occurring out of a total possible number of events is called probability. If we flip a
fair coin, the probability of having head is ½ or 50% or 0.5. The probability of either having head or tail is
1/1 or 100% or 1. Two simple rules of probability need to be remembered.
Addition Rules:
For two or more possible mutually exclusive events the collective probability equals ONE or 100%. For
example there are two possibilities when we flip a fair coin i.e. either head or tail. We cannot have both
head and tail at one flip. The probability of having head is 0.5 or 50%. Therefore, according to addition rule
the probability of having head or tail is 0.5 + 0.5 = 1 or 50% + 50%=100%.
Example: If infant mortality rate is 80 per 1000 in Pakistan, then, the probability of an infant dying is 80
per 1000 or 8 per 100 or 8% or .08. The probability of an infant surviving is 920 per 1000 i.e. 1000-
80=920. It can also be said that the probability of an infant surviving is 92% or .92. As a child can either
survive or die and they are mutually exclusive phenomena, therefore, according to addition rule the
probability of either dying or surviving is .8+.92=1. (If the statement has OR in it, addition rule is applied)
Multiplication Rule: For two or more independent and randomly occurring phenomena the probabilities
multiply. When we flip a fair coin it is an independent event. When we flip a fair coin twice or thrice or
more; all are independent events. The probability of getting head with flip is .5 then having heads on two
flips is .5x.5 = .25.
Example: If we know that 10% of patients visiting a medical OPD suffer from hypertension it means that
probability of a patient having hypertension is .1. So the probability of having first two patients entering the
OPD of being suffering from hypertension is .1x.1 = .01 or 1%. This is called the multiplication rule.
Note: If the probability is stated to be 1 it is called unity. To say one has to die eventually the probability
will be 1. For one to stay alive for ever the probability will be 0. In between 0 and 1 there are fractions of
1, which may go up to many decimals for different events.
NORMAL DISTRIBUTION
NORMAL CURVE
1. It is bell shaped.
2. It is perfectly symmetrical.
3. Mean, Median and Mode are in the centre of the curve i.e. the dome of the curve.
4. Half the values (50%) lie on each side when it is cut into half at the highest point.
5. It has got two determinants Mean (µ) and Standard Deviation (δ).
6. 68.26% of the values lie between the range of Mean ± 1xSD (1δ - µ + 1δ). In other words the
probability of occurrence of values between the range 1xSD – Mean + 1xSD (1x δ-µ+1x δ) is
68.26% or .6826. This also implies that 31.74% of values are either below Mean – 1xSD (µ-1x δ
) or above Mean + 1x SD(µ+1x δ ). In other words the probability of occurrence of values below
Mean – 1xSD (µ-1x δ) or above Mean + 1xSD (µ+1x δ ) is 31.74% or .3174.
7. 95.45% of the values lie between the range of Mean ± 2xSD (2δ - µ + 2δ). In other words the
probability of occurrence of values between the range 2xSD – Mean + 2xSD is 95.45% or .9545.
This also implies that 4.55% of values are either below Mean – 2xSD or above Mean + 2x SD. In
other words the probability of occurrence of values below Mean – 2xSD or above Mean + 2xSD
is 4.55% or .0455.
8. 99.73% of the values lie between the range of Mean ± 3xSD (3δ - µ + 3δ). In other words the
probability of occurrence of values between the range 3xSD – Mean + 3xSD is 99.73% or .9973.
This also implies that 0.27% of values are either below Mean – 3xSD or above Mean + 3x SD. In
other words the probability of occurrence of values below Mean – 3xSD or above Mean + 3xSD
is 0.27% or .0027.
To elaborate it further and make it useful, remember the following landmarks also:
a. 95% of the values lie between the range of Mean ± 1.96xSD (1.96δ - µ + 1.96δ). In other words
the probability of occurrence of values between the range 1.96xSD – Mean + 1.96xSD is 95% or
.95. This also implies that 5% of values are either below Mean – 1.96xSD or above Mean +
1.96x SD. In other words the probability of occurrence of values below Mean – 1.96xSD or above
Mean + 1.96xSD is 5% (2.5% 0n each side) or .05 (0.025 on each side).
b. 99% of the values lie between the range of Mean ± 2.58xSD (2.58δ - µ + 2.58δ). In other words
the probability of occurrence of values between the range 2.58xSD – Mean + 2.58xSD δ is 99%
or .99. This also implies that 1% of values are either below Mean – 2.58xSD or above Mean +
2.58x SD. In other words the probability of occurrence of values below Mean – 2.58 or above
Mean + 2.58xSD is 1% or .01.
-1 µ +1
1 SD - Mean + 1 SD 1
(µ)
47.725% 47.725%
-2 -1 µ +1 +2
2x SD - Mean + 2x SD
(µ)
-3 -2 -1 µ +1 +2 +3
3x SD - Mean + 3x SD
(µ)
-3 -2 -1.96 -1 µ +1 +1.96 +2 +3
-3 -2.58 -2 -1 µ +1 +2 +2.58 +3
Note: When we say that a certain percentage of observations lie between Mean ± Z x SD, the Z in the case
of 68.26% is 1, in the case of 95%, Z is 1.96, in the case of 95.45% Z is 2, in the case of 99% Z is 2.58 and
in the case of 99.73% Z is 3.
The diagrams given below gives an account of hypothetical data of heights in cm of 1000 males. The
second diagram is a histogram upon which falls a normal curve. Carefully look at the Mean ± Z x SD.
The curve may not be symmetrical in some instances. It may have many shapes but two shapes are
important to remember.
Diagram A:
Diagram A is a distribution skewed to right (Positively Skewed); and diagram B shows a distribution that is
skewed to the left (Negatively Skewed).
Now we have three types of curves.
1. Symmetrical Curve (Standard Normal Curve)
2. Curve Skewed on the right
3. Curve Skewed on the left
1. Symmetrical Curve:
Suppose students appear in a test on subject K. Data shows that there are very few students who
have scores less than 10%. As the scores increase the number of students increase until a stage is
reached where the scores are around 50% and most students score around that. That is the mode of
the data. According to the properties of Normal Curve it is also the Mean and Median. Again
when scores increase further the numbers of students keep on decreasing until we reach a stage
where students score around 90%. You can appreciate that they will be very few. This type of
distribution of scores is a normal distribution. Most biological values are distributed like this e.g.
pulse rates, blood pressure, hemoglobin value etc.
2. Curve Skewed to the Right (Positively Skewed):
Suppose students appear in a test on subject L. We observe that there are many students who have
scores on the lower side that will be the mode of the distribution. (We know that mode is not
affected by extreme values at all). Next to mode to the right will be the median as it is less affected
by extreme values. Mean will be on the extreme right where the few extreme values lie. On the
right of the curve will be the students with higher scores but less in number. This distribution is
skewed to the right, which means that maximum students scored less and few scored high marks.
Wealth is generally distributed like this.
3. Curve Skewed to the Left (Negatively Skewed):
Suppose students appear in a test on subject M. We can appreciate the fact that there are very few
students who scored less. Maximum of the students scored high; therefore, the mode of the
distribution will fall on the extreme right. To the left will be the median, and to further left will be
mean. This implies that most of the students did well in test on subject M and only a few lagged
behind. Distribution of Hemoglobin value in children is skewed to the left.
Note: Skew means tail. Skew is said to be to the side where the tail of the distribution is.
One main purpose of the statistics should be to estimate. Estimation means generalizing to a bigger
phenomenon by actually looking at a part of that bigger phenomenon. That is by making statements about a
population (which is not fully examined) on the basis of a part of it that is actually examined. In other
words we extrapolate our sample data to the population from which the sample is drawn. Do not forget that
to be able to make such statements the sample has to be representative of the population it is drawn from.
For a sample to be representative the data has to be collected in a random manner.
From our data of pulse rates for which we have calculated Mean and Standard Deviation
72, 73, 80, 62, 66, 108, 82, 73, 69, 78, 86, 67, 76, 73, 75
Mean = 76
SD = 10.5
n = 15
1. 68.26% of the values are within Mean ± ISD i.e. between 65.5 and 86.5
2. 95% of values are within Mean ± I.96SD i.e. between 55.4 to 96.6
3. 95.45% of values lie within Mean ± 2SD i.e. between 55 to 97
4. 99% of values lie within Mean ± 2.58SD i.e. between 48.9 to 103.1
5. 99.73% of values lie within Mean ± 3SD i.e. between 44.5 to 107.5
These are confidence limits for the Sample. Number 1 is 68.26% confidence limits; number 2 is 95%
confidence limits; number 3 is 95.45% confidence limits; number 4 is 99% confidence limits; and number 5
is 99.73% confidence limits. This means that you can state with a certain percent of confidence in what
range the values within your sample fall. But do not forget that these confidence limits are for your sample
and not the population from which the sample is drawn
.
The upper limit of the range is upper confidence limit; the lower limit of the range is lower confidence
limit. In between the upper and lower confidence limits is the CONFIDENCE INTERVAL. 95%
confidence limits for a sample imply that 95% of the observations in the sample will lie within this range,
which in the case of our data are 55.4 to 96.6. It also means that 5% of observations may lie outside these
limits either below lower confidence limit or above the upper confidence limit.
Such calculations are of no use as long as we do not know the population mean (µ) and population
Standard Deviation (δ). And if after all we know the population mean and Standard Deviation then what is
the need of all this exercise.
Therefore, we have to estimate the population parameters especially the Standard Deviation (δ) of the
population. Here jumps in the Standard Error.
STANDARD ERROR:
Standard Error is the estimate of the population Standard Deviation. It neither is the True Standard
Deviation of the population nor an error in literary meanings.
To understand the concept of Standard Error let’s take the example of our data of pulse rates. We have a
mean of the data, which is 76 per minute. If we draw repeated samples from the same population and
compute means of all the samples then we’ll have a distribution of means of the samples like individual
values of pulse rates in one sample.
Statistics provide us with a formula to calculate SE without going through the cumbersome exercise.
SD
Standard Error (of Mean) =
n
Now we have got an estimate of the population standard deviation in the shape of standard error (SE)
which is 2.7. Based on this we can calculate confidence limits for the population exactly the same way as
for sample but substituting standard deviation of the sample with standard error of the mean.
Confidence Limits based on standard error of a mean are confidence limits for population and hence
an estimation of population situation based on sample. But remember that the actual explanation of
confidence limits calculated on the basis of standard error of mean is a little bit different from the
explanation of confidence limits calculated on the basis of actual standard deviation and mean of
population if known.
CONFIDENCE LIMITS BASED ON STANDARD ERROR OF MEAN: 95% confidence limits mean that if
we draw many samples from the same population 95% of the time the sample means will fall in these
limits. But practically we mean the same as if we know the actual mean and standard deviation of the
population.
CONFIDENCE LIMITS FOR A PROPORTION: We can also calculate confidence limits for a proportion
using standard error of a proportion formula.
Example: If the number of people with Iodine deficiency are 55 out of a randomly selected sample of 440
persons in district Kohistan the 95% and 99% confidence limits will be as under:
SE of Pr oportion
pxq
n
SE of Pr oportion
pxq
n
SE of Pr oportion
12.5 x87.5
440
This means that if we draw repeated samples from the population of Kohistan, 99% samples will have
Iodine deficient people between 8.45% and 16.55%.
Confidence limits for proportion imply the same as is the case with confidence limits for mean. If we
increase the sample size the standard error decreases and consequently the confidence interval will contract.
95% Confidence Interval (CI) is defined as: The range of mean values or proportions within which
there are 95 chances out of 100 that the true population mean or proportion will fall
If we calculate standard error by this method for our data of pulse rates, it will be 10.5/ 3.74 = 2.8
95% confidence limits will be Mean ± t x SE. To know about the value of t we have to refer to t table. First
we have to calculate degrees of freedom (DF), which are n-1. Our n is 15 hence DF = 15-1 = 14. Therefore,
14 are our degrees of freedom (DF). Referring to t table at 14 DF we find the value of t at 0.05, which are
for 95% confidence limits, is 2.14.
95% confidence limits are: Mean ± t x SE = 76 ± 2.14 x SE = 76± 2.14 x 2.8 = 76± 5.99
In the same way we can calculate 99% confidence limits. By referring to t table at 14 DF and 0.01(which
means 99% confidence limits) we find the value of t as 2.98. We substitute 2.98 for t or 2.14 in the
previous example and calculate 99% confidence limits. (Do it yourself)
Note: t is higher than Z ( 2.14 > 1.96 in the case of 95% CL and 2.98 > 2.58 for 99% CL) but after 120 DF
t tends to is equal Z.
We may be interested to compare two or more populations and determine that with regard to some
observations they differ significantly or the differences are just by chance or more precisely the act of
sampling error. We know that means of samples even from the same population may differ but to what
extent remains the question that has to be answered through significant testing or hypothesis testing.
While comparing two or more samples we may have a hypothesis, which is called research hypothesis.
Such a hypothesis may state that there is a difference or otherwise. We have to prove it based on the
collected data.
NULL HYPOTHESIS: It states that the different sets of data belong to one population and the observed
differences are by chance. In other words;
A=B
ALTERNATIVE HYPOTHESIS: It states that the different sets of data belong to different populations and
the differences are statistically significant and are not due to chance. In other words it means;
A≠B
SIGNIFICANCE TESTING:
To test hypothesis or know about significance we perform different statistical tests in different situations.
For data, which are normally distributed, we use the following tests:
Tests applied to data, which is normally distributed, are called parametric tests because they are applied to
data, which have parameters like Mean and Standard Deviation. Parametric data consists of continuous
variables.
For data, which are not normally distributed that means it is not parametric; we use Non-parametric tests.
Non-parametric data consists of nominal or ordinal variables.
The following tests are used for Non-parametric data and hence they are called Non-parametric tests.
i. Chi-square X2 test
ii. Fisher’s exact probability test
iii. Wilcoxon Rank Sum and Signed Rank Tests
iv. Mann Whitney U Test
And many more
NOTE: Parametric tests are more sensitive compared to Non-Parametric tests. It is also important to note
that the data has to be collected randomly to enable the tests to be meaningful.
5% LEVEL OF SIGNIFICANCE (p = 0.05): A level of probability at which the Null hypothesis is rejected
if an obtained sample difference occurs by chance only 5 times or less out of 100.
1% LEVEL OF SIGNIFICANCE (p = 0.01): A level of probability at which the Null hypothesis is rejected
if an obtained sample difference occurs by chance only 1 time or less out of 100.
We will discuss Normal distribution test (Z-test) and Chi-square tests only.
Example:
If we want to compare the weights of girls’ students of 1 st year and Final year, we collect data randomly.
After collection and computation we have the following figures:
By using Z
X1 X 2
Z
SE
SD12 SD22
SE
n1 n2
( 4) 2 (5) 2
SE
32 27
16 25
SE
32 27
SE = 1.2
Remember that the difference will be statistically significant if Z is more than 1.96 for a level of 5% and for
a level of 1% more than 2.58 (please refer to the properties of normal distribution)
Our data shows significant difference both at 5% and 1% level. Hence, we can state that there is
statistically significant difference between the girls’ students of 1 st and final year with regard to their
weights both at 5% and 1% significance level.
Note: One has to determine the significance level during the planning stage of the study.
p1 p2
Z
SE
p1 xq1 p2 xq2
SE (Standard Error difference between two proportions) =
n1 n2
Example:
If we collect some data randomly with the following observations:
13 out of a sample of 63, fourth year students are obese; and 17 out of 61 third year students are obese. Is
there any statistically significant difference between 4th and 3rd year students with regard to frequency of
obesity or the observed differences are by chance?
p1 xq1 p2 xq2
SE
n1 n2
20.6 x79.4 27.9 x72.1
SE
63 61
SE = 7.7
p1 p2 20.6 27.9 7.3
Z 0.94
SE 7 .7 7 .7
As Z is less than 1.96, therefore, we can say at a 5% significance level that there is no statistically
significant difference between 4th and 3rd year students with regard to obesity and the differences observed
are due to chance.
CHI-SQUARE TEST:
Chi-square Test (X2) is applied to Non-parametric data but the data has to be collected randomly. It also
shows association between two or more variables. There are many ways to compute X2 but we will discuss
only 2x2 contingency table (two-way Chi-square test). One-way Chi-square is discussed as example for
hypothesis testing.
Suppose a researcher has invented a new vaccine for measles and claims that it prevents measles. He
randomly selects two groups of children. One group is inoculated with the vaccine and the other group is
left as such and the outcome is observed. His observations are recorded in the table below.
Note: The numbers given in brackets are the cell numbers of table from 1 to 4.
(O E ) 2
Chi-Square (X2) =
E
O = Observed frequencies
E = Expected frequencies
Cell Observed E
Column Total x Row Total Expected O–E (O–E)2 (O E ) 2
No (O) Grand Total (E) E
1 18 136x40 14.9 3.1 9.61 0.64
366
2 118 136x326 121.1 3.1 9.61 0.07
366
3 22 230x40 25.1 3.1 9.61 0.38
366
4 208 230x326 204.9 3.1 9.61 0.05
366
(O E ) 2
1 . 14
E
Like t table, X2 table also has got degrees of freedom (DF). To calculate degrees of freedom (DF) we have
to multiply number of rows minus 1 with number of columns minus 1.
DF = (c–1) (r–1)
In our case there are two columns and two rows (excluding captions and totals)
DF = (2–1) (2–1)
DF = (1) (1)
DF = 1
At 1 DF at 0.05 (5% significance level) X2 = 3.84. Our computed X2 = 1.14, which is less than 3.84. Hence
we can say at 5% significance level that there is no statistically significant difference between group 1 and
group 2 at 5% significance level. In other words the vaccine has got no different effect on children
compared to those who were not vaccinated against measles.
1. Null Hypothesis
2. Alternative Hypothesis
We either accept Null Hypothesis or reject it. When we accept Null Hypothesis, we reject the Alternative
Hypothesis. When we reject Null Hypothesis we accept the Alternative Hypothesis.
STEPS:
Example: A Forensic specialist collects data at random of medicolegal cases of injuries due to the kind of
weapon used, in a district over a period of one year.
Null Hypothesis: There is no difference between the types of weapons used in causing injuries.
2. Select the decision criterion α (or “level of significance”). We select a 5% significance level
(p=0.05). Conventionally a 5% level of significance (p=0.05%) is selected. It can be more
stringent and less than 5% (p<0.05) but it is never more than 5%.
3. Establish the critical values: X2 table at p = 0.05 with degrees of freedom as will be
calculated.
4. Draw a random sample from the population, and calculate the mean of that sample: Sample
randomly drawn from a district.
5. Select appropriate statistical test and compute the value of the test statistic Z or t or X2 (as the
case may be).
Expected Frequencies (E) are calculated by dividing the total frequencies by number of categories. The
number of categories are 4 and total is 508. All expected frequencies equal 127.
6. Compare the calculated value of test statistic with the critical values of Z/t/X2, and then
accept or reject the null hypothesis.
Our calculated X2 is equal to 108.5. The degrees of freedom in this case are equal to the number of
categories minus one. There are four categories of weapons, therefore, DF = 4-1 = 3. At 3 DF X2 is equal
to 7.81 at 0.05. As our calculated value is more than the table value, therefore, the difference among the use
of weapons in causing injuries is statistically significant and cannot be due to chance alone. Therefore, we
reject the Null Hypothesis and accept the alternative hypothesis.
ACTUAL SITUATION
T
Ho True Ho False
E
S Type II error
T
Ho Accepted Correct (β)
False positive
R
E
S
U Type I error
L Ho Rejected (α) Correct
T False negative
To avoid Type I error we may decrease our significance level but that will increase the chance of
committing Type II error. It is easy to avoid type I error but avoiding type II error is not so simple. One
way is to increase the sample size and reduce sampling variations. The Power of a statistical test is defined
as: the ability of a statistical test to reject the null hypothesis when it is actually false and should be
rejected..
Further Reading: