0% found this document useful (0 votes)
3K views32 pages

Introduction to Biostatistics Concepts

The document discusses biostatistics and statistics. It defines key terms like population, sample, descriptive statistics, inferential statistics, sampling techniques. It explains that statistics is used extensively in medical science and all medical graduates must have knowledge of biostatistics. Common sampling techniques like simple random sampling, stratified random sampling and cluster sampling are described.

Uploaded by

Zabia JK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Topics covered

  • p-value,
  • Research Methodology,
  • Hypothesis Testing,
  • Research Design,
  • Sampling Techniques,
  • Statistical Software,
  • Significance Testing,
  • Sample Size,
  • Epidemiology,
  • Medical Statistics
0% found this document useful (0 votes)
3K views32 pages

Introduction to Biostatistics Concepts

The document discusses biostatistics and statistics. It defines key terms like population, sample, descriptive statistics, inferential statistics, sampling techniques. It explains that statistics is used extensively in medical science and all medical graduates must have knowledge of biostatistics. Common sampling techniques like simple random sampling, stratified random sampling and cluster sampling are described.

Uploaded by

Zabia JK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Topics covered

  • p-value,
  • Research Methodology,
  • Hypothesis Testing,
  • Research Design,
  • Sampling Techniques,
  • Statistical Software,
  • Significance Testing,
  • Sample Size,
  • Epidemiology,
  • Medical Statistics

BIOSTATISTICS

Directorate of Research & Development

Khyber Medical University


Peshawar

Dr. Muhammad Salim Wazir


MBBS(Pesh), MSc(UK), [Link]
Assistant Professor
WE ALL ARE STATISTICIANS TO SOME EXTENT; BUT DO WE KNOW?

The mere mention of the word statistics can ring alarm bells in many minds and more so in the minds of
medics. Medical professionals believe that they are duty bound not to touch anything mathematical. They
had opted for Biology instead of Mathematics at FSc level so it would be a sin to think of it. They
erroneously think of Statistics as a branch of Mathematics. No doubt that mathematics are extensively used
in statistics but so is the case with other sciences like, Physics, Chemistry, Agriculture, Engineering and
many more. The subject is further defamed, though, for political reasons; for example when Mark Twain
quotes the Victorian age English Prime Minister Disraeli as saying “there are lies, damned lies and
statistics”. We the readers immediately implicate the subject statistics as the worst types of lies. Facts are to
the contrary. What Disraeli means by statistics here is the facts and figures presented by the government
whereas the subject statistics is an entirely different phenomenon though it also plays with facts and figures
but in a scientific way.

The kind of thinking involved in statistics will not be entirely new to you. Indeed, you will find that many
of our day-to-day assumptions and decisions already depend on it. Suppose you are told that two adults are
sitting in the next room. One is 5 feet tall and the other is six feet tall. What would be your best guess as to
each one’s sex, based on that information alone? You may be fairly confident in assuming that the six feet
tall person may be a man and the five-footer may be a woman. You could be wrong, off course, but
experience tells you that five-foot men and six-foot women are somewhat rare. You have noticed that, by
and large, males tend to be taller than female human beings. Off course you have not seen all men or all
women and you recognize that many women are taller than many men; nevertheless you feel reasonably
confident about generalizing from the particular men and women you have known to men and women as a
whole.

The above is a simple, every day example of statistical thinking. There are many other examples. Anytime
you use phrases like: ‘on average, I sleep 52 hours a week’ or ‘we can expect a lot of rain at this time of
year’ or ‘the earlier you start revising the better you are likely to do in the annual exam’; you are making a
statistical statement, even though you may have performed no calculations.

There are many more things, which are not known to us. We may need information on some of these
things. Without conducting a proper investigation we may be oblivious to many important things. To
conduct such an investigation we need to have knowledge of the subject STATISTICS. Medical science
cannot progress without making use of the subject statistics; hence all medical graduates must posses a first
hand knowledge of statistics. Just think for a moment from where do we know that Hemoglobin level is 12-
15g/dl in adult males. Have we measured the hemoglobin level of all men and women? Certainly not but
still we confidently diagnose a man as anemic with a hemoglobin level of 10g/dl. How do we feel confident
about our diagnosis of anemia with 10g/dl hemoglobin? Take care you need to learn it as we discuss it in
the following pages.

Statistics:
A subject that deals with the collection, compilation, presentation, analysis and interpretation of data

Statistics are:
a). Descriptive Statistics: Methods used to summarize or describe our observations
b). Inferential Statistics: Using those observations as a basis for making estimates or
predictions i.e. inferences about a situation that has not yet been observed. Appropriate
tests of significance are applied in inferential statistics.
* Note that the word statistics used in everyday language means facts and figures and not
the subject statistics. The word statistic means the computed figures from actual
observations (sample).
Data:
Record of observations – facts and figures – any piece of information.
Information:
When data is processed and made meaningful

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 1


Population:
Defined as “the whole set of things about which we want to know” In statistics population can be human
beings, potatoes, tomatoes, rice, chairs, tables, ECG machines, paracetamol tablets etc.

Sample:
A part taken out of the population for actual study

Census:
When all members of a population are examined

Sampling:
The procedures when some members of the population are drawn for examination
You will appreciate the fact that populations are usually very big and at times of infinite size. Time and
other resources are most of the times scarce; therefore, researchers almost always opt for sampling rather
than conducting a census. To be able to draw meaningful information from our samples we would like that
our samples are representatives of the population they are drawn from. We don’t know about the population
so how do we know that our sample is representative of the population. We don’t have any foolproof
method but to be reasonably sure that our samples are representatives of the population they are drawn
from, we must ensure that:
i. they are drawn randomly
ii. they are of adequate size

We can ensure samples are drawn randomly but the size of a sample is usually dictated by the availability
of resources i.e. time, men, money and material rather than the statistical requirements.

Note: For a valid research paper sample size is an important prerequisite.

Sampling unit:
It could be individuals or households, which are actually studied.

Sampling Frame:
The list of sampling units is called sampling frame.

Sampling Techniques:
In an ideal world we need to have a list of all the members of the population and then draw a sample by
method of say lottery. But just imagine that if want to know about the heights of adult males in district
Abbottabad and we want to draw a sample of 1000 adult males. There may be more than 300000 adult
males in district Abbottabad. Do we have any method where we can obtain a list of all adult males of
district Abbottabad? We surely don’t have a complete list of adult males of district Abbottabad. Similarly,
most of the populations we encounter don’t have complete lists. Therefore, we need to look for other
sampling techniques.

Random Samples:
Defined as where each and every member of the population has an equal chance of selection.

There are two types of sampling:

a. Probability Sampling
b. Non-probability Sampling
a. Probability samples are those when members of the population have known and not necessarily
equal chance of getting selected as sample members. In this technique of sampling inferential
statements can be made based on samples. In the case of probability sampling the sampling frame
is available in some shape. Some of such techniques are as under:

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 2


Simple Random Sample: Drawing a sample from a population by a random method e.g. lottery
which gives every individual in the population an equal and independent chance of appearing in
the sample. For example selecting 20 students at random from the population of students of Ayub
Medical College.
Stratified Random Sampling: Drawing a sample from a population which has first been divided
into sub-groups or strata. From each sub-group a sample is drawn by a random method which
gives every individual in the sub-groups equal and independent chance of appearing in the sample
e.g. if students of Ayub Medical College are first divided into five classes i.e. 1st, 2nd, 3rd, 4th and
5th years and then say four students (a specific percentage) from each class are randomly selected.

Multi-stage Sampling: A process of sampling a population in a series of consecutive steps e.g. a


town may be divided into a number of areas and a number of those areas drawn by a random
process; within these drawn areas the schools may be listed and a number of these schools drawn
by a random process. The pupils within these schools are then the samples to be examined (or
further stage sampling can be applied by the random selection of a sample of the pupils)

Cluster Sampling: Cluster sampling is akin to multi-stage sampling in so far as the town is
consecutively subdivided. Cluster sampling is adopted when there is no sampling frame from
which the final sample can be selected. In this type the researcher combs the area meticulously to
find the items needed to from that area’s sample e.g. to know vaccination status of children under
two years a researcher first select a district, then Union Council, then villages and then Mohallah –
all randomly. Then he looks for seven or more such houses located together that have children less
than two years of age. Such seven or more houses located together are known as clusters.

Systematic Sampling: Drawing a sample from a population by a systematic procedure e.g.


selecting every 4th student entering the classroom.

Non-probability Sampling: Such samples are putatively non-representative and no probability


statements can be made based on these samples because unlike probability sampling the
population members have no known chance of getting selected as sample members.
a. Purposive Sampling: Sampling on the basis of a pre-determined idea for example selection of
all diabetic patients
b. Convenience Sampling: Also known as accidental sampling. Selected according to the
convenience of the researcher for example it is convenient to ask the frontbenchers and hence
they are the sample members.
c. Quota Sampling: Similar to cluster sampling or sometimes like stratified random sampling
except that the researcher doesn’t have to comb all the area when compared to cluster
sampling. For example examining only a predetermined number, this is termed Quota. When
compared to stratified random sampling, the strata in this case are defined and instead of
selecting members through simple random sampling, they are selected conveniently as quota.
d. Snow Ball Sampling: If we wish to have a sample of drug addicts we wont be able to find
many. Hence we investigate the first addict and through him/her reach other addicts. Thus the
sample is accumulated like a snowball.

 Note if the sample is not randomly selected it cannot be representative of the population
under study. This is also called SYSTEMATIC ERROR or BIAS. Therefore to decrease bias
the sampling technique is important consideration. To decrease the play of chance sample
size is a consideration.

Variable:
An attribute or characteristic that is variable from one individual to another.

Types of Data:
Data consists of variables. We need to know different types of variables because different statistical
techniques are employed to analyze different variables.

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 3


Variables

Category Quantity

Discrete Continuous
Nominal Ordinal
(counting) (Measuring)

Interval Ratio

Category (Nominal):
Observations have names only for example male/female, black/white/yellow/brown. There are no orders or
ratios. If nominal data has only two groups e.g. male/female it is called dichotomous data.

Category (Ordinal):
When data is placed into meaningful order. Students may be ranked as 1 st, 2nd, 3rd etc. however the interval
between orders is not certain.

Quantity (Discrete):
When items can be counted e.g. number of children a woman gives birth to. They can be 1, 2, 3 or 4 and
even more. It cannot be 2.6

Quantity/Continuous (Interval):
Such data is in order in addition that they can be placed in meaningful order. Temperature on Celsius Scale
is interval scale data. But temperature of 10C0 doesn’t mean that it is twice as hot as a temperature of 5 C 0
because Celsius Scale doesn’t have an absolute zero.

Quantity/Continuous (Ratio):
In such data the intervals have meaningful ratios e.g. a student weighing 80kg is twice as heavy as a student
of 40kg.

Another classification of variables is independent and dependent variable. They are used when we compare
variables. Independent variables are presumed causes and dependent variables are presumed effects.
Incidence of common cold in different seasons is an example to explain this. Season is independent
variable and common cold is dependent variable.

Compilation of Data:
Once data is collected it can be organized for further processes.

Frequency Distribution:
The collected data can be plotted in tabular form or graphic form after organizing it showing frequencies of
different observations. It can also be organized in group form which is called grouped frequency
distributions. If we collect data on the pulse rates/min of 15 students as under:

72, 73, 80, 62, 66, 108, 82, 73, 69, 78, 86, 67, 76, 73, 75

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 4


The frequency distribution is as under:

Pulse Rate No. (Frequency)


62 1
66 1
67 1
69 1
72 1
73 3
75 1
76 1
78 1
80 1
82 1
86 1
108 1
Total 15

Frequency Distribution of the grouped data could be as under:

Pulse Rate No. Frequency 5 Bar Gate (tallying)


(Class interval)
61-70 4 ////
71-80 8 //// ///
81-90 2 //
91-100 0
101-110 1 /
Total 15 15

Note: For tallying observations the FIVE BAR GATES or tallying methods can be used shown in the third
column in the above table. //// is called five bar gate.

Description of Data:
Once the data is collected we need to know the following;

i. Central tendencies of data


ii. Dispersion of data

Central Tendencies of Data:


Measured by Mean – Median – Mode

Mean (Arithmetic Mean): It is defined as the sum of observations divided by the number of observations

x
Mean X 
n
 x  Sum of observations in a sample
n  Number of observations in a sample

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 5


Remember that Mean of a sample is denoted by X (pronounced as X Bar) and mean of population is
denoted by µ (pronounced as mew)

X
Mean   
N
  Mean of a population
 X  Sum of values in a population
N  Number of values in a population

To calculate Mean from the data of Pulse rates of students

Sum of all observations = 1140 (Σx)


Number of observations = 15 (n)

x
( Mean ) X 
n
1140
X  76 beats / min ute
15

To calculate mean of a grouped data, the following method is adopted:

Pulse Rate No. Frequency (F) Mid point (M) FxM


61-70 4 65.5 262
71-80 8 75.5 604
81-90 2 85.5 171
91-100 0 95.5 0
101-110 1 105.5 105.5
Total n =15 Σx = 1142.5

 x 1142 . 5
X    76 . 16 beats / min
n 15
61  70 71  80
In the above table mid point calculated as like  65 . 5 ,  75 . 5 so on .
2 2

Mean calculated in this way is not exactly the same as calculated by adding individual observations but is
near to that. In the case of very large data it goes even nearer to the actual value.

Advantages of Mean:

i. It represents all the values in a distribution


ii. Can be used in further statistical computations

Disadvantages:
i. It is affected by extreme values
ii. Sometimes it can give a ridiculous figure e.g. 2.35 children, 1.13 eggs etc

Mean is used for continuous data

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 6


Median: The centre value of series of observations when the observations are ranked in order from the
lowest value to the highest. Median divides the distribution into two equal halves.

n 1
Position of Median =
2
Using the same data

We first arrange the observations into an order from lowest to highest

62, 66, 67, 69, 72, 73, 73, 73 75, 76, 78, 80, 82, 86, 108

n  1 15  1 16
Position of Median =    8 8th value is the median which is 73.
2 2 2
You can see that there are seven values below 73 and an equal number i.e. seven above the median. In the
data shown n=15 which is an odd number. If n=16, an even number then the Median would be

n  1 16
Position of Median =   8 .5
2 2
8.5 means the average of 8th and 9th observations. If e.g. 8th value was 73 and 9th 75, then

73  75 148
Median would be   74 beats / min ute
2 2
In this case median may not be an actually observed value.

Advantages of Median:
It is not affected by extreme values therefore; it is used for that data, which is, skewed i.e. having extreme
observations.

Disadvantages:
1. It does not take into account all the values of a distribution
2. It is of limited value in further statistical computation

Median can be used in Ratio, Interval and Ordinal data.

Mode: The most frequently observed value in a distribution is known as mode. In the aforementioned data
Mode is 73 beats per minute, which appears three times.

1. Mode can be used for all types of data.


2. Mode is not affected at all by extreme values.
3. Mode is of no value in further statistical computations.
4. Mode does not take into account all the values in a distribution.

Some distribution may have two modes – they are called bimodal distributions. If there are more than two
modes, such distributions are known as multimodal distributions. It can be used for all types of data.

Note: Mean – Median – Mode have the same units as of observations and must be noted with the resultant
value e.g. Mean is 76 beats per minute.

Measures of Dispersions/ Variations

1. Range
2. Mean Deviation

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 7


3. Variance
4. Standard Deviation
5. Coefficient of Variation

1. Range: It is the difference between the highest and lowest observations.


In the data

72, 73, 80, 62, 66, 108, 82, 73, 69, 78, 86, 67, 76, 73, 75

Range is 62-108 beats per minute or 108-62 beats per minute or 46 beats per minute.

Range is a good measure of dispersion when we want to know immediately how the data spread but it takes
into account only the lowest and highest values of a distribution. Therefore, it is not a good measure of
dispersion of data.

2. Variance: is equal to the sum of squared deviation of observations from mean of the distribution
divided by the number of observations

Variance =  ( X  X )
2

We arrange data in the following table:

Heart Rate Mean Deviation from Squared deviation


Mean
(X ) (X ) XX X  X 2

62 76 -14 196
66 76 -10 100
67 76 -9 81
69 76 -7 49
72 76 -4 16
73 76 -3 9
73 76 -3 9
73 76 -3 9
75 76 -1 1
76 76 0 0
78 76 +2 4
80 76 +4 16
82 76 +6 36
86 76 +10 100
108 76 +32 1024
n =15 Σ ( X  X ) =0 Σ ( X  X ) 2 =1650

( X  X ) 2 1650
1650
Variance =  110
15
We square the deviations to get rid of the negative signs but by squaring the values we loose the units.
Therefore, Variance is of limited value in measuring dispersion of the data.

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 8


3. Standard deviation
The most useful measure of dispersion that can be used in further statistical computations. It is the
square root of the sum of squared deviations of observations from mean of the distribution divided
by the number of observations.

Standard Deviation SD  Variance

Variance = (Standard Deviation)2 = SD2

 x  x 2
SD 
n

Using the table for calculating variance:

Heart Rate Mean Deviation from Squared deviation


Mean
X  X  2

(X ) (X ) XX
62 76 -14 196
66 76 -10 100
67 76 -9 81
69 76 -7 49
72 76 -4 16
73 76 -3 9
73 76 -3 9
73 76 -3 9
75 76 -1 1
76 76 0 0
78 76 +2 4
80 76 +4 16
82 76 +6 36
86 76 +10 100
108 76 +32 1024
n =15 Σ ( X  X ) =0 Σ ( X  X ) 2 =1650

We know that  ( x  x) 2 1650


n 15
 x  x 
2
SD 
n
1650
SD   110  10.5 beats / min
15

By squaring the deviation we get rid of the negative signs, but we loose the original unit, which is taken
care of by applying the square root again, which means that original units are restored.

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 9


Direct Formulas:

 x 2

 x 2
A. SD  n
n

x 2
B. SD   x2
n
Formula A:

 x2 
 x  2
SD  n
n

Using the same data we can calculate the standard deviation as under by the above formula:
 x 2  All the observations are first squared then added.
 x 2  All the observations are first added then squared.
Observations Observations
(X) Squared
(X)2
62 3844
66 4356
67 4489
69 4761
72 5184
73 5329
73 5329
73 5329
75 5625
76 5725
78 6084
80 6400
82 6724
86 7396
108 11664
 x  1140  x 2  88290

 x2 
 x  2
SD  n
n
(1140 ) 2
88290 
SD  15
15
1299600
88290 
SD  15
15
88290  86640
SD 
15
1650
SD   110  10.5 beats / min
15

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 10


Using Direct Formula B which is
 x2
SD   x2
n
88290
SD   (76) 2  5886  5776  110
15

SD = 10.5 beats/ min

  x  x  is usually used for small data. The other formulas are used for large
2
The first formula SD 
n
data. If your data consists of less than 30 observations then the two formulas can be amended as under for
correction.

 x  x 
2
1. SD 
n 1
 x2 
 x 2
2. SD  n
n 1

 x2
The last formula SD   x 2 does not have such facility and should be used for data of more
n
than 30 observations only.

The use of standard deviation in statistical data is explained with the Normal Distribution.

Note: Standard Deviation of a sample is denoted by the symbols SD and Standard Deviation of population
is denoted by Greek letter small sigma δ
4. Co-efficient of Variation:
Measures variability in relation to the mean and offers a method by which one can compare the
relative dispersions of one type of data with the relative dispersion of another type of data.

Our data of heart beats per minute will have Co-efficient of variation as under:

SD 10.5
Co-efficient of variation of heart beats data = x100 = x100 13.8%
Mean 76
If we also had recorded the systolic blood pressures of the same individuals with a mean systolic BP of
130mmHg and Standard Deviation of 13mmHg – the co-efficient of variation would have been

Co-efficient of Variation of Systolic Blood Pressure:

SD 13
x100  x100  10%
Mean 130

Now we can compare and conclude that among persons whose pulse rates and systolic blood pressure were
recoded, pulse rates are more variable than systolic blood pressure since co-efficient of variation of pulse
rates is 13.8% and systolic BP is 10%.

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 11


INFERENTIAL STATISTICS

Inferential Statistics means when we go beyond the actual observations and state something (based on the
collected data), which have not been actually observed. Here the theory of probability comes in.

Probability:
The number of events occurring out of a total possible number of events is called probability. If we flip a
fair coin, the probability of having head is ½ or 50% or 0.5. The probability of either having head or tail is
1/1 or 100% or 1. Two simple rules of probability need to be remembered.

Addition Rules:
For two or more possible mutually exclusive events the collective probability equals ONE or 100%. For
example there are two possibilities when we flip a fair coin i.e. either head or tail. We cannot have both
head and tail at one flip. The probability of having head is 0.5 or 50%. Therefore, according to addition rule
the probability of having head or tail is 0.5 + 0.5 = 1 or 50% + 50%=100%.

Example: If infant mortality rate is 80 per 1000 in Pakistan, then, the probability of an infant dying is 80
per 1000 or 8 per 100 or 8% or .08. The probability of an infant surviving is 920 per 1000 i.e. 1000-
80=920. It can also be said that the probability of an infant surviving is 92% or .92. As a child can either
survive or die and they are mutually exclusive phenomena, therefore, according to addition rule the
probability of either dying or surviving is .8+.92=1. (If the statement has OR in it, addition rule is applied)

Multiplication Rule: For two or more independent and randomly occurring phenomena the probabilities
multiply. When we flip a fair coin it is an independent event. When we flip a fair coin twice or thrice or
more; all are independent events. The probability of getting head with flip is .5 then having heads on two
flips is .5x.5 = .25.

Example: If we know that 10% of patients visiting a medical OPD suffer from hypertension it means that
probability of a patient having hypertension is .1. So the probability of having first two patients entering the
OPD of being suffering from hypertension is .1x.1 = .01 or 1%. This is called the multiplication rule.

Note: If the probability is stated to be 1 it is called unity. To say one has to die eventually the probability
will be 1. For one to stay alive for ever the probability will be 0. In between 0 and 1 there are fractions of
1, which may go up to many decimals for different events.

NORMAL DISTRIBUTION

Also known as Gaussian Distribution.

A British mathematician De Moivre conceived it first in the seventeenth century. It is an important


statistical distribution and is a mathematical model of frequency distribution of most biological values in
nature. Shown diagrammatically the Standard Normal Distribution is denoted as a curve known as Normal
Curve or Gaussian Curve.

NORMAL CURVE

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 12


On the X-axis are the values and the Y-axis shows the frequency of those values like a frequency
distribution. It is important to remember that normal distribution is a probability distribution and is an ideal
world. If our collected data has tendency to conform to normal distribution then we make use of it in
statistical inferences. The total probability of frequency of values under the curve is equal to 1 or 100%. All
the individual values under the curve have probability of occurrence (frequency) ranging between 0 and 1
(or 0% to 100%) and total to 1.

PROPERTIES OF STANDARD NORMAL CURVE

1. It is bell shaped.
2. It is perfectly symmetrical.
3. Mean, Median and Mode are in the centre of the curve i.e. the dome of the curve.
4. Half the values (50%) lie on each side when it is cut into half at the highest point.
5. It has got two determinants Mean (µ) and Standard Deviation (δ).
6. 68.26% of the values lie between the range of Mean ± 1xSD (1δ - µ + 1δ). In other words the
probability of occurrence of values between the range 1xSD – Mean + 1xSD (1x δ-µ+1x δ) is
68.26% or .6826. This also implies that 31.74% of values are either below Mean – 1xSD (µ-1x δ
) or above Mean + 1x SD(µ+1x δ ). In other words the probability of occurrence of values below
Mean – 1xSD (µ-1x δ) or above Mean + 1xSD (µ+1x δ ) is 31.74% or .3174.
7. 95.45% of the values lie between the range of Mean ± 2xSD (2δ - µ + 2δ). In other words the
probability of occurrence of values between the range 2xSD – Mean + 2xSD is 95.45% or .9545.
This also implies that 4.55% of values are either below Mean – 2xSD or above Mean + 2x SD. In
other words the probability of occurrence of values below Mean – 2xSD or above Mean + 2xSD
is 4.55% or .0455.
8. 99.73% of the values lie between the range of Mean ± 3xSD (3δ - µ + 3δ). In other words the
probability of occurrence of values between the range 3xSD – Mean + 3xSD is 99.73% or .9973.
This also implies that 0.27% of values are either below Mean – 3xSD or above Mean + 3x SD. In
other words the probability of occurrence of values below Mean – 3xSD or above Mean + 3xSD
is 0.27% or .0027.

To elaborate it further and make it useful, remember the following landmarks also:

a. 95% of the values lie between the range of Mean ± 1.96xSD (1.96δ - µ + 1.96δ). In other words
the probability of occurrence of values between the range 1.96xSD – Mean + 1.96xSD is 95% or
.95. This also implies that 5% of values are either below Mean – 1.96xSD or above Mean +
1.96x SD. In other words the probability of occurrence of values below Mean – 1.96xSD or above
Mean + 1.96xSD is 5% (2.5% 0n each side) or .05 (0.025 on each side).
b. 99% of the values lie between the range of Mean ± 2.58xSD (2.58δ - µ + 2.58δ). In other words
the probability of occurrence of values between the range 2.58xSD – Mean + 2.58xSD δ is 99%
or .99. This also implies that 1% of values are either below Mean – 2.58xSD or above Mean +
2.58x SD. In other words the probability of occurrence of values below Mean – 2.58 or above
Mean + 2.58xSD is 1% or .01.

Figure: Normal Distribution Showing Mean ± 1x SD


Ins ide the Lim its
(68.26%)

Outs ide the Lim its 68.26%


Outs ide the Lim its
(15.87%) (15.87%)
34.13% 34.13%

-1 µ +1

1 SD - Mean + 1 SD 1
(µ)

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 13


Figure: Normal Distribution Showing Mean ± 2x SD

Ins ide the Lim its


Outs ide the Lim its (95.45%)
(2.275%)
95.45%

Outs ide the Lim its


(2.275%)

47.725% 47.725%

-2 -1 µ +1 +2

2x SD - Mean + 2x SD
(µ)

Figure: Normal Distribution Showing Mean ± 3x SD

Ins ide the Lim its


(99.73%)
Outs ide the Lim its
(0.135%) 99.73%

Outs ide the Lim its


(0.135%)
49.865% 49.865%

-3 -2 -1 µ +1 +2 +3

3x SD - Mean + 3x SD
(µ)

Figure: Normal Distribution Showing Mean ± 1.96x SD

Ins ide the Lim its


Outs ide the Lim its (95%)
(2.5%)
95%

Outs ide the Lim its


(2.5%)
47.5% 47.5%

-3 -2 -1.96 -1 µ +1 +1.96 +2 +3

1.96x SD - Mean + 1.96x SD


(µ)

Figure: Normal Distribution Showing Mean ± x SD

Ins ide the Lim its


Outs ide the Lim its (99%)
(0.5%) 99%

Outs ide the Lim its


(0.5%)
49.5% 49.5%

-3 -2.58 -2 -1 µ +1 +2 +2.58 +3

2.58x SD - Mean + 2.58x SD


(µ)

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 14


The many diagrams are for you to understand exactly the area under curve covered by observations
dependent upon the multiples of standard deviation on both sides of mean. The multiple of standard
deviation is called Z, which ranges from 0 to infinity. The area under normal curve is also referred to as
area under Z.

Note: When we say that a certain percentage of observations lie between Mean ± Z x SD, the Z in the case
of 68.26% is 1, in the case of 95%, Z is 1.96, in the case of 95.45% Z is 2, in the case of 99% Z is 2.58 and
in the case of 99.73% Z is 3.

The diagrams given below gives an account of hypothetical data of heights in cm of 1000 males. The
second diagram is a histogram upon which falls a normal curve. Carefully look at the Mean ± Z x SD.

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 15


OTHER SHAPES OF FREQUENCY DISTRIBUTION

The curve may not be symmetrical in some instances. It may have many shapes but two shapes are
important to remember.

Diagram A:

Mode Median Mean

Positively (right) skewed distribution

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 16


Diagram B

Mean Median Mode

Negatively (left) skewed distribution

Diagram A is a distribution skewed to right (Positively Skewed); and diagram B shows a distribution that is
skewed to the left (Negatively Skewed).
Now we have three types of curves.
1. Symmetrical Curve (Standard Normal Curve)
2. Curve Skewed on the right
3. Curve Skewed on the left

1. Symmetrical Curve:
Suppose students appear in a test on subject K. Data shows that there are very few students who
have scores less than 10%. As the scores increase the number of students increase until a stage is
reached where the scores are around 50% and most students score around that. That is the mode of
the data. According to the properties of Normal Curve it is also the Mean and Median. Again
when scores increase further the numbers of students keep on decreasing until we reach a stage
where students score around 90%. You can appreciate that they will be very few. This type of
distribution of scores is a normal distribution. Most biological values are distributed like this e.g.
pulse rates, blood pressure, hemoglobin value etc.
2. Curve Skewed to the Right (Positively Skewed):
Suppose students appear in a test on subject L. We observe that there are many students who have
scores on the lower side that will be the mode of the distribution. (We know that mode is not
affected by extreme values at all). Next to mode to the right will be the median as it is less affected
by extreme values. Mean will be on the extreme right where the few extreme values lie. On the
right of the curve will be the students with higher scores but less in number. This distribution is
skewed to the right, which means that maximum students scored less and few scored high marks.
Wealth is generally distributed like this.
3. Curve Skewed to the Left (Negatively Skewed):
Suppose students appear in a test on subject M. We can appreciate the fact that there are very few
students who scored less. Maximum of the students scored high; therefore, the mode of the
distribution will fall on the extreme right. To the left will be the median, and to further left will be
mean. This implies that most of the students did well in test on subject M and only a few lagged
behind. Distribution of Hemoglobin value in children is skewed to the left.

Note: Skew means tail. Skew is said to be to the side where the tail of the distribution is.

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 17


ESTIMATION

One main purpose of the statistics should be to estimate. Estimation means generalizing to a bigger
phenomenon by actually looking at a part of that bigger phenomenon. That is by making statements about a
population (which is not fully examined) on the basis of a part of it that is actually examined. In other
words we extrapolate our sample data to the population from which the sample is drawn. Do not forget that
to be able to make such statements the sample has to be representative of the population it is drawn from.
For a sample to be representative the data has to be collected in a random manner.

From our data of pulse rates for which we have calculated Mean and Standard Deviation

72, 73, 80, 62, 66, 108, 82, 73, 69, 78, 86, 67, 76, 73, 75

Mean = 76
SD = 10.5
n = 15

By using normal distribution we can say that:

1. 68.26% of the values are within Mean ± ISD i.e. between 65.5 and 86.5
2. 95% of values are within Mean ± I.96SD i.e. between 55.4 to 96.6
3. 95.45% of values lie within Mean ± 2SD i.e. between 55 to 97
4. 99% of values lie within Mean ± 2.58SD i.e. between 48.9 to 103.1
5. 99.73% of values lie within Mean ± 3SD i.e. between 44.5 to 107.5

These are confidence limits for the Sample. Number 1 is 68.26% confidence limits; number 2 is 95%
confidence limits; number 3 is 95.45% confidence limits; number 4 is 99% confidence limits; and number 5
is 99.73% confidence limits. This means that you can state with a certain percent of confidence in what
range the values within your sample fall. But do not forget that these confidence limits are for your sample
and not the population from which the sample is drawn
.
The upper limit of the range is upper confidence limit; the lower limit of the range is lower confidence
limit. In between the upper and lower confidence limits is the CONFIDENCE INTERVAL. 95%
confidence limits for a sample imply that 95% of the observations in the sample will lie within this range,
which in the case of our data are 55.4 to 96.6. It also means that 5% of observations may lie outside these
limits either below lower confidence limit or above the upper confidence limit.

Such calculations are of no use as long as we do not know the population mean (µ) and population
Standard Deviation (δ). And if after all we know the population mean and Standard Deviation then what is
the need of all this exercise.

Therefore, we have to estimate the population parameters especially the Standard Deviation (δ) of the
population. Here jumps in the Standard Error.

STANDARD ERROR:

Standard Error is the estimate of the population Standard Deviation. It neither is the True Standard
Deviation of the population nor an error in literary meanings.

To understand the concept of Standard Error let’s take the example of our data of pulse rates. We have a
mean of the data, which is 76 per minute. If we draw repeated samples from the same population and
compute means of all the samples then we’ll have a distribution of means of the samples like individual
values of pulse rates in one sample.

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 18


Central limit theorem states that means of many samples from the same population are normally
distributed. The Standard Deviation of the distribution of means of many samples of one population is
known as Standard Error (SE). We take Standard Error as an estimate of the population Standard
Deviation. But do you really think that somebody can actually carryout the exercise of drawing repeated
samples from a population and of such a number to construct a meaningful distribution? Only an eccentric
will be prepared to do that.

Statistics provide us with a formula to calculate SE without going through the cumbersome exercise.

SD
Standard Error (of Mean) =
n

SD = Standard Deviation of the sample


n = Number of observations of sample.

If we apply this formula to our data then we may have a SE of


10.5
SE 
15
10.5
SE 
3.87
SE = 2.7

Now we have got an estimate of the population standard deviation in the shape of standard error (SE)
which is 2.7. Based on this we can calculate confidence limits for the population exactly the same way as
for sample but substituting standard deviation of the sample with standard error of the mean.

1. 68.26% confidence limits are: Mean ± I x SE i.e. from 73.3 to 78.7


2. 95% confidence limits are: Mean ± I.96 x SE i.e. from 70.3 to 81.3
3. 95.45% confidence limits are: Mean ± 2 x SE i.e. from 70.6 to 81.4
4. 99% confidence limits are: Mean ± 2.58 x SE i.e. from 69 to 83
5. 99.73% confidence limits are: Mean ± 3 x SE i.e. from 67.9 to 84.1

Confidence Limits based on standard error of a mean are confidence limits for population and hence
an estimation of population situation based on sample. But remember that the actual explanation of
confidence limits calculated on the basis of standard error of mean is a little bit different from the
explanation of confidence limits calculated on the basis of actual standard deviation and mean of
population if known.

CONFIDENCE LIMITS BASED ON ACTUAL STANDARD DEVIATION AND MEAN: 95%


confidence limits mean that 95% of the values of that particular observation in the population lie within
Mean ± I.96x δ.

CONFIDENCE LIMITS BASED ON STANDARD ERROR OF MEAN: 95% confidence limits mean that if
we draw many samples from the same population 95% of the time the sample means will fall in these
limits. But practically we mean the same as if we know the actual mean and standard deviation of the
population.

CONFIDENCE LIMITS FOR A PROPORTION: We can also calculate confidence limits for a proportion
using standard error of a proportion formula.

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 19


SE of Pr oportion  
pxq
n

p = proportion (percent or fraction of 1) of an event occurring


q = proportion of an event not occurring i.e. q = 100-p (percentage) OR 1-p (fraction of 1)
n = number of observations (sample size)

Example: If the number of people with Iodine deficiency are 55 out of a randomly selected sample of 440
persons in district Kohistan the 95% and 99% confidence limits will be as under:

SE of Pr oportion  
pxq
n

The number of persons with Iodine deficiency = 55 out of 440

Then p = 55/440 x 100 = 12.5%

So q = 100 – p = 100 – 12.5 = 87.5%

Sample size = n = 440

Therefore, using the formula

SE of Pr oportion  
pxq
n

SE of Pr oportion  
12.5 x87.5
440

SE(of proportion) = 1.57

95% Confidence limits are p ± 1.96x SE = 12.5 ± 3.07


95% Confidence limits are 3.07 - 12.5 + 3.07 = [9.43% to 15.57%]
This means that if we draw repeated samples from the population of Kohistan, 95% samples will have
Iodine deficient people between 9.43% and 15.57%.

99% Confidence Limits are p ± 2.58 x SE

99% Confidence Limits are p ± 2.58 x 1.57


99% Confidence Limits are p ± 4.05 = 12.5 ± 4.05 = 4.05 – 12.5 + 4.05
99% Confidence Limits are [ 8.45% to 16.55 % ]

This means that if we draw repeated samples from the population of Kohistan, 99% samples will have
Iodine deficient people between 8.45% and 16.55%.

Confidence limits for proportion imply the same as is the case with confidence limits for mean. If we
increase the sample size the standard error decreases and consequently the confidence interval will contract.

95% Confidence Interval (CI) is defined as: The range of mean values or proportions within which
there are 95 chances out of 100 that the true population mean or proportion will fall

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 20


99% Confidence Interval (CI) is defined as: The range of mean values or proportions within which
there are 99 chances out of 100 that the true population mean or proportion will fall
t- Distribution: When the sample size is small we use t-distribution instead of Z distribution (normal
distribution). While calculating the confidence limits we substitute t with Z. Another alteration to the
method is the computation of standard error, given as under:

Standard Error of mean (t-distribution) = SD n 1

If we calculate standard error by this method for our data of pulse rates, it will be 10.5/ 3.74 = 2.8

95% confidence limits will be Mean ± t x SE. To know about the value of t we have to refer to t table. First
we have to calculate degrees of freedom (DF), which are n-1. Our n is 15 hence DF = 15-1 = 14. Therefore,
14 are our degrees of freedom (DF). Referring to t table at 14 DF we find the value of t at 0.05, which are
for 95% confidence limits, is 2.14.

95% confidence limits are: Mean ± t x SE = 76 ± 2.14 x SE = 76± 2.14 x 2.8 = 76± 5.99

Therefore, 95% confidence limits are: 70.01 to 81.99

In the same way we can calculate 99% confidence limits. By referring to t table at 14 DF and 0.01(which
means 99% confidence limits) we find the value of t as 2.98. We substitute 2.98 for t or 2.14 in the
previous example and calculate 99% confidence limits. (Do it yourself)

Note: t is higher than Z ( 2.14 > 1.96 in the case of 95% CL and 2.98 > 2.58 for 99% CL) but after 120 DF
t tends to is equal Z.

The table for t distribution is given as under:

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 21


Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 22
SIGNIFICANCE TESTING
HYPOTHESIS TESTING

We may be interested to compare two or more populations and determine that with regard to some
observations they differ significantly or the differences are just by chance or more precisely the act of
sampling error. We know that means of samples even from the same population may differ but to what
extent remains the question that has to be answered through significant testing or hypothesis testing.

While comparing two or more samples we may have a hypothesis, which is called research hypothesis.
Such a hypothesis may state that there is a difference or otherwise. We have to prove it based on the
collected data.

NULL HYPOTHESIS: It states that the different sets of data belong to one population and the observed
differences are by chance. In other words;
A=B

ALTERNATIVE HYPOTHESIS: It states that the different sets of data belong to different populations and
the differences are statistically significant and are not due to chance. In other words it means;
A≠B

SIGNIFICANCE TESTING:
To test hypothesis or know about significance we perform different statistical tests in different situations.
For data, which are normally distributed, we use the following tests:

i. Z-test for difference between two means


ii. Z-test for difference between two proportions
iii. t-test
iv. ANOVA – Analysis of variance for more than two samples (F-ratio)

Tests applied to data, which is normally distributed, are called parametric tests because they are applied to
data, which have parameters like Mean and Standard Deviation. Parametric data consists of continuous
variables.

For data, which are not normally distributed that means it is not parametric; we use Non-parametric tests.
Non-parametric data consists of nominal or ordinal variables.

The following tests are used for Non-parametric data and hence they are called Non-parametric tests.

i. Chi-square X2 test
ii. Fisher’s exact probability test
iii. Wilcoxon Rank Sum and Signed Rank Tests
iv. Mann Whitney U Test
And many more

NOTE: Parametric tests are more sensitive compared to Non-Parametric tests. It is also important to note
that the data has to be collected randomly to enable the tests to be meaningful.

5% LEVEL OF SIGNIFICANCE (p = 0.05): A level of probability at which the Null hypothesis is rejected
if an obtained sample difference occurs by chance only 5 times or less out of 100.

1% LEVEL OF SIGNIFICANCE (p = 0.01): A level of probability at which the Null hypothesis is rejected
if an obtained sample difference occurs by chance only 1 time or less out of 100.

We will discuss Normal distribution test (Z-test) and Chi-square tests only.

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 23


I. Z-test: difference between two means
Pre-requisites:
i. Data is normally distributed
ii. Data is randomly collected
X1  X 2
Z
SE
X 1 = Mean of sample 1
X 2 = Mean of sample 2
Whereas

S.E is the Standard Error difference between two means


SD12 SD22
SE ( diff between two means)  
n1 n2
SD1 = Standard Deviation of sample 1
SD2 = Standard Deviation of sample 2
n1 = Number of observations in sample 1
n2 =Number of observations in sample 2

Example:
If we want to compare the weights of girls’ students of 1 st year and Final year, we collect data randomly.
After collection and computation we have the following figures:

1st year (Sample 1)


Number of girls = n1 = 32
Mean weight in Kg = X1 = 54
Standard Deviation =SD1 = 04

Final year (Sample 2)


Number of girls = n2 = 27
Mean weight in Kg = X2 = 62
Standard Deviation =SD2 = 05

By using Z

X1  X 2
Z
SE
SD12 SD22
SE  
n1 n2
( 4) 2 (5) 2
SE  
32 27
16 25
SE  
32 27
SE = 1.2

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 24


X1  X 2
Z
SE
54  62
Z  6 .7
1 .2

Remember that the difference will be statistically significant if Z is more than 1.96 for a level of 5% and for
a level of 1% more than 2.58 (please refer to the properties of normal distribution)

Our data shows significant difference both at 5% and 1% level. Hence, we can state that there is
statistically significant difference between the girls’ students of 1 st and final year with regard to their
weights both at 5% and 1% significance level.

(We will reject null hypothesis)

Note: One has to determine the significance level during the planning stage of the study.

ii. Z-test: for difference between two proportions

p1  p2
Z
SE

p1 = percentage (proportions) of occurrence in sample 1


p2 = percentage (proportions) of occurrence in sample 2

p1 xq1 p2 xq2
SE (Standard Error difference between two proportions) =  
n1 n2

p1 = percentage (proportions) of occurrence in sample 1


p2 = percentage (proportions) of occurrence in sample 2
q1 = percentage (proportions) of non-occurrence in sample 1 (100 – p1)
q2 = percentage (proportions) of non-occurrence in sample 2 (100-p2)
n1 = Number of observations in sample 1
n2 =Number of observations in sample 2

Example:
If we collect some data randomly with the following observations:
13 out of a sample of 63, fourth year students are obese; and 17 out of 61 third year students are obese. Is
there any statistically significant difference between 4th and 3rd year students with regard to frequency of
obesity or the observed differences are by chance?

To know the answer we apply Z test for two proportions.

4th year students (sample 1)


13
Percentage of obese = p1 = x100  20.6%
63
Percentage of non-obese = q1 = 100-p1 = 100 – 20.6 = 79.4%

Number of observations = n1=63

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 25


3rd year students (sample 2)
17
Percentage of obese = p2 = x100  27.9%
61
Percentage of non-obese = q2 = 100-p2 = 100 – 27.9 = 72.1%
Number of observations = n2=61

p1 xq1 p2 xq2
SE  
n1 n2
20.6 x79.4 27.9 x72.1
SE  
63 61
SE = 7.7
p1  p2 20.6  27.9 7.3
Z    0.94
SE 7 .7 7 .7
As Z is less than 1.96, therefore, we can say at a 5% significance level that there is no statistically
significant difference between 4th and 3rd year students with regard to obesity and the differences observed
are due to chance.

(In this case we accept Null Hypothesis)

II. TESTS FOR NON- PARAMETRIC DATA

CHI-SQUARE TEST:
Chi-square Test (X2) is applied to Non-parametric data but the data has to be collected randomly. It also
shows association between two or more variables. There are many ways to compute X2 but we will discuss
only 2x2 contingency table (two-way Chi-square test). One-way Chi-square is discussed as example for
hypothesis testing.
Suppose a researcher has invented a new vaccine for measles and claims that it prevents measles. He
randomly selects two groups of children. One group is inoculated with the vaccine and the other group is
left as such and the outcome is observed. His observations are recorded in the table below.

Developed Measles No Measles Total


Inoculated 18 118 136
(1) (2)
Not inoculated 22 208 230
(3) (4)
Total 40 326 366

Note: The numbers given in brackets are the cell numbers of table from 1 to 4.

(O  E ) 2
Chi-Square (X2) = 
E

O = Observed frequencies

E = Expected frequencies

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 26


To calculate Expected frequencies for each cell (1-4)

Column Total x RowTotal


ExpectedFrequencies( E ) 
Grand Total

Cell Observed E 
Column Total x Row Total Expected O–E (O–E)2 (O  E ) 2
No (O) Grand Total (E) E
1 18 136x40 14.9 3.1 9.61 0.64
366
2 118 136x326 121.1 3.1 9.61 0.07
366
3 22 230x40 25.1 3.1 9.61 0.38
366
4 208 230x326 204.9 3.1 9.61 0.05
366
(O  E ) 2
  1 . 14
E

Our completed X2 = 1.14

Now we have to refer to X2 – table

Like t table, X2 table also has got degrees of freedom (DF). To calculate degrees of freedom (DF) we have
to multiply number of rows minus 1 with number of columns minus 1.

DF = (c–1) (r–1)

In our case there are two columns and two rows (excluding captions and totals)

DF = (2–1) (2–1)

DF = (1) (1)

DF = 1

At 1 DF at 0.05 (5% significance level) X2 = 3.84. Our computed X2 = 1.14, which is less than 3.84. Hence
we can say at 5% significance level that there is no statistically significant difference between group 1 and
group 2 at 5% significance level. In other words the vaccine has got no different effect on children
compared to those who were not vaccinated against measles.

(We accept Null Hypothesis)

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 27


Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 28
HYPOTHESIS TESTING

We deal with two hypotheses which are:

1. Null Hypothesis
2. Alternative Hypothesis

We either accept Null Hypothesis or reject it. When we accept Null Hypothesis, we reject the Alternative
Hypothesis. When we reject Null Hypothesis we accept the Alternative Hypothesis.

STEPS:

1. State the null and alternative hypotheses, Ho and HA.


2. Select the decision criterion α (or “level of significance”).
3. Establish the critical values
4. Draw a random sample from the population, and calculate the mean of that sample
5. Select appropriate statistical test and compute the value of the test statistic Z or t or X2 (as the case
may be).
6. Compare the calculated value of test statistic with the critical values of Z/t/X 2, and then accept or
reject the null hypothesis.

Example: A Forensic specialist collects data at random of medicolegal cases of injuries due to the kind of
weapon used, in a district over a period of one year.

The data is given as under:

Type of Weapon Used Number of injured persons


Sharp 184
Blunt 168
Firearm 123
Corrosives 34
Total 509

We follow the steps as given above:

1. State the null and alternative hypotheses, Ho and HA.

Null Hypothesis: There is no difference between the types of weapons used in causing injuries.

Alternative Hypothesis: There seems to be preference for weapons in inflicting injuries.

2. Select the decision criterion α (or “level of significance”). We select a 5% significance level
(p=0.05). Conventionally a 5% level of significance (p=0.05%) is selected. It can be more
stringent and less than 5% (p<0.05) but it is never more than 5%.

3. Establish the critical values: X2 table at p = 0.05 with degrees of freedom as will be
calculated.

4. Draw a random sample from the population, and calculate the mean of that sample: Sample
randomly drawn from a district.

5. Select appropriate statistical test and compute the value of the test statistic Z or t or X2 (as the
case may be).

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 29


We select X2 test as the data is not continuous and do computations as under (One-way Chi-square test):

Type of Number of injured Expected O-E (O–E)2 (O  E ) 2


Weapon Used persons Frequencies
(Categories) Observed Frequencies (E) E
(O)
Sharp 184 127 57 3249 3249/127 = 25.58
Blunt 168 127 41 1681 1681/127 = 13.23
Firearm 123 127 4 16 16/127 = 0.125
Corrosives 33 127 94 8836 8836/127 = 69.57
Total 508 508 (O  E ) 2
  108.5
E

Expected Frequencies (E) are calculated by dividing the total frequencies by number of categories. The
number of categories are 4 and total is 508. All expected frequencies equal 127.

6. Compare the calculated value of test statistic with the critical values of Z/t/X2, and then
accept or reject the null hypothesis.

Our calculated X2 is equal to 108.5. The degrees of freedom in this case are equal to the number of
categories minus one. There are four categories of weapons, therefore, DF = 4-1 = 3. At 3 DF X2 is equal
to 7.81 at 0.05. As our calculated value is more than the table value, therefore, the difference among the use
of weapons in causing injuries is statistically significant and cannot be due to chance alone. Therefore, we
reject the Null Hypothesis and accept the alternative hypothesis.

While testing a hypothesis we may be liable to commit errors, which are:

Type I Error: Rejecting a true hypothesis (α)

Type II Error: Accepting a false hypothesis. (β)

ACTUAL SITUATION

T
Ho True Ho False
E
S Type II error
T
Ho Accepted Correct (β)
False positive
R
E
S
U Type I error
L Ho Rejected (α) Correct
T False negative

To avoid Type I error we may decrease our significance level but that will increase the chance of
committing Type II error. It is easy to avoid type I error but avoiding type II error is not so simple. One
way is to increase the sample size and reduce sampling variations. The Power of a statistical test is defined
as: the ability of a statistical test to reject the null hypothesis when it is actually false and should be
rejected..

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 30


p-value: Some statisticians calculate p value instead of doing significance testing or hypothesis testing.
p-value if calculated by referring to relevant statistical tables will mean the exact probability of stating
chance of sampling variation. If p = 0.0001, it will mean that the obtained sample difference occurs by
chance one out of 10000. Statistical packages calculate p value up to some decimals. While stating p value
it will mean the probability of getting wrong. p = 0.05 will mean an otherwise result of 5 out of 100 .

Further Reading:

1. High Yeild Biostatistics by A N Glaser


2. A Short Text Book of Medical Statistics by A B Hill
3. Elementary Statistics in Social Research by J Levin, J A Fox
4. Statistics without Tears by Rowentree
5. Statistics at Square ONE BMJ by Swinscow
6. Statistics with confidence BMJ

Dr. M. Salim Wazir, Asstt. Prof. (Biostatistics Notes) 31

You might also like