Understanding Inferential Statistics
Understanding Inferential Statistics
statistics
Prof. MOHAMMADI Ahlam
The main objective is to understand to what extent the results observed on a well-chosen
sample can be used to deduce the characteristics of the original population. This involves
measuring the reliability of the conclusions drawn, taking into account uncertainties and possible
errors. These uncertainties are managed using tools such as point estimates, confidence
intervals and hypothesis testing, which determine the level of confidence that can be placed in
the results.
Estimation
Point estimate
confidence interval estimation
Population size N Sample size n
𝒎, 𝝈, 𝝈𝟐 , 𝒑 The sample must 𝒙, 𝒔, 𝒔𝟐 , 𝒇
ഥ
(population parameters) be representative (sample characteristics)
Course outline
Chapter 1: Sampling Theory
Sampling Methods
● Empirical Methods
● Probabilistic Methods
Determining Sample Size
● Factors influencing sample size.
● Calculation methods for optimal sample size.
● Practical considerations in the study context.
Use of the Bienaymé- Chebichev Inequality
● Introduction to inequality and its importance in sampling.
● Application to estimate the variability of estimates
Course outline
Chapter 1: Sampling Theory
Point Estimate
● Definition and importance of point estimate.
● Examples of common point estimates (mean, proportion).
Confidence Interval Estimation
● Concept and interpretation of confidence intervals.
● Construction of confidence intervals for means and proportions.
● Impact of sample size on interval precision.
Course outline
Chapter 3: Hypothesis Testing
The head of a political party wants to estimate the proportion of militants in favor
of Mr X's candidacy for the next presidential election.
How do you calculate a candidate's popularity within a population?
It's too expensive to interview all the militants!
Importance of sampling
When it comes to collecting information on a population, there are two
possibilities:
It should be noted that recourse to the second solution, i.e. the partial survey,
is the most common practice.
Population & Sample
● Population:
In statistics, a population includes every possible element that you are interested
in measuring, or the entire dataset that you want to draw conclusions about. A
statistical population can refer to any type of data, including: People,
Organizations, Objects, Events And more.
● The sample:
A sample is defined as a finite subset of a population, which constitutes a portion
of the group under investigation. The sample size denotes the total number of
items selected to form the sample.
● Fraction or sampling rate:
Proportion of population units included in the sample. It is the ratio of sample
size n to population size N.
To sum up
Advantages of Sampling
● Reduced Costs: Collecting data from an entire population can be expensive
and time-consuming. Sampling allows researchers to obtain insights at a
fraction of the cost.
● Faster Data Collection: Sampling facilitates quicker data collection and
analysis, enabling timely decision-making.
● Practical Approach: In many cases, it's impractical or impossible to study an
entire population (e.g., large-scale surveys or rare events). Sampling provides
a feasible alternative.
● Simplified Analysis: Working with a smaller sample makes data management
and analysis more straightforward, allowing for more detailed examination of
the data.
Back to the examples
● For the lamp manufacturer:
He takes a sample consisting of 130 lamps.
For each lamp, he measures the operating time.
The sample average is 36,000 hours.
An estimate for all lamps is 36,000 hours.
They may create distortions in the results, leading users to draw incorrect
conclusions. When analystes do not select samples that represent the entire
population, the sampling errors are significant.
Sampling errors
Causes:
• Sample Size: Smaller samples are more susceptible to sampling errors, as they
may not capture the full variability of the population.
• Sampling Method: The method used to select the sample (e.g., random
sampling, stratified sampling) can influence the extent of sampling error.
Examples:
• If a survey estimates the average income of a community based on a sample of
50 individuals, the sample mean might differ from the true population mean.
• In a study of consumer preferences, if only young adults are surveyed, the
results may not accurately reflect the views of older adults, leading to a
sampling error.
Non-Sampling errors
Non-sampling errors refer to inaccuracies that arise from factors other than the
sampling process itself. These errors can occur at any stage of data collection and
analysis, affecting the validity of the results.
Impact: Non-sampling errors can lead to biased results that are not easily
quantifiable. Unlike sampling errors, they do not decrease with larger sample sizes
and can significantly affect the reliability and validity of research findings.
Non-Sampling errors
Causes:
• Measurement Error: Errors that occur when collecting data, such as using
poorly worded survey questions or faulty measuring instruments.
• Response Bias: When respondents do not provide truthful or accurate
answers, often due to social desirability or misunderstanding questions.
• Non-Response Error: When individuals selected for the sample do not
respond, leading to a potential bias if non-respondents differ from
respondents.
• Processing Error: Mistakes made during data entry, coding, or analysis can
introduce inaccuracies.
Difference between
Sampling error Non-Sampling error
What? Occurs due to the sample selected Due to the sources other than
does not perfectly represents the sampling.
population of interest.
Why? Deviation between sample mean Scarcity of data and miscalculated of
and population mean. misinterpreted of analysis.
Where? Only during sample selection From the beginning to the end.
Conclusion
Both sampling errors and non-sampling errors are important to consider when
conducting research. Sampling errors are inherent to the process of taking
samples and can often be managed statistically, while non-sampling errors can
stem from various factors throughout the research process and may require
careful design and implementation strategies to minimize their impact.
Understanding both types of errors helps researchers ensure the accuracy and
reliability of their findings.
Sampling Methods
If the results of a sample survey are to be extrapolated to the entire population
under study, it is essential that the survey is conducted according to well-defined
rules, and that the calculations leading to these extrapolations are consistent with
the sampling procedure used.
A sampling frame is a list or set containing all the units (individuals, companies,
households, etc.) in the population from which a sample will be selected for a
survey or study. This frame serves as a reference for identifying and selecting the
members of the future sample, ensuring that each unit has a known, non-zero
chance of being selected.
Sampling Methods
Key Characteristics:
• Known probability of selection: Each member of the population has a
measurable chance of being chosen.
• Random selection: The process of choosing individuals for the sample is
random, which helps to minimize bias.
• Representative sample: Probability-based sampling aims to create a sample
that accurately reflects the characteristics of the broader population.
Types of Probability-Based Sampling
(1) Simple Random Sampling
Simple Random Sampling is a fundamental probability-based sampling method
where every individual in a population has an equal and independent chance of
being selected for the sample. It is one of the simplest and most widely used
sampling techniques.
Key Features:
• Equal Probability: Each member of the population has an equal chance of
being included in the sample.
• Independence: The selection of one individual does not affect the selection of
another; each selection is independent.
• Randomness: The selection process is entirely random, ensuring no bias in the
sample choice.
Types of Probability-Based Sampling
(1) Simple Random Sampling
How It Works:
• Population Definition: First, define the population from which the sample will
be drawn.
• Assign Numbers: Each individual in the population is assigned a unique
number or identifier.
• Random Selection: Using a random number generator, lottery system, or
another randomization method, select the required number of individuals for
the sample.
Types of Probability-Based Sampling
(1) Simple Random Sampling
Example 1:
Question:
A high school has a boarding school with 90 boarders. Each week, 5 of these
boarders are drawn at random to clear the tables in the dining hall after each
meal. Each week, the random number generator below is used to draw these 5
students.
92200 99401 54473 34336 82786
What is the list of numbers resulting from this draw?
Types of Probability-Based Sampling
(2) Stratified sampling
Stratified sampling is a technique that involves subdividing a heterogeneous
population, of size N, into p more homogeneous subpopulations or “strata” of size
Ni such that:
𝐍 = 𝐍𝟏 + 𝐍𝟐 + ⋯ + 𝐍𝐩
A sample of size ni is then taken independently from each stratum, using a
sampling plan of the user's choice. In most cases, Simple Random Sampling is
used within each stratum.
There are two possible approaches to distributing the total sample size, 𝑛, among
the different strata:
i. Proportional Allocation:
In proportional allocation, the sampling fraction (the proportion of the population in each
stratum that is selected) remains the same across all strata. This means that the size of the
sample from each stratum is proportional to the size of the stratum in the population.
Formula
If the total population size is N, and the population of stratum i is Ni , then the sample size for
stratum i, ni , is given by:
𝐧𝐢 𝐍𝐢 𝐍𝐢
= ⟺ 𝐧𝐢 = 𝐧 ×
𝐧 𝐍 𝐍
Types of Probability-Based Sampling
(2) Stratified sampling
Proportional Allocation: Example
In a population of 10,000 companies, divided into 5,000 small companies, 3,000 medium-
sized companies and 2,000 large companies, we want to have a sample of 500 companies.
With the proportional allocation principle, the sampling fraction is constant:
500
𝑓= = 0.05 = 𝟓%
10000
Stratum Stratum size Sample size
Small 5000 5000 ∗ 𝟎, 𝟎𝟓 = 250
Medium 3000 3000 ∗ 𝟎, 𝟎𝟓 = 150
Large 2000 2000 ∗ 𝟎, 𝟎𝟓 = 100
Total N = 10000 n = 500
Types of Probability-Based Sampling
(2) Stratified sampling
Advantages of Proportional Allocation:
• Simplicity: This method is straightforward and easy to implement.
• Representativeness: Since each stratum’s sample size is proportional to its
representation in the population, the sample reflects the population’s overall
structure.
In optimal allocation, the sample size for each stratum is determined based on
both the size of the stratum and the variability within the stratum. This method
seeks to minimize the sampling variance or maximize the precision of the
estimates while considering the survey budget.
Formula
The sample size for stratum i, 𝒏𝒊, in optimal allocation is calculated as:
𝐍𝐢 𝛔𝐢
𝐧𝐢 = 𝐧 ×
σ 𝐍𝐣 𝛔𝐣
Types of Probability-Based Sampling
(2) Stratified sampling
Advantages of Optimal Allocation:
• Precision: Optimal allocation provides more precise estimates by assigning
larger sample sizes to strata that have greater variability.
• Efficient Resource Use: It takes into account the survey budget, allocating
resources where they will have the greatest impact on reducing uncertainty.
Disadvantages of Optimal Allocation:
• Complexity: This method requires knowledge of the variability in each stratum,
which may not always be available or easy to estimate.
• Higher Costs: It may lead to higher costs if larger samples are needed in strata
with more variability or smaller populations.
Types of Probability-Based Sampling
(2) Stratified sampling
Proportional Allocation VS Optimal Allocation
Conclusion
At each stage, a sample is drawn, starting from larger, more general groups and
moving toward more specific, smaller units. For instance, if you're surveying a
country's population, the first stage might involve selecting regions, the second
stage might focus on cities within those regions, and subsequent stages could
narrow the focus to specific neighborhoods and then households.
Types of Probability-Based Sampling
(3) Multistage sampling
Types of Probability-Based Sampling
Process steps of multistage sampling:
• First degree (or level): The population is divided into large groups or subsets
(called primary units or first-degree units). These groups may be geographic
regions, schools, companies, or other sub-groups depending on the survey.
• Second stage: A sample of sub-units is selected from the selected primary
units. For example, after selecting regions, a number of cities or
neighborhoods within these regions could be selected.
• Third stage (if necessary): If the secondary unit is still too large or
heterogeneous, we continue to subdivide into smaller and smaller units,
selecting a sample at each stage. This can continue over several stages until
more reasonable unit sizes, such as households or individuals, are reached.
Types of Probability-Based Sampling
This method is particularly popular for its simplicity and efficiency. It is often used
when the population is organized in the form of a list or file. For example, if you
have an ordered list of 1,000 people and you wish to select 100, you would choose
a random starting point, then select every k-th person until you reach the desired
sample. The step k is determined by dividing the total population size by the
desired sample size.
Types of Probability-Based Sampling
Types of Probability-Based Sampling
(4) Systematic sampling
If the elements of the population are arranged randomly (with no particular
pattern), systematic sampling produces a sample equivalent to that obtained by
simple random sampling. In other words, both methods have the same
probability of selecting representative samples, since the absence of order in the
population guarantees that each unit has an equal chance of being selected,
regardless of the method.
• Define the sample size: Start by determining the required sample size (n) and
the total population size (N).
• Calculate the sampling interval: The sampling interval k is calculated as
follows: k = N/n. This interval represents how many elements to skip between
two selections.
• Select a random starting point: Choose a random starting point within the
first k elements of the population.
• Select the elements: From the chosen starting point, select every k-th
element until the required sample size is obtained.
Types of Probability-Based Sampling
Imagine a school has a list of 1,000 students, and the administration wants to
survey 100 of them about their satisfaction with the cafeteria services.
Desired sample size (n): 100 students; Total population size (NNN): 1,000 students.
k=N/n=1000/100=10
This means every 10th student will be selected.
Let’s say the administration randomly selects the 4th student from the list as the
starting point.
Starting from the 4th student, the administration will select every 10th student:
4th, 14th, 24th, 34th, 44th, ..., up to 994th.
Types of Probability-Based Sampling
Advantages of Systematic Sampling:
• Simplicity: This Systematic sampling is easy to implement. Once the sampling
interval is determined and a random starting point is selected, the process of
selecting the sample is straightforward.
• Efficiency: It requires less time and effort compared to random sampling
methods, as the selection follows a regular pattern. This makes it faster to gather
data.
Limitations:
• Requires a Complete List: This method necessitates having a complete and
ordered list of the population, which may not always be available.
• Less Randomness: While systematic sampling is efficient, it is less random than
simple random sampling. This can lead to a lack of variability in the sample,
potentially affecting the results.
Non Probability-Based Sampling
Sampling process
• Choice of location: The researcher goes to the cafeteria, a place frequented by
many students, to maximize the number of potential participants.
• Participant selection: The researcher begins by approaching nearby students,
asking them questions about their coffee consumption. He makes no effort to
ensure that participants represent different age groups, genders or fields of
study. He simply focuses on those who are present and available at the time.
Non Probability-Based Sampling
• Advantages of convenience sampling:
Accidental sampling has several advantages. Firstly, it is simple to implement,
enabling data to be collected quickly, especially in situations where time is
limited. Secondly, it is cost-effective, as it reduces the costs associated with
research, such as travel or remuneration of participants. In addition, it can be
useful for exploratory studies where initial hypotheses are being tested.
A researcher conducts a study on the use of fitness apps among young adults.
Before starting data collection, the researcher defines specific inclusion criteria:
• Participants must be between the ages of 18 and 30.
• Participants must be residents of a specific city to control for geographical
factors.
The researcher then uses social media platforms, inviting members who match
the age and usage criteria to participate in the study.
The resulting sample is made up of individuals who meet the a priori criteria,
enabling the researcher to gather specific information.
Non Probability-Based Sampling
Once quotas have been set, individuals are selected at the interviewer's
convenience.
The criteria used to define quotas should not be too numerous. Beyond 3 criteria,
the process becomes complex.
Population structure
Sample Distribution
Population structure
Non Probability-Based Sampling
Example of Quota Sampling:
Sample distribution
Exercise: Quota sampling method
A company wants to conduct a survey among its clients to evaluate their
satisfaction. The total population of clients is divided according to two main
criteria: gender and age group, but the age group proportions differ for each
gender.
The company wants to form a representative sample of 500 clients using the
quota sampling method.
Questions:
● How many men and how many women should be included in the sample?
● Calculate the distribution of men and women by age group.
● How many clients in total will be in each age group, regardless of gender?