RSU - Statistics - Lecture 1 - Final - myRSU
RSU - Statistics - Lecture 1 - Final - myRSU
The related term data science or data analysis stands for a study of
processes and systems that extract knowledge or insights from data
in various forms, either structured or unstructured. Data science is a
continuation of some of the fields such as statistics, data mining, and
predictive analytics.
Key Terms
Observation unit is the unit described by the data that one analyzes.
The unit of observation might be an individual, university, country,
etc.
Data are the actual values of the variable. They may be numeric or
text variables (string variables).
Data
Primary Secondary
Data
Primary Secondary
Four broad kinds of research designs are used in the behavioral and
social sciences:
• survey,
• experimental,
• comparative,
• and ethnographic.
Sample:
We can learn nearly as much
Population: by studying a suitably large
may be too big and/or correctly specified sample of a
expensive to study population as we can from
studying the entire population.
Random Selection
1. Bains, W. (2008) Random number generation and creativity, Medical Hypotheses, 70(1), pp. 186-190
Determining Sample Size
One crucial aspect of study design is deciding how big your sample
should be. If you increase your sample size you increase the precision
of your estimates, which means that, for any given estimate / size of
effect, the greater the sample size the more “statistically significant”
the result will be. In other words, if your analysis is based on a small
number of observations, it will not detect results that are in fact
statistically significant.
However, increase in the sample size increases costs, therefore one
should define an optimal sample size.
Three main criteria need to be specified to determine the appropriate
sample size:
1. the level of precision
2. the level of confidence
3. the degree of variability of parameters measured
1. The level of precision
The level of precision or sampling error is the range in which the true
value of the population is estimated to be.
This range is often expressed in percentage points (e.g. ± 5%o). Thus,
if we find that 40% of students in our sample have read “Introduction
to Statistics” from A to Z, and our precision rate is ± 5%o, then we
may conclude that between 35% and 45% of all students have read
the entire book.
2. The confidence level
NB! The sample refers to the number of respondents, and not to the
number of people invited to participate in a survey. These sample
sizes also assume a truly random sample is used. If you need to
reflect differences in gender or age or geographic distribution, then
you have to use a stratified sampling system and a larger sample size.
3. Using Formulae
= 384.16 385
278
The use of tables and formulas to determine sample size in the above
discussion employed proportions that assume a dichotomous
response for the attributes being measured. There are two methods
to determine sample size for variables that are polytomous or
continuous. One method is to combine responses into two categories
and then use a sample size based on proportion. The second method
is to use the formula for the sample size for the mean. The formula of
the sample size for the mean is similar to that of the proportion,
except for the measure of variability. The formula for the mean
employs σ2 instead of (p x (p-1)).
Sample selection bias is the bias that results from the failure to
ensure the proper randomization of a population sample.
The flaws of the sample selection process lead to situations where
some groups or individuals in the population are less likely to be
included in the sample, while others are more likely to participate.
The presence of sample selection bias may distort the statistical
analysis of a sample and affect the statistical significance of the
chosen statistical tests.
Types of Sample Selection Bias
1. Self-selection
Self-selection happens when the participants of the study exercise
control over the decision to participate in the study to a certain extent.
Since the participants may decide whether to participate in the
research or not, the selected sample does not represent the entire
population.
2. Selection from a specific area
The participants of the study are
selected from certain areas only
while other areas are not
represented in the sample.
3. Exclusion
Some groups in the population are
excluded from the study.
Types of Sample Selection Bias
4. Survivorship bias
Survivorship bias occurs when a sample is concentrated on subjects
that passed the selection process and ignores subjects that did not pass
the selection process. The survivorship bias results in overly optimistic
findings from the study.
5. Pre-screening of participants
The participants of the study are
recruited only from particular
groups. Thus, the sample will not
represent the entire population of
the study.
How to Overcome Bias?