100% found this document useful (1 vote)
35 views44 pages

RSU - Statistics - Lecture 1 - Final - myRSU

The document discusses applied statistics including data types, sampling, descriptive statistics, and statistical methods. It covers topics like population and sample data, determining optimal sample size, and research design. Statistical analysis techniques are used to extract information from data and assess research outputs.

Uploaded by

irina.mozajeva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
35 views44 pages

RSU - Statistics - Lecture 1 - Final - myRSU

The document discusses applied statistics including data types, sampling, descriptive statistics, and statistical methods. It covers topics like population and sample data, determining optimal sample size, and research design. Statistical analysis techniques are used to extract information from data and assess research outputs.

Uploaded by

irina.mozajeva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

APPLIED STATISTICS

Docent, Dr.oec. Irina Mozhaeva


Lecture 1
Data and Sampling
SYLLABUS

1. Types of Data and Data Collection. Population and sample data.


General ideas and types of sampling, optimal sample size
determination. Bias: how it arises and is avoided.
2. Aim of the survey. Designing questionnaires: best practice, tips
and common mistakes. Types of questions. Defining groups and
intervals. Summary representation of data.
3. Distributions of random variables. Exploring data: basic concepts
of descriptive statistics, measures of central tendency and
variation.
4. Statistical inference, confidence intervals. Central Limit Theorem.
5. Bivariate analyses. Correlation and covariance. Lines of best fit.
6. Time series analysis. Moving average and exponential smoothing
techniques. Seasonality. Use of a trend line and seasonal
component in prediction.
Assessment Criteria

• 50% of the final grade - exam in statistics


• 50% of the final grade - independent work, homework, activity:
– Attendance of lectures and seminars: 5%
– Activity and quality of answers in seminars: 15%
– Homework assignments and interim tests: 30%
LECTURE 1 OUTLINE

• Statistics: Basic Definitions and Concepts


• Statistical methods
• Data types
• Data collection, research design
• Population and sample data
• Sampling, optimal sample size determination
• Bias: how it arises and is avoided
Statistics: Basic Definitions and
Concepts
Key Terms

Statistics, also known as statistical analysis, or statistical inference is


a field of study concerned with collecting, summarizing data,
interpreting them, and making decisions based on data.

The related term data science or data analysis stands for a study of
processes and systems that extract knowledge or insights from data
in various forms, either structured or unstructured. Data science is a
continuation of some of the fields such as statistics, data mining, and
predictive analytics.
Key Terms

A population is any specific collection (whole number!) of objects


(persons, things, etc.) of interest.

To study the population, we usually select a sample. The idea


of sampling is to select a subset or subcollection of the population
and study it to gain information about the population.

A representative sample is a subset of a population that seeks to


accurately reflect the characteristics of the population.
Key Terms

A quantity calculated in a sample to estimate a value in a population


is called a statistic.

A parameter is a numerical characteristic of the whole population


that can be estimated by a statistic.
Key Terms

Observation unit is the unit described by the data that one analyzes.
 The unit of observation might be an individual, university, country,
etc.

A variable, usually notated by letters such as X and Y, is a


characteristic or measurement that can be determined for each
observation, sample and population.

Data are the actual values of the variable. They may be numeric or
text variables (string variables).

The probability of an event is a measure of the likelihood that the


event will occur.
Statistical Methods
Statistical Methods

Statistical methods are mathematical formulas, models, and


techniques that are used in statistical analysis of raw research data.

The application of statistical methods extracts information from


research data and provides different ways to assess the robustness
of research outputs.
Statistical Methods

Descriptive statistics is the branch of statistics that involves


organizing, summarizing, displaying, and describing data.

Inferential statistics is the branch of statistics that involves drawing


conclusions that extend beyond the immediate data alone.
Descriptive Statistics. Univariate analysis.

Univariate analysis involves the examination across cases of one


variable at a time.
Since it's a single variable it doesn’t deal with causes or
relationships. The main purpose of univariate analysis is to describe
the data and find patterns that exist within it.
There are three major characteristics of a single variable that we
tend to look at:
• the distribution
• the central tendency (e.g. mean, mode, median)
• the dispersion (e.g. variance, standard deviation,
interquartile range)

In many situations, we would describe all three of these


characteristics for our variables.
Descriptive Statistics. Bivariate Analysis

Bivariate analysis is used to find out if there is a relationship


between two different variables. Something as simple as
creating a scatterplot by plotting one variable against another
on a Cartesian plane (think X and Y axis) can sometimes give
you a picture of what the data is trying to tell you.
Multivariate Analysis

Multivariate analysis is the analysis of three or more


variables. There are many ways to perform multivariate analysis
depending on your goals. Some of these methods include:
• Additive Tree
• Canonical Correlation Analysis
• Cluster Analysis
• Correspondence Analysis
• Factor Analysis
• Generalized Procrustean Analysis
• Multidimensional Scaling
• Multiple Regression Analysis
• Partial Least Square Regression
• Redundancy Analysis
Data Types
Types of Data
Types of Data

Data

Primary Secondary

Primary data - quantitative or qualitative data obtained directly


from individuals, objects or processes. Such data is usually
collected exactly for the research problem you plan to study.
Types of Data

Data

Primary Secondary

Secondary data - data gathered by another researcher or agency


(and made available to you). Examples: census data published by
the Central Statistical Bureau, stock prices data published by CNN,
formal unemployment data provided by the State Employment
Agency.
Research Designs for Primary Data
Collection
Study design

In many ways the design of a study is more important than the


analysis. A badly designed study can never be retrieved, whereas
poor analysis can usually be amended or altered.

Hence, it is important at the outset to:


• Make objectives/research questions clear and unambiguous
(hypothesis-driven)
• Identify what data you need
• Plan your statistical analysis and decide on the methodology
applied before you collect any data.
Kinds of research designs

Four broad kinds of research designs are used in the behavioral and
social sciences:
• survey,
• experimental,
• comparative,
• and ethnographic.

Survey designs include the collection and analysis of data from


censuses, sample surveys, and longitudinal studies and the
examination of various relationships among the observed
phenomena. Randomization here is used to select members of a
sample so that the sample is as representative of the whole
population as possible.
Kinds of research designs

Experimental designs, in either the laboratory or field settings,


systematically manipulate a few variables while others that may affect the
outcome are held constant, randomized, or otherwise controlled. The
purpose of randomized experiments is to ensure that only one or a few
variables can systematically affect the results, so that causality can be
analyzed.
Comparative designs involve the retrieval of evidence that is recorded
in the flow of current or past events in different times or places and the
interpretation and analysis of this evidence.
Ethnographic designs involve a qualitative method where researchers
observe and/or interact with a study’s participants in their real-life
environment. The aim of an ethnographic study is to get ‘under the
skin’ of a problem (and all its associated issues). It is hoped that by
achieving this, a designer will be able to truly understand the problem
and therefore design a far better solution.
Population and Sample Data.
Sampling, Optimal Sample Size
Determination.
Population and Sample

Sample:
We can learn nearly as much
Population: by studying a suitably large
may be too big and/or correctly specified sample of a
expensive to study population as we can from
studying the entire population.
Random Selection

Most research study designs require a sample to be randomly


selected from a population.
Research1 suggests humans cannot generate random numbers and
thus cannot make random selections.
For a simplified random selection using one variable you can:
• Select numbered balls out of a bag (as in a lottery)
• Use an online random number generator, such as
www.random.org/integers
• Use the RAND or RANDBETWEEN functions in Excel
o See a tutorial: https://www.youtube.com/watch?v=fkyzQvjsqz0
• Or use Data Analysis in Excel.
o See a tutorial: https://www.youtube.com/watch?v=5XrJcFmbpWI&t=15s

1. Bains, W. (2008) Random number generation and creativity, Medical Hypotheses, 70(1), pp. 186-190
Determining Sample Size

One crucial aspect of study design is deciding how big your sample
should be. If you increase your sample size you increase the precision
of your estimates, which means that, for any given estimate / size of
effect, the greater the sample size the more “statistically significant”
the result will be. In other words, if your analysis is based on a small
number of observations, it will not detect results that are in fact
statistically significant.
However, increase in the sample size increases costs, therefore one
should define an optimal sample size.
Three main criteria need to be specified to determine the appropriate
sample size:
1. the level of precision
2. the level of confidence
3. the degree of variability of parameters measured
1. The level of precision

The level of precision or sampling error is the range in which the true
value of the population is estimated to be.
This range is often expressed in percentage points (e.g. ± 5%o). Thus,
if we find that 40% of students in our sample have read “Introduction
to Statistics” from A to Z, and our precision rate is ± 5%o, then we
may conclude that between 35% and 45% of all students have read
the entire book.
2. The confidence level

The confidence or risk level is based on ideas encompassed under


the Central Limit Theorem. The key idea of this theorem is that when
a population is repeatedly sampled, the average value of the
attribute obtained by those samples is equal to the true population
value. Furthermore, the values obtained by these samples are
distributed normally about the true value, with some samples having
a higher value and some obtaining a lower score.
In a normal distribution, about 95% of
the sample values will lie within two
standard deviations of the true
population value. This means that if a
95% confidence level is selected, 95 out
of 100 samples will have the true
population value within the specified
range of precision. Distribution of Means
for Repeated Samples
3. Degree of variability

The degree of variability in the attributes being measured refers to


the distribution of attributes in the population. The more
heterogeneous a population, the larger the sample size required to
obtain a given level of precision. The more homogeneous a
population, the smaller the sample size required.
You should note that a 50/50 split on a specific attribute or response
indicates maximum variability in the population, whereas a 90/10
split means that 90 per cent of the population share an attribute, so
the sample is less variable. If you don’t know what level of variability
to expect, then assume that it is 50 per cent. This may mean that you
use a larger sample size than was really needed, but that is better
than using a sample size that is too small, and then having no
confidence in the results.
Options for Determining Sample Size

There are several approaches to determining the sample size. These


include:
1. Census for small populations
2. Using published tables
3. Applying formulae
1. Using a Census for Small Populations

One option is to undertake a census, that is, to survey every member


of the population.
For small populations this may well be the only way to guarantee a
degree of accuracy. It eliminates all sampling error and provides data
on the whole population. However, cost considerations make this
impractical once populations exceed a few hundred.
2. Using Published Tables

The quickest option is to find a relevant table of sample sizes. These


will give sample sizes for different populations and with different
levels of precision, confidence levels and degrees of variability.
If you are happy with a 95 per cent confidence level, 5 per cent
precision and 50 per cent degree of variability, then you can choose
the sample size from the table shown below.

NB! The sample refers to the number of respondents, and not to the
number of people invited to participate in a survey. These sample
sizes also assume a truly random sample is used. If you need to
reflect differences in gender or age or geographic distribution, then
you have to use a stratified sampling system and a larger sample size.
3. Using Formulae

If you want different confidence intervals, or have different degrees


of variability, then you may find it easiest to use a formula to calculate
the sample size. For large populations and cases when the population
size is unknown, this formula by Cochran (1963) will tell you the
sample size required.

Z depends on the degree of confidence that you want. For a


confidence level of 95 per cent, Z=1.96; for 90 per cent, Z= 1.645; and
for 99 per cent, Z=2.576.
p is the degree of variability, expressed as a decimal; if you don’t
know this, then use 0.5.
e is the level of precision, expressed as a decimal.
• Refers to the actual uncertainty in a quantity. For example, prevalence of
coronavirus is 20% ± 10%, the absolute uncertainty is 10%.
3. Using Formulae. Example 1.

Imagine that you need to survey the total population of SMEs to


discover how many have a loan from a bank. You are happy with a
confidence level of 95 per cent, a precision rate of ±5 per cent and a
degree of variability of 50 per cent.
For a confidence level of 95 per cent, Z=1.96.

= 384.16 385

You obviously cannot interview a fraction of a person, so you need to


round upwards.
3. Using Formulae. Example 1.
Population Correction
If your total population is small, then your sample can be smaller.
If the total population of SMEs is just 1,000 members, then you can
adjust your sample size by using this equation, where n is the new
sample size and N is the size of the population.

384.16 / (1 + (384.16-1)/1000) = 277.41

278

You need to do the earlier calculation (on previous slide) to discover


the sample size for a large population and then you can apply the
‘finite population correction’.
Determining Sample Size Using Formulae. Task 1.

We want to estimate the true immunization coverage in a community


of children. Research tells us that immunization coverage should be
somewhere around 80% to avoid spread of the disease.
Precision (absolute): we would like the result to be within 2% of the
true value. Confidence level: conventional 95%.
Calculate the appropriate sample size using
the formula proposed by Cohran.

p = guess for the expected proportion in the population = 0.80


e = absolute precision = 0.02
Confidence level = 95%  Z = 1.96

n0 = 1.962 * (0.8) * (0.2) / 0.022


n0 = 1536.64  1537
Determining Sample Size Using Formulae. Task 1.

Now imagine we want to estimate the true immunization coverage in


one school only. Our population is 1200 children, other parameters
the same as above.
Adjust the sample size accordingly using the formula:

n = 1536.64/ (1 + (1536.64-1)/1200) = 674.05


Formula for Sample Size for the Mean

The use of tables and formulas to determine sample size in the above
discussion employed proportions that assume a dichotomous
response for the attributes being measured. There are two methods
to determine sample size for variables that are polytomous or
continuous. One method is to combine responses into two categories
and then use a sample size based on proportion. The second method
is to use the formula for the sample size for the mean. The formula of
the sample size for the mean is similar to that of the proportion,
except for the measure of variability. The formula for the mean
employs σ2 instead of (p x (p-1)).

σ2 is the variance of an attribute in the population.


The disadvantage of the sample size based on the mean is that a
"good" estimate of the population variance is necessary.
Bias: How It Arises and Is Avoided
What is Sample Selection Bias?

Sample selection bias is the bias that results from the failure to
ensure the proper randomization of a population sample.
The flaws of the sample selection process lead to situations where
some groups or individuals in the population are less likely to be
included in the sample, while others are more likely to participate.
The presence of sample selection bias may distort the statistical
analysis of a sample and affect the statistical significance of the
chosen statistical tests.
Types of Sample Selection Bias

1. Self-selection
Self-selection happens when the participants of the study exercise
control over the decision to participate in the study to a certain extent.
Since the participants may decide whether to participate in the
research or not, the selected sample does not represent the entire
population.
2. Selection from a specific area
The participants of the study are
selected from certain areas only
while other areas are not
represented in the sample.
3. Exclusion
Some groups in the population are
excluded from the study.
Types of Sample Selection Bias

4. Survivorship bias
Survivorship bias occurs when a sample is concentrated on subjects
that passed the selection process and ignores subjects that did not pass
the selection process. The survivorship bias results in overly optimistic
findings from the study.

5. Pre-screening of participants
The participants of the study are
recruited only from particular
groups. Thus, the sample will not
represent the entire population of
the study.
How to Overcome Bias?

The most obvious method is the establishment of a random sample


selection process.
Furthermore, one should ensure that the subgroups selected are
equivalent to the population in terms of their key characteristics (if
the key characteristics of population are known).
By analyzing the population of the study and by identifying the
subgroups of the population, a researcher must ensure that the
selected sample represents the total population as much as
possible.
If some of the population subgroups in the resulting sample are
underrepresented while other groups are overrepresented, a
researcher should apply a statistical correction by assigning weights
that will correct the bias.

You might also like