Statistics refresher
Part I – Basic statistics
Types of variables
Qualitative variables
• Nominal – No hierarchy; description of case
– Examples: Gender, marital status, household-type
• Ordinal – Hierarchically ordered; no meaningful &
set difference between numbers
– Examples: education level, satisfaction rates (low vs.
medium vs. high).
Types of variables
Quantitative variables
• Interval – Hierarchically ordered; set difference
between numbers; arbitrary zero
– Examples: IQ scores, birth year
• Ratio – Hierarchically ordered; set difference
between numbers; logical/real zero
– Examples: Age, distance in km, income in euro’s
• Discrete – Ordered; set difference; no decimals
– Example: number of children
Types of variables
Analyses for different types of variables
• Nominal – Frequencies; cross-tabulations; Chi2 test
• Ordinal – Frequencies; cross-tabulations; Chi2 test;
logit/probit model
• Interval – Mean; standard deviation; correlations; t-
test; OLS regression
• Ratio – Mean; standard deviation; correlations; t-
test; OLS regression
Types of variables
Life satisfaction / happiness instruments
• Often a single item scale such as: “On a scale from 0 to 10,
how satisfied are you with your life as a whole?”
• Technically, this is an ordinal scale
– Hierarchical order, but no meaningful and set difference
between categories
• However, in research it is usually treated as a ratio-scale:
Makes little difference in results; and is easier to interpret.
Descriptive statistics
1. Statistical measures
Descriptive statistics
• Measures of central tendency
– Mean; median; mode
• Measures of variability
– Standard deviation; variance; min-max; range;
quartiles
• Measures of skewness
– Skewness; kurtosis
Descriptive statistics
Measures of tendency
𝑥𝑖
• Mean: sum of all scores divided by the number of cases
𝑛
• Median: Middle value when all values are ordered from
smallest to largest
• Mode: Most frequent value
Descriptive statistics
Measures of variability
• Standard deviation: SD = (𝑥𝑖−𝜇)2
𝑁
– Subtract the mean from all scores
– Square the result
– Divide by the number of cases
– Take the square root
• Variance: SD2
• Min-Max & range: Lowest – highest score
• Quartiles: Q1 – 25% | Q2 – 50% (median) | Q3 – 75%
Descriptive statistics
Table 1: Descriptive statistics - example
Mean (SD)
Life satisfaction (1-10) 7.53 (1.37)
Household income in € 2,851 (243)
Percent
Gender
Female 55%
Male 45%
For categorical variables mean and standard deviation do not make sense!
Descriptive statistics
2. Frequencies and graphs
Descriptive statistics
Frequency Circle Bar chart Histogram Line
table diagram diagram
Nominal X X X
Ordinal X X
Interval / ratio X
(discrete)
Interval / ratio X X
(continuous) (over time)
Descriptive statistics
• Frequency table
Civil status Freq. Percent Cum.
Married 311 51% 51%
Separated 25 0% 51%
Divorced 605 10% 61%
Widow or widower 387 6% 68%
Never been married 1971 32% 100%
Total 6098 100%
Graphs
• Circle diagram (nominal, few categories)
Graphs
• Bar chart (nominal, ordinal, interval/ratio in
cat.)
Graphs
• Histogram (interval/ratio, in intervals)
Graphs
• Line diagram (interval/ratio over time)
Graphs
• Boxplot (interval/ratio + nominal/ordinal)
Part 2 –
Relations between variables
Relations between variables
1. Normal distribution
2. Comparing groups
3. Cross-tables
4. Correlations
5. Regression analysis
1. Normal distribution
• Central limit theorem: In a large sample, scores on a variable tend to
approximate a ‘normal’ distribution
• Standard deviation
determines width
• Half of values is on
either side of the
centre
• The total area under
the line is 100%
1. Normal distribution
2. Comparing groups
• T-test: compares means of two groups
– Independent samples (H0 mean group A = mean group B)
– Paired sample (H0 mean at time 1 = mean at time 2)
– One sample (H0 mean = x)
2. Comparing groups
• Anova: compares means of multiple groups
– H0 mean group A = mean group B = mean group C … etc.
– Example: are all North-Western European countries equally happy?
– Compares the variance within groups to the variance between groups
– H0 rejected if variance between values is mostly determined by their
groups (between variation)
3. Cross-tables
• Compares percentages of categorical variables
• Can show whether there is a significant difference between
Var A for groups of Var B
Middle
Low education education High education Total
Highly urban 660 842 1,047 2,549
26% 33% 41% 100%
Moderately urban 387 509 445 1,341
29% 38% 33% 100%
Not urban 726 762 659 2,147
34% 35% 31% 100%
Total 1,773 2,113 2,151 6,037
29% 35% 36% 100%
Pearson chi2(4) = 69.5156 Pr = 0.000
4. Correlations
• Strength of the relation between var A and var B
• Standardized (between -1 and 1), so comparable
• Doesn’t tell you anything about causality
• With 95% certainty p-value of 0.05
– P-value of 0.01 = 99% certainty
– P-value of 0.10 = 90% certainty
Linear regression
• Ordinary Least Squares (OLS)
• Estimates the equation of the line that best describes the
association between two variables
• Minimizes the sum of the errors (distance between real and
predicted values)
Elements of regression model
• y = β0 + β1 x1 + u
• y = dependent variable
• x1 = independent variable
• u = unobserved, error term, disturbance
• β0 = intercept
• β1 = regression coefficient
Multiple regression
• Single regression
y = β0 + β1 x1 + u
Happiness = Intercept + β1 * Health + Error
• Multiple regression
y = β0 + β1 x1 + β2 x2 + u
Happiness = Intercept + β1 Health + β2 gender + Error
Categorical vs. continuous
independent variables
• Independent variables can be included as continuous
or categorical variables in an OLS regression model.
• Qualitative variables are typically included as
categorical variables.
• Quantitative variables are typically included as
continuous variables.
Dummy variable
• You will frequently come across the term “dummy” in
regression models.
• Dummy variable = A dichotomous variable indicating
whether a respondent is in a category or not.
• Example: gender (male or female)
Type of data collection
• Cross-section: Single measurement on multiple units
– Example: a one-time employee satisfaction survey
• Time series: Multiple measurements on one unit
– Example: A country’s happiness growth over time
• Longitudinal/panel: Multiple measurements on multiple units
– Example: Surveying the same individuals each year about their
happiness
• Pooled cross section: Cross section at multiple times
– Example: Repeating the same survey with different individuals each
year
Interpretation regression coefficients
OLS regression
Dependent variable = happiness (0-10)
Independent variables B (SE)
Religiosity (0-10) 0.02 (0.00) **
Employment
Paid job Ref.
Unemployed -0.61 (0.08) **
Not on job market -0.02 (0.03)
** Significant at 1% level; N=9.714
Continuous IV:
A one unit increase in x is, on average, associated with a βx change in y, ceteris paribus.
Example:
A one unit increase in religiosity is, on average, associated with a 0.02 higher happiness
score, ceteris paribus.
Categorical IV:
βx is the average difference in Y between the comparison group and the reference group,
ceteris paribus.
Example:
Unemployed people have on average a 0.61 lower happiness score than employed people
(the reference group), ceteris paribus.