0% found this document useful (0 votes)
106 views34 pages

Statistics Refresher for Data Science

1. The document discusses different types of variables including nominal, ordinal, interval, and ratio variables. It also discusses how to analyze different types of variables. 2. Descriptive statistics are introduced including measures of central tendency, variability, skewness, frequencies, and graphs. 3. Relations between variables are covered including the normal distribution, comparing groups through t-tests and ANOVA, cross-tabulations, correlations, and regression analysis.

Uploaded by

Corrado Bisotto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views34 pages

Statistics Refresher for Data Science

1. The document discusses different types of variables including nominal, ordinal, interval, and ratio variables. It also discusses how to analyze different types of variables. 2. Descriptive statistics are introduced including measures of central tendency, variability, skewness, frequencies, and graphs. 3. Relations between variables are covered including the normal distribution, comparing groups through t-tests and ANOVA, cross-tabulations, correlations, and regression analysis.

Uploaded by

Corrado Bisotto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Statistics refresher

Part I – Basic statistics


Types of variables
Qualitative variables
• Nominal – No hierarchy; description of case
– Examples: Gender, marital status, household-type
• Ordinal – Hierarchically ordered; no meaningful &
set difference between numbers
– Examples: education level, satisfaction rates (low vs.
medium vs. high).
Types of variables
Quantitative variables
• Interval – Hierarchically ordered; set difference
between numbers; arbitrary zero
– Examples: IQ scores, birth year
• Ratio – Hierarchically ordered; set difference
between numbers; logical/real zero
– Examples: Age, distance in km, income in euro’s
• Discrete – Ordered; set difference; no decimals
– Example: number of children
Types of variables
Analyses for different types of variables
• Nominal – Frequencies; cross-tabulations; Chi2 test
• Ordinal – Frequencies; cross-tabulations; Chi2 test;
logit/probit model
• Interval – Mean; standard deviation; correlations; t-
test; OLS regression
• Ratio – Mean; standard deviation; correlations; t-
test; OLS regression
Types of variables
Life satisfaction / happiness instruments
• Often a single item scale such as: “On a scale from 0 to 10,
how satisfied are you with your life as a whole?”
• Technically, this is an ordinal scale
– Hierarchical order, but no meaningful and set difference
between categories
• However, in research it is usually treated as a ratio-scale:
Makes little difference in results; and is easier to interpret.
Descriptive statistics

1. Statistical measures
Descriptive statistics
• Measures of central tendency
– Mean; median; mode
• Measures of variability
– Standard deviation; variance; min-max; range;
quartiles
• Measures of skewness
– Skewness; kurtosis
Descriptive statistics
Measures of tendency
𝑥𝑖
• Mean:  sum of all scores divided by the number of cases
𝑛
• Median: Middle value when all values are ordered from
smallest to largest
• Mode: Most frequent value
Descriptive statistics
Measures of variability
• Standard deviation: SD = (𝑥𝑖−𝜇)2
𝑁

– Subtract the mean from all scores


– Square the result
– Divide by the number of cases
– Take the square root
• Variance: SD2
• Min-Max & range: Lowest – highest score
• Quartiles: Q1 – 25% | Q2 – 50% (median) | Q3 – 75%
Descriptive statistics
Table 1: Descriptive statistics - example
Mean (SD)
Life satisfaction (1-10) 7.53 (1.37)
Household income in € 2,851 (243)

Percent
Gender
Female 55%
Male 45%
For categorical variables  mean and standard deviation do not make sense!
Descriptive statistics

2. Frequencies and graphs


Descriptive statistics

Frequency Circle Bar chart Histogram Line


table diagram diagram
Nominal X X X
Ordinal X X
Interval / ratio X
(discrete)
Interval / ratio X X
(continuous) (over time)
Descriptive statistics
• Frequency table
Civil status Freq. Percent Cum.

Married 311 51% 51%


Separated 25 0% 51%
Divorced 605 10% 61%
Widow or widower 387 6% 68%
Never been married 1971 32% 100%

Total 6098 100%


Graphs
• Circle diagram (nominal, few categories)
Graphs
• Bar chart (nominal, ordinal, interval/ratio in
cat.)
Graphs
• Histogram (interval/ratio, in intervals)
Graphs
• Line diagram (interval/ratio over time)
Graphs
• Boxplot (interval/ratio + nominal/ordinal)
Part 2 –
Relations between variables
Relations between variables
1. Normal distribution
2. Comparing groups
3. Cross-tables
4. Correlations
5. Regression analysis
1. Normal distribution
• Central limit theorem: In a large sample, scores on a variable tend to
approximate a ‘normal’ distribution
• Standard deviation
determines width
• Half of values is on
either side of the
centre
• The total area under
the line is 100%
1. Normal distribution
2. Comparing groups
• T-test: compares means of two groups
– Independent samples (H0  mean group A = mean group B)
– Paired sample (H0  mean at time 1 = mean at time 2)
– One sample (H0  mean = x)
2. Comparing groups
• Anova: compares means of multiple groups
– H0  mean group A = mean group B = mean group C … etc.
– Example: are all North-Western European countries equally happy?
– Compares the variance within groups to the variance between groups
– H0 rejected if variance between values is mostly determined by their
groups (between variation)
3. Cross-tables
• Compares percentages of categorical variables
• Can show whether there is a significant difference between
Var A for groups of Var B
Middle
Low education education High education Total

Highly urban 660 842 1,047 2,549


26% 33% 41% 100%
Moderately urban 387 509 445 1,341
29% 38% 33% 100%
Not urban 726 762 659 2,147
34% 35% 31% 100%
Total 1,773 2,113 2,151 6,037
29% 35% 36% 100%
Pearson chi2(4) = 69.5156 Pr = 0.000
4. Correlations
• Strength of the relation between var A and var B
• Standardized (between -1 and 1), so comparable
• Doesn’t tell you anything about causality
• With 95% certainty  p-value of 0.05
– P-value of 0.01 = 99% certainty
– P-value of 0.10 = 90% certainty
Linear regression
• Ordinary Least Squares (OLS)
• Estimates the equation of the line that best describes the
association between two variables

• Minimizes the sum of the errors (distance between real and


predicted values)
Elements of regression model
• y = β0 + β1 x1 + u

• y = dependent variable
• x1 = independent variable
• u = unobserved, error term, disturbance
• β0 = intercept
• β1 = regression coefficient
Multiple regression
• Single regression
y = β0 + β1 x1 + u
Happiness = Intercept + β1 * Health + Error

• Multiple regression
y = β0 + β1 x1 + β2 x2 + u
Happiness = Intercept + β1 Health + β2 gender + Error
Categorical vs. continuous
independent variables
• Independent variables can be included as continuous
or categorical variables in an OLS regression model.
• Qualitative variables are typically included as
categorical variables.
• Quantitative variables are typically included as
continuous variables.
Dummy variable
• You will frequently come across the term “dummy” in
regression models.
• Dummy variable = A dichotomous variable indicating
whether a respondent is in a category or not.
• Example: gender (male or female)
Type of data collection
• Cross-section: Single measurement on multiple units
– Example: a one-time employee satisfaction survey
• Time series: Multiple measurements on one unit
– Example: A country’s happiness growth over time
• Longitudinal/panel: Multiple measurements on multiple units
– Example: Surveying the same individuals each year about their
happiness
• Pooled cross section: Cross section at multiple times
– Example: Repeating the same survey with different individuals each
year
Interpretation regression coefficients
OLS regression
Dependent variable = happiness (0-10)
Independent variables B (SE)
Religiosity (0-10) 0.02 (0.00) **
Employment
Paid job Ref.
Unemployed -0.61 (0.08) **
Not on job market -0.02 (0.03)
** Significant at 1% level; N=9.714

Continuous IV:
A one unit increase in x is, on average, associated with a βx change in y, ceteris paribus.
Example:
A one unit increase in religiosity is, on average, associated with a 0.02 higher happiness
score, ceteris paribus.

Categorical IV:
βx is the average difference in Y between the comparison group and the reference group,
ceteris paribus.
Example:
Unemployed people have on average a 0.61 lower happiness score than employed people
(the reference group), ceteris paribus.

You might also like