Introduction To Machine Learning: Big Data For Economic Applications
Introduction To Machine Learning: Big Data For Economic Applications
to Machine
Learning
Alessio
Farcomeni
Introduction to Machine Learning
Big Data for Economic Applications
Alessio Farcomeni
University of Rome “Tor Vergata”
[email protected]
Prediction
Introduction
to Machine
Learning
Most tasks involve understanding interrelationships among
Alessio
Farcomeni predictors and outcome
Other ones are more focused on prediction of the outcome
in future settings (when it has not been measured, only
the predictors)
A sample of examples/training data is available where an
outcome has been measured together with (possibly,
cheap) predictors
Will a firm go into default? Has there been a fraud with
this transaction? Etc.
(Logistic) regression can be used, but it is not the only
option.
Basic Idea: which class for the diamond?
Introduction
to Machine
Learning
Alessio
Farcomeni 6
4
y[1:100]
2
0
−2
−2 0 2 4 6
x[1:100]
Why?
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni 6
4
y[1:100]
2
0
−2
−2 0 2 4 6
x[1:100]
Two methods
Introduction
to Machine
Learning
Alessio
Farcomeni 6
6
4
4
y[1:100]
y[1:100]
2
2
0
0
−2
−2
−2 0 2 4 6 −2 0 2 4 6
x[1:100] x[1:100]
A more realistic scenario
Introduction
to Machine
Learning
Alessio
Farcomeni 0.5
0.5
0.0
0.0
y[z == 2]
y[z == 2]
−0.5
−0.5
−2 −1 0 1 2 −2 −1 0 1 2
x[z == 2] x[z == 2]
Take home messages
Introduction
to Machine
Learning
Alessio
Farcomeni
Groups might not be separated, so that the classifier will
make errors also on the training set
Non-linear classifiers tend to make less errors on the
available data (less biased) than linear/simpler classifiers
Non-linear classifiers tend to make more errors on the new
data (more variable) than linear/simpler classifiers.
Simple and interpretable method: logistic regression.
Complex and not interpretable method: random forests
(coming up next)
Performance on new data
Introduction
to Machine
Learning
Alessio
Farcomeni
The estimated performance on available data is a bad
predictor of performance on new data
Better predictor: performance on available data set aside
(test set)
The rest is a training set used for model estimation.
Split data at random (sample), maybe stratify if there are
small categories
What performance measure? Goodhart’s law holds. We
will use the total number of misclassified objects.
Goodhart’s law
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
1 Select (a) few methods, by varying prediction method,
Farcomeni number of predictors, tuning parameters.
2 Split the data in a training set (50-80% of your sample)
and a test set (the remaining). You know the answer for
the test set!
3 Train selected methods on the training set, predict test set
(ignoring the target). For each method, estimate
prediction error as the proportion of misclassified examples
4 Best method is the one with lowest prediction error on the
test set.
Can you improve on the best performance? How? Compare
training and test error rates.
Comparing performance on test and training
Introduction
to Machine
Learning
Alessio
Farcomeni
Training error almost always smaller than test error
Very small (or zero) error on the training set, large on
test: overfitting. Simplify method.
Large error on the training set: underfitting. Get new and
better variables, increase sample size
Large or small depends on comparison. Always get a
benchmark (no predictors, human performance, etc.).
Balance complexity and fit: the simplest model that fits
best is the best
What if I do not have enough predictors?
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Fix k ≥ 1, select a distance function
For new data with predictors x, select i1 , . . . , ik such that
d(Xij , x) ranks within the k smallest distances over n
values
P
If the outcome Y is continuous, set ŷ = j yij /k (or,
better, use the median of the k values in case k is large)
If it is categorical, set ŷ as the modal category over
yi1 , . . . , yik .
Technical note: what is a distance
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
You have a target Y and a set of predictors X1 , . . . , Xp
Split with respect to the variable which maximally
separates the outcome
E.g., the most presence and absences; the most different
mean outcome
Within each subgroup (node), split further.
Iterate till stopping rule satisfied (e.g., less than 50 units
in the leaf, three levels of split, etc.)
Note: heuristic approach!
For instance
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Introduction
to Machine
Learning
Alessio
Farcomeni
They predict well off-the-shelf (usually no tuning is
needed)
They seldom overfit: they adapt to the right amount of
complexity automatically.
They scale well to high dimensional problems
Problem: you need a moderately large training set. With
n < 100 do not even attempt
Problem: interpretation is difficult.
Interpretation: variable importance
Introduction
to Machine
Learning
Alessio
Farcomeni
Variable effects are highly non-linear, conditional, and
difficult to describe (even if sensitivity analyses are always
possible)
Variable importance is simple to estimate: compare trees
that did not involve a variable with those that used it,
summarize
For classification: mean difference in accuracy (per class,
overall)
For regression: mean difference in Mean Squared Error
Random forests in R
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Different neurons
Introduction
to Machine
Learning
Alessio
Farcomeni ReLU (Rectified Linear Unit): f (x) = max(x, 0)
Hyperbolic tangent: f (x) = (e x − e −x )/(e x + e −x )
Softmax (logistic): f (x) = e xj / h e xh
P
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Notes
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Output units
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Introduction
to Machine
Learning
Alessio
Farcomeni
Deep NN have thousands of parameters, getting good
estimates is a matter of huge computational efforts, data
availability, luck.
A well trained NN is a pearl in an oyster
Transfer learning: re-use trained NN for different task by
weight update
The same architecture is specified for your (similar) task,
with few exceptions (e.g., the output layer)
Only the different nodes are learned.
Learning from scratch
Introduction
to Machine
Learning
Alessio
Farcomeni