0% found this document useful (0 votes)
34 views

Introduction To Machine Learning: Big Data For Economic Applications

This document provides an introduction to machine learning concepts including prediction, classification methods, performance evaluation on training and test datasets, and overfitting/underfitting issues. Specific machine learning algorithms discussed include logistic regression, random forests, k-nearest neighbors, and the importance of variables and sample size. Key goals of machine learning are to balance complexity and fit while minimizing errors on new test data.

Uploaded by

Valerio Zarrelli
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Introduction To Machine Learning: Big Data For Economic Applications

This document provides an introduction to machine learning concepts including prediction, classification methods, performance evaluation on training and test datasets, and overfitting/underfitting issues. Specific machine learning algorithms discussed include logistic regression, random forests, k-nearest neighbors, and the importance of variables and sample size. Key goals of machine learning are to balance complexity and fit while minimizing errors on new test data.

Uploaded by

Valerio Zarrelli
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Introduction

to Machine
Learning

Alessio
Farcomeni
Introduction to Machine Learning
Big Data for Economic Applications

Alessio Farcomeni
University of Rome “Tor Vergata”

[email protected]
Prediction

Introduction
to Machine
Learning
Most tasks involve understanding interrelationships among
Alessio
Farcomeni predictors and outcome
Other ones are more focused on prediction of the outcome
in future settings (when it has not been measured, only
the predictors)
A sample of examples/training data is available where an
outcome has been measured together with (possibly,
cheap) predictors
Will a firm go into default? Has there been a fraud with
this transaction? Etc.
(Logistic) regression can be used, but it is not the only
option.
Basic Idea: which class for the diamond?

Introduction
to Machine
Learning

Alessio
Farcomeni 6
4
y[1:100]

2
0
−2

−2 0 2 4 6

x[1:100]
Why?

Introduction
to Machine
Learning

Alessio
Farcomeni

There are a huge number of methods for classification.


They will all agree.
Idea is always that of dividing the feature space in
segments, one for each class.
Main differences: complexity, interpretability
If your only target is prediction: only aim is performance
(on new data)
A more challenging case

Introduction
to Machine
Learning

Alessio
Farcomeni 6
4
y[1:100]

2
0
−2

−2 0 2 4 6

x[1:100]
Two methods

Introduction
to Machine
Learning

Alessio
Farcomeni 6

6
4

4
y[1:100]

y[1:100]
2

2
0

0
−2

−2

−2 0 2 4 6 −2 0 2 4 6

x[1:100] x[1:100]
A more realistic scenario

Introduction
to Machine
Learning

Alessio
Farcomeni 0.5

0.5
0.0

0.0
y[z == 2]

y[z == 2]
−0.5

−0.5

−2 −1 0 1 2 −2 −1 0 1 2

x[z == 2] x[z == 2]
Take home messages

Introduction
to Machine
Learning

Alessio
Farcomeni
Groups might not be separated, so that the classifier will
make errors also on the training set
Non-linear classifiers tend to make less errors on the
available data (less biased) than linear/simpler classifiers
Non-linear classifiers tend to make more errors on the new
data (more variable) than linear/simpler classifiers.
Simple and interpretable method: logistic regression.
Complex and not interpretable method: random forests
(coming up next)
Performance on new data

Introduction
to Machine
Learning

Alessio
Farcomeni
The estimated performance on available data is a bad
predictor of performance on new data
Better predictor: performance on available data set aside
(test set)
The rest is a training set used for model estimation.
Split data at random (sample), maybe stratify if there are
small categories
What performance measure? Goodhart’s law holds. We
will use the total number of misclassified objects.
Goodhart’s law

Introduction
to Machine
Learning

Alessio
Farcomeni

When a measure becomes a target, it ceases to be a good


measure
Alternative targets

Introduction
to Machine
Learning

Alessio
Farcomeni

Classify SPAM emails.


Total misclassified emails
Proportion of misclassified SPAM + Proportion
misclassified ham
Any weighted sum of counts/proportions of
misclassifications
Procedure in practice

Introduction
to Machine
Learning

Alessio
1 Select (a) few methods, by varying prediction method,
Farcomeni number of predictors, tuning parameters.
2 Split the data in a training set (50-80% of your sample)
and a test set (the remaining). You know the answer for
the test set!
3 Train selected methods on the training set, predict test set
(ignoring the target). For each method, estimate
prediction error as the proportion of misclassified examples
4 Best method is the one with lowest prediction error on the
test set.
Can you improve on the best performance? How? Compare
training and test error rates.
Comparing performance on test and training

Introduction
to Machine
Learning

Alessio
Farcomeni
Training error almost always smaller than test error
Very small (or zero) error on the training set, large on
test: overfitting. Simplify method.
Large error on the training set: underfitting. Get new and
better variables, increase sample size
Large or small depends on comparison. Always get a
benchmark (no predictors, human performance, etc.).
Balance complexity and fit: the simplest model that fits
best is the best
What if I do not have enough predictors?

Introduction
to Machine
Learning

Alessio
Farcomeni

If performance can be improved by removing predictors,


life is easy.
What if more are needed? (And no new measurements can
be performed or the experts have no idea what to do)
Augmentation: do polynomial regression by creating new
variables (squares, products, etc.)
Machine Learning

Introduction
to Machine
Learning

Alessio
Farcomeni

Statistics and Machine Learning: reinventing the same


wheel since 1950.
Machine Learning: Learning a Task from Experience
Essentially, tuning an algorithm from training data.
Ingredients: training experience/data, target function,
learning algorithm.
1-nearest neighbor

Introduction
to Machine
Learning

Alessio
Farcomeni

Lazy learner: no computation until prediction is needed


Select a distance function
For new data with predictors x, select the unit i such that
d(Xi , x) is the smallest over i = 1, . . . , n
Set ŷ = yi
k-nearest neighbors

Introduction
to Machine
Learning

Alessio
Farcomeni
Fix k ≥ 1, select a distance function
For new data with predictors x, select i1 , . . . , ik such that
d(Xij , x) ranks within the k smallest distances over n
values
P
If the outcome Y is continuous, set ŷ = j yij /k (or,
better, use the median of the k values in case k is large)
If it is categorical, set ŷ as the modal category over
yi1 , . . . , yik .
Technical note: what is a distance

Introduction
to Machine
Learning

Alessio
Farcomeni

A distance d(x, y ) is such that


Separability: d(x, y ) ≥ 0, with d(x, y ) = 0 if and only if
x =y
Symmetry: d(x, y ) = d(y , x)
Triangular inequality: d(x, y ) + d(y , z) ≥ d(x, z)
No triangular inequality? You have a dissimilarity measure.
Classification and Regression Trees

Introduction
to Machine
Learning

Alessio
Farcomeni
You have a target Y and a set of predictors X1 , . . . , Xp
Split with respect to the variable which maximally
separates the outcome
E.g., the most presence and absences; the most different
mean outcome
Within each subgroup (node), split further.
Iterate till stopping rule satisfied (e.g., less than 50 units
in the leaf, three levels of split, etc.)
Note: heuristic approach!
For instance

Introduction
to Machine
Learning

Alessio
Farcomeni

Y : indicator of deprived household


X : income, number of members, etc.
First level of the tree: for daily income> 2$, there are only
14% poors while 86% for lower income
Further level: for income > 2$, deprived are found only
with number of members > 2; no further splits for low
incomes.
Prediction through classification trees

Introduction
to Machine
Learning

Alessio
Farcomeni

Prediction is done by following the estimated tree to the


leaf. E.g., a family with unknown status with 52$ per day,
two members, will belong to a leaf with 0.5% deprived.
Do I have to predict presence or absence?
Trees are also easy to visualize.
Classification trees in R

Introduction
to Machine
Learning

Alessio
Farcomeni

These are estimated using function rpart in


library(rpart). Usual syntax and methods (including
predict).
Visualize through prp function in library(rpart.plot)
Random Forests

Introduction
to Machine
Learning

Alessio Forests are made of trees.


Farcomeni

The predictive performance of trees is good, but it is


incredibly better when you put many together.
Random Forests: select at random a large number
(usually, 500) subsets of the data (both samples and
predictors), grow 500 trees separately.
Each tree makes a prediction as we just discussed.
Categorical outcomes: final prediction is the majority vote.
If 480 out of 500 trees vote that the household is deprived,
you predict deprivation.
Forests work.

Introduction
to Machine
Learning

Alessio
Farcomeni
They predict well off-the-shelf (usually no tuning is
needed)
They seldom overfit: they adapt to the right amount of
complexity automatically.
They scale well to high dimensional problems
Problem: you need a moderately large training set. With
n < 100 do not even attempt
Problem: interpretation is difficult.
Interpretation: variable importance

Introduction
to Machine
Learning

Alessio
Farcomeni
Variable effects are highly non-linear, conditional, and
difficult to describe (even if sensitivity analyses are always
possible)
Variable importance is simple to estimate: compare trees
that did not involve a variable with those that used it,
summarize
For classification: mean difference in accuracy (per class,
overall)
For regression: mean difference in Mean Squared Error
Random forests in R

Introduction
to Machine
Learning

Alessio
Farcomeni

Use function randomForest in library(randomForest).


Usual synthax and methods. The outcome must be a
factor (or you get regression).
Useful options for tuning: ntree, mtry (number of
variables in each subset), etc. (see help)
RF or logistic regression?

Introduction
to Machine
Learning

Alessio RF: usually better predictions (especially in complex cases)


Farcomeni
Logistic: interpretable (mandatory in certain
circumstances)
RF has implicit model choice, logistic regression must be
gauged carefully.
In many cases this choice is not really so important. The
crucial part is the predictors used (parsimonious, but not
stingy).
Final recommendation: try both (maybe with different
tuning for RFs), evaluate on a test set, pick the minimum
error.
Neural Networks

Introduction
to Machine
Learning

Alessio
Farcomeni

Logistic regression can be seen as a neuron


Inputs xi1 , . . . , xip are weighted to form β 0 x
If f (β 0 x) is above a threshold then there is activation
(threshold triggering)
f (·) is the activation function, f (x) = e x /(1 + e x ) for
logistic regression.
Logistic regression as neuron

Introduction
to Machine
Learning

Alessio
Farcomeni
Different neurons

Introduction
to Machine
Learning

Alessio
Farcomeni ReLU (Rectified Linear Unit): f (x) = max(x, 0)
Hyperbolic tangent: f (x) = (e x − e −x )/(e x + e −x )
Softmax (logistic): f (x) = e xj / h e xh
P

Usually the same activation function is specified for all


neurons in the network
Note: thresholds are not set, the output of each neuron is
f (x)
Common choice with deep networks: ReLU for hidden
layers
A network of neurons

Introduction
to Machine
Learning

Alessio
Farcomeni

Regression has only an output layer (one node with


output) and an input layer (n nodes with input)
Neural network also has one or more hidden layers with m
nodes
One hidden layer: shallow network, more than one: deep
A shallow network

Introduction
to Machine
Learning

Alessio
Farcomeni
Notes

Introduction
to Machine
Learning

Alessio
Farcomeni

Each arrow is associated with βj parameters to summarize


a linear combination for feeding to the child node
Feed forward: no cycles
Dense: each neuron at current layer is connect to each
neuron at the following
A deep network

Introduction
to Machine
Learning

Alessio
Farcomeni
Output units

Introduction
to Machine
Learning

Alessio
Farcomeni

Continuous endpoint: one node


Binary endpoint: one or two nodes
k-category endpoint: k − 1 or k nodes
Predictions can be subject-specific: n, 2n, up to kn nodes
in output
Ingredients

Introduction
to Machine
Learning

Alessio
Farcomeni

β weights are parameters which must be learned


(estimated) during the training (estimation) phase
Depth, architecture (including activation function),
neurons per layer: hyperparameters (that must be tuned)
Learning algorithm invariably uses batching and gradient
descent. Hyperparameters: learning rate, batch size,
epochs, etc.
Gradient descent howto

Introduction
to Machine
Learning

Alessio
Farcomeni

Whatever flavour you choose, gradient descent needs


partial derivatives of the objective function.
Can you do the math?
Steps

Introduction
to Machine
Learning

Alessio
Farcomeni

Randomly initialized weights (no zeros)


Implement forward propagation: compute output for each
training unit
Compute cost function
Backpropagation to compute partial derivatives of cost
fucntion
Gradient descent to update weights
Rules for tuning

Introduction
to Machine
Learning

Alessio
Farcomeni

Different architectures shall be compared, but general


rules can guide the range of models
Less data implies more shallow NN
Regular tasks: 1-5 hidden layers, p or log(p) nodes per
layer. More nodes per layer, less layers; and vice versa.
Transfer learning

Introduction
to Machine
Learning

Alessio
Farcomeni
Deep NN have thousands of parameters, getting good
estimates is a matter of huge computational efforts, data
availability, luck.
A well trained NN is a pearl in an oyster
Transfer learning: re-use trained NN for different task by
weight update
The same architecture is specified for your (similar) task,
with few exceptions (e.g., the output layer)
Only the different nodes are learned.
Learning from scratch

Introduction
to Machine
Learning

Alessio
Farcomeni

Early stopping: monitor error in dev set, stop when it


increases
Weight decay: use Ridge-type objective functions to
regularize the weights
Dropout: each step update only γ% (e.g., 50%) of your
weights

You might also like