CSE 471: MACHINE LEARNING
Learning from Examples (Continued)
Outline
Model Selection and Optimization
Developing Machine Learning Systems
Self study, just read through
Model Selection and
Optimization
Stationary assumption
P(Ej) = P(Ej+1) = P(Ej+2) = ….
P(Ej) = P(Ej|Ej , Ej , Ej , ….)
i.i.d - Independent and Identically
distributed
Optimal Fit
Minimize the error rate on Test set
Suppose a researcher
Generates a hypotheses for one setting of
hyperparameter
Measures the error rates on the test set,
and then tries different hyperparameters.
No individual hypothesis has peeked at the
test set data, but the overall process did,
through the researcher.
Optimal Fit
We need 3 datasets
Training set
Train the models
Validation set (Development set)
Evaluatecandidate models
Choose the best one
Test set
Final unbiased evaluation of the chosen model
Optimal Fit
Alternate approach
k-fold cross validation
k =5
k = 10
k = n, Leave-one-out Cross Validation (LOOCV)
We can do without the validation set
We still need the test set
Model Selection
Regularization
Another option is Feature Selection
Recursive Feature Elimination (RFE)
Correlation study
Minimum Redundancy Maximum Relevance (mRMR)
Hyperparameter tuning
Hand-tuning
Grid search
Few parameters
Each parameter has small number of possible
values
Can be parallelized
if two hyperparameters are independent of each
other, they can be optimized separately
Random search
Bayesian optimization
Population-based training (PBT)
Bayesian Optimization
An ML problem in hyperparameter space!
In the validation dataset
Input
The vector of hyperparameter values (X)
Labels
A vector of losses (Y) on the validation set for
the model built with those hyperparameters
y is a function of x.
The learning problem
Find the function f(x) that approximates y
Population-based training
(PBT)
First generation of models
Use random search of hyperparameters
Can be done in parallel
Second generation of models
Hyperparameters from successful (good fit)
models from first generation models
Mutation
Cross-over etc.
Can be done in parallel