Making Machine Learning Work in Practice - StampedeCon 2014

Machine Learning
in practice
common pitfalls, and debugging tricks
!
Kilian Weinberger,
Associate Professor
(thanks to Rob Shapire, Andrew Ng)

Traditional Computer Science
Data
Program
Output
Computer
Traditional CS:

Machine Learning
Data
Program
Output
Computer
Traditional CS:
Machine Learning:
Data
Output
Program
Computer

Machine Learning
Data
Program
Output
Computer
Data
Output
Program
Computer
Machine Learning: Traditional CS:

Machine Learning
Data
Program
Output
Computer
Train Data
Labels
Computer
Training: Testing:

Making Machine Learning Work in Practice - StampedeCon 2014

Goal
Data
Miracle Learning
Algorithm
Amazing results!!!
Fame, Glory, Rock’n Roll!
Idea

1. Learning Problem
What is my relevant data?
What am I trying to learn?
Can I obtain trustworthy supervision?
QUIZ: What would be some answers for
email spam ﬁltering?

Example:
What is my data?
What am I trying to learn?
Can I obtain trustworthy supervision?
Email content / Meta Data
User’s spam/ham labels
Employees?

2. Train / Test split
How much data do I need? (More is more.)
How do you split into train / test? (Always by time! o/w: random)
Training data should be just like test data!! (i.i.d.)
Train Data Test Data
time
Real World
Data
??

Data set overfitting
!
By evaluating on the same data set over and over, you will overfit
Overfitting bounded by:
Kishore’s rule of thumb: subtract 1% accuracy for every time you have
tested on a data set
Ideally: Create a second train / test split!
time
Real World
Data
??
many runs one run!
O
s
log (#trials)
#examples
!

3. Data Representation:
feature vector:
0
1
0
1
1
...
0
1
2342304222342
12323
0
...
0.232
0.1
...
“viagra”
“hello”
“cheap”
“$”
“Microsoft”
...
Sender in address book?
IP known?
Sent time in s since 1/1/1970
Email size
Attachment size
...
Percentile in email length
Percentile in token likelihood
...
data (email)

Data Representation:
3
1
0
1
1
...
0
1
2342304222342
12323
0
...
0.232
0.1
...
feature vector:
“viagra”
“hello”
“cheap”
“$”
“Microsoft”
...
IP known?
Sent time in s since 1/1/1970
Email size
Attachment size
...
Percentile in email length
Percentile in token likelihood
...
bag of word features
(sparse)
meta features
(sparse / dense)
aggregate statistics
(dense real)
Pitfall #1: Aggregate statistics should not be over test data!
Sender in address book?

Pitfall #2:
Feature scaling
1. With linear classiﬁers / kernels features should have similar scale (e.g.
range [0,1])
2. Must use the same scaling constants for test data!!! (most likely test
data will not be in a clean [0,1] interval)
3. Dense features should be down-weighted when combined with sparse
features
(Scale does not matter for decision trees.)
fi ! (fi + ai) ⇤ bi

Over-condensing of features
Features do not need to be
semantically meaningful
Just add them: Redundancy is
(generally) not a problem
Let the learning algorithm decide
what’s useful!
3
1
0
1
1
...
0
1
2342304222342
12323
0
...
0.232
0.1
...
1.2
-23.2
2.3
5.3
12.1
condensed
feature vector
raw data:
Pitfall #3:

Example: Thought reading
fMRI scan
Nobody knows what the features are
But it works!!!
[Mitchell et al 2008]

4. Training Signal
• How reliable is my labeling source? (E.g. in web search editors agree
33% of the time.)
• Does the signal have high coverage?
• Is the signal derived independently of the features?!
• Could the signal shift after deployment?

Quiz: Spam ﬁltering
The spammer with IP e.v.i.l has sent 10M spam emails
over the last 10 days - use all emails with this IP as
spam examples
!
Use user’s spam / not-spam votes as signal
!
Use WUSTL students’ spam/not-spam votes
not diverse
potentially label
in data
too noisy
low
coverage

Example: Spam ﬁltering
spam
ﬁlter
user
feedback:
SPAM / NOT-SPAM
incoming
email
Inbox
Junk

old
spam
ﬁlter
user
incoming
email Inbox
Junk
new
ML spam
ﬁlter
feedback:
SPAM / NOT-SPAM
annotates
email
QUIZ: What is wrong with this setup?

old
spam
filter
incoming
email Inbox
new
ML spam
filter
annotates
email
feedback:
SPAM / NOT-SPAM
Problem: Users only vote when classifier is wrong
New filter learns to exactly invert the old classifier
Possible solution: Occasionally let emails through filter to avoid bias

Example: Trusted votes
Goal: Classify email votes as trusted / untrusted
Signal conjecture:
time
votes
voted “bad”
voted
“good”
evil spammer community

Searching for signal
time
voted “bad”
voted
“good”
evil spammer community
The good news: We found that exact pattern A LOT!!
votes

The bad news: We found other patterns just as often
time
voted “bad”
voted
“good”
votes

The bad news: We found other patterns just as often
time
voted
“bad”
voted
“good”
voted
“good”
voted
“bad”
voted
“good”
votes
Moral: Given enough data you’ll ﬁnd anything!
You need to be very very careful that you learn the right thing!

5. Learning Method
• Classiﬁcation / Regression / Ranking?
• Do you want probabilities?
• How sensitive is a model to label noise?
• Do you have skewed classes / weighted examples?
• Best off-the-shelf: Random Forests, Boosted Trees, SVM
• Generally: Try out several algorithms

Method Complexity (KISS)
Common pitfall: Use a too complicated
learning algorithm
ALWAYS try simplest algorithm first!!!
Move to more complex systems after the
simple one works
Rule of diminishing returns!!
(Scientific papers exaggerate benefit of
complex theory.)
QUIZ: What would you use for spam?

Ready-Made Packages
Weka 3
http://www.cs.waikato.ac.nz/~ml/index.html
Vowpal Wabbit (very large scale)
http://hunch.net/~vw/
Machine Learning Open Software Project
http://mloss.org/software
MALLET: Machine Learning for Language Toolking
http://mallet.cs.umass.edu/index.php/Main_Page
scikit learn (Python)
http://scikit-learn.org/stable/
Large-scale SVM:
http://machinelearning.wustl.edu/pmwiki.php/Main/Wusvm
SVM Lin (very fast linear SVM)
http://people.cs.uchicago.edu/~vikass/svmlin.html
LIB SVM (Powerful SVM implementation)
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
SVM Light
http://svmlight.joachims.org/svm_struct.html

Model Selection
(parameter setting with cross validation)
Do not trust default hyper-parameters
Grid Search / Bayesian Optimization
Most importantly: Learning rate!!
Pick best parameters for Val
B.O. usually better than grid search
Train
Train’ Val

6. Experimental Setup
1. Automate everything (one button setup)
• pre-processing / training / testing / evaluation
• Let’s you reproduce results easily
• Fewer errors!!
2. Parallelize your experiments

Quiz
T/F: Condensing features with domain expertise improves learning? FALSE
T/F: Feature scaling is irrelevant for boosted decision trees. TRUE
To avoid dataset overﬁtting benchmark on a second train/test data set.
T/F: Ideally, derive your signal directly from the features. FALSE
You cannot create train/test split when your data changes over time. FALSE
T/F: Always compute aggregate statistics over the entire corpus. FALSE

Debugging: Spam ﬁltering
You implemented
logistic regression with
regularization.
Problem: Your test
error is too high
(12%)!
QUIZ: What can you do to ﬁx it?

Fixing attempts:
1. Get more training data
2. Get more features
3. Select fewer features
4. Feature engineering (e.g. meta features, header information)
5. Run gradient descent longer
6. Use Newton’s Method for optimization
7. Change regularization
8. Use SVMs instead of logistic regression
But: which one should we try out?

Possible problems
Diagnostics:
1.Underﬁtting: Training error almost as high as test error
2.Overﬁtting: Training error much lower than test error
3.Wrong Algorithm: Other methods do better
4.Optimizer: Loss function is not minimized

Diagnostics
training set size
training error
testing error
desired error
error
over ﬁtting • test error still decreasing with more data
• large gap between train and test error
Remedies:
- Get more data
- Do bagging
- Feature selection

Diagnostics
training set size
training error
testing error
desired error
error
under ﬁtting • even training error is too high
• small gap between train and test error
Remedies:
- Add features
- Improve features
- Use more powerful ML algorithm
- (Boosting)

Problem: You are “too good” on
your setup ...
iterations
training error testing error
desired error
error
online error

Possible Problems
Is the label included in data set?
Does the training set contain test data?
Famous example in 2007: Caltech 101
0.0
22.5
45.0
67.5
90.0
Caltech 101 Test Accuracy
20062005 2007

Problem: Online error > Test Error
training set size
training error
testing error
desired error
error online error

Analytics:
Suspicion: Online data differently distributed
Construct new binary classiﬁcation problem: Online vs. train+test
If you can learn this (error < 50%), you have a distribution problem!!
1.You do not need any labels for this!!
online
train/test

Suspicion: Temporal distribution drift
Train Test
!
Train Test
shuﬄe
time
12% Error
1% Error
If E(shufﬂe)<E(train/test) then you have temporal distribution drift
Cures: Retrain frequently / online learning

Final Quiz
Increasing your training set size increases the training error.
Temporal drift can be detected through shuffling the training/test sets.
Increasing your feature set size decreases the training error.
T/F: More features always decreases the test error? False
T/F: Very low validation error always indicates you are doing well. False
When an algorithm overfits there is a big gap between train and test error.
T/F: Underfitting can be cured with more powerful learners. True
T/F: The test error is (almost) never below the training error. True

Summary
“Machine learning is only sexy when it works.”
ML algorithms deserve a careful setup
Debugging is just like any other code
1. Carefully rule out possible causes
2. Apply appropriate ﬁxes

Resources
Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)
Y. LeCun, L. Bottou, G. Orr and K. Muller: Eﬃcient BackProp, in Orr, G. and Muller
K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998.
Pattern Recognition and Machine Learning by Christopher M. Bishop
Andrew Ng’s ML course: http://www.youtube.com/watch?v=UzxYlbK2c7E

Making Machine Learning Work in Practice - StampedeCon 2014

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Making Machine Learning Work in Practice - StampedeCon 2014 (20)

More from StampedeCon (20)

Recently uploaded (20)

Making Machine Learning Work in Practice - StampedeCon 2014