Machine Learning
in practice
common pitfalls, and debugging tricks
!
Kilian Weinberger,
Associate Professor
(thanks to Rob Shapire, Andrew Ng)
What is Machine Learning
Traditional Computer Science
Data
Program
Output
Computer
Traditional CS:
Machine Learning
Data
Program
Output
Computer
Traditional CS:
Machine Learning:
Data
Output
Program
Computer
Machine Learning
Data
Program
Output
Computer
Data
Output
Program
Computer
Machine Learning: Traditional CS:
Machine Learning
Data
Program
Output
Computer
Train Data
Labels
Computer
Training: Testing:
Example: Spam Filter
Making Machine Learning Work in Practice - StampedeCon 2014
Date
Soon:Autonomous Cars
Machine Learning Setup
Goal
Data
Miracle Learning
Algorithm
Amazing results!!!
Fame, Glory, Rock’n Roll!
Idea
1. Learning Problem
What is my relevant data?
What am I trying to learn?
Can I obtain trustworthy supervision?
QUIZ: What would be some answers for
email spam filtering?
Example:
What is my data?
What am I trying to learn?
Can I obtain trustworthy supervision?
Email content / Meta Data
User’s spam/ham labels
Employees?
2. Train / Test split
How much data do I need? (More is more.)
How do you split into train / test? (Always by time! o/w: random)
Training data should be just like test data!! (i.i.d.)
Train Data Test Data
time
Real World
Data
??
Train Data Test Data
Data set overfitting
!
By evaluating on the same data set over and over, you will overfit
Overfitting bounded by:
Kishore’s rule of thumb: subtract 1% accuracy for every time you have
tested on a data set
Ideally: Create a second train / test split!
Train Data Test Data
time
Real World
Data
??
many runs one run!
O
s
log (#trials)
#examples
!
3. Data Representation:
feature vector:
0
1
0
1
1
...
0
1
2342304222342
12323
0
...
0.232
0.1
...
“viagra”
“hello”
“cheap”
“$”
“Microsoft”
...
Sender in address book?
IP known?
Sent time in s since 1/1/1970
Email size
Attachment size
...
Percentile in email length
Percentile in token likelihood
...
data (email)
Data Representation:
3
1
0
1
1
...
0
1
2342304222342
12323
0
...
0.232
0.1
...
feature vector:
“viagra”
“hello”
“cheap”
“$”
“Microsoft”
...
IP known?
Sent time in s since 1/1/1970
Email size
Attachment size
...
Percentile in email length
Percentile in token likelihood
...
bag of word features
(sparse)
meta features
(sparse / dense)
aggregate statistics
(dense real)
Pitfall #1: Aggregate statistics should not be over test data!
Sender in address book?
Pitfall #2:
Feature scaling
1. With linear classifiers / kernels features should have similar scale (e.g.
range [0,1])
2. Must use the same scaling constants for test data!!! (most likely test
data will not be in a clean [0,1] interval)
3. Dense features should be down-weighted when combined with sparse
features
(Scale does not matter for decision trees.)
fi ! (fi + ai) ⇤ bi
Over-condensing of features
Features do not need to be
semantically meaningful
Just add them: Redundancy is
(generally) not a problem
Let the learning algorithm decide
what’s useful!
3
1
0
1
1
...
0
1
2342304222342
12323
0
...
0.232
0.1
...
1.2
-23.2
2.3
5.3
12.1
condensed
feature vector
raw data:
Pitfall #3:
Example: Thought reading
fMRI scan
Nobody knows what the features are
But it works!!!
[Mitchell et al 2008]
4. Training Signal
• How reliable is my labeling source? (E.g. in web search editors agree
33% of the time.)
• Does the signal have high coverage?
• Is the signal derived independently of the features?!
• Could the signal shift after deployment?
Quiz: Spam filtering
The spammer with IP e.v.i.l has sent 10M spam emails
over the last 10 days - use all emails with this IP as
spam examples
!
Use user’s spam / not-spam votes as signal
!
Use WUSTL students’ spam/not-spam votes
not diverse
potentially label
in data
too noisy
low
coverage
Example: Spam filtering
spam
filter
user
feedback:
SPAM / NOT-SPAM
incoming
email
Inbox
Junk
Example: Spam filtering
old
spam
filter
user
incoming
email Inbox
Junk
new
ML spam
filter
feedback:
SPAM / NOT-SPAM
annotates
email
QUIZ: What is wrong with this setup?
Example: Spam filtering
old
spam
filter
incoming
email Inbox
new
ML spam
filter
annotates
email
feedback:
SPAM / NOT-SPAM
Problem: Users only vote when classifier is wrong
New filter learns to exactly invert the old classifier
Possible solution: Occasionally let emails through filter to avoid bias
Example: Trusted votes
Goal: Classify email votes as trusted / untrusted
Signal conjecture:
time
votes
voted “bad”
voted
“good”
evil spammer community
Searching for signal
time
voted “bad”
voted
“good”
evil spammer community
The good news: We found that exact pattern A LOT!!
votes
Searching for signal
The good news: We found that exact pattern A LOT!!
The bad news: We found other patterns just as often
time
voted “bad”
voted
“good”
votes
Searching for signal
The good news: We found that exact pattern A LOT!!
The bad news: We found other patterns just as often
time
voted
“bad”
voted
“good”
voted
“good”
voted
“bad”
voted
“good”
votes
Moral: Given enough data you’ll find anything!
You need to be very very careful that you learn the right thing!
5. Learning Method
• Classification / Regression / Ranking?
• Do you want probabilities?
• How sensitive is a model to label noise?
• Do you have skewed classes / weighted examples?
• Best off-the-shelf: Random Forests, Boosted Trees, SVM
• Generally: Try out several algorithms
Method Complexity (KISS)
Common pitfall: Use a too complicated
learning algorithm
ALWAYS try simplest algorithm first!!!
Move to more complex systems after the
simple one works
Rule of diminishing returns!!
(Scientific papers exaggerate benefit of
complex theory.)
QUIZ: What would you use for spam?
Ready-Made Packages
Weka 3
http://www.cs.waikato.ac.nz/~ml/index.html
Vowpal Wabbit (very large scale)
http://hunch.net/~vw/
Machine Learning Open Software Project
http://mloss.org/software
MALLET: Machine Learning for Language Toolking
http://mallet.cs.umass.edu/index.php/Main_Page
scikit learn (Python)
http://scikit-learn.org/stable/
Large-scale SVM:
http://machinelearning.wustl.edu/pmwiki.php/Main/Wusvm
SVM Lin (very fast linear SVM)
http://people.cs.uchicago.edu/~vikass/svmlin.html
LIB SVM (Powerful SVM implementation)
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
SVM Light
http://svmlight.joachims.org/svm_struct.html
Model Selection
(parameter setting with cross validation)
Do not trust default hyper-parameters
Grid Search / Bayesian Optimization
Most importantly: Learning rate!!
Pick best parameters for Val
B.O. usually better than grid search
Train
Train’ Val
6. Experimental Setup
1. Automate everything (one button setup)
• pre-processing / training / testing / evaluation
• Let’s you reproduce results easily
• Fewer errors!!
2. Parallelize your experiments
Quiz
T/F: Condensing features with domain expertise improves learning? FALSE
T/F: Feature scaling is irrelevant for boosted decision trees. TRUE
To avoid dataset overfitting benchmark on a second train/test data set.
T/F: Ideally, derive your signal directly from the features. FALSE
You cannot create train/test split when your data changes over time. FALSE
T/F: Always compute aggregate statistics over the entire corpus. FALSE
Debugging ML algorithms
Debugging: Spam filtering
You implemented
logistic regression with
regularization.
Problem: Your test
error is too high
(12%)!
QUIZ: What can you do to fix it?
Fixing attempts:
1. Get more training data
2. Get more features
3. Select fewer features
4. Feature engineering (e.g. meta features, header information)
5. Run gradient descent longer
6. Use Newton’s Method for optimization
7. Change regularization
8. Use SVMs instead of logistic regression
But: which one should we try out?
Possible problems
Diagnostics:
1.Underfitting: Training error almost as high as test error
2.Overfitting: Training error much lower than test error
3.Wrong Algorithm: Other methods do better
4.Optimizer: Loss function is not minimized
Underfitting / Overfitting
Diagnostics
training set size
training error
testing error
desired error
error
over fitting • test error still decreasing with more data
• large gap between train and test error
Remedies:
- Get more data
- Do bagging
- Feature selection
Diagnostics
training set size
training error
testing error
desired error
error
under fitting • even training error is too high
• small gap between train and test error
Remedies:
- Add features
- Improve features
- Use more powerful ML algorithm
- (Boosting)
Problem: You are “too good” on
your setup ...
iterations
training error testing error
desired error
error
online error
Possible Problems
Is the label included in data set?
Does the training set contain test data?
Famous example in 2007: Caltech 101
0.0
22.5
45.0
67.5
90.0
Caltech 101 Test Accuracy
20062005 2007
Caltech 101
2007 2009
Problem: Online error > Test Error
training set size
training error
testing error
desired error
error online error
Analytics:
Suspicion: Online data differently distributed
Construct new binary classification problem: Online vs. train+test
If you can learn this (error < 50%), you have a distribution problem!!
1.You do not need any labels for this!!
online
train/test
Suspicion: Temporal distribution drift
Train Test
!
Train Test
shuffle
time
12% Error
1% Error
If E(shuffle)<E(train/test) then you have temporal distribution drift
Cures: Retrain frequently / online learning
Final Quiz
Increasing your training set size increases the training error.
Temporal drift can be detected through shuffling the training/test sets.
Increasing your feature set size decreases the training error.
T/F: More features always decreases the test error? False
T/F: Very low validation error always indicates you are doing well. False
When an algorithm overfits there is a big gap between train and test error.
T/F: Underfitting can be cured with more powerful learners. True
T/F: The test error is (almost) never below the training error. True
Summary
“Machine learning is only sexy when it works.”
ML algorithms deserve a careful setup
Debugging is just like any other code
1. Carefully rule out possible causes
2. Apply appropriate fixes
Resources
Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)
Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and Muller
K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998.
Pattern Recognition and Machine Learning by Christopher M. Bishop
Andrew Ng’s ML course: http://www.youtube.com/watch?v=UzxYlbK2c7E

More Related Content

PPTX
What is Machine Learning?
PPTX
Introduction to Machine Learning
PPTX
Machine learning module 2
PPTX
Machine Learning: A Fast Review
PPTX
Machine learning introduction
PPTX
Machine learning
PDF
[Eestec] Machine Learning online seminar 1, 12 2016
PPTX
Introduction to machine learning
What is Machine Learning?
Introduction to Machine Learning
Machine learning module 2
Machine Learning: A Fast Review
Machine learning introduction
Machine learning
[Eestec] Machine Learning online seminar 1, 12 2016
Introduction to machine learning

What's hot (20)

PDF
Machine Learning Introduction
PDF
Machine Learning
PPTX
Introduction to Machine Learning
PPTX
Simple overview of machine learning
PPTX
Machine learning (ML) and natural language processing (NLP)
PPTX
Introduction to Machine Learning
PPTX
Introduction to Machine learning ppt
PPT
Machine Learning - Supervised learning
PPTX
Introduction to Machine Learning
PPTX
Primer to Machine Learning
PDF
Machine Learning for Everyone
PPTX
Meetup sthlm - introduction to Machine Learning with demo cases
PDF
Brief introduction to Machine Learning
PDF
Le Machine Learning de A à Z
PPTX
Machine Learning 101 | Essential Tools for Machine Learning
PPTX
What is Machine Learning
PPTX
Introduction to-machine-learning
ODP
Introduction to Machine learning
PPT
Machine Learning presentation.
PPTX
Intro to modelling-supervised learning
Machine Learning Introduction
Machine Learning
Introduction to Machine Learning
Simple overview of machine learning
Machine learning (ML) and natural language processing (NLP)
Introduction to Machine Learning
Introduction to Machine learning ppt
Machine Learning - Supervised learning
Introduction to Machine Learning
Primer to Machine Learning
Machine Learning for Everyone
Meetup sthlm - introduction to Machine Learning with demo cases
Brief introduction to Machine Learning
Le Machine Learning de A à Z
Machine Learning 101 | Essential Tools for Machine Learning
What is Machine Learning
Introduction to-machine-learning
Introduction to Machine learning
Machine Learning presentation.
Intro to modelling-supervised learning
Ad

Viewers also liked (20)

PPTX
Introduction to Machine Learning
PPTX
Algorithmic Web Spam detection - Matt Peters MozCon
PDF
Applications in Machine Learning
PPTX
Machine Learning @ Mendeley
PPT
SMS Spam Filter Design Using R: A Machine Learning Approach
PPTX
KleinTech using machine vision and learning 4 streets
PPTX
Machine Learning Introduction for Digital Business Leaders
PPTX
Applications of Machine Learning
PPTX
Best Deep Learning Post from LinkedIn Group
PDF
Machine Learning in the Cloud: Building a Better Forecast with H20 & Salesforce
PDF
Machine Learning with R and Tableau
PDF
Machine Learning for Developers - Danilo Poccia - Codemotion Rome 2017
PDF
Machine Learning Application to Manufacturing using Tableau, Tableau and Goog...
PDF
Building a Machine Learning App with AWS Lambda
PPTX
Application of machine learning in industrial applications
PDF
林守德/Practical Issues in Machine Learning
PDF
Spam Filtering
PPTX
An introduction to Machine Learning (and a little bit of Deep Learning)
PDF
Azure Machine Learning tutorial
PDF
IOT & Machine Learning
Introduction to Machine Learning
Algorithmic Web Spam detection - Matt Peters MozCon
Applications in Machine Learning
Machine Learning @ Mendeley
SMS Spam Filter Design Using R: A Machine Learning Approach
KleinTech using machine vision and learning 4 streets
Machine Learning Introduction for Digital Business Leaders
Applications of Machine Learning
Best Deep Learning Post from LinkedIn Group
Machine Learning in the Cloud: Building a Better Forecast with H20 & Salesforce
Machine Learning with R and Tableau
Machine Learning for Developers - Danilo Poccia - Codemotion Rome 2017
Machine Learning Application to Manufacturing using Tableau, Tableau and Goog...
Building a Machine Learning App with AWS Lambda
Application of machine learning in industrial applications
林守德/Practical Issues in Machine Learning
Spam Filtering
An introduction to Machine Learning (and a little bit of Deep Learning)
Azure Machine Learning tutorial
IOT & Machine Learning
Ad

Similar to Making Machine Learning Work in Practice - StampedeCon 2014 (20)

PPTX
Evaluating machine learning claims
PDF
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
PDF
Choosing a Machine Learning technique to solve your need
PPTX
Machine-Learning-Overview a statistical approach
PDF
Making Netflix Machine Learning Algorithms Reliable
PPTX
Model Development And Evaluation in ML.pptx
PPTX
Supervised learning
PDF
Overview of machine learning
PDF
10 more lessons learned from building Machine Learning systems - MLConf
PDF
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
PDF
10 more lessons learned from building Machine Learning systems
PPTX
introduction to machine learning
PDF
VSSML18. OptiML and Fusions
PDF
An introduction to Machine Learning
PDF
Simple rules for building robust machine learning models
PPTX
Machine Learning - Lecture2.pptx
PDF
VSSML17 Review. Summary Day 1 Sessions
PDF
introducatio to ml introducatio to ml introducatio to ml
PDF
Machine Learning Foundations
PPT
Lecture -2 Classification (Machine Learning Basic and kNN).ppt
Evaluating machine learning claims
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Choosing a Machine Learning technique to solve your need
Machine-Learning-Overview a statistical approach
Making Netflix Machine Learning Algorithms Reliable
Model Development And Evaluation in ML.pptx
Supervised learning
Overview of machine learning
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
10 more lessons learned from building Machine Learning systems
introduction to machine learning
VSSML18. OptiML and Fusions
An introduction to Machine Learning
Simple rules for building robust machine learning models
Machine Learning - Lecture2.pptx
VSSML17 Review. Summary Day 1 Sessions
introducatio to ml introducatio to ml introducatio to ml
Machine Learning Foundations
Lecture -2 Classification (Machine Learning Basic and kNN).ppt

More from StampedeCon (20)

PDF
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
PDF
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
PDF
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
PDF
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
PDF
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
PDF
Foundations of Machine Learning - StampedeCon AI Summit 2017
PDF
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
PDF
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
PDF
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
PDF
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
PDF
A Different Data Science Approach - StampedeCon AI Summit 2017
PDF
Graph in Customer 360 - StampedeCon Big Data Conference 2017
PDF
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
PDF
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
PDF
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
PDF
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
PDF
Innovation in the Data Warehouse - StampedeCon 2016
PPTX
Creating a Data Driven Organization - StampedeCon 2016
PPTX
Using The Internet of Things for Population Health Management - StampedeCon 2016
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Innovation in the Data Warehouse - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016

Recently uploaded (20)

PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PDF
4 layer Arch & Reference Arch of IoT.pdf
PPTX
Presentation - Principles of Instructional Design.pptx
PDF
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
PDF
Co-training pseudo-labeling for text classification with support vector machi...
PDF
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
SaaS reusability assessment using machine learning techniques
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
CEH Module 2 Footprinting CEH V13, concepts
PDF
Ensemble model-based arrhythmia classification with local interpretable model...
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Examining Bias in AI Generated News Content.pdf
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
PDF
Connector Corner: Transform Unstructured Documents with Agentic Automation
PPTX
Build automations faster and more reliably with UiPath ScreenPlay
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
4 layer Arch & Reference Arch of IoT.pdf
Presentation - Principles of Instructional Design.pptx
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
Co-training pseudo-labeling for text classification with support vector machi...
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
SaaS reusability assessment using machine learning techniques
Rapid Prototyping: A lecture on prototyping techniques for interface design
CEH Module 2 Footprinting CEH V13, concepts
Ensemble model-based arrhythmia classification with local interpretable model...
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Examining Bias in AI Generated News Content.pdf
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
Module 1 Introduction to Web Programming .pptx
Lung cancer patients survival prediction using outlier detection and optimize...
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
Connector Corner: Transform Unstructured Documents with Agentic Automation
Build automations faster and more reliably with UiPath ScreenPlay

Making Machine Learning Work in Practice - StampedeCon 2014

  • 1. Machine Learning in practice common pitfalls, and debugging tricks ! Kilian Weinberger, Associate Professor (thanks to Rob Shapire, Andrew Ng)
  • 2. What is Machine Learning
  • 12. 1. Learning Problem What is my relevant data? What am I trying to learn? Can I obtain trustworthy supervision? QUIZ: What would be some answers for email spam filtering?
  • 13. Example: What is my data? What am I trying to learn? Can I obtain trustworthy supervision? Email content / Meta Data User’s spam/ham labels Employees?
  • 14. 2. Train / Test split How much data do I need? (More is more.) How do you split into train / test? (Always by time! o/w: random) Training data should be just like test data!! (i.i.d.) Train Data Test Data time Real World Data ??
  • 15. Train Data Test Data Data set overfitting ! By evaluating on the same data set over and over, you will overfit Overfitting bounded by: Kishore’s rule of thumb: subtract 1% accuracy for every time you have tested on a data set Ideally: Create a second train / test split! Train Data Test Data time Real World Data ?? many runs one run! O s log (#trials) #examples !
  • 16. 3. Data Representation: feature vector: 0 1 0 1 1 ... 0 1 2342304222342 12323 0 ... 0.232 0.1 ... “viagra” “hello” “cheap” “$” “Microsoft” ... Sender in address book? IP known? Sent time in s since 1/1/1970 Email size Attachment size ... Percentile in email length Percentile in token likelihood ... data (email)
  • 17. Data Representation: 3 1 0 1 1 ... 0 1 2342304222342 12323 0 ... 0.232 0.1 ... feature vector: “viagra” “hello” “cheap” “$” “Microsoft” ... IP known? Sent time in s since 1/1/1970 Email size Attachment size ... Percentile in email length Percentile in token likelihood ... bag of word features (sparse) meta features (sparse / dense) aggregate statistics (dense real) Pitfall #1: Aggregate statistics should not be over test data! Sender in address book?
  • 18. Pitfall #2: Feature scaling 1. With linear classifiers / kernels features should have similar scale (e.g. range [0,1]) 2. Must use the same scaling constants for test data!!! (most likely test data will not be in a clean [0,1] interval) 3. Dense features should be down-weighted when combined with sparse features (Scale does not matter for decision trees.) fi ! (fi + ai) ⇤ bi
  • 19. Over-condensing of features Features do not need to be semantically meaningful Just add them: Redundancy is (generally) not a problem Let the learning algorithm decide what’s useful! 3 1 0 1 1 ... 0 1 2342304222342 12323 0 ... 0.232 0.1 ... 1.2 -23.2 2.3 5.3 12.1 condensed feature vector raw data: Pitfall #3:
  • 20. Example: Thought reading fMRI scan Nobody knows what the features are But it works!!! [Mitchell et al 2008]
  • 21. 4. Training Signal • How reliable is my labeling source? (E.g. in web search editors agree 33% of the time.) • Does the signal have high coverage? • Is the signal derived independently of the features?! • Could the signal shift after deployment?
  • 22. Quiz: Spam filtering The spammer with IP e.v.i.l has sent 10M spam emails over the last 10 days - use all emails with this IP as spam examples ! Use user’s spam / not-spam votes as signal ! Use WUSTL students’ spam/not-spam votes not diverse potentially label in data too noisy low coverage
  • 23. Example: Spam filtering spam filter user feedback: SPAM / NOT-SPAM incoming email Inbox Junk
  • 24. Example: Spam filtering old spam filter user incoming email Inbox Junk new ML spam filter feedback: SPAM / NOT-SPAM annotates email QUIZ: What is wrong with this setup?
  • 25. Example: Spam filtering old spam filter incoming email Inbox new ML spam filter annotates email feedback: SPAM / NOT-SPAM Problem: Users only vote when classifier is wrong New filter learns to exactly invert the old classifier Possible solution: Occasionally let emails through filter to avoid bias
  • 26. Example: Trusted votes Goal: Classify email votes as trusted / untrusted Signal conjecture: time votes voted “bad” voted “good” evil spammer community
  • 27. Searching for signal time voted “bad” voted “good” evil spammer community The good news: We found that exact pattern A LOT!! votes
  • 28. Searching for signal The good news: We found that exact pattern A LOT!! The bad news: We found other patterns just as often time voted “bad” voted “good” votes
  • 29. Searching for signal The good news: We found that exact pattern A LOT!! The bad news: We found other patterns just as often time voted “bad” voted “good” voted “good” voted “bad” voted “good” votes Moral: Given enough data you’ll find anything! You need to be very very careful that you learn the right thing!
  • 30. 5. Learning Method • Classification / Regression / Ranking? • Do you want probabilities? • How sensitive is a model to label noise? • Do you have skewed classes / weighted examples? • Best off-the-shelf: Random Forests, Boosted Trees, SVM • Generally: Try out several algorithms
  • 31. Method Complexity (KISS) Common pitfall: Use a too complicated learning algorithm ALWAYS try simplest algorithm first!!! Move to more complex systems after the simple one works Rule of diminishing returns!! (Scientific papers exaggerate benefit of complex theory.) QUIZ: What would you use for spam?
  • 32. Ready-Made Packages Weka 3 http://www.cs.waikato.ac.nz/~ml/index.html Vowpal Wabbit (very large scale) http://hunch.net/~vw/ Machine Learning Open Software Project http://mloss.org/software MALLET: Machine Learning for Language Toolking http://mallet.cs.umass.edu/index.php/Main_Page scikit learn (Python) http://scikit-learn.org/stable/ Large-scale SVM: http://machinelearning.wustl.edu/pmwiki.php/Main/Wusvm SVM Lin (very fast linear SVM) http://people.cs.uchicago.edu/~vikass/svmlin.html LIB SVM (Powerful SVM implementation) http://www.csie.ntu.edu.tw/~cjlin/libsvm/ SVM Light http://svmlight.joachims.org/svm_struct.html
  • 33. Model Selection (parameter setting with cross validation) Do not trust default hyper-parameters Grid Search / Bayesian Optimization Most importantly: Learning rate!! Pick best parameters for Val B.O. usually better than grid search Train Train’ Val
  • 34. 6. Experimental Setup 1. Automate everything (one button setup) • pre-processing / training / testing / evaluation • Let’s you reproduce results easily • Fewer errors!! 2. Parallelize your experiments
  • 35. Quiz T/F: Condensing features with domain expertise improves learning? FALSE T/F: Feature scaling is irrelevant for boosted decision trees. TRUE To avoid dataset overfitting benchmark on a second train/test data set. T/F: Ideally, derive your signal directly from the features. FALSE You cannot create train/test split when your data changes over time. FALSE T/F: Always compute aggregate statistics over the entire corpus. FALSE
  • 37. Debugging: Spam filtering You implemented logistic regression with regularization. Problem: Your test error is too high (12%)! QUIZ: What can you do to fix it?
  • 38. Fixing attempts: 1. Get more training data 2. Get more features 3. Select fewer features 4. Feature engineering (e.g. meta features, header information) 5. Run gradient descent longer 6. Use Newton’s Method for optimization 7. Change regularization 8. Use SVMs instead of logistic regression But: which one should we try out?
  • 39. Possible problems Diagnostics: 1.Underfitting: Training error almost as high as test error 2.Overfitting: Training error much lower than test error 3.Wrong Algorithm: Other methods do better 4.Optimizer: Loss function is not minimized
  • 41. Diagnostics training set size training error testing error desired error error over fitting • test error still decreasing with more data • large gap between train and test error Remedies: - Get more data - Do bagging - Feature selection
  • 42. Diagnostics training set size training error testing error desired error error under fitting • even training error is too high • small gap between train and test error Remedies: - Add features - Improve features - Use more powerful ML algorithm - (Boosting)
  • 43. Problem: You are “too good” on your setup ... iterations training error testing error desired error error online error
  • 44. Possible Problems Is the label included in data set? Does the training set contain test data? Famous example in 2007: Caltech 101 0.0 22.5 45.0 67.5 90.0 Caltech 101 Test Accuracy 20062005 2007
  • 46. Problem: Online error > Test Error training set size training error testing error desired error error online error
  • 47. Analytics: Suspicion: Online data differently distributed Construct new binary classification problem: Online vs. train+test If you can learn this (error < 50%), you have a distribution problem!! 1.You do not need any labels for this!! online train/test
  • 48. Suspicion: Temporal distribution drift Train Test ! Train Test shuffle time 12% Error 1% Error If E(shuffle)<E(train/test) then you have temporal distribution drift Cures: Retrain frequently / online learning
  • 49. Final Quiz Increasing your training set size increases the training error. Temporal drift can be detected through shuffling the training/test sets. Increasing your feature set size decreases the training error. T/F: More features always decreases the test error? False T/F: Very low validation error always indicates you are doing well. False When an algorithm overfits there is a big gap between train and test error. T/F: Underfitting can be cured with more powerful learners. True T/F: The test error is (almost) never below the training error. True
  • 50. Summary “Machine learning is only sexy when it works.” ML algorithms deserve a careful setup Debugging is just like any other code 1. Carefully rule out possible causes 2. Apply appropriate fixes
  • 51. Resources Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and Muller K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998. Pattern Recognition and Machine Learning by Christopher M. Bishop Andrew Ng’s ML course: http://www.youtube.com/watch?v=UzxYlbK2c7E