Project Proposals
▪ Due next Monday (13th Feb) 11.59pm ET
▪ 2 page proposal (more details on “Course
Logistics” document on canvas) + References
▪ Today by 5pm ET, we will post:
▪ Project topics and some concrete problems
▪ Sample proposals and final reports from past iterations
▪ LaTeX and Word templates which you will use to write
the proposal
1
Office Hours and Paper Presentations
▪ Office hours switch this week
▪ Suraj and Jiaqi today
▪ Hima on Thursday
▪ Students signed up for presentations next week
should see us in office hours this week
▪ Full slide deck (ideally!)
▪ An overview of what you plan to present
2
Rule Based Approaches
Agenda
▪ Paper 1: Interpretable Rule Lists (Letham et. al.)
▪ Paper 2: Interpretable Rule Sets (Lakkaraju et. al.)
▪ Discussion
4
Interpretable Classifiers Using
Rules and Bayesian Analysis
Benjamin Letham, Cynthia Rudin, Tyler McCormick, David Madigan; 2015
Contributions
▪ Introducing a generative model called Bayesian
Rule Lists (BRL)
▪ Goal is to output a decision list (if then else-if)
▪ Novel prior structure to encourage sparsity
▪ Predictive accuracy on par with top algorithms
6
Decision List: Example
This is “an” accurate and interpretable decision list – possibly one of
many such lists
7
Introduction: BRL
▪ Produces a posterior distribution over
permutations of if.. then.. Else-if.. rules from a
large set of pre-mined rules
▪ Decision lists with high posterior probability tend
to be both accurate and interpretable
▪ Prior favors concise lists with small number of rules and
fewer terms in left hand side
8
Introduction: BRL
▪ New type of balance between accuracy,
interpretability, and computation
▪ What about using other similar models?
▪ Decision trees (CART)
▪ They employ greedy construction methods
▪ Not particularly computationally demanding but affects
quality of solution – both accuracy and interpretability
9
Pre-mined Rules
▪ A major source of practical feasibility: pre-mined
rules
▪ Reduces model space
▪ Complexity of problem depends on number of pre-
mined rules
▪ As long as pre-mined set is expressive, accurate
decision list can be found + smaller model space
means better generalization (Vapnik, 1995)
10
Pre-mined Rules: Intuition
Minimum Support = 3
This is Apriori algorithm. FP-growth is a single pass algorithm (more efficient)
11
Preliminaries: Notation
▪ Training data:
▪ Two labels: stroke or no stroke
12
Bayesian Decision Lists
where
13
Preliminaries: Multinomial
▪ Sampling from a multinomial:
▪ Parameters are probability values
14
Preliminaries: Dirichlet
▪ Dirichlet: sampling over a probability simplex
▪ E.g., (0.6, 0.4) is a sample from a Dirichlet distribution;
▪ K-dimensional Dirichlet has k parameters – any
positive number
▪ E.g., Dirichlet(60, 40)
15
Preliminaries: Dirichlet Prior
▪ Conjugate prior for multinomial distribution
▪ Conjugate prior: posterior in the same family as
prior
▪ Prior:
▪ Posterior:
16
Bayesian Association Rules
17
Generative Model
Our goal is to sample from the posterior distribution over antecedent lists:
is complete collection of pre-mined antecedents
18
Prior Probabilities
Truncated Poisson:
Ensures that sampled values are within bounds!
Also, ensures expected value is close to
when there are a large number of pre-mined rules
19
Prior Probabilities
Another Truncated Poisson,
is sampled uniformly from available antecedents with
appropriate cardinality.
20
Likelihood
▪ Likelihood is the product of multinomial probability
mass functions for the observed label counts at
each rule
Marginalize over → integrate out the intermediate parameter
21
Markov Chain Monte Carlo
▪ Generate a chain of random samples until
convergence
▪ Each random sample is a stepping stone for the
next one (chain)
▪ New samples do not depend on any samples
before the previous one (Markov)
22
Markov Chain Monte Carlo
▪ How to go to (optimal) d* from current dt
▪ Move an antecedent to a different position in the
list
▪ Add an antecedent that is not currently in the list
▪ Remove an antecedent from the list
23
Metropolis Hastings
▪ Start with a random decision list
▪ Choose a move based on “proposal distribution”
Q
▪ After you choose your move, you compute an
acceptance probability A
▪ Generate a random number u
▪ If u <= A, then accept; otherwise reject
24
Metropolis Hastings
25
Proposal Probabilities
▪ Move chosen uniformly
▪ Which antecedents and their new position is also
chosen uniformly
26
Estimating label of a new observation
Match the antecedent by looking at feature values of new observation
27
Tic-Tac-Toe
5 fold cross validation; accuracy computed across 5 folds
29
Stroke Prediction
▪ N = 12,586, 14% had stroke
▪ 6000 times larger than data for CHADS2 score
▪ Pre-mining: support 10% and max cardinality 2
▪ 5 fold evaluation
30
Stroke Prediction
31
Stroke Prediction - AUC
33
Interpretable Decision Sets
Hima Lakkaraju, Stephen Bach, Jure Leskovec; 2016
Contributions
▪ A framework called Interpretable Decision Sets
(IDS) for classification
▪ Novel objective function + proof of submodularity
▪ Optimization procedure with optimality guarantees
▪ Detailed metrics for evaluating interpretability +
user studies
35
Motivation
▪ Traditional classification models optimize for predictive
accuracy
▪ Very little understanding of the model itself and its
predictions
▪ Model being “readable” is not enough
▪ Humans should be able to reason about predictions
and readily explain the functionality of the model
36
Decision Sets
38
Criteria for Interpretability
▪ Parsimony: Fewer rules with fewer conditions
▪ Cognitive limits of human understanding
▪ Distinctness: Minimal overlap of rules w.r.t the data
points they cover
▪ No redundant and contradicting explanations of data
points
▪ Class Coverage: Explain all the classes in the data
▪ Rules explaining minority classes are important
39
Problem Formulation
40
Desiderata
▪ We need to optimize for the following criteria
▪ Recall
▪ Precision
▪ Distinctness
▪ Parsimony
▪ Class Coverage
▪ Recall and Precision → Accurate predications
▪ Distinctness, Parsimony, and Class Coverage
→Interpretability
41
Objective Function
42
Objective Function
43
Objective Function
44
Objective Function
45
Objective Function
46
Objective Function
47
Submodularity
Diminishing returns characterization
F(A d) – F(A) ≥ F(B d) – F(B)
Gain of adding d to a small set Gain of adding d to a large set
B A + d Large improvement
+ d Small improvement
A non-negative linear combination of
submodular functions is submodular
Objective Function
The complete objective is non-negative,
non-normal, non-monotone, submodular
49
Optimizing the Objective
▪ Maximizing a non-monotone submodular function
is NP-hard
▪ Smooth local search [SLS] algorithm provides a
2/5 approximation [Feige, Mirrokni, Vondrak FOCS
07; SIAM Comp. J. 11 ]
▪ Will be at least 2/5 of the optimal solution
50
Submodular Maximization: Local Search
Each node here
corresponds to a
candidate
rule = (pattern, class) tuple
S and S’ correspond to the intermediate solution sets
51
Submodular Maximization: Local Search
S and S’ correspond to the intermediate solution sets
52
Submodular Maximization: Local Search
S and S’ correspond to the intermediate solution sets
53
Submodular Maximization: Local Search
S and S’ correspond to the intermediate solution sets
54
Submodular Maximization: Local Search
S and S’ correspond to the intermediate solution sets
55
Submodular Maximization: Local Search
S and S’ correspond to the intermediate solution sets
56
Submodular Maximization: Local Search
S and S’ correspond to the intermediate solution sets
57
Local Search
▪ ~1/3 approximation
▪ At least 1/3 of optimal solution
▪ we use a slightly different version of this algorithm
▪ Smooth local search
▪ 2/5 approximation
58
Smooth Local Search
Initialization
Marginal gain
If marginal gain > threshold,
Add element
If marginal gain < threshold,
Add element
59
Evaluation: Datasets
Dataset # of Features Classes
datapoints
Bail Outcomes 86K Gender, age, current offense No Risk, Failure to
details, past criminal record Appear, New
Criminal Activity
Student 21K Gender, age, grades, Graduated on Time,
Performance `absence rates & tardiness Delayed Graduation,
behavior through grades 6 to Dropped out
8, suspension/withdrawal
history
Medical Diagnosis 150K Current ailments, age, BMI, Asthma, Diabetes,
gender, smoking habits, Depression, Lung
medical history, family Cancer, Rare Blood
history Cancer
60
Evaluating Predictive Performance
Method AUC AUC AUC
Bail Data Student Medical Data
Data
Our Approach 69.78 75.12 61.19
Bayesian Decision Lists 67.18 72.54 59.18
(Letham et. al.)
Classification Based on 70.68 76.02 63.03
Association (Liu et. al.)
CN2 71.02 76.36 64.78
Decision Trees 70.08 75.31 63.28
Gradient Boosted Trees 71.23 77.18 64.21
Random Forests 70.87 77.12 63.92
62
Evaluating Goodness of Rules
▪ Results on Medical Diagnosis Data
Method Fraction of Fraction of Avg. Rule Num. Fraction
Overlap Data Points Width Rules of
Uncovered Classes
Covered
Our Approach 0.09 0.13 3.17 12 1.0
Bayesian 0.00 0.18 8.46 11 0.67
Decision Lists
(Letham et. al.)
Classification 0.00 0.14 8.60 32 1.00
Based on
Association (Liu
et. al.)
CN2 0.12 0.14 9.78 38 1.00
63
Ablation Study
▪ Results on Medical Diagnosis Data
64
Evaluating Interpretability:
User Study
▪ Compared our interpretable decision sets to
Bayesian Decision Lists (Letham et. al.)
▪ Each user is randomly assigned one of the two
models
▪ 10 objective and 2 descriptive questions per user
65
Interface for Objective Questions
66
Interface for Descriptive Questions
67
User Study Results
Task Metrics Our Bayesian
Approach Decision Lists
Descriptive Human 0.81 0.17
Accuracy
Avg. Time Spent 113.4 396.86
(secs.)
Avg. # of Words 31.11 120.57
Objective Human 0.97 0.82
Accuracy
Avg. Time Spent 28.18 36.34
(secs.)
Objective Questions: 17% more accurate, 22% faster;
Descriptive Questions: 74% fewer words, 71% faster.
68