0% found this document useful (0 votes)
90 views64 pages

Inherently Interpretable Models 1 of 2

The document discusses two rule-based machine learning papers: 1. Interpretable Rule Lists (Letham et al.) which introduces Bayesian Rule Lists (BRL) to generate interpretable decision lists from pre-mined rules while maintaining predictive accuracy. 2. Interpretable Rule Sets (Lakkaraju et al.) which proposes Interpretable Decision Sets (IDS) as a framework to optimize for accuracy, interpretability, and class coverage using a novel submodular objective function and optimization algorithm.

Uploaded by

Paul George
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views64 pages

Inherently Interpretable Models 1 of 2

The document discusses two rule-based machine learning papers: 1. Interpretable Rule Lists (Letham et al.) which introduces Bayesian Rule Lists (BRL) to generate interpretable decision lists from pre-mined rules while maintaining predictive accuracy. 2. Interpretable Rule Sets (Lakkaraju et al.) which proposes Interpretable Decision Sets (IDS) as a framework to optimize for accuracy, interpretability, and class coverage using a novel submodular objective function and optimization algorithm.

Uploaded by

Paul George
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Project Proposals

▪ Due next Monday (13th Feb) 11.59pm ET

▪ 2 page proposal (more details on “Course


Logistics” document on canvas) + References

▪ Today by 5pm ET, we will post:


▪ Project topics and some concrete problems
▪ Sample proposals and final reports from past iterations
▪ LaTeX and Word templates which you will use to write
the proposal

1
Office Hours and Paper Presentations

▪ Office hours switch this week


▪ Suraj and Jiaqi today
▪ Hima on Thursday

▪ Students signed up for presentations next week


should see us in office hours this week
▪ Full slide deck (ideally!)
▪ An overview of what you plan to present

2
Rule Based Approaches
Agenda

▪ Paper 1: Interpretable Rule Lists (Letham et. al.)

▪ Paper 2: Interpretable Rule Sets (Lakkaraju et. al.)

▪ Discussion

4
Interpretable Classifiers Using
Rules and Bayesian Analysis
Benjamin Letham, Cynthia Rudin, Tyler McCormick, David Madigan; 2015
Contributions

▪ Introducing a generative model called Bayesian


Rule Lists (BRL)
▪ Goal is to output a decision list (if then else-if)

▪ Novel prior structure to encourage sparsity

▪ Predictive accuracy on par with top algorithms

6
Decision List: Example

This is “an” accurate and interpretable decision list – possibly one of


many such lists

7
Introduction: BRL

▪ Produces a posterior distribution over


permutations of if.. then.. Else-if.. rules from a
large set of pre-mined rules

▪ Decision lists with high posterior probability tend


to be both accurate and interpretable
▪ Prior favors concise lists with small number of rules and
fewer terms in left hand side

8
Introduction: BRL

▪ New type of balance between accuracy,


interpretability, and computation

▪ What about using other similar models?


▪ Decision trees (CART)
▪ They employ greedy construction methods
▪ Not particularly computationally demanding but affects
quality of solution – both accuracy and interpretability

9
Pre-mined Rules

▪ A major source of practical feasibility: pre-mined


rules
▪ Reduces model space
▪ Complexity of problem depends on number of pre-
mined rules

▪ As long as pre-mined set is expressive, accurate


decision list can be found + smaller model space
means better generalization (Vapnik, 1995)

10
Pre-mined Rules: Intuition

Minimum Support = 3

This is Apriori algorithm. FP-growth is a single pass algorithm (more efficient)

11
Preliminaries: Notation

▪ Training data:

▪ Two labels: stroke or no stroke

12
Bayesian Decision Lists

where

13
Preliminaries: Multinomial

▪ Sampling from a multinomial:

▪ Parameters are probability values

14
Preliminaries: Dirichlet

▪ Dirichlet: sampling over a probability simplex


▪ E.g., (0.6, 0.4) is a sample from a Dirichlet distribution;

▪ K-dimensional Dirichlet has k parameters – any


positive number
▪ E.g., Dirichlet(60, 40)

15
Preliminaries: Dirichlet Prior

▪ Conjugate prior for multinomial distribution

▪ Conjugate prior: posterior in the same family as


prior

▪ Prior:

▪ Posterior:

16
Bayesian Association Rules

17
Generative Model

Our goal is to sample from the posterior distribution over antecedent lists:

is complete collection of pre-mined antecedents

18
Prior Probabilities

Truncated Poisson:

Ensures that sampled values are within bounds!

Also, ensures expected value is close to


when there are a large number of pre-mined rules

19
Prior Probabilities

Another Truncated Poisson,

is sampled uniformly from available antecedents with


appropriate cardinality.

20
Likelihood

▪ Likelihood is the product of multinomial probability


mass functions for the observed label counts at
each rule

Marginalize over → integrate out the intermediate parameter

21
Markov Chain Monte Carlo

▪ Generate a chain of random samples until


convergence

▪ Each random sample is a stepping stone for the


next one (chain)

▪ New samples do not depend on any samples


before the previous one (Markov)

22
Markov Chain Monte Carlo

▪ How to go to (optimal) d* from current dt

▪ Move an antecedent to a different position in the


list

▪ Add an antecedent that is not currently in the list

▪ Remove an antecedent from the list

23
Metropolis Hastings

▪ Start with a random decision list

▪ Choose a move based on “proposal distribution”


Q

▪ After you choose your move, you compute an


acceptance probability A

▪ Generate a random number u

▪ If u <= A, then accept; otherwise reject


24
Metropolis Hastings

25
Proposal Probabilities

▪ Move chosen uniformly

▪ Which antecedents and their new position is also


chosen uniformly

26
Estimating label of a new observation

Match the antecedent by looking at feature values of new observation

27
Tic-Tac-Toe

5 fold cross validation; accuracy computed across 5 folds

29
Stroke Prediction

▪ N = 12,586, 14% had stroke


▪ 6000 times larger than data for CHADS2 score
▪ Pre-mining: support 10% and max cardinality 2
▪ 5 fold evaluation

30
Stroke Prediction

31
Stroke Prediction - AUC

33
Interpretable Decision Sets
Hima Lakkaraju, Stephen Bach, Jure Leskovec; 2016
Contributions

▪ A framework called Interpretable Decision Sets


(IDS) for classification

▪ Novel objective function + proof of submodularity

▪ Optimization procedure with optimality guarantees

▪ Detailed metrics for evaluating interpretability +


user studies

35
Motivation

▪ Traditional classification models optimize for predictive


accuracy

▪ Very little understanding of the model itself and its


predictions

▪ Model being “readable” is not enough

▪ Humans should be able to reason about predictions


and readily explain the functionality of the model

36
Decision Sets

38
Criteria for Interpretability

▪ Parsimony: Fewer rules with fewer conditions


▪ Cognitive limits of human understanding

▪ Distinctness: Minimal overlap of rules w.r.t the data


points they cover
▪ No redundant and contradicting explanations of data
points

▪ Class Coverage: Explain all the classes in the data


▪ Rules explaining minority classes are important

39
Problem Formulation

40
Desiderata

▪ We need to optimize for the following criteria


▪ Recall
▪ Precision
▪ Distinctness
▪ Parsimony
▪ Class Coverage

▪ Recall and Precision → Accurate predications

▪ Distinctness, Parsimony, and Class Coverage


→Interpretability

41
Objective Function

42
Objective Function

43
Objective Function

44
Objective Function

45
Objective Function

46
Objective Function

47
Submodularity

Diminishing returns characterization

F(A  d) – F(A) ≥ F(B  d) – F(B)


Gain of adding d to a small set Gain of adding d to a large set

B A + d Large improvement

+ d Small improvement

A non-negative linear combination of


submodular functions is submodular
Objective Function

The complete objective is non-negative,


non-normal, non-monotone, submodular

49
Optimizing the Objective

▪ Maximizing a non-monotone submodular function


is NP-hard

▪ Smooth local search [SLS] algorithm provides a


2/5 approximation [Feige, Mirrokni, Vondrak FOCS
07; SIAM Comp. J. 11 ]
▪ Will be at least 2/5 of the optimal solution

50
Submodular Maximization: Local Search

Each node here


corresponds to a
candidate
rule = (pattern, class) tuple

S and S’ correspond to the intermediate solution sets


51
Submodular Maximization: Local Search

S and S’ correspond to the intermediate solution sets


52
Submodular Maximization: Local Search

S and S’ correspond to the intermediate solution sets


53
Submodular Maximization: Local Search

S and S’ correspond to the intermediate solution sets


54
Submodular Maximization: Local Search

S and S’ correspond to the intermediate solution sets


55
Submodular Maximization: Local Search

S and S’ correspond to the intermediate solution sets


56
Submodular Maximization: Local Search

S and S’ correspond to the intermediate solution sets


57
Local Search

▪ ~1/3 approximation
▪ At least 1/3 of optimal solution

▪ we use a slightly different version of this algorithm


▪ Smooth local search
▪ 2/5 approximation

58
Smooth Local Search

Initialization

Marginal gain

If marginal gain > threshold,


Add element

If marginal gain < threshold,


Add element

59
Evaluation: Datasets

Dataset # of Features Classes


datapoints
Bail Outcomes 86K Gender, age, current offense No Risk, Failure to
details, past criminal record Appear, New
Criminal Activity
Student 21K Gender, age, grades, Graduated on Time,
Performance `absence rates & tardiness Delayed Graduation,
behavior through grades 6 to Dropped out
8, suspension/withdrawal
history
Medical Diagnosis 150K Current ailments, age, BMI, Asthma, Diabetes,
gender, smoking habits, Depression, Lung
medical history, family Cancer, Rare Blood
history Cancer

60
Evaluating Predictive Performance

Method AUC AUC AUC


Bail Data Student Medical Data
Data
Our Approach 69.78 75.12 61.19
Bayesian Decision Lists 67.18 72.54 59.18
(Letham et. al.)
Classification Based on 70.68 76.02 63.03
Association (Liu et. al.)
CN2 71.02 76.36 64.78
Decision Trees 70.08 75.31 63.28
Gradient Boosted Trees 71.23 77.18 64.21
Random Forests 70.87 77.12 63.92

62
Evaluating Goodness of Rules

▪ Results on Medical Diagnosis Data


Method Fraction of Fraction of Avg. Rule Num. Fraction
Overlap Data Points Width Rules of
Uncovered Classes
Covered
Our Approach 0.09 0.13 3.17 12 1.0
Bayesian 0.00 0.18 8.46 11 0.67
Decision Lists
(Letham et. al.)
Classification 0.00 0.14 8.60 32 1.00
Based on
Association (Liu
et. al.)
CN2 0.12 0.14 9.78 38 1.00

63
Ablation Study

▪ Results on Medical Diagnosis Data

64
Evaluating Interpretability:
User Study

▪ Compared our interpretable decision sets to


Bayesian Decision Lists (Letham et. al.)

▪ Each user is randomly assigned one of the two


models

▪ 10 objective and 2 descriptive questions per user

65
Interface for Objective Questions

66
Interface for Descriptive Questions

67
User Study Results

Task Metrics Our Bayesian


Approach Decision Lists
Descriptive Human 0.81 0.17
Accuracy
Avg. Time Spent 113.4 396.86
(secs.)
Avg. # of Words 31.11 120.57
Objective Human 0.97 0.82
Accuracy
Avg. Time Spent 28.18 36.34
(secs.)

Objective Questions: 17% more accurate, 22% faster;


Descriptive Questions: 74% fewer words, 71% faster.
68

You might also like