Objective
Segmentation
Decision Tree
• The goal is to create a model that predicts the value of a target variable
based on several input variables.
• Each interior node corresponds to one of the input variables; there are
edges to children for each of the possible values of that input variable.
• Each leaf represents a value of the target variable given the values of
the input variables represented by the path from the root to the leaf.
• Classification tree analysis – Class label
• Regression tree analysis – Real number
Decision Tree Construction
Specific decision-tree algorithms:
ID3 (Iterative Dichotomiser 3)
C4.5 (successor of ID3)
CART (Classification And Regression Tree)
CHAID (CHI-squared Automatic Interaction Detector). Performs multi-
level splits
MARS: extends decision trees to handle numerical data better.
Conditional Inference Trees. Statistics-based approach that uses non-
parametric tests as splitting criteria, corrected for multiple testing to avoid
over fitting
Advantages of Decision Trees
• Simple to understand and interpret
• Requires little data preparation
• Able to handle both numerical and categorical
• Uses a white box model
• Possible to validate a model using statistical tests
• Robust
• Performs well with large datasets
Tools to construct Decision Trees
Salford Systems CART (which licensed the proprietary code of the original CART
authors)
IBM SPSS Modeler
Rapid Miner
SAS Enterprise Miner
Matlab
R (an open source software environment for statistical computing which includes
several CART implementations such as rpart, party and random Forest packages)
Weka (a free and open-source data mining suite, contains many decision tree
algorithms)
Orange (a free data mining software suite, which includes the tree module orngTree)
KNIME
Microsoft SQL Server
Scikit-learn
CHAID: CHI-squared Automatic Interaction
Detector
• Morgan and Sonquist (1963)
• AID - Automatic Interaction Detection
• Stepwise splitting
• One split of k categories into two groups – 2k-1 possible splits
• Kass (1980) proposed a modification to AID called CHAID for
categorized dependent and independent variables.
Key features
1.Categorical Variables
2.Chi-squared Test: determine the best split for a node
3.Multiple Branches
4.Merging Categories: If no statistically significant difference
is found among different categories of a variable
5.Stopping Criteria: no statistically significant splits are
found or minimum node size.
6.Applications: marketing for segmenting customers , risk
assessment, predicting response rates etc..
7.Visualization: the tree structure highlights the hierarchy
of significant variables that lead to different segments.
Algorithm-step 1
Dividing the cases that reach a certain node in the
tree
1. Cross tabulate the response variable (target) with
each of the explanatory variables.
Gender=male Gender=Female
Yes 12 0
No 1 13
Algorithm – step 2
2. When there are more than two columns, find the
"best" subtable formed by combining column
categories
2.1 This is applied to each table with more than 2
columns.
2.2 Compute Pearson X2 tests for independence for each
allowable subtable
2.3 Look for the smallest X2 value. If it is not
significant, combine the column categories.
2.4 Repeat step 2 if the new table has more than two
columns
Algorithm – step 3
3 Allows categories combined at step 2 to be broken apart.
3.1 For each compound category consisting of at least 3 of the original
categories, find the “most significant" binary split
3.2 if X2 is significant, implement the split and return to step 2.
3.3 otherwise retain the compound categories for this variable, and move on to
the next variable
Algorithm - Step 4
4. You have now completed the “optimal” combining of categories for
each explanatory variable.
4.1 Find the most significant of these “optimally” merged explanatory variables
4.2 Compute a “Bonferroni” adjusted chi-squared test of independence for the reduced
table for each explanatory variable.
Algorithm – Step 5
5 Use the “most significant" variable in step 4 to
split the node with respect to the merged categories
for that variable.
5.1 repeat steps 1-5 for each of the offspring nodes.
5.2 Stop if
• no variable is significant in step 4.
• the number of cases reaching a node is below a specified
limit.
CART – Classification and Regression
Tree
• CART algorithm was introduced in Breiman et al. (1986).
• A CART tree is a binary decision tree that is constructed by splitting a
node into two child nodes repeatedly, beginning with the root node
that contains the whole learning sample.
• The CART growing method attempts to maximize within-node
homogeneity.
• Gini Index – impurity measure
Gini Index
• Another sensible measure of impurity
(i and j are classes)
• After applying attribute A, the resulting Gini index is
• Gini can be interpreted as expected error rate
Gini Index
. .
. .
. .
Attributes: color, border, dot
Classification: triangle, square
16
Gini Index for Color
. .
. .
. .
. .
red
Color? green
.
yellow .
.
.
17
Gain of Gini Index
18
Regression Tree
Overfitting- pruning
• In order to fit the data (even noisy data), the model
keeps generating new nodes and ultimately the tree
becomes too complex to interpret.
Pre vs Post pruning
• Pre pruning - The hyperparameters are max_depth,
min_samples_leaf, and min_samples_split
• Post pruning - the model might slightly increase the
training error but drastically decrease the testing error
Tree Score = SSR + alpha*T, where alpha is a tuning
parameter -Cross Validation.
SSR –sum of squared residuals, T – Tree complexity
penalty