Feature Impact For Prediction Explanation (2) - Annotated
Feature Impact For Prediction Explanation (2) - Annotated
net/publication/335189270
CITATIONS READS
0 946
1 author:
Mohammad Bataineh
HUMANA Inc.
10 PUBLICATIONS 77 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Mohammad Bataineh on 16 December 2019.
Abstract. Companies across the globe have been adapting complex Machine
Learning (ML) techniques to build advanced predictive models to improve their
operations and services and help in decision making. While these ML tech-
niques are extremely powerful and have found success in different industries for
helping with decision making, a common feedback heard across many indus-
tries worldwide is that too often these techniques are opaque in nature with no
details as to why a particular prediction probability was reached. T his work pre-
sents an innovative algorithm that addresses this limitatio n by providing a
ranked list of all features according to their contribution to a model’s prediction.
T his new algorithm, Feature Impact for Prediction Explanation (FIPE), incorpo-
rates individual feature variations and correlations to calculate feature imp act
for a prediction. T he true power of FIPE lies in its computationally -efficient
ability to provide feature impact irrespective of the base ML technique used.
1 Introduction
Machine learning (ML) algorithms provide new milestones for a company’s s uccess
by providing future prediction and forecasting for various metrics and events for b et -
ter and optimized outcomes. Simple algorithms, like logistic regression and decisio n
trees, have architectures that make it easy to interpret and explain the reasons b ehin d
the algorithms predictions. On the other hand, with the technological advancement in
computing power and big-data sources, more complex ML algorithms have b eco me
dominant over these simple algorithms in practical applications. Techniques like mu l-
ti-perceptron neural network, deep learning tools, random forest, and gradient-boosted
trees usually outperform the simple ML algorithms. The down side is that these mo re
complex algorithms are not interpretable when it comes to explaining individual p re-
dictions. This in turn makes it difficult for the users of these models to take personal-
ized actions based on the individual predictions. For example, a model might be us ed
to predict patients who might be non-compliant with their medications. While the
model can provide a list of these patients, it cannot provide any in d icat ion o f wh y
these individuals are non-compliant with their medications. Being non-compliant
could be due to a variety of reasons, financial difficulties to pay for expensiv e med i-
cations or lacking transportation to pick up the medications, for example. Th es e d if-
ferent burdens require alternative actions to be taken by the model users.
2
Through examination of the literature, a clear gap is apparent in that there is a criti-
cal need to have some efficient tool to interpret these complex model pred ict io ns in
important fields like healthcare and finance, among others [1-2]. The effort in mo d el
interpretation, however, has been recent and limited. One of earliest proposed met h -
ods to solve this challenging problem is the permutation-based interpretation met h od
called individual conditional expectation (ICE) [3]. ICE creates variants of an in-
stance by replacing the feature’s value with values from a grid and makes new predic-
tions using the original model. Lundberg [4] proposed another approach t h at calcu -
lates marginal contribution of a feature value across all possible options; then all con -
tributions are averaged using Shapley values. Both the ICE and Sh ap ely v alu es re-
quire expensive run times when you have many features. In addition, both techniques
can produce invalid local interpretations when features are correlated.
Other approaches for black box model interpretation are the local surrogate models
like local interpretable model-agnostic explanations (LIME), which builds logistic
regression to represent each observation predictive value [5]. Another d eriv at ion o f
the surrogate models uses decision trees to represent the predictive values [6]. Th es e
surrogate methods experience serious limitations that include: computationally expen-
sive since each observation should be represented in new model, over simplificat io n
assumption for the model by assuming all features have simple relationships at lo c al
level, and instability of the explanations due to heuristic settings and v ariable s am-
pling process [1, 7-8]. Evaluation of majority of these and other similar methods
along with their limitations are detailed in Guidotti [9]. There are ot h er ap p roach es
that are designed for global model interpretation [10-13], but tho se are b eyo nd t h e
scope of this work since they are not capable of providing individual explanation fo r
each observation.
Besides the fact that use of these model interpretation algorithms burdens the u ser
with severe limitations, the algorithms are neither mature nor widely used in industrial
applications. In real world situations where the model has thousands of features to b e
evaluated, some algorithms like ICE and LIME run into memory and run time issues.
Moreover, other algorithms are designed for scenarios that are too specific and crafted
with simplistic assumptions. Therefore, this work was motivated by the fact that there
are still tremendous areas of improvement needed in regards to model interpretat ion .
This is especially the case when it comes to real world applications with flexible s et -
tings, and ML models with large numbers of features.
To address these issues, we introduce in this work a novel algorit h m t h at can b e
used to interpret ML modeling by providing the top predictors t hat are d riv in g t h e
model prediction for each observation. The new algorithm is called Feature Impact for
Prediction Explanation (FIPE). The FIPE algorithms can be used with any ML mod el
with superior credibility because: (1) it preserves and uses the original built model t o
rank top predictors and interpret the predictions; (2) it runs relat iv ely fas t wit h out
running in memory issues or implementation obstacles, which makes it s calab le t o
any application; (3) it produces local prediction interpretation for each o bservatio n;
and (4) it’s proven to have broad representation of all predictors in the model without
any biases or in-advanced assumptions and settings that might limit o r ch an g e t h e
number of represented predictors.
3
The hypothesis for the FIPE algorithm is built around the fact that an informative
feature in a model is the one that has varying values and is correlat ed t o t h e t arget
(i.e., output) so that the variation in the model predictions are cau sed b y t h ese in -
formative features. The greatest impact on a model prediction is produced by the fea-
tures with extreme values relative to each other. Essentially, the features with extreme
low or high values from their median/mode drives the resulting score along with their
importance/weight in the model architecture. The feature impact on a model p red ic-
tion can actually be estimated by evaluating the prediction changes when that imp act
is removed. The most straightforward way to do this is to set the feature value back to
its normal value, like the median for numeric features and the mode fo r cat egorical
features. In most practical ML problems, features are somewhat co rrelat ed t o each
other. They all contribute cohesively to the model predictions. Thus, whenever a fea-
ture impact is evaluated the correlated impact with other features should also be co n-
sidered.
The FIPE algorithm was developed to calculate the feature impact u sin g t wo as-
pects: the impact of the individual feature on the model, and the co rrelat ed imp act
resulting from being associated with other features. Eventually, FIPE algorithm calcu-
lates the total feature impact as the resulting sum of both impacts based on the chang-
es they cause on the model prediction. We have summarized the FIPE alg o rit h m in
the following steps:
Step 1: Each input (i.e., each observation, which is called ID in this context) is
scored using the original ML predictive model. The resulting predicted score is called
(𝑆𝑜)
Step 2: Calculate feature statistics. The median value is calculated for each numer-
ic feature. The mode value is calculated for each categorical feature
Step 3: Within each observation, each input feature x is used t o create t wo n ew
scores using the original model as follows:
𝑆𝑥: Predicted score when only feature x is set to its median/mode v alu e wh ile all
other input features are kept at their original values
𝑆𝑐: Predicted score when all input features except x are set to their med ian / mo d e
values. Feature x is kept at its original value
Step 4: Repeat step 3 for all features within the same observation. Then, repeat t h e
process for all observations.
Step 5: Create one additional score that is a result of setting all feat u res t o t heir
median/mode values (called 𝑆𝑚). This single score is a common score that will be
used for all observations.
Step 6: Calculate feature x net impact (𝑌𝑥) using Equation 1. The impact 𝑌𝑥 con-
sists of the impact sign (𝑈𝑥 ) (Equation 2) multiplied by the absolute impact value. The
impact value comes from the feature x individual impact and its correlated impact
with all other features.
|𝑆𝑥−𝑆𝑜|+|(𝑆𝑚−𝑆𝑜 )−(𝑆𝑐−𝑆𝑜 )| |𝑆𝑥−𝑆𝑜|+|𝑆𝑚−𝑆𝑐 |
𝑌𝑥 = 𝑈𝑥 × | | = 𝑈𝑥 × | | (1)
𝑆𝑚 𝑆𝑚
4
In order to communicate the underlying mathematical logic for the FIPE algorithm
that was presented in the above steps, we demonstrate how t h e feat ure x imp act is
calculated in Equation 1, as follows. Figure 1 represents the scores distributio n fo r a
hypothetical ML predictive model. In Figure 1, the individual impact for feat u re x is
obtained in (A), which explains the first part of Equation 1. The correlated impact fo r
feature x is obtained in (B), which explains the second part in Equation 1. Th en , t h e
final net impact is normalized by dividing it by 𝑆𝑚, which is the result o f s ett in g all
features to their median/model values, allowing for all features impacts to be evaluat-
ed relative to each other. Both impacts A and B represent isolating the entire effect o f
feature x on the original predicted score 𝑆𝑜 by eliminating the features’ ext reme v al-
ues through adjusting them to their mean/median.
Fig. 1. Score distribution for hypothetical predictive model along with representation for the
FIPE feature impact portions and their associated scores
With respect to the impact sign 𝑈𝑥 that is presented in Equation 2, the larg er imp act
portion between A and B determines the final sign of the net impact. If A > B, t h en
feature x has a negative impact only when 𝑆𝑥 ≥ 𝑆𝑜. This means that removing feature
x impact, which is represented by the score 𝑆𝑥, leads to a higher score compared to the
original score 𝑆𝑜. This in turn indicates that having feature x with it s cu rren t v alu e
reduces the score 𝑆𝑜 from what it should be when x has a normal value (i.e., set to
median/mode value). Using the same logic, when B > A, then feature x impact is
negative when 𝑆𝑚 ≥ 𝑆𝑐 .
5
The feature impact calculations are repeated over all features for the same observa-
tion. Therefore, each feature will have its own normalized impact that can be com-
pared relative to other features for the same observation. Features then can be ran ked
by positive, negative, or absolute impact. To make the features impact v alu es mo re
interpretable to end users, the features impacts are normalized by dividing each fea-
ture impact 𝑌𝑥 by the maximum feature impact for the same observation. In t h e n ext
section we provide a few examples on the FIPE implementation, and calculate the
feature impacts.
3 Examples
Like many other fields, the healthcare field is full of ML applications − from dis eas e
or diagnosis prediction to identifying members with a high risk of an emergen cy d e-
partment (ED) visit or inpatient admission. Regardless of the application, it beco mes
critical to know why the model produced a high prediction for someone enabling
those making use of the model (care manager, nurse, social worker, etc.) t o p ro v ide
impactful interventions and obtain better engagement with patients. In this section, we
provide a simplified example using healthcare data to examine the FIPE implemen t a-
tion. In this example, we have a hypothetical model with 3 inputs and single classifi-
cation output. The model inputs are the person’s age, ED visits in p as t 1 y ear, an d
utilization count in past 1 year (doctors’ visits). The model output is simply a raw
prediction between 0 and 1 for a classification output (i.e., target).
Table 1 presents sample predicted results (𝑆𝑜) for 3 patients, which have been
anonymized to patient identifications (ID). For the remainder of this section, we will
calculate and rank the 3 features impact for ID number 1 only.
Table 1. Sample of three IDs with their input values along with their model predictions
From the FIPE steps detailed in the previous section, the next step is to get the result-
ing score from setting all inputs to their median/mode values. In this case, the median
values for the entire evaluated population is provided in Table 2, which also provid es
the result score (𝑆𝑚).
Table 2. T he median values for the model three inputs and the resulting prediction
Now, each individual feature will be used to create 2 additional scores deriv ed fro m
new settings for the features values. The new observations along with their scores are
provided in Table 3. The table includes additional columns: (1) the column “feat u re
Name” is to flag the feature used to create the new score, and (2) the column “Sco re
Tag” is to provide the type of resulting score for the evaluated feature. As n ot ed , all
evaluated features in Table 3 are for the same single ID.
Table 3. T he new observations for the three features and their resulting scores
Based on the different resulting scores, the feature impacts are estimated using Equ a-
tions 1 and 2. The detailed calculations for each of the three feat ures are p ro vid ed
below:
|0.91 − 0.95| + |0.2 − 0.12| 0.04 + 0.08
𝑌𝑥( Age) = −1 × | | = −1 × | | = −0.6
0.2 0.2
From the calculated impacts, it is clear the “ED visits” feature has the largest abso-
lute impact. In order to produce more interpretable impact values, the impacts are all
normalized by dividing each feature impact by the maximum impact value (i.e., in
this case “ED visits” has the maximum impact equal 6.05). The resulting normalized
impacts and their absolute ranking are presented in Table 4.
Table 4. Summary of features impacts, normalized impacts, and their absolute ranks.
ED = Emergency department
7
The examples presented in this section can be populated for each input ID among t h e
entire dataset. In the result, the features impacts and ranking that are similar t o t h o s e
produced in Table 4 will be available for each ID. Depending on the user preference
and needs it is necessary to choose the appropriate ranking. In this case, we were in -
terested in the absolute ranking of all three features; however, other ap plicatio ns o f
this tool may be of interest for features with positive or negative impacts only.
It is important to emphasize that the FIPE algorithm runs relatively q u ickly , an d
the program should not undergo memory or implementation issues once expanded t o
a large real-world model in an industrial setting. That is because, along with one fixed
score 𝑆𝑚 that is used for all features, FIPE algorithm creates only two new s cores 𝑆𝑥
and 𝑆𝑐 for each feature to calculate the impact. As opposed to other algorit hms wit h
expensive calculations and computation time increases exponentially with the number
of features, FIPE always calculates new observations equal to twice t h e n u mb er o f
features.
4 Discussion
The model prediction explanation for any ML algorithm has been generating concern
due to the use of advanced predictive models in applications in a div ers e v ariet y o f
fields. This work provides a new algorithm called FIPE for prediction explanation that
can be applied on any ML model. Besides the importance ranking of all mo d el fea-
tures for an individual prediction, the new FIPE algorithm is easy t o imp lemen t fo r
any ML technique and runs very quickly. The FIPE algorithm makes what is us u ally
known as a black box ML model to an open and transparent tool. This will essentially
open new areas of research not just useful for advancing the ML algo rit h ms an d it s
usage, but also for future effort in trying to understand the behaviors of the underlying
complex problems.
FIPE algorithm provides a new tipping point for practical and realistic implementa-
tion of a personalized model explanation. Future effort will be focused around ident i-
fying a standard methodology to evaluate FIPE and other similar alg o rit h ms in ap -
plied settings. Moreover, the algorithm should be subjectively evaluated on a full M L
model with a large number of features and concise validation.
References
1. Alvarez-Melis D, Jaakkola T S.: On the robustness of interpretability methods. arXiv p r e-
print arXiv:1806.08049. 2018 Jun 21.
2. Interpretable machine learning. A Guide for Making Black Box Models Explainable ,
https://christophm.github.io/interpretable-ml-book/, last accessed 2019/03/20.
3. Goldstein A, Kapelner A, Bleich J, Pitkin E.: Peeking inside the black box: Visualizing
statistical learning with plots of individual conditional expectation. Journal of Computa-
tional and Graphical Statistics. 2015 Jan 2;24(1):44-65.
4. Lundberg S, Lee SI.: An unexpected unity among methods for interpreting model predic-
tions. arXiv preprint arXiv:1611.07478. 2016 Nov 22.
8
5. Ribeiro MT , Singh S, Guestrin C.: Why should i trust you?: Explaining the predictions of
any classifier. InProceedings of the 22nd ACM SIGKDD international conference on
knowledge discovery and data mining 2016 Aug 13 (pp. 1135 -1144). ACM.
6. Bastani O, Kim C, Bastani H.: Interpreting blackbox models via model extraction. arXiv
preprint arXiv:1705.08504. 2017 May 23.
7. Fong RC, Vedaldi A.: Interpretable explanations of black boxes by meaningful perturba-
tion. InProceedings of the IEEE International Conference on Computer Vision 2 0 1 7 ( p p .
3429-3437).
8. Ithapu VK.: Decoding the Deep: Exploring class hierarchies of deep representations usin g
multiresolution matrix factorization. InProceedings of the IEEE Conference on Com p ut er
Vision and Pattern Recognition Workshops 2017 (pp. 45 -54).
9. Guidotti R, Monreale A, Ruggieri S, T urini F, Giannotti F, Pedreschi D.: A survey of
methods for explaining black box models. ACM computing surveys (CSUR). 2018 Aug
22;51(5):93.
10. Friedman JH, Popescu BE.: Predictive learning via rule ensembles. T he Annals of Applied
Statistics. 2008;2(3):916-54.
11. Apley DW.: Visualizing the effects of predictor variables in black box supervised learning
models. arXiv preprint arXiv:1612.08468. 2016 Dec 27.
12. Lakkaraju H, Kamar E, Caruana R, Leskovec J.: Interpretable & explorable approxima-
tions of black box models. arXiv preprint arXiv:1707.01154. 2017 Jul 4.
13. Fisher A, Rudin C, Dominici F.: All Models are Wrong but many are Useful: Variable Im-
portance for Black-Box, Proprietary, or Misspecified Prediction Models, using Model
Class Reliance. arXiv preprint arXiv:1801.01489. 2018 Jan 4.