Introduction to Uplift Modelling
An online gaming application
A few words about me
•  Senior Data Scientist at Dataiku
(worked on churn prediction, fraud detection, bot detection, recommender systems, graph
analytics, smart cities, … )
•  Occasional Kaggle competitor
•  Mostly code with python and SQL
•  Twitter @prrgutierrez
Plan
•  Introduction / Client situation
•  Uplift use case examples
•  Uplift modeling
•  Uplift evaluation & results
Client situation
•  Ankama : French Online Gaming Company (RPG)
•  Users are leaving
•  let’s do a churn prediction model !
•  Target : no come back in 14 or 28 days.
(14 missing days -> 80 % of chance not to come back
28 missing days -> 90 % of chance not to come back)
•  Features :
•  Connection features :
•  Time played in 1,7,15,30,… days
•  Time since last connection
•  Connection frequency
•  Days of week / hours of days played
•  Equivalent for payments and subscriptions
•  Age, sex, country
•  Number of account, is a bot …
•  No in game features (no data)
	
  
	
  
Client situation
•  Model Results :
•  AUC 0.88
•  Very stable model
•  Marketing actions :
•  7 different actions based on customer segmentation
(offers, promotion, … )
•  A/B test
-> -5 % churn for persons contacted by email
•  Going further :
•  Feature engineering : guilds, close network, in game actions, …
•  Study long term churn …
Client situation
•  But wait !
•  Strong hypothesis : target the person that are the most likely to churn
Client situation
•  But wait !
•  Strong hypothesis : target the person that are the most likely to churn
•  What is the gain / person for an action ?
•  cost of action
•  value of the customer
•  independent variables
•  “treated” population and “control” population
• 
•  Value with action :
•  Value without action :
•  Gain (if independent of treatment ) :
c
vi i
X
T C
Y =
⇢
1 if customer churn
0 otherwise
ET
(Vi) = vi(1 PT
(Y = 1|X)) c
EC
(Vi) = vi(1 PC
(Y = 1|X))
vi
E(Gi) = vi(PC
(Y = 1|X) PT
(Y = 1|X)) c
Client situation
•  But wait !
•  Strong hypothesis : target the person that are the most likely to churn
•  What is the gain / person for an action ?
•  Objective : maximize this gain
•  Targeting highly probable churner -> minimize
But not the difference !
•  Intuitive examples :
•  : action is expected to make the situation worst. Spam ?
•  : user does not care, is already lost
Upli&	
  =	
  Model	
  
E(Gi) = vi(PC
(Y = 1|X) PT
(Y = 1|X)) c
PT
(Y = 1|X)
PC
(Y = 1) ⇡ PT
(Y = 1)
P
PC
(Y = 1) < PT
(Y = 1)
Uplift
•  Model effect of the action
•  4 groups of customers / patients
•  1  Responded because of the action
(the people we want)
•  2  Responded, but would have responded anyway
(unnecessary costs)
•  3  Did not respond and the action had no impact
(unnecessary costs)
•  4  Did not respond because the action had a negative impact
(negative impact)
•  Incomplete knowledge
Uplift Examples
•  Healthcare :
•  A typical medical trial:
•  treatment group: gets the treatment
•  control group: gets placebo (or another treatment)
•  do a statistical test to show that the treatment is better than placebo
•  With uplift modeling we can find out for whom the treatment works best
•  Personalized medicine
•  Ex : What is the gain in survival probability ?
-> classification/uplift problem
Uplift Examples
•  Churn :
•  E-gaming
•  Other Ex : Coyote
•  Retail :
•  Compare coupons campaigns
Uplift Examples
•  Mailing : Hillstrom challenge
•  2 campaigns :
•  one men email
•  one woman email
•  Question : who are the people to target / that have the best response rate
Uplift Examples
•  Common pattern
•  Experiment or A/B testing -> Test and control
•  Warning : Control can be biased easily :
•  Targeted most probable churners and control is the rest
•  Call only the people that come to a shop
•  Limited experiment trial -> no bandit algorithm :
(once a medicine experiment is done, you don’t continue the “exploration”)
-> relatively large and discrete in time feedbacks.
Uplift modelling
•  Three main methods :
•  Two models approach
•  Class variable modification
•  Modification of existing machine learning models
Uplift modelling : Two model approach
•  Build a model on treatment to get
•  Build a model on control to get
•  Set :
PT
(Y |X)
PC
(Y |X)
P = PT
(Y |X) PC
(Y |X)
Uplift modelling : Two model approach
•  Advantages :
•  Standard ML models can be used
•  In theory, two good estimators -> a good uplift model
•  Works well in practice
•  Generalize to regression and multi-treatment easily
•  Drawbacks
•  Difference of estimators is probably not the best estimator of the difference
•  The two classifier can ignore the weaker uplift signal (since it’s not their target)
•  Algorithm focusing on estimating the difference should perform better
Uplift modelling : Class variable modification
•  Introduced in Jaskowski, Jaroszewicz 2012
•  Allows any classifier to be updated to uplift modeling
•  Let denote the group membership (Treatment or Control)
•  Let’s define the new target variable :
•  This corresponds to flipping the target in the control dataset.
G 2 {T, C}
Z =
8
<
:
1 if G = T and Y = 1
1 if G = C and Y = 0
0 otherwise
Uplift modelling : Class variable modification
•  Why does it work ?
•  By design (A/B test warning !), should be independent from
•  Possibly with a reweighting of the datasets we should have :
thus
P(Z = 1|X) = PT
(Y = 1|X)P(G = T|X) + PC
(Y = 0|X)P(G = C|X)
P(Z = 1|X) = PT
(Y = 1|X)P(G = T) + PC
(Y = 0|X)P(G = C)
G X
P(G = T) = P(G = C) = 1/2
2P(Z = 1|X) = PT
(Y = 1|X) + PC
(Y = 0|X)
Uplift modelling : Class variable modification
•  Why does it work ?
Thus
And sorting by is the same as sorting by
2P(Z = 1|X) = PT
(Y = 1|X) + PC
(Y = 0|X)
= PT
(Y = 1|X) + 1 PC
(Y = 1|X)
P = 2P(Z = 1|X) 1
P(Z = 1|X) P
Uplift modelling : Class variable modification
•  Summary :
•  Flip class for control dataset
•  Concatenate test and control dataset
•  Build a classifier
•  Target users with highest probability
•  Advantages :
•  Any classifier can be used
•  Directly predict uplift (and not each class separately)
•  Single model on a larger dataset (instead of two small ones)
•  Drawbacks :
•  Complex decision surface -> model can perform poorly
•  Interpretation : what is AUC in this case ?
Uplift modeling : Other methods
•  Based on decision trees :
•  Rzepakowski Jaroszewicz 2012
new decision tree split criterion based on information theory
•  Soltys Rzepakowski Jaroszewicz 2013
Ensemble methods for uplift modeling
(out of today scope)
Evaluation
•  We used :
•  2 model approach. -> AUC ? Not very informative.
•  1 model approach -> does AUC means something ?
•  How can we evaluate / compare them ?
•  Cross Validation :
•  4 datasets : treatment/control x train/test
•  Problem :
•  We don’t have a clear 0/1 target.
•  We would need to know for each customer
•  Response to treatment
•  Response to control
-> not possible
Evaluation
•  Gain for group of customers :
•  Gain for the 10% highest scoring customers =
% of successes for top 10% treated customers − % of successes for top 10% control
customers
•  Uplift curve ? :
•  Difference between two lift curve
•  Interpretation : net gain in success rate if a given percentage of the population is treated
•  Pb : no theoretic maximum
•  Pb 2 : weird behaviour for 2 wizard models.
Evaluation : Qini
•  Qini Measure :
•  Similar to Gini (Area under lift curve). Lift Curve <-> Qini Curve
•  Parametric curve defined by :
•  When taking the first observations
•  is the total number of 1 seen in target observations
•  is the total number of 1 seen in control observations
•  is the total number of target observations
•  is the total number of control observations
•  Balanced setting :
t
f(t) = YT (t) YC(t) ⇤ NC(t)/NT (t)
YT
YC
NC
NT
f(t) = YT (t) YC(t)
Evaluation : Qini
•  Personal intuition :
•  We can’t know everything :
•  treated that convert, not treated that don’t convert. What would have happen ?
•  But we don’t want to see :
•  Treated not converting
•  Not treated converting (in our top list)
•  In we want to minimize :
•  Very similar to lift taking into account only negative examples.
t
NT (t) YT (t) + YC(t)
Evaluation : Qini
f(t) = YT (t) YC(t)
Evaluation : Qini
•  Best model :
•  Take first all positive in target and last all positive in control.
•  No theoretic best model :
•  depends on possibility of negative effect
•  Displayed for no negative effect
•  Random model :
•  Corresponds to global effect of treatment
•  Hillstrom Dataset :
•  For women models are comparable and useful
•  For men, there is no clear individuals to target
Evaluation : Qini
f(t) = YT (t) YC(t)
Evaluation : Qini
•  Back to our study :
•  Class modification performs best
•  Two models approach performs poorly
•  A/B test problem :
•  Control dataset is way to small !
•  Class modification model very close to lift
•  Two model slightly better than random
-> would need to redo the A/B test.
Conclusion
•  Uplift :
•  Surprisingly little literature / examples
•  The theory is rather easy to test
•  Two models
•  Class modification
•  The intuition and evaluation are not easy to grasp
•  On the client side :
•  A good lead to select the best offer for a customer
A few references
•  Data :
•  Churn in gaming :
WOWAH dataset (blog post to come)
•  Uplift for healthcare :
Colon Dataset
•  Uplift in mailing :
Hillstrom data challenge
•  Uplift in General :
Simulated data :
(blog post to come)
A few references
•  Application
•  Uplift modeling for clinical trial data (Jaskowski, Jaroszewicz)
•  Uplift Modeling in Direct Marketing (Rzepakowski, Jaroszewicz)
A few references
•  Modeling techniques :
•  Rzepakowski Jaroszewicz 2011 (decision trees)
•  Soltys Rzepakowski Jaroszewicz 2013 (ensemble for uplift)
•  Jaskowski Jaroszewicz 2012 (Class modification model)
A few references
•  Evaluation
•  Using Control Groups to Target on Predicted Lift (Radcliffe)
•  Testing a New Metric for Uplift Models (Mesalles Naranjo)
Thank you for your attention !

Meetup_FGVA_Uplift @ Dataiku

  • 1.
    Introduction to UpliftModelling An online gaming application
  • 2.
    A few wordsabout me •  Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities, … ) •  Occasional Kaggle competitor •  Mostly code with python and SQL •  Twitter @prrgutierrez
  • 3.
    Plan •  Introduction /Client situation •  Uplift use case examples •  Uplift modeling •  Uplift evaluation & results
  • 4.
    Client situation •  Ankama: French Online Gaming Company (RPG) •  Users are leaving •  let’s do a churn prediction model ! •  Target : no come back in 14 or 28 days. (14 missing days -> 80 % of chance not to come back 28 missing days -> 90 % of chance not to come back) •  Features : •  Connection features : •  Time played in 1,7,15,30,… days •  Time since last connection •  Connection frequency •  Days of week / hours of days played •  Equivalent for payments and subscriptions •  Age, sex, country •  Number of account, is a bot … •  No in game features (no data)    
  • 5.
    Client situation •  ModelResults : •  AUC 0.88 •  Very stable model •  Marketing actions : •  7 different actions based on customer segmentation (offers, promotion, … ) •  A/B test -> -5 % churn for persons contacted by email •  Going further : •  Feature engineering : guilds, close network, in game actions, … •  Study long term churn …
  • 6.
    Client situation •  Butwait ! •  Strong hypothesis : target the person that are the most likely to churn
  • 7.
    Client situation •  Butwait ! •  Strong hypothesis : target the person that are the most likely to churn •  What is the gain / person for an action ? •  cost of action •  value of the customer •  independent variables •  “treated” population and “control” population •  •  Value with action : •  Value without action : •  Gain (if independent of treatment ) : c vi i X T C Y = ⇢ 1 if customer churn 0 otherwise ET (Vi) = vi(1 PT (Y = 1|X)) c EC (Vi) = vi(1 PC (Y = 1|X)) vi E(Gi) = vi(PC (Y = 1|X) PT (Y = 1|X)) c
  • 8.
    Client situation •  Butwait ! •  Strong hypothesis : target the person that are the most likely to churn •  What is the gain / person for an action ? •  Objective : maximize this gain •  Targeting highly probable churner -> minimize But not the difference ! •  Intuitive examples : •  : action is expected to make the situation worst. Spam ? •  : user does not care, is already lost Upli&  =  Model   E(Gi) = vi(PC (Y = 1|X) PT (Y = 1|X)) c PT (Y = 1|X) PC (Y = 1) ⇡ PT (Y = 1) P PC (Y = 1) < PT (Y = 1)
  • 9.
    Uplift •  Model effectof the action •  4 groups of customers / patients •  1  Responded because of the action (the people we want) •  2  Responded, but would have responded anyway (unnecessary costs) •  3  Did not respond and the action had no impact (unnecessary costs) •  4  Did not respond because the action had a negative impact (negative impact) •  Incomplete knowledge
  • 10.
    Uplift Examples •  Healthcare: •  A typical medical trial: •  treatment group: gets the treatment •  control group: gets placebo (or another treatment) •  do a statistical test to show that the treatment is better than placebo •  With uplift modeling we can find out for whom the treatment works best •  Personalized medicine •  Ex : What is the gain in survival probability ? -> classification/uplift problem
  • 11.
    Uplift Examples •  Churn: •  E-gaming •  Other Ex : Coyote •  Retail : •  Compare coupons campaigns
  • 12.
    Uplift Examples •  Mailing: Hillstrom challenge •  2 campaigns : •  one men email •  one woman email •  Question : who are the people to target / that have the best response rate
  • 13.
    Uplift Examples •  Commonpattern •  Experiment or A/B testing -> Test and control •  Warning : Control can be biased easily : •  Targeted most probable churners and control is the rest •  Call only the people that come to a shop •  Limited experiment trial -> no bandit algorithm : (once a medicine experiment is done, you don’t continue the “exploration”) -> relatively large and discrete in time feedbacks.
  • 14.
    Uplift modelling •  Threemain methods : •  Two models approach •  Class variable modification •  Modification of existing machine learning models
  • 15.
    Uplift modelling :Two model approach •  Build a model on treatment to get •  Build a model on control to get •  Set : PT (Y |X) PC (Y |X) P = PT (Y |X) PC (Y |X)
  • 16.
    Uplift modelling :Two model approach •  Advantages : •  Standard ML models can be used •  In theory, two good estimators -> a good uplift model •  Works well in practice •  Generalize to regression and multi-treatment easily •  Drawbacks •  Difference of estimators is probably not the best estimator of the difference •  The two classifier can ignore the weaker uplift signal (since it’s not their target) •  Algorithm focusing on estimating the difference should perform better
  • 17.
    Uplift modelling :Class variable modification •  Introduced in Jaskowski, Jaroszewicz 2012 •  Allows any classifier to be updated to uplift modeling •  Let denote the group membership (Treatment or Control) •  Let’s define the new target variable : •  This corresponds to flipping the target in the control dataset. G 2 {T, C} Z = 8 < : 1 if G = T and Y = 1 1 if G = C and Y = 0 0 otherwise
  • 18.
    Uplift modelling :Class variable modification •  Why does it work ? •  By design (A/B test warning !), should be independent from •  Possibly with a reweighting of the datasets we should have : thus P(Z = 1|X) = PT (Y = 1|X)P(G = T|X) + PC (Y = 0|X)P(G = C|X) P(Z = 1|X) = PT (Y = 1|X)P(G = T) + PC (Y = 0|X)P(G = C) G X P(G = T) = P(G = C) = 1/2 2P(Z = 1|X) = PT (Y = 1|X) + PC (Y = 0|X)
  • 19.
    Uplift modelling :Class variable modification •  Why does it work ? Thus And sorting by is the same as sorting by 2P(Z = 1|X) = PT (Y = 1|X) + PC (Y = 0|X) = PT (Y = 1|X) + 1 PC (Y = 1|X) P = 2P(Z = 1|X) 1 P(Z = 1|X) P
  • 20.
    Uplift modelling :Class variable modification •  Summary : •  Flip class for control dataset •  Concatenate test and control dataset •  Build a classifier •  Target users with highest probability •  Advantages : •  Any classifier can be used •  Directly predict uplift (and not each class separately) •  Single model on a larger dataset (instead of two small ones) •  Drawbacks : •  Complex decision surface -> model can perform poorly •  Interpretation : what is AUC in this case ?
  • 21.
    Uplift modeling :Other methods •  Based on decision trees : •  Rzepakowski Jaroszewicz 2012 new decision tree split criterion based on information theory •  Soltys Rzepakowski Jaroszewicz 2013 Ensemble methods for uplift modeling (out of today scope)
  • 22.
    Evaluation •  We used: •  2 model approach. -> AUC ? Not very informative. •  1 model approach -> does AUC means something ? •  How can we evaluate / compare them ? •  Cross Validation : •  4 datasets : treatment/control x train/test •  Problem : •  We don’t have a clear 0/1 target. •  We would need to know for each customer •  Response to treatment •  Response to control -> not possible
  • 23.
    Evaluation •  Gain forgroup of customers : •  Gain for the 10% highest scoring customers = % of successes for top 10% treated customers − % of successes for top 10% control customers •  Uplift curve ? : •  Difference between two lift curve •  Interpretation : net gain in success rate if a given percentage of the population is treated •  Pb : no theoretic maximum •  Pb 2 : weird behaviour for 2 wizard models.
  • 24.
    Evaluation : Qini • Qini Measure : •  Similar to Gini (Area under lift curve). Lift Curve <-> Qini Curve •  Parametric curve defined by : •  When taking the first observations •  is the total number of 1 seen in target observations •  is the total number of 1 seen in control observations •  is the total number of target observations •  is the total number of control observations •  Balanced setting : t f(t) = YT (t) YC(t) ⇤ NC(t)/NT (t) YT YC NC NT f(t) = YT (t) YC(t)
  • 25.
    Evaluation : Qini • Personal intuition : •  We can’t know everything : •  treated that convert, not treated that don’t convert. What would have happen ? •  But we don’t want to see : •  Treated not converting •  Not treated converting (in our top list) •  In we want to minimize : •  Very similar to lift taking into account only negative examples. t NT (t) YT (t) + YC(t)
  • 26.
    Evaluation : Qini f(t)= YT (t) YC(t)
  • 27.
    Evaluation : Qini • Best model : •  Take first all positive in target and last all positive in control. •  No theoretic best model : •  depends on possibility of negative effect •  Displayed for no negative effect •  Random model : •  Corresponds to global effect of treatment •  Hillstrom Dataset : •  For women models are comparable and useful •  For men, there is no clear individuals to target
  • 28.
    Evaluation : Qini f(t)= YT (t) YC(t)
  • 29.
    Evaluation : Qini • Back to our study : •  Class modification performs best •  Two models approach performs poorly •  A/B test problem : •  Control dataset is way to small ! •  Class modification model very close to lift •  Two model slightly better than random -> would need to redo the A/B test.
  • 30.
    Conclusion •  Uplift : • Surprisingly little literature / examples •  The theory is rather easy to test •  Two models •  Class modification •  The intuition and evaluation are not easy to grasp •  On the client side : •  A good lead to select the best offer for a customer
  • 31.
    A few references • Data : •  Churn in gaming : WOWAH dataset (blog post to come) •  Uplift for healthcare : Colon Dataset •  Uplift in mailing : Hillstrom data challenge •  Uplift in General : Simulated data : (blog post to come)
  • 32.
    A few references • Application •  Uplift modeling for clinical trial data (Jaskowski, Jaroszewicz) •  Uplift Modeling in Direct Marketing (Rzepakowski, Jaroszewicz)
  • 33.
    A few references • Modeling techniques : •  Rzepakowski Jaroszewicz 2011 (decision trees) •  Soltys Rzepakowski Jaroszewicz 2013 (ensemble for uplift) •  Jaskowski Jaroszewicz 2012 (Class modification model)
  • 34.
    A few references • Evaluation •  Using Control Groups to Target on Predicted Lift (Radcliffe) •  Testing a New Metric for Uplift Models (Mesalles Naranjo)
  • 35.
    Thank you foryour attention !