SlideShare a Scribd company logo
Algorithmic Digital
Attribution Using Spark
Anny (Yunzhu) Chen and William (Zhenyu) Yan
Adobe
Agenda
•  Digital attribution
•  Algorithmic digital attribution
•  Spark implementation
•  Lessons learned
Path of media touches to conversion
Display"
• See a display
ads of Adobe
Creative Cloud"
Paid Search"
• Search the term
“Creative Cloud”
and click the
promoted link"
Organic
Search"
• Search the term
“Adobe Creative
Cloud” and click
an organic link"
Signup"
• Signup for a free
trial"
Email"
• Receive an email
from Adobe"
Subscribe to
Adobe Creative
Cloud
(positive)
Not subscribe to
Adobe Creative
Cloud
(negative)
A customer may receive various kinds of media touch points before deciding
whether to subscribe to a product (conversion) or not
What’s digital attribution?
4
•  A digital attribution model determines how credits for conversions are assigned to
media touch points
•  It is quite important in performance monitoring and budget planning
Display"
• Saw a
display ads
of Adobe
Creative
Cloud"
• 0.1"
Paid Search"
• Search the
term
“Creative
Cloud” and
click the
promoted
link"
• 0.2"
Organic
Search"
• Search the
term “Adobe
Creative
Cloud” and
click an
organic link"
• 0.2"
Signup"
• Signup for a
free trial"
• 0.2"
Email"
• Received an
email from
Adobe"
• 0.3"
Convert!
Digital attribution model at a glance
Models!
Consider all
media touch
points!
Time decay!
Data driven!
(vs. rule based)!
Last interaction, first interaction,
last paid search click!
no" no" no"
Linear! yes" no" no"
Time decay! yes" yes" no"
Position based! yes" no" no"
Algorithmic! yes" yes " yes"
Algorithmic Attribution
•  Rule based: predetermined weights based on rules
•  Algorithmic: machine learning or statistical models are used to
determine the weights
6
Algorithmic Attribution Modeling
The attribution model is based on a combination of
•  Distributed-lag econometric model
•  Discrete-time survival model
Some highlights:
•  The basic idea is to compare the media touches in positive paths vs. those in negative paths and thus
infer their effects
•  Logistic link function
•  Time decay parameters to address time-decaying of media effects
•  Fit media touch effects and decay parameters simultaneously
•  Constraints on coefficients (combining with rules)
•  Stratified sampling
•  Bias reduction:
–  Using control variables, such as duration of exposure
–  Causal modeling
•  Maximum likelihood estimation
Tokenization
•  Tokenization is used to group neighboring events together
•  why is tokenization needed?
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, Adobe)
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, Adobe)
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, Adobe)
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, Adobe)
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, Adobe)
Tokenization
DISPLAY" DISPLAY" DISPLAY" DISPLAY" DISPLAY"
DISPLAY"
•  The effect of a media touch is hardly doubled or tripled if it is sent repeatedly
to a customer twice or three times in a short period
•  Tokenization is used to group neighboring events together
•  If you see the same advertisement 5 times in one
hour, will you be impressed 5 times or just have the
impression “I saw that ad”?"
Why Spark?
Raw data"
Big data
tools"
Hive, Pig,
Impala,
Presto…"
Big data
machine
learning tool"
Spark" Mahout"
Python, R,
Java, C++
…"
Big data Small data
Rule based Algorithmic
Iterative update
Data process and stitching
Event"
•  Media touch events; conversion events; other user behavior events"
•  Data fields: customer id , time stamp, event type, campaign id and etc."
Path"
•  Stitch the events of the same customer by “id” "
•  Generate both positive paths and negative paths"
Model"
•  Each path is a record"
•  Positive/negative label is the outcome"
Algorithm evolution
First model"
•  Fix time decay"
•  Logistic regression using
Mllib logistic regression with
SGD"
Second model"
•  Include both regression
coefficients and decay
parameters in the model"
•  Optimize alternatively in
each iteration"
•  Customize and extend the
original Mllib logistic
regression"
Third model "
•  Second model + Causal
modeling (matched sample)"
Implementation
architecture
Data storage: s3"
Computing platform: Spark
standalone cluster on AWS (Amazon
Web Service)"
Monitor: web UI provided by Spark standalone
cluster"
Server side encryption
Bastion host
Model building"
Data processing and tokenization"
Generate paths"
Parallel algorithm"
Save model to S3"
Attribution"
Data processing and tokenization"
Retrieve model from S3"
Generate positive paths"
Attribution with model"
Implementation
Detailed Workflow
Lessons learned
•  Memory management
–  Each record was transformed from String-based to Byte-based using a hash
function
•  Tremendously increased the speed
•  Reduced shuffle size in the next groupBy step
Lessons learned
•  Memory management
–  Write separate classes for model training and attribution computation
•  Model training needs complex transformation and intensive iteration
•  Attribution computation needs more information from each objects
Lessons Learned (cont’d)
•  Cache before iterative computation
–  Only cache the RDDs right before entering the model
–  Unpersist unused RDDs to save space
•  Clear jobs of stopped apps in workers from time to time
–  Check and clean ~/spark/work folder in workers
–  Check and clean /mnt/spark folder in workers if necessary
•  Be careful when dealing with unserialized Java objects
Lessons Learned (cont’d)
•  Errors of one line of code doesn’t necessarily come from that line
–  Spark is lazy evaluation
–  Errors arise at action step, but may come from the previous
transformation steps
•  Adjustment of step size
–  Time decay must be between 0 and 1
–  After the matching, sample size decreased dramatically, the step
size has to be adjusted accordingly
Thank you!

More Related Content

PDF
Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Cl...
PDF
"Tips For Detecting A Phishing Email" Infographic
PDF
20180801 AWS Black Belt Online Seminar Amazon QuickSight アップデート
PDF
AWS Black Belt Tech シリーズ 2016 - Amazon SQS / Amazon SNS
PDF
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
PDF
AWS Black Belt online seminar 2017 Snowball
PPTX
AWS 12월 웨비나 │클라우드 마이그레이션을 통한 성공사례
PPTX
Azure vs Aws vs Google Cloud Providers
Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Cl...
"Tips For Detecting A Phishing Email" Infographic
20180801 AWS Black Belt Online Seminar Amazon QuickSight アップデート
AWS Black Belt Tech シリーズ 2016 - Amazon SQS / Amazon SNS
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
AWS Black Belt online seminar 2017 Snowball
AWS 12월 웨비나 │클라우드 마이그레이션을 통한 성공사례
Azure vs Aws vs Google Cloud Providers

Viewers also liked (16)

PPTX
Advanced attribution model
PPTX
How To Make Your Marketing Match Your Reality (#mozcon 2015)
PPTX
How to Build an Attribution Solution in 1 Day
PDF
Multi touch attribution & attribution modeling - GAUC Sydney Melbourne - 2013...
PDF
Attribution Modeling - Case Study
PPTX
Attribution Super Modeling with Google Analytics
PDF
GAUC 2017 Workshop Attribution with Google Analytics: Peter Falcone (Google)
PDF
Web Analytics Attribution
PPTX
Introduction to Google Analytics
PDF
Webinar: Survival Analysis for Marketing Attribution - July 17, 2013
PDF
Operational Attribution with Google Analytics
PPTX
Attribution Modeling and Big Data, Google
PDF
Markov model for the online multichannel attribution problem
PPTX
Webinar: Improve Campaign Results with Multi-Channel Funnels and Acquisio Att...
PDF
Red Door Interactive: Contribution-Attribution-Mix, Oh My! Creating Content f...
PDF
Marketing Attribution 101: Understanding Attribution and Calculating Cost of ...
Advanced attribution model
How To Make Your Marketing Match Your Reality (#mozcon 2015)
How to Build an Attribution Solution in 1 Day
Multi touch attribution & attribution modeling - GAUC Sydney Melbourne - 2013...
Attribution Modeling - Case Study
Attribution Super Modeling with Google Analytics
GAUC 2017 Workshop Attribution with Google Analytics: Peter Falcone (Google)
Web Analytics Attribution
Introduction to Google Analytics
Webinar: Survival Analysis for Marketing Attribution - July 17, 2013
Operational Attribution with Google Analytics
Attribution Modeling and Big Data, Google
Markov model for the online multichannel attribution problem
Webinar: Improve Campaign Results with Multi-Channel Funnels and Acquisio Att...
Red Door Interactive: Contribution-Attribution-Mix, Oh My! Creating Content f...
Marketing Attribution 101: Understanding Attribution and Calculating Cost of ...
Ad

Similar to Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, Adobe) (20)

PPTX
Apache Spark Model Deployment
PPTX
"Hadoop: What we've learned in 5 years", Martin Oberhuber, Senior Data Scient...
PPTX
2024-02-24_Session 1 - PMLE_UPDATED.pptx
PPTX
Democratizing data science Using spark, hive and druid
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
PDF
Test-Driven Machine Learning
PPTX
Dataiku tatvic webinar presentation
PDF
Production-Ready BIG ML Workflows - from zero to hero
PPTX
Attribution Modelling 101: Credit Where Credit is Due!: Dynamic talks Seattle...
PDF
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
PDF
Architecting for analytics
PPTX
Data and Business Team Collaboration
PDF
Dev Dives: Supercharge testing and RPA with coded automations
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Media_Entertainment_Veriticals
PDF
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...
PPTX
An Agile Approach to Machine Learning
PDF
Natural Sparksmanship – The Art of Making an Analytics Enterprise Cross the C...
PPTX
Heyhumming_Tech Capabilities_Draft-updated - Read-Only.pptx
PDF
Digital Transformation Summit 2021
Apache Spark Model Deployment
"Hadoop: What we've learned in 5 years", Martin Oberhuber, Senior Data Scient...
2024-02-24_Session 1 - PMLE_UPDATED.pptx
Democratizing data science Using spark, hive and druid
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Test-Driven Machine Learning
Dataiku tatvic webinar presentation
Production-Ready BIG ML Workflows - from zero to hero
Attribution Modelling 101: Credit Where Credit is Due!: Dynamic talks Seattle...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Architecting for analytics
Data and Business Team Collaboration
Dev Dives: Supercharge testing and RPA with coded automations
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Media_Entertainment_Veriticals
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...
An Agile Approach to Machine Learning
Natural Sparksmanship – The Art of Making an Analytics Enterprise Cross the C...
Heyhumming_Tech Capabilities_Draft-updated - Read-Only.pptx
Digital Transformation Summit 2021
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
PDF
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...

Recently uploaded (20)

PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Foundation of Data Science unit number two notes
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Mega Projects Data Mega Projects Data
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Supervised vs unsupervised machine learning algorithms
IB Computer Science - Internal Assessment.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
ISS -ESG Data flows What is ESG and HowHow
Data_Analytics_and_PowerBI_Presentation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Knowledge Engineering Part 1
Clinical guidelines as a resource for EBP(1).pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Quality review (1)_presentation of this 21
Foundation of Data Science unit number two notes
STUDY DESIGN details- Lt Col Maksud (21).pptx
Reliability_Chapter_ presentation 1221.5784
Mega Projects Data Mega Projects Data
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Database Infoormation System (DBIS).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, Adobe)

  • 1. Algorithmic Digital Attribution Using Spark Anny (Yunzhu) Chen and William (Zhenyu) Yan Adobe
  • 2. Agenda •  Digital attribution •  Algorithmic digital attribution •  Spark implementation •  Lessons learned
  • 3. Path of media touches to conversion Display" • See a display ads of Adobe Creative Cloud" Paid Search" • Search the term “Creative Cloud” and click the promoted link" Organic Search" • Search the term “Adobe Creative Cloud” and click an organic link" Signup" • Signup for a free trial" Email" • Receive an email from Adobe" Subscribe to Adobe Creative Cloud (positive) Not subscribe to Adobe Creative Cloud (negative) A customer may receive various kinds of media touch points before deciding whether to subscribe to a product (conversion) or not
  • 4. What’s digital attribution? 4 •  A digital attribution model determines how credits for conversions are assigned to media touch points •  It is quite important in performance monitoring and budget planning Display" • Saw a display ads of Adobe Creative Cloud" • 0.1" Paid Search" • Search the term “Creative Cloud” and click the promoted link" • 0.2" Organic Search" • Search the term “Adobe Creative Cloud” and click an organic link" • 0.2" Signup" • Signup for a free trial" • 0.2" Email" • Received an email from Adobe" • 0.3" Convert!
  • 5. Digital attribution model at a glance Models! Consider all media touch points! Time decay! Data driven! (vs. rule based)! Last interaction, first interaction, last paid search click! no" no" no" Linear! yes" no" no" Time decay! yes" yes" no" Position based! yes" no" no" Algorithmic! yes" yes " yes"
  • 6. Algorithmic Attribution •  Rule based: predetermined weights based on rules •  Algorithmic: machine learning or statistical models are used to determine the weights 6
  • 7. Algorithmic Attribution Modeling The attribution model is based on a combination of •  Distributed-lag econometric model •  Discrete-time survival model Some highlights: •  The basic idea is to compare the media touches in positive paths vs. those in negative paths and thus infer their effects •  Logistic link function •  Time decay parameters to address time-decaying of media effects •  Fit media touch effects and decay parameters simultaneously •  Constraints on coefficients (combining with rules) •  Stratified sampling •  Bias reduction: –  Using control variables, such as duration of exposure –  Causal modeling •  Maximum likelihood estimation
  • 8. Tokenization •  Tokenization is used to group neighboring events together •  why is tokenization needed?
  • 14. Tokenization DISPLAY" DISPLAY" DISPLAY" DISPLAY" DISPLAY" DISPLAY" •  The effect of a media touch is hardly doubled or tripled if it is sent repeatedly to a customer twice or three times in a short period •  Tokenization is used to group neighboring events together •  If you see the same advertisement 5 times in one hour, will you be impressed 5 times or just have the impression “I saw that ad”?"
  • 15. Why Spark? Raw data" Big data tools" Hive, Pig, Impala, Presto…" Big data machine learning tool" Spark" Mahout" Python, R, Java, C++ …" Big data Small data Rule based Algorithmic Iterative update
  • 16. Data process and stitching Event" •  Media touch events; conversion events; other user behavior events" •  Data fields: customer id , time stamp, event type, campaign id and etc." Path" •  Stitch the events of the same customer by “id” " •  Generate both positive paths and negative paths" Model" •  Each path is a record" •  Positive/negative label is the outcome"
  • 17. Algorithm evolution First model" •  Fix time decay" •  Logistic regression using Mllib logistic regression with SGD" Second model" •  Include both regression coefficients and decay parameters in the model" •  Optimize alternatively in each iteration" •  Customize and extend the original Mllib logistic regression" Third model " •  Second model + Causal modeling (matched sample)"
  • 18. Implementation architecture Data storage: s3" Computing platform: Spark standalone cluster on AWS (Amazon Web Service)" Monitor: web UI provided by Spark standalone cluster" Server side encryption Bastion host
  • 19. Model building" Data processing and tokenization" Generate paths" Parallel algorithm" Save model to S3" Attribution" Data processing and tokenization" Retrieve model from S3" Generate positive paths" Attribution with model" Implementation Detailed Workflow
  • 20. Lessons learned •  Memory management –  Each record was transformed from String-based to Byte-based using a hash function •  Tremendously increased the speed •  Reduced shuffle size in the next groupBy step
  • 21. Lessons learned •  Memory management –  Write separate classes for model training and attribution computation •  Model training needs complex transformation and intensive iteration •  Attribution computation needs more information from each objects
  • 22. Lessons Learned (cont’d) •  Cache before iterative computation –  Only cache the RDDs right before entering the model –  Unpersist unused RDDs to save space •  Clear jobs of stopped apps in workers from time to time –  Check and clean ~/spark/work folder in workers –  Check and clean /mnt/spark folder in workers if necessary •  Be careful when dealing with unserialized Java objects
  • 23. Lessons Learned (cont’d) •  Errors of one line of code doesn’t necessarily come from that line –  Spark is lazy evaluation –  Errors arise at action step, but may come from the previous transformation steps •  Adjustment of step size –  Time decay must be between 0 and 1 –  After the matching, sample size decreased dramatically, the step size has to be adjusted accordingly