The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production

KILLER
FEATURE
STORE
Nathan Buesgens
Accenture Applied Intelligence
USING SPARKML PIPELINES AND MLFLOW

Agenda
Definitions of a Feature Store
A clear need, many approaches.
The Feature Flow Algorithm
ML pipeline orchestration.
The ML Pipeline Mesh
Governance and automation.

DEFINITIONS
OF A FEATURE STORE

ML LIFECYCLE
SUCCESS CRITERIA
VALIDATE BUSINESS
HYPOTHESIS
NEW BUSINESS
INSIGHT
A positive experimental result
creates KPI lift in production.
Regardless of production results,
new business insights are
captured and made discoverable
(with a feature store).
This accelerates future
experimentation.

FEATURE STORES
THREE APPROACHES TO AUTOMATION
Feature
Store
Approaches
Feature “Ops”
Automating Feature Data
Delivery to ML Pipelines
Feature “Modelling”
Automating ETL/Feature Engineering
Feature “Orchestration”
Automating ML Pipeline Construction

FEATURE STORES
Feature
Store
Approaches
Feature “Ops”
• Most common approach.
• Data access pattern for ML
pipelines.
• Generally, post “feat. engineering”.
• Supplement Data Governance with
DS semantics.

TRAIN/TEST Data Science Semantics
Extending the Data Governance Framework: An Example

TRAIN/TEST Data Science Semantics
Customer Segmentation
Train/Test
Split
… ML …
customer
segment
features
“preprocessed” sales data
test data
training data

TRAIN/TEST Data Semantics
Sales Prospect Segmentation
Train/Test
Split
… ML …
prospect
segment
features
Next Best Action
Train/Test
Split
test data
training data
Assemble
Features
test data
training data

TRAIN/TEST Data Semantics
Sales Prospect Segmentation
Train/Test
Split
… ML …
test data
training data
prospect
segment
features
Next Best Action
Train/Test
Split
test data
training data
Assemble
Features
WHAT’S WRONG WITH
THIS PICTURE?

FEATURE STORES
Feature
Store
Approaches
Feature “Ops”
AutoML
Key Stakeholder:
Citizen Scientist

FEATURE STORES
Feature
Store
Approaches
Feature “Ops”
AutoML
Key Stakeholder:
Citizen Scientist
“Feature Flow”
Key Stakeholder:
ML Engineer

MANAGE ML PIPELINES
(not just models)

ML Pipeline Review
source: https://spark.apache.org/docs/latest/ml-pipeline.html
# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([…])
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# Fit the pipeline to training documents.
model = pipeline.fit(training)
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([…], ["id", "text"])
# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)

ML Pipeline Review
source: https://spark.apache.org/docs/latest/ml-pipeline.html
# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([…])
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# Fit the pipeline to training documents.
model = pipeline.fit(training)
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([…], ["id", "text"])
# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
What does this
line do for me (as
an engineer)?

FEATURE FLOW
ORCHESTRATION
ALGORITHM: FEATURE
INFERENCE
Feature Flow takes pipeline
stages as input, builds a
graph, then sorts the stages
topologically.
First, we iteratively infer the stages
that need to be added to the
pipeline to produce the necessary
features.
Then, we sort the stages
topologically.
Tokenize TFIDF
Sentiment
Est.
THE “MONOLITHIC” PIPELINE (THE OLD WAY)
tokenize = ...
tfidf = ...
sentiment = ...
pipeline = Pipeline(
stages=[
tokenize, tfidf, sentiment
])
Tokenize TFIDF
Toxicity
Est.
tokenize = ...
tfidf = ...
toxicity = ...
pipeline = Pipeline(
stages=[
tokenize, tfidf, toxicity
])
FEATURE STAGE DEPLOYMENTS (THE NEW WAY)
Tokenize tokens
TFIDF vectors
Sentimen
t Est.
sentiment
tokens
vectors
Toxicity
Est.
toxicityvectors
Tokenize tokens
TFIDF vectors
Sentimen
t Est.
sentiment
tokens
vectors
Toxicity
Est.
toxicityvectors
toxicitysentiment
Tokenize tokens
TFIDF vectors
Sentimen
t Est.
tokens
vectors
Toxicity
Est.
vectors
toxicitysentiment
Tokenize tokens
TFIDF
Sentimen
t Est.
tokens
vectors
Toxicity
Est.
vectors
toxicitysentiment
Tokenize TFIDF
Sentimen
t Est.
tokens
vectors
Toxicity
Est.
vectors
toxicitysentiment

THEN, ELIMINATE ALL NODES WITH
MULTIPLE INCOMING EDGES PER FEATURE.
And, replace with nodes for the
product of all incoming features.
Feature: vectors
FEATURE FLOW
ORCHESTRATION
ALGORITHM: FEATURE
LINEAGE
Feature Flow gives us the
tools to experiment with
subsets of our pipeline.
The graph gets more complex
when we are evaluating multiple
strategies that create the same
features.
To manage multiple possible
traversals of the graph, we
maintain a lineage of each feature.
AN EXAMPLE STAGE WITH MULTIPLE STRATEGIES
Tokenize tokens
TFIDF vectors
Sentiment
Est.
sentiment
tokens
vectors
Toxicity
Est.
toxicityvectors
Word2Vec vectorstokens
FIRST, BUILD THE GRAPH
Tokenize
TFIDF
Sentiment
Est.
Word2Vec
Tokenize
TFIDF
Word2Vec
Sentiment
Est.
(TFIDF)
Sentiment
Est.
(Word2Vec)
Toxicity
Est. (TFIDF)
Toxicity
Est.
(Word2Vec)
Toxicity
Est.

SEPARATE CONCERNS OF
ALGORITHMIC DESIGN
FROM
OPERATIONS
Deployment Automation
and
Runtime Management
Metadata Management
and
Discovery
ML Pipeline
Governance

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production

More Related Content

What's hot (20)

Similar to The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production (20)

More from Databricks (20)

Recently uploaded (20)

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production