The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production
KILLER
FEATURE
STORE
Nathan Buesgens
Accenture Applied Intelligence
USING SPARKML PIPELINES AND MLFLOW
Agenda
Definitions of a Feature Store
A clear need, many approaches.
The Feature Flow Algorithm
ML pipeline orchestration.
The ML Pipeline Mesh
Governance and automation.
DEFINITIONS
OF A FEATURE STORE
ML LIFECYCLE
SUCCESS CRITERIA
VALIDATE BUSINESS
HYPOTHESIS
NEW BUSINESS
INSIGHT
A positive experimental result
creates KPI lift in production.
Regardless of production results,
new business insights are
captured and made discoverable
(with a feature store).
This accelerates future
experimentation.
featurestore.org
FEATURE STORES
THREE APPROACHES TO AUTOMATION
Feature
Store
Approaches
Feature “Ops”
Automating Feature Data
Delivery to ML Pipelines
Feature “Modelling”
Automating ETL/Feature Engineering
Feature “Orchestration”
Automating ML Pipeline Construction
FEATURE STORES
THREE APPROACHES TO AUTOMATION
Feature
Store
Approaches
Feature “Ops”
Automating Feature Data
Delivery to ML Pipelines
Feature “Modelling”
Automating ETL/Feature Engineering
Feature “Orchestration”
Automating ML Pipeline Construction
• Most common approach.
• Data access pattern for ML
pipelines.
• Generally, post “feat. engineering”.
• Supplement Data Governance with
DS semantics.
TRAIN/TEST Data Science Semantics
Extending the Data Governance Framework: An Example
TRAIN/TEST Data Science Semantics
Extending the Data Governance Framework: An Example
Customer Segmentation
Train/Test
Split
… ML …
customer
segment
features
“preprocessed” sales data
test data
training data
TRAIN/TEST Data Semantics
Extending the Data Governance Framework: An Example
“preprocessed” sales data
Sales Prospect Segmentation
Train/Test
Split
… ML …
prospect
segment
features
Next Best Action
Train/Test
Split
test data
training data
Assemble
Features
test data
training data
TRAIN/TEST Data Semantics
Extending the Data Governance Framework: An Example
“preprocessed” sales data
Sales Prospect Segmentation
Train/Test
Split
… ML …
test data
training data
prospect
segment
features
Next Best Action
Train/Test
Split
test data
training data
Assemble
Features
WHAT’S WRONG WITH
THIS PICTURE?
FEATURE STORES
THREE APPROACHES TO AUTOMATION
Feature
Store
Approaches
Feature “Ops”
Automating Feature Data
Delivery to ML Pipelines
Feature “Modelling”
Automating ETL/Feature Engineering
Feature “Orchestration”
Automating ML Pipeline Construction
FEATURE STORES
THREE APPROACHES TO AUTOMATION
Feature
Store
Approaches
Feature “Ops”
Automating Feature Data
Delivery to ML Pipelines
Feature “Modelling”
Automating ETL/Feature Engineering
AutoML
Key Stakeholder:
Citizen Scientist
Feature “Orchestration”
Automating ML Pipeline Construction
FEATURE STORES
THREE APPROACHES TO AUTOMATION
Feature
Store
Approaches
Feature “Ops”
Automating Feature Data
Delivery to ML Pipelines
Feature “Modelling”
Automating ETL/Feature Engineering
AutoML
Key Stakeholder:
Citizen Scientist
Feature “Orchestration”
Automating ML Pipeline Construction
“Feature Flow”
Key Stakeholder:
ML Engineer
THE FEATURE FLOW
ALGORITHM
MANAGE ML PIPELINES
(not just models)
ML Pipeline Review
source: https://spark.apache.org/docs/latest/ml-pipeline.html
# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([…])
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# Fit the pipeline to training documents.
model = pipeline.fit(training)
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([…], ["id", "text"])
# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
ML Pipeline Review
source: https://spark.apache.org/docs/latest/ml-pipeline.html
# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([…])
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# Fit the pipeline to training documents.
model = pipeline.fit(training)
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([…], ["id", "text"])
# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
What does this
line do for me (as
an engineer)?
FEATURE FLOW
ORCHESTRATION
ALGORITHM: FEATURE
INFERENCE
Feature Flow takes pipeline
stages as input, builds a
graph, then sorts the stages
topologically.
First, we iteratively infer the stages
that need to be added to the
pipeline to produce the necessary
features.
Then, we sort the stages
topologically.
Tokenize TFIDF
Sentiment
Est.
THE “MONOLITHIC” PIPELINE (THE OLD WAY)
tokenize = ...
tfidf = ...
sentiment = ...
pipeline = Pipeline(
stages=[
tokenize, tfidf, sentiment
])
Tokenize TFIDF
Toxicity
Est.
tokenize = ...
tfidf = ...
toxicity = ...
pipeline = Pipeline(
stages=[
tokenize, tfidf, toxicity
])
FEATURE STAGE DEPLOYMENTS (THE NEW WAY)
Tokenize tokens
TFIDF vectors
Sentimen
t Est.
sentiment
tokens
vectors
Toxicity
Est.
toxicityvectors
Tokenize tokens
TFIDF vectors
Sentimen
t Est.
sentiment
tokens
vectors
Toxicity
Est.
toxicityvectors
toxicitysentiment
Tokenize tokens
TFIDF vectors
Sentimen
t Est.
tokens
vectors
Toxicity
Est.
vectors
toxicitysentiment
Tokenize tokens
TFIDF
Sentimen
t Est.
tokens
vectors
Toxicity
Est.
vectors
toxicitysentiment
Tokenize TFIDF
Sentimen
t Est.
tokens
vectors
Toxicity
Est.
vectors
toxicitysentiment
THEN, ELIMINATE ALL NODES WITH
MULTIPLE INCOMING EDGES PER FEATURE.
And, replace with nodes for the
product of all incoming features.
Feature: vectors
FEATURE FLOW
ORCHESTRATION
ALGORITHM: FEATURE
LINEAGE
Feature Flow gives us the
tools to experiment with
subsets of our pipeline.
The graph gets more complex
when we are evaluating multiple
strategies that create the same
features.
To manage multiple possible
traversals of the graph, we
maintain a lineage of each feature.
AN EXAMPLE STAGE WITH MULTIPLE STRATEGIES
Tokenize tokens
TFIDF vectors
Sentiment
Est.
sentiment
tokens
vectors
Toxicity
Est.
toxicityvectors
Word2Vec vectorstokens
FIRST, BUILD THE GRAPH
Tokenize
TFIDF
Sentiment
Est.
Word2Vec
Tokenize
TFIDF
Word2Vec
Sentiment
Est.
(TFIDF)
Sentiment
Est.
(Word2Vec)
Toxicity
Est. (TFIDF)
Toxicity
Est.
(Word2Vec)
Toxicity
Est.
THE ML PIPELINE MESH
SEPARATE CONCERNS OF
ALGORITHMIC DESIGN
FROM
OPERATIONS
Deployment Automation
and
Runtime Management
Metadata Management
and
Discovery
ML Pipeline
Governance
Demo
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

More Related Content

PDF
Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflow
PDF
Sysml 2019 demo_paper
PDF
Productionizing Machine Learning with a Microservices Architecture
PDF
Streaming Inference with Apache Beam and TFX
PDF
Scaling Data and ML with Apache Spark and Feast
PDF
Hopsworks Feature Store 2.0 a new paradigm
PDF
MLeap: Productionize Data Science Workflows Using Spark
PDF
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflow
Sysml 2019 demo_paper
Productionizing Machine Learning with a Microservices Architecture
Streaming Inference with Apache Beam and TFX
Scaling Data and ML with Apache Spark and Feast
Hopsworks Feature Store 2.0 a new paradigm
MLeap: Productionize Data Science Workflows Using Spark
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...

What's hot (20)

PPTX
Spark ML Pipeline serving
PDF
Scaling Machine Learning To Billions Of Parameters
PPTX
Oct 2011 CHADNUG Presentation on Hadoop
PDF
Operational Tips For Deploying Apache Spark
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
PDF
Uber Business Metrics Generation and Management Through Apache Flink
PDF
Tuning ML Models: Scaling, Workflows, and Architecture
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
PDF
Managed Feature Store for Machine Learning
PDF
Fast and Reliable Apache Spark SQL Engine
PDF
Flock: Data Science Platform @ CISL
PDF
Reproducible AI using MLflow and PyTorch
PDF
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PPTX
Log Data Analysis Platform by Valentin Kropov
PPTX
Log Data Analysis Platform
PDF
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PDF
Production Readiness Testing At Salesforce Using Spark MLlib
PDF
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
PDF
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
Spark ML Pipeline serving
Scaling Machine Learning To Billions Of Parameters
Oct 2011 CHADNUG Presentation on Hadoop
Operational Tips For Deploying Apache Spark
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Uber Business Metrics Generation and Management Through Apache Flink
Tuning ML Models: Scaling, Workflows, and Architecture
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Managed Feature Store for Machine Learning
Fast and Reliable Apache Spark SQL Engine
Flock: Data Science Platform @ CISL
Reproducible AI using MLflow and PyTorch
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Apache Spark MLlib 2.0 Preview: Data Science and Production
Log Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Production Readiness Testing At Salesforce Using Spark MLlib
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...

Similar to The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production (20)

PDF
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
PDF
Spark DataFrames and ML Pipelines
PDF
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
PDF
Introduction to and Extending Spark ML
PDF
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PDF
Hopsworks MLOps World talk june 21
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
PDF
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
PDF
An introduction into Spark ML plus how to go beyond when you get stuck
PPTX
Introduction to Spark ML
PDF
Hamburg Data Science Meetup - MLOps with a Feature Store
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
PDF
Introduction to Spark ML Pipelines Workshop
PDF
Extending spark ML for custom models now with python!
PPTX
Machine Learning Pipelines - Joseph Bradley - Databricks
PDF
Unified MLOps: Feature Stores & Model Deployment
PDF
Hopsworks at Google AI Huddle, Sunnyvale
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Spark DataFrames and ML Pipelines
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Introduction to and Extending Spark ML
PyData Berlin 2023 - Mythical ML Pipeline.pdf
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Hopsworks MLOps World talk june 21
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
An introduction into Spark ML plus how to go beyond when you get stuck
Introduction to Spark ML
Hamburg Data Science Meetup - MLOps with a Feature Store
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Introduction to Spark ML Pipelines Workshop
Extending spark ML for custom models now with python!
Machine Learning Pipelines - Joseph Bradley - Databricks
Unified MLOps: Feature Stores & Model Deployment
Hopsworks at Google AI Huddle, Sunnyvale
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Practical Distributed Machine Learning Pipelines on Hadoop

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Power BI - Microsoft Power BI is an interactive data visualization software p...
PDF
Q1-wK1-Human-and-Cultural-Variation-sy-2024-2025-Copy-1.pdf
PDF
Nucleic-Acids_-Structure-Typ...-1.pdf 011
PDF
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
PPTX
cardiac failure and associated notes.pptx
PPTX
Basic Statistical Analysis for experimental data.pptx
PPTX
Understanding AI: Basics on Artificial Intelligence and Machine Learning
PPT
Technicalities in writing workshops indigenous language
PDF
MULTI-ACCESS EDGE COMPUTING ARCHITECTURE AND SMART AGRICULTURE APPLICATION IN...
PDF
NU-MEP-Standards معايير تصميم جامعية .pdf
PDF
Teal Blue Futuristic Metaverse Presentation.pdf
PPTX
The future of AIThe future of AIThe future of AI
PPTX
Transport System for Biology students in the 11th grade
PDF
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
PPT
DWDM unit 1 for btech 3rd year students.ppt
PPT
Handout for Lean and Six Sigma application
PPTX
Stats annual compiled ipd opd ot br 2024
PPT
Drug treatment of Malbbbbbhhbbbbhharia.ppt
PDF
PPT nikita containers of the company use
PDF
n8n Masterclass.pdfn8n Mastercn8n Masterclass.pdflass.pdf
Power BI - Microsoft Power BI is an interactive data visualization software p...
Q1-wK1-Human-and-Cultural-Variation-sy-2024-2025-Copy-1.pdf
Nucleic-Acids_-Structure-Typ...-1.pdf 011
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
cardiac failure and associated notes.pptx
Basic Statistical Analysis for experimental data.pptx
Understanding AI: Basics on Artificial Intelligence and Machine Learning
Technicalities in writing workshops indigenous language
MULTI-ACCESS EDGE COMPUTING ARCHITECTURE AND SMART AGRICULTURE APPLICATION IN...
NU-MEP-Standards معايير تصميم جامعية .pdf
Teal Blue Futuristic Metaverse Presentation.pdf
The future of AIThe future of AIThe future of AI
Transport System for Biology students in the 11th grade
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
DWDM unit 1 for btech 3rd year students.ppt
Handout for Lean and Six Sigma application
Stats annual compiled ipd opd ot br 2024
Drug treatment of Malbbbbbhhbbbbhharia.ppt
PPT nikita containers of the company use
n8n Masterclass.pdfn8n Mastercn8n Masterclass.pdflass.pdf

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production

  • 2. KILLER FEATURE STORE Nathan Buesgens Accenture Applied Intelligence USING SPARKML PIPELINES AND MLFLOW
  • 3. Agenda Definitions of a Feature Store A clear need, many approaches. The Feature Flow Algorithm ML pipeline orchestration. The ML Pipeline Mesh Governance and automation.
  • 5. ML LIFECYCLE SUCCESS CRITERIA VALIDATE BUSINESS HYPOTHESIS NEW BUSINESS INSIGHT A positive experimental result creates KPI lift in production. Regardless of production results, new business insights are captured and made discoverable (with a feature store). This accelerates future experimentation.
  • 7. FEATURE STORES THREE APPROACHES TO AUTOMATION Feature Store Approaches Feature “Ops” Automating Feature Data Delivery to ML Pipelines Feature “Modelling” Automating ETL/Feature Engineering Feature “Orchestration” Automating ML Pipeline Construction
  • 8. FEATURE STORES THREE APPROACHES TO AUTOMATION Feature Store Approaches Feature “Ops” Automating Feature Data Delivery to ML Pipelines Feature “Modelling” Automating ETL/Feature Engineering Feature “Orchestration” Automating ML Pipeline Construction • Most common approach. • Data access pattern for ML pipelines. • Generally, post “feat. engineering”. • Supplement Data Governance with DS semantics.
  • 9. TRAIN/TEST Data Science Semantics Extending the Data Governance Framework: An Example
  • 10. TRAIN/TEST Data Science Semantics Extending the Data Governance Framework: An Example Customer Segmentation Train/Test Split … ML … customer segment features “preprocessed” sales data test data training data
  • 11. TRAIN/TEST Data Semantics Extending the Data Governance Framework: An Example “preprocessed” sales data Sales Prospect Segmentation Train/Test Split … ML … prospect segment features Next Best Action Train/Test Split test data training data Assemble Features test data training data
  • 12. TRAIN/TEST Data Semantics Extending the Data Governance Framework: An Example “preprocessed” sales data Sales Prospect Segmentation Train/Test Split … ML … test data training data prospect segment features Next Best Action Train/Test Split test data training data Assemble Features WHAT’S WRONG WITH THIS PICTURE?
  • 13. FEATURE STORES THREE APPROACHES TO AUTOMATION Feature Store Approaches Feature “Ops” Automating Feature Data Delivery to ML Pipelines Feature “Modelling” Automating ETL/Feature Engineering Feature “Orchestration” Automating ML Pipeline Construction
  • 14. FEATURE STORES THREE APPROACHES TO AUTOMATION Feature Store Approaches Feature “Ops” Automating Feature Data Delivery to ML Pipelines Feature “Modelling” Automating ETL/Feature Engineering AutoML Key Stakeholder: Citizen Scientist Feature “Orchestration” Automating ML Pipeline Construction
  • 15. FEATURE STORES THREE APPROACHES TO AUTOMATION Feature Store Approaches Feature “Ops” Automating Feature Data Delivery to ML Pipelines Feature “Modelling” Automating ETL/Feature Engineering AutoML Key Stakeholder: Citizen Scientist Feature “Orchestration” Automating ML Pipeline Construction “Feature Flow” Key Stakeholder: ML Engineer
  • 17. MANAGE ML PIPELINES (not just models)
  • 18. ML Pipeline Review source: https://spark.apache.org/docs/latest/ml-pipeline.html # Prepare training documents from a list of (id, text, label) tuples. training = spark.createDataFrame([…]) # Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr. tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(maxIter=10, regParam=0.001) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) # Fit the pipeline to training documents. model = pipeline.fit(training) # Prepare test documents, which are unlabeled (id, text) tuples. test = spark.createDataFrame([…], ["id", "text"]) # Make predictions on test documents and print columns of interest. prediction = model.transform(test)
  • 19. ML Pipeline Review source: https://spark.apache.org/docs/latest/ml-pipeline.html # Prepare training documents from a list of (id, text, label) tuples. training = spark.createDataFrame([…]) # Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr. tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(maxIter=10, regParam=0.001) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) # Fit the pipeline to training documents. model = pipeline.fit(training) # Prepare test documents, which are unlabeled (id, text) tuples. test = spark.createDataFrame([…], ["id", "text"]) # Make predictions on test documents and print columns of interest. prediction = model.transform(test) What does this line do for me (as an engineer)?
  • 20. FEATURE FLOW ORCHESTRATION ALGORITHM: FEATURE INFERENCE Feature Flow takes pipeline stages as input, builds a graph, then sorts the stages topologically. First, we iteratively infer the stages that need to be added to the pipeline to produce the necessary features. Then, we sort the stages topologically. Tokenize TFIDF Sentiment Est. THE “MONOLITHIC” PIPELINE (THE OLD WAY) tokenize = ... tfidf = ... sentiment = ... pipeline = Pipeline( stages=[ tokenize, tfidf, sentiment ]) Tokenize TFIDF Toxicity Est. tokenize = ... tfidf = ... toxicity = ... pipeline = Pipeline( stages=[ tokenize, tfidf, toxicity ]) FEATURE STAGE DEPLOYMENTS (THE NEW WAY) Tokenize tokens TFIDF vectors Sentimen t Est. sentiment tokens vectors Toxicity Est. toxicityvectors Tokenize tokens TFIDF vectors Sentimen t Est. sentiment tokens vectors Toxicity Est. toxicityvectors toxicitysentiment Tokenize tokens TFIDF vectors Sentimen t Est. tokens vectors Toxicity Est. vectors toxicitysentiment Tokenize tokens TFIDF Sentimen t Est. tokens vectors Toxicity Est. vectors toxicitysentiment Tokenize TFIDF Sentimen t Est. tokens vectors Toxicity Est. vectors toxicitysentiment
  • 21. THEN, ELIMINATE ALL NODES WITH MULTIPLE INCOMING EDGES PER FEATURE. And, replace with nodes for the product of all incoming features. Feature: vectors FEATURE FLOW ORCHESTRATION ALGORITHM: FEATURE LINEAGE Feature Flow gives us the tools to experiment with subsets of our pipeline. The graph gets more complex when we are evaluating multiple strategies that create the same features. To manage multiple possible traversals of the graph, we maintain a lineage of each feature. AN EXAMPLE STAGE WITH MULTIPLE STRATEGIES Tokenize tokens TFIDF vectors Sentiment Est. sentiment tokens vectors Toxicity Est. toxicityvectors Word2Vec vectorstokens FIRST, BUILD THE GRAPH Tokenize TFIDF Sentiment Est. Word2Vec Tokenize TFIDF Word2Vec Sentiment Est. (TFIDF) Sentiment Est. (Word2Vec) Toxicity Est. (TFIDF) Toxicity Est. (Word2Vec) Toxicity Est.
  • 23. SEPARATE CONCERNS OF ALGORITHMIC DESIGN FROM OPERATIONS Deployment Automation and Runtime Management Metadata Management and Discovery ML Pipeline Governance
  • 24. Demo
  • 25. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.