SlideShare a Scribd company logo
3
Most read
4
Most read
17
Most read
CRISP-DM
Agile Approach to Data Mining Projects
Michał Łopuszyński
Warsaw Data Science Meetup, 2016.06.07
About me
I work at ICM UW•
Our group = Applied Data Analysis Lab•
Supercomputing centre, weather forecast , virtual library,
open science platform, visualization solutions, ...
•
Involved in modelling and data analysis projects from cosmology, medicine,
bioinformatics, quantum chemistry, biophysics, fluid dynamics, materials
science, social network analysis ...
•
Automatic information extraction from PDFs•
Text-mining in scientific literature•
Variety of application projects (analysis of court judgments, aviation,
deploying solutions on the big data stack Spark/Hadoop, trainings)
•
About me
adalab.icm.edu.pl
What is CRISP-DM?
Cross Industry Standard Process
for Data Mining
•
SPSS, Teradata, Daimler, OCHRA, NCR
Developed in 1996 by big players
in data analysis
•
•
I follow "CRISP-DM 1.0 Step-by-step data mining guide"•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Most popular methodology
for data-centric projects
See KDNuggets Polls•
Runner-up SEMMA•
I find it agile•
Introduces almost no overhead•
Emphasizes adaptive transitions
between project phases
•
2007, 2014
Business Understanding
Determine business objectives•
Resources (data!), risks, costs & benefits
Assess situation•
Ideally with quantitative success criteria
Determine data mining goals•
Estimate time line, budget, but also tools and
techniques
Develop project plan•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Business Understanding
Difficult!•
Often, you have to enter a new field•
You have to explain data science
limitations to non-experts
•
Source: http://xkcd.com/1425
No, performance will not be 100%•
We need much more data to train
an accurate model
•
For tomorrow, it is impossible•
Business Understanding – my DOs and DON'Ts
Have a lot of patience for vaguely defined problems•
Do not waste your time on ill-defined, unrealistic projects•
Learn to concretize or even reduce the scope of the initial idea•
Data sample•
Real-life use cases•
Quantitative success metrics•
Data Understanding
Collect initial data•
Persist results
Describe data•
Persist results
Explore data•
Carefully document problems and issues found!
Verify data quality•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Data Understanding – Validate Everything
<judgement id="...">
<date>3013-12-04 00:00:00.0 CET</date>
<publicationDate>2014-07-23 02:52:17.0 CEST</publicationDate>
<courtId>15250000</courtId>
<departmentId>503</departmentId>
<chairman>Małgorzata ...</chairman>
<judges>
<judge>Małgorzata ...</judge>
</judges>
...
</judgement>
<judgement id="...">
<date>2012-10-01 00:00:00.0 CEST</date>
<publicationDate>2014-12-31 18:15:05.0 CET</publicationDate>
<courtId>15450500</courtId>
<departmentId>6027</departmentId>
<judges>
<judge>Piotr ...</judge>
<judge>wskazał</judge>
<judge>czego wymaga art. 17a ust. 2 ustawy</judge>
...
</judges>
</judgement>
Data Understanding – Spot Anomalies
Histogram of certain smooth quantity measured using "precise equipment"
Explanation – effect of human interface between precise equipment & db
Data Understanding – Spot Anomalies
Secondary school examination (Matura) score distribution from Polish
Exploratory data analysis can reveal imperfections of conducted
experiment
Source: CKE Materials, Matura 2012
Data Understanding – my DOs and DON'Ts
Do not trust data quality estimates provided by your customer•
Verify as far as you can, if your data is correct, complete, coherent,
deduplicated, representative, independent, up-to-date, stationary
•
Understand anomalies and outliers•
Do not economize on this phase•
The earlier you discover issues with your data the better (yes, your data will
have issues!)
•
Data understanding leads to domain understanding, it will pay off in
the modelling phase
•
Investigate what sort of processing was applied to the raw data•
Data Preparation
Select data•
Clean data•
Generate derived attributes
Construct data•
Merge information from different sources
Integrate data•
Convert to format convenient for modelling
Format data•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Data Preparation
Tedious!•
Make, Drake
Use workflow tools to document, automate & parallelize data prep.•
classification-jsonl
data-aux/class-riffle
data-clean/joind-jsonl
data-aux/metad-riffle data-aux/priis-json data-aux/prinf-json
stat/basic stat/basic-fp7 stat/collab
metadata-jsonl projects-from-iis-jsonl projects-from-infspace-jsonlmetadata-extracted-jsonl
Oozie, Azkaban, Luigi, Airflow, ...
Data Preparation
Data understanding and preparation will usually consume half or
more of your project time!
•
20% 20%
14%
10% 10%10%
What % of time in your data mining project(s) is
spent on data cleaning and preparation?
8%
4%
25%
25%
39%
Percentage of responses
Percentageoftime
Source: M.A.Munson, A Study on the Importance of
and Time Spent Different Modeling Steps,
ACM SIGKDD Explorations Newsletter
13, 65-71 (2011)
Source: KDNuggets Poll 2003
Data Preparation – my DOs and DON'Ts
Use workflow tools to help you with the above•
Prepare your customer that data understanding and preparation
take considerable amount of time
•
Automate this phase as far as possible•
When merging multiple sources, track provenance of your data•
Modelling
Generate test design•
Feature eng., optimize model parameters
Build model•
Iterate the above
Assess model•
Assumptions, measure of accuracy
Select modelling technique•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Modelling – Tooling Selection
Where your model will be deployed?•
Do you need to distribute your
computations? (avoid!)
•
Breadth = performance, lots of general
purpose libraries and tooling, easy creation
of web services
Should I use general purpose language?•
C++
Java
C#
R
Matlab
Mathematica
Python
Scala
ClojureF#
BreadthDepth
(quality of general purpose tooling)
(qualityofdataanalysistooling)
Depth = easy data manipulation, latest
models and statistical techniques available
Should I use data analysis language?•
Can I afford a prototype?•
Modelling – my DOs and DON'Ts
Develop your model with deployment conditions in mind•
Allocate time for hyperparameter optimization•
• Whenever possible, peek inside your model and consult it with
domain expert
Assess feature importance•
Run your model on simulated data•
Be creative with your features (feature engineering)•
Esp. from textual data or time-series you can generate a lot of std. features•
Make conscious decision about missing data (NAs) and outliers (regression!)•
Evaluation
Review process•
To deploy or not to deploy?
Determine next steps• Determine next steps
Business success criteria fulfilled?
Evaluate results•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Evaluation – my DOs and DON'Ts
Work with the performance criteria dictated by your customer's
business model
•
Assess not only performance, but also practical aspects, related to
deployment, for example:
•
Training and prediction speed•
Robustness and maintainability
(tooling, dependence on other subsystems, library vs. homegrown code)
•
Watch out for data leakage, for example:•
Time series – mixing past and future•
Meaningful identifiers•
Other nasty ways of artificially introducing extra information, not available
in production
•
Deployment
Plan monitoring and maintenance•
Produce final report•
Plan deployment•
Collect lessons learned!
Review project•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Deployment – my DOs and DON'Ts
Read this paper, for excellent insights!
Thank you!
Questions?
@lopusz

More Related Content

What's hot (20)

PPT
Introduction To Data Mining
Phi Jack
 
PDF
Introduction to Data Science
Niko Vuokko
 
PDF
Data Engineering Basics
Catherine Kimani
 
PPTX
Data warehousing
Anshika Nigam
 
PPTX
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
DataWorks Summit
 
PPTX
Incorporating ERP metadata in your data models
Christopher Bradley
 
PPTX
Data Quality & Data Governance
Tuba Yaman Him
 
PPT
Knowledge discovery thru data mining
Devakumar Jain
 
PPTX
Data Mining: Mining ,associations, and correlations
Datamining Tools
 
PDF
Graph based data models
Moumie Soulemane
 
PPTX
Presentation on Big Data Analytics
S P Sajjan
 
PDF
Modern Data architecture Design
Kujambu Murugesan
 
PPT
03 data mining : data warehouse
Institute of Technology Telkom
 
PPTX
big data and machine learning ppt.pptx
NATASHABANO
 
PPTX
Big_data_ppt
Sadhana Singh
 
PPSX
Langage RDF/RDFs
Rached Krim
 
PPTX
Data warehouse
Yogendra Uikey
 
PPT
Date warehousing concepts
pcherukumalla
 
PPTX
Data modeling star schema
Sayed Ahmed
 
PPTX
Data mining presentation.ppt
neelamoberoi1030
 
Introduction To Data Mining
Phi Jack
 
Introduction to Data Science
Niko Vuokko
 
Data Engineering Basics
Catherine Kimani
 
Data warehousing
Anshika Nigam
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
DataWorks Summit
 
Incorporating ERP metadata in your data models
Christopher Bradley
 
Data Quality & Data Governance
Tuba Yaman Him
 
Knowledge discovery thru data mining
Devakumar Jain
 
Data Mining: Mining ,associations, and correlations
Datamining Tools
 
Graph based data models
Moumie Soulemane
 
Presentation on Big Data Analytics
S P Sajjan
 
Modern Data architecture Design
Kujambu Murugesan
 
03 data mining : data warehouse
Institute of Technology Telkom
 
big data and machine learning ppt.pptx
NATASHABANO
 
Big_data_ppt
Sadhana Singh
 
Langage RDF/RDFs
Rached Krim
 
Data warehouse
Yogendra Uikey
 
Date warehousing concepts
pcherukumalla
 
Data modeling star schema
Sayed Ahmed
 
Data mining presentation.ppt
neelamoberoi1030
 

Similar to CRISP-DM - Agile Approach To Data Mining Projects (20)

PDF
From Raw Data to Deployed Product. Fast & Agile with CRISP-DM
Michał Łopuszyński
 
PPT
WWV2015: Jibes Paul van der Hulst big data
webwinkelvakdag
 
PPTX
Breed data scientists_ A Presentation.pptx
GautamPopli1
 
PPTX
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
XanGwaps
 
PPTX
Data Science Introduction: Concepts, lifecycle, applications.pptx
sumitkumar600840
 
PDF
OpenML data@Sheffield
Joaquin Vanschoren
 
PDF
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
PDF
Intro to Data Science for Non-Data Scientists
Sri Ambati
 
PDF
Bridging Big Data and Data Science Using Scalable Workflows
Ilkay Altintas, Ph.D.
 
PPTX
Data science 101 Masterclass
Ben Keen
 
PDF
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
PDF
Making an impact with data science
Jordan Engbers
 
PDF
Data Scientists
Leonid Zhukov
 
PDF
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
PDF
predictive analysis and usage in procurement ppt 2017
Prashant Bhatmule
 
PDF
Machinr Learning and artificial_Lect1.pdf
SaketBansal9
 
PDF
Barga Data Science lecture 2
Roger Barga
 
PPTX
Advanced Analytics and Data Science Expertise
SoftServe
 
PPT
Big Data on The Cloud
Putchong Uthayopas
 
PPTX
DataScience.pptx
M Vishnuvardhan Reddy
 
From Raw Data to Deployed Product. Fast & Agile with CRISP-DM
Michał Łopuszyński
 
WWV2015: Jibes Paul van der Hulst big data
webwinkelvakdag
 
Breed data scientists_ A Presentation.pptx
GautamPopli1
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
XanGwaps
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
sumitkumar600840
 
OpenML data@Sheffield
Joaquin Vanschoren
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
Intro to Data Science for Non-Data Scientists
Sri Ambati
 
Bridging Big Data and Data Science Using Scalable Workflows
Ilkay Altintas, Ph.D.
 
Data science 101 Masterclass
Ben Keen
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
Making an impact with data science
Jordan Engbers
 
Data Scientists
Leonid Zhukov
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
predictive analysis and usage in procurement ppt 2017
Prashant Bhatmule
 
Machinr Learning and artificial_Lect1.pdf
SaketBansal9
 
Barga Data Science lecture 2
Roger Barga
 
Advanced Analytics and Data Science Expertise
SoftServe
 
Big Data on The Cloud
Putchong Uthayopas
 
DataScience.pptx
M Vishnuvardhan Reddy
 
Ad

Recently uploaded (20)

PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Group 5_RMB Final Project on circular economy
pgban24anmola
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PPTX
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Group 5_RMB Final Project on circular economy
pgban24anmola
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
Ad

CRISP-DM - Agile Approach To Data Mining Projects

  • 1. CRISP-DM Agile Approach to Data Mining Projects Michał Łopuszyński Warsaw Data Science Meetup, 2016.06.07
  • 2. About me I work at ICM UW• Our group = Applied Data Analysis Lab• Supercomputing centre, weather forecast , virtual library, open science platform, visualization solutions, ... • Involved in modelling and data analysis projects from cosmology, medicine, bioinformatics, quantum chemistry, biophysics, fluid dynamics, materials science, social network analysis ... • Automatic information extraction from PDFs• Text-mining in scientific literature• Variety of application projects (analysis of court judgments, aviation, deploying solutions on the big data stack Spark/Hadoop, trainings) • About me adalab.icm.edu.pl
  • 3. What is CRISP-DM? Cross Industry Standard Process for Data Mining • SPSS, Teradata, Daimler, OCHRA, NCR Developed in 1996 by big players in data analysis • • I follow "CRISP-DM 1.0 Step-by-step data mining guide"• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment Most popular methodology for data-centric projects See KDNuggets Polls• Runner-up SEMMA• I find it agile• Introduces almost no overhead• Emphasizes adaptive transitions between project phases • 2007, 2014
  • 4. Business Understanding Determine business objectives• Resources (data!), risks, costs & benefits Assess situation• Ideally with quantitative success criteria Determine data mining goals• Estimate time line, budget, but also tools and techniques Develop project plan• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  • 5. Business Understanding Difficult!• Often, you have to enter a new field• You have to explain data science limitations to non-experts • Source: http://xkcd.com/1425 No, performance will not be 100%• We need much more data to train an accurate model • For tomorrow, it is impossible•
  • 6. Business Understanding – my DOs and DON'Ts Have a lot of patience for vaguely defined problems• Do not waste your time on ill-defined, unrealistic projects• Learn to concretize or even reduce the scope of the initial idea• Data sample• Real-life use cases• Quantitative success metrics•
  • 7. Data Understanding Collect initial data• Persist results Describe data• Persist results Explore data• Carefully document problems and issues found! Verify data quality• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  • 8. Data Understanding – Validate Everything <judgement id="..."> <date>3013-12-04 00:00:00.0 CET</date> <publicationDate>2014-07-23 02:52:17.0 CEST</publicationDate> <courtId>15250000</courtId> <departmentId>503</departmentId> <chairman>Małgorzata ...</chairman> <judges> <judge>Małgorzata ...</judge> </judges> ... </judgement> <judgement id="..."> <date>2012-10-01 00:00:00.0 CEST</date> <publicationDate>2014-12-31 18:15:05.0 CET</publicationDate> <courtId>15450500</courtId> <departmentId>6027</departmentId> <judges> <judge>Piotr ...</judge> <judge>wskazał</judge> <judge>czego wymaga art. 17a ust. 2 ustawy</judge> ... </judges> </judgement>
  • 9. Data Understanding – Spot Anomalies Histogram of certain smooth quantity measured using "precise equipment" Explanation – effect of human interface between precise equipment & db
  • 10. Data Understanding – Spot Anomalies Secondary school examination (Matura) score distribution from Polish Exploratory data analysis can reveal imperfections of conducted experiment Source: CKE Materials, Matura 2012
  • 11. Data Understanding – my DOs and DON'Ts Do not trust data quality estimates provided by your customer• Verify as far as you can, if your data is correct, complete, coherent, deduplicated, representative, independent, up-to-date, stationary • Understand anomalies and outliers• Do not economize on this phase• The earlier you discover issues with your data the better (yes, your data will have issues!) • Data understanding leads to domain understanding, it will pay off in the modelling phase • Investigate what sort of processing was applied to the raw data•
  • 12. Data Preparation Select data• Clean data• Generate derived attributes Construct data• Merge information from different sources Integrate data• Convert to format convenient for modelling Format data• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  • 13. Data Preparation Tedious!• Make, Drake Use workflow tools to document, automate & parallelize data prep.• classification-jsonl data-aux/class-riffle data-clean/joind-jsonl data-aux/metad-riffle data-aux/priis-json data-aux/prinf-json stat/basic stat/basic-fp7 stat/collab metadata-jsonl projects-from-iis-jsonl projects-from-infspace-jsonlmetadata-extracted-jsonl Oozie, Azkaban, Luigi, Airflow, ...
  • 14. Data Preparation Data understanding and preparation will usually consume half or more of your project time! • 20% 20% 14% 10% 10%10% What % of time in your data mining project(s) is spent on data cleaning and preparation? 8% 4% 25% 25% 39% Percentage of responses Percentageoftime Source: M.A.Munson, A Study on the Importance of and Time Spent Different Modeling Steps, ACM SIGKDD Explorations Newsletter 13, 65-71 (2011) Source: KDNuggets Poll 2003
  • 15. Data Preparation – my DOs and DON'Ts Use workflow tools to help you with the above• Prepare your customer that data understanding and preparation take considerable amount of time • Automate this phase as far as possible• When merging multiple sources, track provenance of your data•
  • 16. Modelling Generate test design• Feature eng., optimize model parameters Build model• Iterate the above Assess model• Assumptions, measure of accuracy Select modelling technique• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  • 17. Modelling – Tooling Selection Where your model will be deployed?• Do you need to distribute your computations? (avoid!) • Breadth = performance, lots of general purpose libraries and tooling, easy creation of web services Should I use general purpose language?• C++ Java C# R Matlab Mathematica Python Scala ClojureF# BreadthDepth (quality of general purpose tooling) (qualityofdataanalysistooling) Depth = easy data manipulation, latest models and statistical techniques available Should I use data analysis language?• Can I afford a prototype?•
  • 18. Modelling – my DOs and DON'Ts Develop your model with deployment conditions in mind• Allocate time for hyperparameter optimization• • Whenever possible, peek inside your model and consult it with domain expert Assess feature importance• Run your model on simulated data• Be creative with your features (feature engineering)• Esp. from textual data or time-series you can generate a lot of std. features• Make conscious decision about missing data (NAs) and outliers (regression!)•
  • 19. Evaluation Review process• To deploy or not to deploy? Determine next steps• Determine next steps Business success criteria fulfilled? Evaluate results• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  • 20. Evaluation – my DOs and DON'Ts Work with the performance criteria dictated by your customer's business model • Assess not only performance, but also practical aspects, related to deployment, for example: • Training and prediction speed• Robustness and maintainability (tooling, dependence on other subsystems, library vs. homegrown code) • Watch out for data leakage, for example:• Time series – mixing past and future• Meaningful identifiers• Other nasty ways of artificially introducing extra information, not available in production •
  • 21. Deployment Plan monitoring and maintenance• Produce final report• Plan deployment• Collect lessons learned! Review project• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  • 22. Deployment – my DOs and DON'Ts Read this paper, for excellent insights!