SlideShare a Scribd company logo
Breeding Data Scientists
• Danielle Dean, PhD Senior Data Scientist Lead, Microsoft
• Amy O’Connor Business Value Enablement, Cloudera
Data
Engineering
Cloud
Enabled
Five changes in the world of the Data Scientist
More Data,
Insights, Results
Organization
& Culture
Productivity
Tools
More Data, More Insights
Data is abundant,
diverse & shared freely
As is how we store,
process and analyze it
Streaming Machine Learning BI
ETL Modeling
More Results
Top Cancer Research
Institutions
Working to Cure Cancer Rocket Science
Thorn
Destroying Human Trafficking
Networks
“Only 27% of the big data projects are regarded as successful”
“Only 8% of the big data projects are regarded as VERY successful”
Only 13% of organizations have achieved full-scale production for their
Big Data implementations
Source: CapGemini 2014
“Only 17% of survey respondents said they had a
well-developed Predictive/Prescriptive Analytics program
in place, while 80% said they planned on implementing
such a program within five years” Dataversity 2015 Survey
Organization & Culture: Sobering Statistics
The Data Scientist is not one person
Curiosity
Math and
Statistical
Knowledge
Hacking
Skills
Substantive
Expertise
Traditional
Research
Data
Science
Danger
Zone
Machine
Learning
Source: Drew Conway
The Data Scientist does not stand alone
Data Engineer/ETL Engineer
Executive Sponsor
Data Steward/SME
Subject Matter Expert
Data Scientist
+ Product Owner, app developer,
program manager, devOps etc
The Data Scientist does not sit in a centralized org
Other - 37%
CIO or IT Function - 18%
CMO - 11%
CFO - 9%
Chief Analytics Officer - 7%
CRO / Risk - 7%
VP Strategic Planning - 5%
VP Sales - 3%
Chief Data Officer - 3%
VP Customer Service - 3%
Source: Gartner 2016
“How do I become a Data Scientist?”
“How do I become a Data Scientist?”
Importance of Process
Data Science != Software Engineering
But, we can learn a lot, especially on processes
after all…Failing to plan is planning to fail
2. Feature
Extraction
3. Data Flow
Implementation
Data
Acquisition
1. Data Flow
Architecture
4. Data Flow
Validation
2. Data Schema
Architecture
2. Acquire Data
Sources
3. Data exploration
4. Create analytics
dataset
5. Modeling
& Descriptive
Analysis
6. Model evaluation
and tuning
7 . Model
Deployment
Data Science
1. Data Problem
Formulation
Standard Project Lifecycle
Standardized Document
Templates, Project Structure
Shared, Distributed
Resources
Productivity Tools, Shared
Utilities
1
2
3
4
Four Pillars of the Team Data Science Process
• Data science virtual machines
(DSVMs) as the fundamental
development platform on cloud
• Use Visual Studio Team Services
(VSTS)
• Work item tracking and scrum planning
• Git repositories
• Shared data science utilities in Git
repository
• Use cloud-based Azure resources as
needed
Team Data Science Process at Microsoft
Question
is sharp.
Data
measures
what they
care
about.
Data is
connected.
Data is
accurate.
A lot of
data.
The better the raw materials, the better the product.
E.g. Predict
whether
component X will
fail in the next Y
days; clear path
of action with
answer
E.g. Identifiers at
the level they are
predicting
E.g. Will be difficult
to predict failure
accurately with few
examples
E.g. Failures are
really failures,
human labels on
root causes; domain
knowledge
translated into
process
E.g. Machine
information linkable
to usage
information
Data Engineering – ready for ML?
A Bit more on Data Engineering
How do
Data Scientists
spend their
time?
Gartner estimates that poor quality of data costs an average organization
$13.5 million per year, and yet data governance problems
— which all organizations suffer from — are worsening.
Cleaning & organizing data - 60%
Collecting data sets - 19%
Mining data for patterns -- 9%
Refining algorithms - 4%
Building training sets - 3%
Other - 5% Source: CrowdFlower
A Bit more on Data Engineering
Data Ingestion
(Kafka, Navigator, Search)
Cloudera enables users to build real-time, end-to-
end data pipelines in order to power their
business. Leadership in Apache Spark and Kafka
have made Cloudera a trusted resource for users
who want to capture real-time, streaming, and time
series data without being presented with gaps in
security.
Data Processing
(Spark, Hive)
Cloudera is helping users accelerate their data pipelines
with leadership in technologies like Apache Spark. Data
processing in Cloudera Enterprise can help take
processing windows from hours to minutes and enables
faster access to data for a variety of users and skillsets.
Data Engineering/Science/Analyst Tools
Cloudera Certified Partners
0
10
20
30
40
50
60
70
2015 2016
Data Engineering
0
10
20
30
40
50
2015 2016
Data Science/Analytics
0
20
40
60
80
100
120
2015 2016
Data Analyst / BI
Flexible deployments: Cloud enabled
Easy Administration
• Dynamic cluster lifecycle management
• Single pane of glass: multi-cluster view
• Consumption based billing and metering
Enterprise-grade
• Integration across Cloudera Enterprise
• Management of CDH deployments at
scale
Flexible Deployments
• No cloud vendor lock-in: open plugin
framework for IaaS platforms
• Scaling of provisioned clusters
• Spot instance provisioning
Cloudera Director
Cortana Intelligence Suite on Azure cloud platform
Intelligence
Dashboards &
Visualizations
Information
Management
Big Data Stores Machine Learning
and Analytics
Cortana
Event Hubs
HDInsight
(Hadoop and
Spark)
Stream
Analytics
Data Intelligence Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Bot
Framework
SQL Data
Warehouse
Data Catalog
Data Lake
Analytics
Data Factory
Machine
Learning
Data Lake Store
Cognitive
Services
Power BI
Data
Sources
Apps
Sensors
and
devices
Data
Careful checking
and cleaning of
data
Leverage the
power of
the cloud
More Data =
More results!
Create a data
driven culture
& DS processes
Use the right
tool for the
job
• Microsoft’s “Team Data Science Process” Github: http://aka.ms/tdsp
• Productive utilities repository: https://github.com/Azure/Azure-TDSP-Utilities
• Sign up for a free VSTS account: http://www.visualstudio.com
• Complete Cloudera resource library: https://www.cloudera.com/resources.html
• Coursera Data Science: http://www.coursera.org
Resources

More Related Content

PDF
2022 Trends in Enterprise Analytics
PDF
Advanced Analytics and Machine Learning with Data Virtualization (India)
PDF
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
PDF
Advanced Analytics and Machine Learning with Data Virtualization
PPTX
Future.ready().watson dataplatform 01
PDF
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
PPTX
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
PDF
How to Consume Your Data for AI
2022 Trends in Enterprise Analytics
Advanced Analytics and Machine Learning with Data Virtualization (India)
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Advanced Analytics and Machine Learning with Data Virtualization
Future.ready().watson dataplatform 01
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
How to Consume Your Data for AI

Similar to Breed data scientists_ A Presentation.pptx (20)

PDF
DevOps Spain 2019. Olivier Perard-Oracle
PDF
Big Data Evolution
PDF
Cortana Intelligence Solutions
PPTX
Opportunity: Data, Analytic & Azure
PPTX
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
PDF
Advanced Analytics and Machine Learning with Data Virtualization
PDF
Data and AI in education
PPTX
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
PDF
How IBM is Creating a Foundation for Cloud Innovation
 
PDF
CSC - Presentation at Hortonworks Booth - Strata 2014
PPTX
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
PDF
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
PDF
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
PDF
Big Data Fabric: A Necessity For Any Successful Big Data Initiative
PPTX
Microsoft cloud big data strategy
PDF
Harnessing Microsoft Fabric and Azure Service Fabric Analytics as a Service a...
PPTX
ALIGNED Data Curation Methods and Tools
PPTX
Part 1: Introducing the Cloudera Data Science Workbench
PPTX
Data Mesh using Microsoft Fabric
PPTX
JavaZone 2018 - A Practical(ish) Introduction to Data Science
DevOps Spain 2019. Olivier Perard-Oracle
Big Data Evolution
Cortana Intelligence Solutions
Opportunity: Data, Analytic & Azure
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Advanced Analytics and Machine Learning with Data Virtualization
Data and AI in education
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
How IBM is Creating a Foundation for Cloud Innovation
 
CSC - Presentation at Hortonworks Booth - Strata 2014
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Big Data Fabric: A Necessity For Any Successful Big Data Initiative
Microsoft cloud big data strategy
Harnessing Microsoft Fabric and Azure Service Fabric Analytics as a Service a...
ALIGNED Data Curation Methods and Tools
Part 1: Introducing the Cloudera Data Science Workbench
Data Mesh using Microsoft Fabric
JavaZone 2018 - A Practical(ish) Introduction to Data Science

Recently uploaded (20)

PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
NOI Hackathon - Summer Edition - GreenThumber.pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Introduction and Scope of Bichemistry.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Open Quiz Monsoon Mind Game Final Set.pptx
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Cell Structure & Organelles in detailed.
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
NOI Hackathon - Summer Edition - GreenThumber.pptx
Renaissance Architecture: A Journey from Faith to Humanism
01-Introduction-to-Information-Management.pdf
Introduction and Scope of Bichemistry.pptx
O7-L3 Supply Chain Operations - ICLT Program
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Open Quiz Monsoon Mind Game Final Set.pptx
Pharma ospi slides which help in ospi learning
Cell Structure & Organelles in detailed.
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
STATICS OF THE RIGID BODIES Hibbelers.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester

Breed data scientists_ A Presentation.pptx

  • 1. Breeding Data Scientists • Danielle Dean, PhD Senior Data Scientist Lead, Microsoft • Amy O’Connor Business Value Enablement, Cloudera
  • 2. Data Engineering Cloud Enabled Five changes in the world of the Data Scientist More Data, Insights, Results Organization & Culture Productivity Tools
  • 3. More Data, More Insights Data is abundant, diverse & shared freely As is how we store, process and analyze it Streaming Machine Learning BI ETL Modeling
  • 4. More Results Top Cancer Research Institutions Working to Cure Cancer Rocket Science Thorn Destroying Human Trafficking Networks
  • 5. “Only 27% of the big data projects are regarded as successful” “Only 8% of the big data projects are regarded as VERY successful” Only 13% of organizations have achieved full-scale production for their Big Data implementations Source: CapGemini 2014 “Only 17% of survey respondents said they had a well-developed Predictive/Prescriptive Analytics program in place, while 80% said they planned on implementing such a program within five years” Dataversity 2015 Survey Organization & Culture: Sobering Statistics
  • 6. The Data Scientist is not one person Curiosity Math and Statistical Knowledge Hacking Skills Substantive Expertise Traditional Research Data Science Danger Zone Machine Learning Source: Drew Conway
  • 7. The Data Scientist does not stand alone Data Engineer/ETL Engineer Executive Sponsor Data Steward/SME Subject Matter Expert Data Scientist + Product Owner, app developer, program manager, devOps etc
  • 8. The Data Scientist does not sit in a centralized org Other - 37% CIO or IT Function - 18% CMO - 11% CFO - 9% Chief Analytics Officer - 7% CRO / Risk - 7% VP Strategic Planning - 5% VP Sales - 3% Chief Data Officer - 3% VP Customer Service - 3% Source: Gartner 2016
  • 9. “How do I become a Data Scientist?”
  • 10. “How do I become a Data Scientist?”
  • 11. Importance of Process Data Science != Software Engineering But, we can learn a lot, especially on processes after all…Failing to plan is planning to fail 2. Feature Extraction 3. Data Flow Implementation Data Acquisition 1. Data Flow Architecture 4. Data Flow Validation 2. Data Schema Architecture 2. Acquire Data Sources 3. Data exploration 4. Create analytics dataset 5. Modeling & Descriptive Analysis 6. Model evaluation and tuning 7 . Model Deployment Data Science 1. Data Problem Formulation
  • 12. Standard Project Lifecycle Standardized Document Templates, Project Structure Shared, Distributed Resources Productivity Tools, Shared Utilities 1 2 3 4 Four Pillars of the Team Data Science Process
  • 13. • Data science virtual machines (DSVMs) as the fundamental development platform on cloud • Use Visual Studio Team Services (VSTS) • Work item tracking and scrum planning • Git repositories • Shared data science utilities in Git repository • Use cloud-based Azure resources as needed Team Data Science Process at Microsoft
  • 14. Question is sharp. Data measures what they care about. Data is connected. Data is accurate. A lot of data. The better the raw materials, the better the product. E.g. Predict whether component X will fail in the next Y days; clear path of action with answer E.g. Identifiers at the level they are predicting E.g. Will be difficult to predict failure accurately with few examples E.g. Failures are really failures, human labels on root causes; domain knowledge translated into process E.g. Machine information linkable to usage information Data Engineering – ready for ML?
  • 15. A Bit more on Data Engineering How do Data Scientists spend their time? Gartner estimates that poor quality of data costs an average organization $13.5 million per year, and yet data governance problems — which all organizations suffer from — are worsening. Cleaning & organizing data - 60% Collecting data sets - 19% Mining data for patterns -- 9% Refining algorithms - 4% Building training sets - 3% Other - 5% Source: CrowdFlower
  • 16. A Bit more on Data Engineering Data Ingestion (Kafka, Navigator, Search) Cloudera enables users to build real-time, end-to- end data pipelines in order to power their business. Leadership in Apache Spark and Kafka have made Cloudera a trusted resource for users who want to capture real-time, streaming, and time series data without being presented with gaps in security. Data Processing (Spark, Hive) Cloudera is helping users accelerate their data pipelines with leadership in technologies like Apache Spark. Data processing in Cloudera Enterprise can help take processing windows from hours to minutes and enables faster access to data for a variety of users and skillsets.
  • 17. Data Engineering/Science/Analyst Tools Cloudera Certified Partners 0 10 20 30 40 50 60 70 2015 2016 Data Engineering 0 10 20 30 40 50 2015 2016 Data Science/Analytics 0 20 40 60 80 100 120 2015 2016 Data Analyst / BI
  • 18. Flexible deployments: Cloud enabled Easy Administration • Dynamic cluster lifecycle management • Single pane of glass: multi-cluster view • Consumption based billing and metering Enterprise-grade • Integration across Cloudera Enterprise • Management of CDH deployments at scale Flexible Deployments • No cloud vendor lock-in: open plugin framework for IaaS platforms • Scaling of provisioned clusters • Spot instance provisioning Cloudera Director
  • 19. Cortana Intelligence Suite on Azure cloud platform Intelligence Dashboards & Visualizations Information Management Big Data Stores Machine Learning and Analytics Cortana Event Hubs HDInsight (Hadoop and Spark) Stream Analytics Data Intelligence Action People Automated Systems Apps Web Mobile Bots Bot Framework SQL Data Warehouse Data Catalog Data Lake Analytics Data Factory Machine Learning Data Lake Store Cognitive Services Power BI Data Sources Apps Sensors and devices Data
  • 20. Careful checking and cleaning of data Leverage the power of the cloud More Data = More results! Create a data driven culture & DS processes Use the right tool for the job
  • 21. • Microsoft’s “Team Data Science Process” Github: http://aka.ms/tdsp • Productive utilities repository: https://github.com/Azure/Azure-TDSP-Utilities • Sign up for a free VSTS account: http://www.visualstudio.com • Complete Cloudera resource library: https://www.cloudera.com/resources.html • Coursera Data Science: http://www.coursera.org Resources