0% found this document useful (0 votes)

45 views

The Hardest Thing in Data Science - Caffeinated Data Science

The document discusses that asking the right question is the hardest part of data science. It is harder than learning complex math or algorithms. Good questions are specific, avoid broad generalizations, consider the audience's impatience, and acknowledge uncertainty in conclusions from data. Narrowing questions and getting preliminary answers quickly helps manage expectations.

Uploaded by

Francisco Araújo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

The Hardest Thing in Data Science - Caffeinated Data Science

Uploaded by

Francisco Araújo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Caffeinated Data Science

Data + Algorithm + Interpretation = Meaning

The Hardest Thing In Data Science

DECEMBER 30, 2015AUGUST 1, 2017 ~ BUCKWOODY

When I started down the path of learning Data Science, I was
nervous. I have to work hard at math – it’s a skill I love but one
that does not come naturally to me
(http://www.theatlantic.com/education/archive/2015/12/math-
class-performing/421710/). I was nervous because I thought the
most daunting task I would face in Data Science was learning all
the algebra, statistics, and other maths
(https://buckwoody.wordpress.com/2015/11/18/learning-
statistics/) I would need to do the job.

But I was wrong.

Math isn’t the hardest thing in Data Science. Actually, since it’s so
mature, and documented, and well-known
(https://www.khanacademy.org/math), it’s quite possibly the
easiest thing to conquer in the skillset. No, the hardest thing about
Data Science is asking the right question.

Wait – what? Surely that’s an easy thing to do – you have something you want to know, and you just
ask that, right? Well, no. Many an aspiring Data Scientist is dashed on the rocks of the following
process:

1. Listen to question
2. Select technology to answer question
3. Find data
4. Use technology over data
But that’s wrong, too. As a Data Scientist, you need to spend time – real time – on that first item (I’ll
cover the proper process for Data Science in another post). So what is so hard about asking a
question?

Nothing is knowable

There is no certainty in any data

(https://courses.washington.edu/phys431/uncertainty_notes.pdf). In
fact, there’s not even any certainty in reality
(https://www.youtube.com/watch?v=xV0Crm6xFsU), but that’s another
matter. But most people don’t realize that. Sure, you can count the
number of customers, or days, or widgets (maybe) to a degree of
accuracy, but making predictions or classifications from those numbers
is simply precision guesswork.

The farther out or the less data you have, the worse the prediction or
classification. Your audience won’t believe that. They are bathed in
news channels with simple graphics, three-line statements from politicians, and deceptive, trigger-
based marketing. They want something that is exact, sure, and confident.

What you’ll need to work on here involves two things – one for your methodology and one for your
audience. For your methodology, your focus is on reducing the margin of error. To do that you need
good quality data, data you fully understand, and lots of it.

To help with the audience problem, use analogies and stories. Explain the possibility you’re wrong
more than the possibility you’re right – something that goes against what you might want to tell the
person paying your bills.

You don’t have all the data

“More data beats better algorithms” is kind of true, to a point. For instance, if I had every possible
data point for an object, I can simply observe it in the descriptive, rather than having to extrapolate
with a numerical analysis. But you will never have all the data, because of time and the ability to
gather it.

But you do need more data, and you need better quality data. In Machine Learning, the “features” are
the columns of data that predict the “label”, which is the answer you are looking for. Feature
selection, and data grooming, are the parts of the process you should spend the most time on. Once
you define the right features, you want a lot of them. More is better.

Your question is way too broad

“Why is my system slow?” or “What is our customer base

like?” are questions that are just far too open. Data Science is
more accurate when you start by telling audience questions
“When you say that, tell me what you really want to do with
the answer.” A better question would be “Among our best
customers, what are the social causes they care about the most,
so that we can advertise to them at those locations?” Or “When
our systems slow down, is that due to human or systemic shortcomings?” and so on. Then don’t stop.
Ask why they want to know that. “Because we have a limited set of funds for advertising” is a really
good thing to know – the question might then change to “Where should we spend our limited
resources for advertising for the highest return” – or even better ” Are we spending enough on
advertising?” See how the question changes when you push back? Push back.

Your audience is impatient

A Data Scientist spends an inordinate amount of time setting and

re-setting expectations. Sure, systems like Azure ML and HDInsight
make getting answers from data faster than ever – but that’s just the
processing part. Question definition, data sourcing, data grooming,
testing and experimentation, and interaction development (reports
or Cortana) takes time, and in today’s smart-phone app world,
people just won’t wait
(http://www.criticalthinking.org/pages/defining-critical-
thinking/766).

But some things are complicated because they’re, well, complicated.

They take time. But your audience won’t wait…so what do you do?

Break the problem down. Get as many smaller answers as you can so that you buy time to develop
more complete answers. Show results quickly, and qualify that there are better answers coming.
So there you have it. Yes, you need to learn the math. You need to know R
(https://buckwoody.wordpress.com/2015/10/07/statistics-working-with-r-and-revolution-analytics-
software/), and Python (https://buckwoody.wordpress.com/2015/11/04/python-for-the-data-
scientist/), and Azure ML (https://mva.microsoft.com/en-us/training-courses/getting-started-with-
microsoft-azure-machine-learning-8425), and the Data Catalog (https://azure.microsoft.com/en-
us/documentation/services/data-catalog/), and more. But the part that is the hardest has little to do
with technology. It’s knowing how to ask a good question.
POSTED IN LEARNING DATA SCIENCE
CAREER DATA SCIENCE

Published by BuckWoody

Buck Woody works on the Microsoft Cloud and AI Team, and uses data and technology to solve
business and science problems. With over 35 years of professional and practical experience in
computer technology, he is also a popular speaker at conferences around the world; author of over
700 articles and seven books (databases, machine learning, and R) sits on various Data Science Boards
at two US Universities, and specializes in advanced data analysis techniques. He is passionate about
mentoring and growing the next generation of data professionals. Specialties: Data, Data Science,
Databases, Communication, Teaching, Speaking, Writing, Cloud Computing, Security Clifton's
Strengths: Individualization, Learner, Connectedness, Positivity, Achiever, Ideation View all posts by
BuckWoody

11 thoughts on “The Hardest Thing In Data Science”

1. Pingback: Backyard Data Science | Backyard Data Science

2. Pingback: The Hardest Thing In Data Science - Carpe Datum - Site Home - MSDN Blogs
3. Pingback: Asking The Right Question – Curated SQL
1. buckwoody
SAYS:
JANUARY 4, 2016 AT 8:24 AM
Awesome. True, the math ain’t easy either as isn’t selecting the right features, the
application of the proper algorithm, the right visualization, all of that. Data Science is NOT
easy at all – my point here was simply that I see this mistake constantly, and people often
misunderstand how epic-ally important getting the right question is. Thanks for reading, and
for commenting!

Reply
4. salisbury_matt
SAYS:
JANUARY 5, 2016 AT 10:13 AM
Good stuff.

It reminds me of something Oracle chap Tom Kyte said in a Performance Tuning presentation a
few years back ‘Sometimes you have to tune the question, not the query’

Reply
5. Pingback: (SFTW) SQL Server Links 08/01/16 - John Sansom
6. Pingback: Data Wrangling – Regular Expressions | Backyard Data Science
7. Pingback: Data Science: Start at the very Beginning, It’s a very good place to start | Backyard Data
Science
8. Pingback: The Keys to Effective Data Science Projects – The Question | Backyard Data Science
9. Ju Son
SAYS:
NOVEMBER 13, 2017 AT 6:19 AM
Very nice writing. Thanks.

Reply
10. Pingback: 4 tips for developing better data algorithms - News

This site uses Akismet to reduce spam. Learn how your comment data is processed.

BLOG AT WORDPRESS.COM.

Executive Data Science A Guide To Training and Managing The Best Data Scientists by Brian Caffo, Roger D. Peng, Jeffrey T. Leek
100% (1)
Executive Data Science A Guide To Training and Managing The Best Data Scientists by Brian Caffo, Roger D. Peng, Jeffrey T. Leek
150 pages
Be The Outlier - How To Ace Data Science Interviews - Shrilata Murthy
100% (2)
Be The Outlier - How To Ace Data Science Interviews - Shrilata Murthy
150 pages
9781838826321-Managing Data Science
100% (7)
9781838826321-Managing Data Science
276 pages
Preview of Project Economics and Decision Analysis Volume 1 Determinisitic Models
0% (3)
Preview of Project Economics and Decision Analysis Volume 1 Determinisitic Models
2 pages
Bad News Letter Fall 20
No ratings yet
Bad News Letter Fall 20
2 pages
Chapter 1 Introduction To Operations Management
No ratings yet
Chapter 1 Introduction To Operations Management
15 pages
Challenges and Scope of Data Science Project
No ratings yet
Challenges and Scope of Data Science Project
21 pages
Eds
100% (2)
Eds
151 pages
Data Science Interview Questions - 365 Questions
No ratings yet
Data Science Interview Questions - 365 Questions
48 pages
Peck G Tableau 9 The Official Guide
100% (7)
Peck G Tableau 9 The Official Guide
353 pages
Past4 Manual PDF
No ratings yet
Past4 Manual PDF
283 pages
5 Common Challenges That Data Scientists Face in Starting Their Careers 1
No ratings yet
5 Common Challenges That Data Scientists Face in Starting Their Careers 1
17 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Data Science Tips and Tricks To Learn Data Science Theories Effectively
No ratings yet
Data Science Tips and Tricks To Learn Data Science Theories Effectively
208 pages
10 Things Know Before First Data Science Project
No ratings yet
10 Things Know Before First Data Science Project
8 pages
MACHINE LEARNING: Artificial Intelligence learning overview
From Everand
MACHINE LEARNING: Artificial Intelligence learning overview
Mulayam Singh
No ratings yet
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
5 Data Science Project Lifecycle
No ratings yet
5 Data Science Project Lifecycle
33 pages
Ebook Data Science
100% (3)
Ebook Data Science
48 pages
Data Science
100% (2)
Data Science
33 pages
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
Aspiring Data Scientist? Master These Fundamentals
No ratings yet
Aspiring Data Scientist? Master These Fundamentals
10 pages
Is Data Scientist The Sexiest Job of The 21st Century and How Do You Get One of Your Own
100% (2)
Is Data Scientist The Sexiest Job of The 21st Century and How Do You Get One of Your Own
5 pages
40 Interview Questions asked at Startups in Machine Learning _ Data Science
No ratings yet
40 Interview Questions asked at Startups in Machine Learning _ Data Science
13 pages
Unit 3
No ratings yet
Unit 3
9 pages
Starting A Data Science Team: Dr. Jonathan D. Adler
No ratings yet
Starting A Data Science Team: Dr. Jonathan D. Adler
39 pages
IDS - UNIT-2 - Notes part1_Introduction to Data Science and Prob concept[1]
No ratings yet
IDS - UNIT-2 - Notes part1_Introduction to Data Science and Prob concept[1]
66 pages
Part1 Ds ML Introduction
No ratings yet
Part1 Ds ML Introduction
61 pages
Data Science
No ratings yet
Data Science
18 pages
Unit 1 Part 1
No ratings yet
Unit 1 Part 1
18 pages
Big Data Tips 1-2-3
From Everand
Big Data Tips 1-2-3
Richard M Batenburg, Jr
No ratings yet
Data Science and The Essential Terms 2
No ratings yet
Data Science and The Essential Terms 2
4 pages
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
Data Collection: Getting Started With Statistics
From Everand
Data Collection: Getting Started With Statistics
Lee Baker
No ratings yet
Getting Data Science Done: Managing Projects From Ideas to Products
From Everand
Getting Data Science Done: Managing Projects From Ideas to Products
John Hawkins
No ratings yet
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
63 pages
6220010
No ratings yet
6220010
37 pages
Project Report
No ratings yet
Project Report
29 pages
1666777204580-1666708806962-Introduction to Data Science REV
No ratings yet
1666777204580-1666708806962-Introduction to Data Science REV
41 pages
lecture_1
No ratings yet
lecture_1
14 pages
Data Science With Python (MSC 3rd Sem) Unit 1
No ratings yet
Data Science With Python (MSC 3rd Sem) Unit 1
17 pages
Week 12 Intro to DS and ML
No ratings yet
Week 12 Intro to DS and ML
67 pages
Introduction To Datascience
No ratings yet
Introduction To Datascience
15 pages
Data Science Report - Compress
No ratings yet
Data Science Report - Compress
31 pages
Data Science
No ratings yet
Data Science
3 pages
A Glimpse Inside The Mind of A Data Scientist
No ratings yet
A Glimpse Inside The Mind of A Data Scientist
16 pages
What Is Data Science - A Beginner's Guide To Data Science - Edureka
No ratings yet
What Is Data Science - A Beginner's Guide To Data Science - Edureka
14 pages
intro
No ratings yet
intro
144 pages
Kadir
No ratings yet
Kadir
84 pages
DSE 3 Unit 1
100% (1)
DSE 3 Unit 1
10 pages
File
No ratings yet
File
27 pages
Unit I
No ratings yet
Unit I
52 pages
Voulgaris - Data Scientist (AVG) (2014)
No ratings yet
Voulgaris - Data Scientist (AVG) (2014)
297 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
Chapter 1 - Lecture
No ratings yet
Chapter 1 - Lecture
7 pages
Data Science Hiring Guide
50% (2)
Data Science Hiring Guide
56 pages
Why Data Science?
No ratings yet
Why Data Science?
13 pages
Ds Intro KK
No ratings yet
Ds Intro KK
11 pages
Introduction of Data Science.docx
No ratings yet
Introduction of Data Science.docx
28 pages
data scince report
No ratings yet
data scince report
11 pages
Intro To Career in Data Science: Md. Rabiul Islam
100% (1)
Intro To Career in Data Science: Md. Rabiul Islam
62 pages
Chapter 5
No ratings yet
Chapter 5
58 pages
Lecture 1 - Introduction To Data Science
No ratings yet
Lecture 1 - Introduction To Data Science
14 pages
Data Science Process
No ratings yet
Data Science Process
4 pages
The Government-Academia Complex and Big Data Religion
No ratings yet
The Government-Academia Complex and Big Data Religion
8 pages
Rich Morin - Facebooks Experiment Causes A Lot of Fuss For Little Result
No ratings yet
Rich Morin - Facebooks Experiment Causes A Lot of Fuss For Little Result
6 pages
Big Data and Five V's Characteristics
No ratings yet
Big Data and Five V's Characteristics
9 pages
DEA Introduction
No ratings yet
DEA Introduction
15 pages
Coursera Org Specializations Applied Data Science R
No ratings yet
Coursera Org Specializations Applied Data Science R
6 pages
International Series in Operations Research & Management Science
No ratings yet
International Series in Operations Research & Management Science
13 pages
Efficiency Measurement of Grocery Retail Warehouses With DEA
No ratings yet
Efficiency Measurement of Grocery Retail Warehouses With DEA
33 pages
The Inventory Management Kpis You Should Be Tracking
No ratings yet
The Inventory Management Kpis You Should Be Tracking
10 pages
Owl Pellet Lab
No ratings yet
Owl Pellet Lab
4 pages
Download Islamic Monetary Economics and Institutions Theory and Practice Muhamed Zulkhibri ebook All Chapters PDF
No ratings yet
Download Islamic Monetary Economics and Institutions Theory and Practice Muhamed Zulkhibri ebook All Chapters PDF
55 pages
Chem 106 Lab Report 6
No ratings yet
Chem 106 Lab Report 6
7 pages
Grade:: 3Rd Common Core Ms. Jabbari Chadlya Echebbi School Ms Chebbi Aida December 2013
No ratings yet
Grade:: 3Rd Common Core Ms. Jabbari Chadlya Echebbi School Ms Chebbi Aida December 2013
4 pages
Technology Integration Lesson Plan
100% (1)
Technology Integration Lesson Plan
3 pages
Custom Metadata Types Impl Guide
No ratings yet
Custom Metadata Types Impl Guide
27 pages
Project
No ratings yet
Project
44 pages
DS 2 Marks
No ratings yet
DS 2 Marks
2 pages
HSBC Technology - Trainee Software Engineer
No ratings yet
HSBC Technology - Trainee Software Engineer
2 pages
Apache Helicopter: Department OF Mechanical Engineering
No ratings yet
Apache Helicopter: Department OF Mechanical Engineering
18 pages
Icom IC-2200H Service Manual
No ratings yet
Icom IC-2200H Service Manual
54 pages
Upper-Intermediate & Advanced Grammar Course For TOEFL, IELTS, GRE, SAT, and GMAT
No ratings yet
Upper-Intermediate & Advanced Grammar Course For TOEFL, IELTS, GRE, SAT, and GMAT
331 pages
Wrong Food Combinations
No ratings yet
Wrong Food Combinations
7 pages
Ebook Dynamic Equations On Time Scales and Applications 1e by Ravi P. Agarwal, Bipan Hazarika, Sanket Tikare
No ratings yet
Ebook Dynamic Equations On Time Scales and Applications 1e by Ravi P. Agarwal, Bipan Hazarika, Sanket Tikare
55 pages
GEC102 Executive Summary
No ratings yet
GEC102 Executive Summary
4 pages
Marketing Management Project - Surf Excel
0% (1)
Marketing Management Project - Surf Excel
67 pages
Wa0022
No ratings yet
Wa0022
147 pages
The Art of Conversation Questions
100% (1)
The Art of Conversation Questions
2 pages
Euro Tower Presentation
No ratings yet
Euro Tower Presentation
38 pages
Wave Glider
No ratings yet
Wave Glider
3 pages
Meggers & Evans 1957 Arcnvestigation at The Mouth of Heaological Ithe Amazon
No ratings yet
Meggers & Evans 1957 Arcnvestigation at The Mouth of Heaological Ithe Amazon
820 pages
Crafting and Executing Strategy Core Concepts
No ratings yet
Crafting and Executing Strategy Core Concepts
5 pages
Ad-Adventurer Club General Resource
No ratings yet
Ad-Adventurer Club General Resource
12 pages
131 - Physics 2
No ratings yet
131 - Physics 2
4 pages
Britannia
No ratings yet
Britannia
41 pages
Literature Circles
100% (1)
Literature Circles
26 pages
Drug DNA Interaction Protocols 2nd Edition Yang Liu download
100% (1)
Drug DNA Interaction Protocols 2nd Edition Yang Liu download
58 pages
Battery Final Spec
No ratings yet
Battery Final Spec
11 pages

The Hardest Thing in Data Science - Caffeinated Data Science

Uploaded by

The Hardest Thing in Data Science - Caffeinated Data Science

Uploaded by

Caffeinated Data Science

Data + Algorithm + Interpretation = Meaning

The Hardest Thing In Data Science

DECEMBER 30, 2015AUGUST 1, 2017 ~ BUCKWOODY

But I was wrong.

There is no certainty in any data

You don’t have all the data

Your question is way too broad

“Why is my system slow?” or “What is our customer base

Your audience is impatient

A Data Scientist spends an inordinate amount of time setting and

But some things are complicated because they’re, well, complicated.

11 thoughts on “The Hardest Thing In Data Science”

1. Pingback: Backyard Data Science | Backyard Data Science

You might also like