0% found this document useful (0 votes)
45 views

The Hardest Thing in Data Science - Caffeinated Data Science

The document discusses that asking the right question is the hardest part of data science. It is harder than learning complex math or algorithms. Good questions are specific, avoid broad generalizations, consider the audience's impatience, and acknowledge uncertainty in conclusions from data. Narrowing questions and getting preliminary answers quickly helps manage expectations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

The Hardest Thing in Data Science - Caffeinated Data Science

The document discusses that asking the right question is the hardest part of data science. It is harder than learning complex math or algorithms. Good questions are specific, avoid broad generalizations, consider the audience's impatience, and acknowledge uncertainty in conclusions from data. Narrowing questions and getting preliminary answers quickly helps manage expectations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Caffeinated Data Science

Data + Algorithm + Interpretation = Meaning

The Hardest Thing In Data Science

DECEMBER 30, 2015AUGUST 1, 2017 ~ BUCKWOODY


When I started down the path of learning Data Science, I was
nervous. I have to work hard at math – it’s a skill I love but one
that does not come naturally to me
(http://www.theatlantic.com/education/archive/2015/12/math-
class-performing/421710/). I was nervous because I thought the
most daunting task I would face in Data Science was learning all
the algebra, statistics, and other maths
(https://buckwoody.wordpress.com/2015/11/18/learning-
statistics/) I would need to do the job.

But I was wrong.

Math isn’t the hardest thing in Data Science. Actually, since it’s so
mature, and documented, and well-known
(https://www.khanacademy.org/math), it’s quite possibly the
easiest thing to conquer in the skillset. No, the hardest thing about
Data Science is asking the right question.

Wait – what? Surely that’s an easy thing to do – you have something you want to know, and you just
ask that, right? Well, no. Many an aspiring Data Scientist is dashed on the rocks of the following
process:

1. Listen to question
2. Select technology to answer question
3. Find data
4. Use technology over data
But that’s wrong, too. As a Data Scientist, you need to spend time – real time – on that first item (I’ll
cover the proper process for Data Science in another post). So what is so hard about asking a
question?

Nothing is knowable

There is no certainty in any data


(https://courses.washington.edu/phys431/uncertainty_notes.pdf). In
fact, there’s not even any certainty in reality
(https://www.youtube.com/watch?v=xV0Crm6xFsU), but that’s another
matter. But most people don’t realize that. Sure, you can count the
number of customers, or days, or widgets (maybe) to a degree of
accuracy, but making predictions or classifications from those numbers
is simply precision guesswork.

The farther out or the less data you have, the worse the prediction or
classification. Your audience won’t believe that. They are bathed in
news channels with simple graphics, three-line statements from politicians, and deceptive, trigger-
based marketing. They want something that is exact, sure, and confident.

What you’ll need to work on here involves two things – one for your methodology and one for your
audience. For your methodology, your focus is on reducing the margin of error. To do that you need
good quality data, data you fully understand, and lots of it.

To help with the audience problem, use analogies and stories. Explain the possibility you’re wrong
more than the possibility you’re right – something that goes against what you might want to tell the
person paying your bills.

You don’t have all the data

“More data beats better algorithms” is kind of true, to a point. For instance, if I had every possible
data point for an object, I can simply observe it in the descriptive, rather than having to extrapolate
with a numerical analysis. But you will never have all the data, because of time and the ability to
gather it.

But you do need more data, and you need better quality data. In Machine Learning, the “features” are
the columns of data that predict the “label”, which is the answer you are looking for. Feature
selection, and data grooming, are the parts of the process you should spend the most time on. Once
you define the right features, you want a lot of them. More is better.

Your question is way too broad

“Why is my system slow?” or “What is our customer base


like?” are questions that are just far too open. Data Science is
more accurate when you start by telling audience questions
“When you say that, tell me what you really want to do with
the answer.” A better question would be “Among our best
customers, what are the social causes they care about the most,
so that we can advertise to them at those locations?” Or “When
our systems slow down, is that due to human or systemic shortcomings?” and so on. Then don’t stop.
Ask why they want to know that. “Because we have a limited set of funds for advertising” is a really
good thing to know – the question might then change to “Where should we spend our limited
resources for advertising for the highest return” – or even better ” Are we spending enough on
advertising?” See how the question changes when you push back? Push back.

 Your audience is impatient

A Data Scientist spends an inordinate amount of time setting and


re-setting expectations. Sure, systems like Azure ML and HDInsight
make getting answers from data faster than ever – but that’s just the
processing part. Question definition, data sourcing, data grooming,
testing and experimentation, and interaction development (reports
or Cortana) takes time, and in today’s smart-phone app world,
people just won’t wait
(http://www.criticalthinking.org/pages/defining-critical-
thinking/766).

But some things are complicated because they’re, well, complicated.


They take time. But your audience won’t wait…so what do you do?

Break the problem down. Get as many smaller answers as you can so that you buy time to develop
more complete answers. Show results quickly, and qualify that there are better answers coming.
So there you have it. Yes, you need to learn the math. You need to know R
(https://buckwoody.wordpress.com/2015/10/07/statistics-working-with-r-and-revolution-analytics-
software/), and Python (https://buckwoody.wordpress.com/2015/11/04/python-for-the-data-
scientist/), and Azure ML (https://mva.microsoft.com/en-us/training-courses/getting-started-with-
microsoft-azure-machine-learning-8425), and the Data Catalog (https://azure.microsoft.com/en-
us/documentation/services/data-catalog/), and more. But the part that is the hardest has little to do
with technology. It’s knowing how to ask a good question.
POSTED IN LEARNING DATA SCIENCE
CAREER DATA SCIENCE

Published by BuckWoody

Buck Woody works on the Microsoft Cloud and AI Team, and uses data and technology to solve
business and science problems. With over 35 years of professional and practical experience in
computer technology, he is also a popular speaker at conferences around the world; author of over
700 articles and seven books (databases, machine learning, and R) sits on various Data Science Boards
at two US Universities, and specializes in advanced data analysis techniques. He is passionate about
mentoring and growing the next generation of data professionals. Specialties: Data, Data Science,
Databases, Communication, Teaching, Speaking, Writing, Cloud Computing, Security Clifton's
Strengths: Individualization, Learner, Connectedness, Positivity, Achiever, Ideation View all posts by
BuckWoody

11 thoughts on “The Hardest Thing In Data Science”

1. Pingback: Backyard Data Science | Backyard Data Science


2. Pingback: The Hardest Thing In Data Science - Carpe Datum - Site Home - MSDN Blogs
3. Pingback: Asking The Right Question – Curated SQL
1. buckwoody
SAYS:
JANUARY 4, 2016 AT 8:24 AM
Awesome. True, the math ain’t easy either as isn’t selecting the right features, the
application of the proper algorithm, the right visualization, all of that. Data Science is NOT
easy at all – my point here was simply that I see this mistake constantly, and people often
misunderstand how epic-ally important getting the right question is. Thanks for reading, and
for commenting!

Reply
4. salisbury_matt
SAYS:
JANUARY 5, 2016 AT 10:13 AM
Good stuff.

It reminds me of something Oracle chap Tom Kyte said in a Performance Tuning presentation a
few years back ‘Sometimes you have to tune the question, not the query’

Reply
5. Pingback: (SFTW) SQL Server Links 08/01/16 - John Sansom
6. Pingback: Data Wrangling – Regular Expressions | Backyard Data Science
7. Pingback: Data Science: Start at the very Beginning, It’s a very good place to start | Backyard Data
Science
8. Pingback: The Keys to Effective Data Science Projects – The Question | Backyard Data Science
9. Ju Son
SAYS:
NOVEMBER 13, 2017 AT 6:19 AM
Very nice writing. Thanks.

Reply
10. Pingback: 4 tips for developing better data algorithms - News

This site uses Akismet to reduce spam. Learn how your comment data is processed.

BLOG AT WORDPRESS.COM.

You might also like