The Hardest Thing in Data Science - Caffeinated Data Science
The Hardest Thing in Data Science - Caffeinated Data Science
Math isn’t the hardest thing in Data Science. Actually, since it’s so
mature, and documented, and well-known
(https://www.khanacademy.org/math), it’s quite possibly the
easiest thing to conquer in the skillset. No, the hardest thing about
Data Science is asking the right question.
Wait – what? Surely that’s an easy thing to do – you have something you want to know, and you just
ask that, right? Well, no. Many an aspiring Data Scientist is dashed on the rocks of the following
process:
1. Listen to question
2. Select technology to answer question
3. Find data
4. Use technology over data
But that’s wrong, too. As a Data Scientist, you need to spend time – real time – on that first item (I’ll
cover the proper process for Data Science in another post). So what is so hard about asking a
question?
Nothing is knowable
The farther out or the less data you have, the worse the prediction or
classification. Your audience won’t believe that. They are bathed in
news channels with simple graphics, three-line statements from politicians, and deceptive, trigger-
based marketing. They want something that is exact, sure, and confident.
What you’ll need to work on here involves two things – one for your methodology and one for your
audience. For your methodology, your focus is on reducing the margin of error. To do that you need
good quality data, data you fully understand, and lots of it.
To help with the audience problem, use analogies and stories. Explain the possibility you’re wrong
more than the possibility you’re right – something that goes against what you might want to tell the
person paying your bills.
“More data beats better algorithms” is kind of true, to a point. For instance, if I had every possible
data point for an object, I can simply observe it in the descriptive, rather than having to extrapolate
with a numerical analysis. But you will never have all the data, because of time and the ability to
gather it.
But you do need more data, and you need better quality data. In Machine Learning, the “features” are
the columns of data that predict the “label”, which is the answer you are looking for. Feature
selection, and data grooming, are the parts of the process you should spend the most time on. Once
you define the right features, you want a lot of them. More is better.
Break the problem down. Get as many smaller answers as you can so that you buy time to develop
more complete answers. Show results quickly, and qualify that there are better answers coming.
So there you have it. Yes, you need to learn the math. You need to know R
(https://buckwoody.wordpress.com/2015/10/07/statistics-working-with-r-and-revolution-analytics-
software/), and Python (https://buckwoody.wordpress.com/2015/11/04/python-for-the-data-
scientist/), and Azure ML (https://mva.microsoft.com/en-us/training-courses/getting-started-with-
microsoft-azure-machine-learning-8425), and the Data Catalog (https://azure.microsoft.com/en-
us/documentation/services/data-catalog/), and more. But the part that is the hardest has little to do
with technology. It’s knowing how to ask a good question.
POSTED IN LEARNING DATA SCIENCE
CAREER DATA SCIENCE
Published by BuckWoody
Buck Woody works on the Microsoft Cloud and AI Team, and uses data and technology to solve
business and science problems. With over 35 years of professional and practical experience in
computer technology, he is also a popular speaker at conferences around the world; author of over
700 articles and seven books (databases, machine learning, and R) sits on various Data Science Boards
at two US Universities, and specializes in advanced data analysis techniques. He is passionate about
mentoring and growing the next generation of data professionals. Specialties: Data, Data Science,
Databases, Communication, Teaching, Speaking, Writing, Cloud Computing, Security Clifton's
Strengths: Individualization, Learner, Connectedness, Positivity, Achiever, Ideation View all posts by
BuckWoody
Reply
4. salisbury_matt
SAYS:
JANUARY 5, 2016 AT 10:13 AM
Good stuff.
It reminds me of something Oracle chap Tom Kyte said in a Performance Tuning presentation a
few years back ‘Sometimes you have to tune the question, not the query’
Reply
5. Pingback: (SFTW) SQL Server Links 08/01/16 - John Sansom
6. Pingback: Data Wrangling – Regular Expressions | Backyard Data Science
7. Pingback: Data Science: Start at the very Beginning, It’s a very good place to start | Backyard Data
Science
8. Pingback: The Keys to Effective Data Science Projects – The Question | Backyard Data Science
9. Ju Son
SAYS:
NOVEMBER 13, 2017 AT 6:19 AM
Very nice writing. Thanks.
Reply
10. Pingback: 4 tips for developing better data algorithms - News
This site uses Akismet to reduce spam. Learn how your comment data is processed.
BLOG AT WORDPRESS.COM.