Data Science Essentials in Python PDF
Data Science Essentials in Python PDF
This PDF file contains pages extracted from Python Companion to Data Science,
published by the Pragmatic Bookshelf. For more information or to purchase a
paperback or PDF copy, please visit http://www.pragprog.com.
Note: This extract contains some colored text (particularly in code listing). This
is available only in online versions of the books. The printed versions are black
and white. Pagination might vary between the online and printed versions; the
content is otherwise identical.
Copyright © 2016 The Pragmatic Programmers, LLC.
Dmitry Zinoviev
Preface
This book was inspired by an introductory data science course in Python that
I taught in Summer 2015 to a small group of select undergraduate students
of Suffolk University in Boston. The course was expected to be the first in a
two-course sequence, with an emphasis on obtaining, cleaning, organizing,
and visualizing data, sprinkled with some elements of statistics, machine
learning, and network analysis.
I quickly came to realize that the abundance of systems and Python modules
involved in these operations (databases, natural language processing frame-
works, JSON and HTML parsers, and high-performance numerical data
structures, to name a few) could easily overwhelm not only an undergraduate
student, but also a seasoned professional. In fact, I have to confess that while
working on my own research projects in the fields of data science and network
analysis, I had to spend more time calling the help() function and browsing
scores of online Python discussion boards than I was comfortable with, not
to mention the embarrassing moments in the classroom when the name of
some function or some optional parameter would seem to have been hopelessly
forgotten.
have learned the methods of data science, including statistics, elsewhere. The
subject index at the end of the book refers to the Python implementations of
the key concepts, but in most cases you will already be familiar with the
concepts.
You’ll find a summary of Python data structures; string, file, and Web func-
tions; regular expressions; and even list comprehension in Chapter 2, Core
Python for Data Science, on page ?. This summary is provided to refresh
your knowledge of these topics, not to teach them. There are a lot of excellent
Python texts, and having the mastery of the language is absolutely important
for a successful data scientist.
The first part of the book looks at working with different types of text data
including processing structured and unstructured text, processing numeric
data with the NumPy and Pandas modules, and network analysis. Three more
chapters address different analysis aspects: working with relational and non-
relational databases, data visualization, and simple predictive analysis.
This book is partly a story and partly a reference. Depending on how you see
it, you can either read it sequentially or jump right to the index, find the
function or concept of concern, and look up relevant explanations and
examples. In the former case, if you are an experienced Python programmer,
you can safely skip Chapter 2, Core Python for Data Science, on page ?. If
you do not plan to work with external databases (such as MySQL), you can
ignore Chapter 4, Working with Databases, on page ? as well. Lastly, Chapter
9, Probability and Statistics, on page ? assumes that you have no idea about
statistics. If you do, you have a good excuse to bypass the first two units and
find yourself at Unit 47, Doing Stats the Python Way on page ?.
The book is intended for graduate and undergraduate students, data science
instructors, entry-level data science professionals—especially those converting
from R to Python—as well as seasoned developers who want a reference to
help them remember all of the Python functions and options.
All Python examples in this book are known to work for the modules mentioned
in the following table. All of these modules, with the exception of the community
module that must be installed separately1, and the Python interpreter itself,
are included in the Anaconda distribution, which is provided by Continuum
Analytics and is available for free.2
Notes on Quotes
Python allows the user to enclose character strings in 'single', "double",
'''triple''', and even """triple double""" quotes (the latter two can be used for
multiline strings). However, when printing out strings, it always uses single
quote notation, regardless of which quotes you used in the program.
Many other languages (C, C++, Java) use single and double quotes differently:
single for individual characters, double for character strings. To pay tribute
1. pypi.python.org/pypi/python-louvain/0.3
2. www.continuum.io
3. www.mysql.com
4. www.mongodb.com
to this differentiation, in this book I, too, use single quotes for single characters
and double quotes for character strings.
Another great resource for questions and answers (not specific to this book)
is the newly created Data Science Stack Exchange forum.6
Your Turn
At the end of each chapter there is a unit called “Your Turn.” This unit has
descriptions of several projects that you may want to accomplish on your own
(or with someone you trust) to strengthen your understanding of the material.
The projects marked with a single star* are the simplest. All you need to work
on them is solid knowledge of the functions mentioned in the preceding
chapters. Expect to complete single-star projects in no more than 30 minutes.
You’ll find solutions to them in Appendix 2, Solutions to Single-Star Projects,
on page ?.
The projects marked with two stars** are hard(er). They may take you an hour
or more, depending on your programming skills and habits. Two-star projects
involve the use of intermediate data structures and well thought-out algo-
rithms.
Finally, the three-star*** projects are the hardest. Some of the three-star
projects may not even have a perfect solution, so don’t get desperate if you
cannot find one! Just by working on these projects, you certainly make
yourself a better programmer and a better data scientist. And if you are an
educator, think of the three-star projects as potential mid-semester assign-
ments.
Dmitry Zinoviev
[email protected]
May 2016
5. pragprog.com/book/dzpyds
6. datascience.stackexchange.com