0% found this document useful (0 votes)
340 views

Data Science Essentials in Python PDF

Uploaded by

Vivek Ss
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
340 views

Data Science Essentials in Python PDF

Uploaded by

Vivek Ss
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Extracted from:

Python Companion to Data Science


Collect → Organize → Explore → Predict → Value

This PDF file contains pages extracted from Python Companion to Data Science,
published by the Pragmatic Bookshelf. For more information or to purchase a
paperback or PDF copy, please visit http://www.pragprog.com.
Note: This extract contains some colored text (particularly in code listing). This
is available only in online versions of the books. The printed versions are black
and white. Pagination might vary between the online and printed versions; the
content is otherwise identical.
Copyright © 2016 The Pragmatic Programmers, LLC.

All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted,


in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise,
without the prior consent of the publisher.

The Pragmatic Bookshelf


Dallas, Texas • Raleigh, North Carolina
Python Companion to Data Science
Collect → Organize → Explore → Predict → Value

Dmitry Zinoviev

The Pragmatic Bookshelf


Dallas, Texas • Raleigh, North Carolina
Many of the designations used by manufacturers and sellers to distinguish their products
are claimed as trademarks. Where those designations appear in this book, and The Pragmatic
Programmers, LLC was aware of a trademark claim, the designations have been printed in
initial capital letters or in all capitals. The Pragmatic Starter Kit, The Pragmatic Programmer,
Pragmatic Programming, Pragmatic Bookshelf, PragProg and the linking g device are trade-
marks of The Pragmatic Programmers, LLC.
Every precaution was taken in the preparation of this book. However, the publisher assumes
no responsibility for errors or omissions, or for damages that may result from the use of
information (including program listings) contained herein.
Our Pragmatic courses, workshops, and other products can help you and your team create
better software and have more fun. For more information, as well as the latest Pragmatic
titles, please visit us at https://pragprog.com.

For customer support, please contact [email protected].

For international rights, please contact [email protected].

Copyright © 2016 The Pragmatic Programmers, LLC.


All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted,


in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise,
without the prior consent of the publisher.

Printed in the United States of America.


ISBN-13: 978-1-68050-184-1
Encoded using the finest acid-free high-entropy binary digits.
Book version: B1.0—May 11, 2016
I must instruct you in a little science by-and-by, to distract your thoughts.
➤ Marie Corelli, British novellist

Preface
This book was inspired by an introductory data science course in Python that
I taught in Summer 2015 to a small group of select undergraduate students
of Suffolk University in Boston. The course was expected to be the first in a
two-course sequence, with an emphasis on obtaining, cleaning, organizing,
and visualizing data, sprinkled with some elements of statistics, machine
learning, and network analysis.

I quickly came to realize that the abundance of systems and Python modules
involved in these operations (databases, natural language processing frame-
works, JSON and HTML parsers, and high-performance numerical data
structures, to name a few) could easily overwhelm not only an undergraduate
student, but also a seasoned professional. In fact, I have to confess that while
working on my own research projects in the fields of data science and network
analysis, I had to spend more time calling the help() function and browsing
scores of online Python discussion boards than I was comfortable with, not
to mention the embarrassing moments in the classroom when the name of
some function or some optional parameter would seem to have been hopelessly
forgotten.

As a part of teaching the course, I compiled a set of cheat sheets on various


topics that turned out to be quite a useful reference. The cheat sheets even-
tually evolved into this book. Hopefully, having it on your desk will make you
think more about data science and data analysis than about function names
and optional parameters.

About This Book


The book covers data acquisition, cleaning, storing, retrieval, transformation,
visualization, elements of advanced data analysis (network analysis), statistics,
and machine learning. It is not an introduction to data science or a general
data science reference, although you’ll find a quick overview of how to do data
science in Chapter 1, What Is Data Science, on page ?. I assume that you

• Click HERE to purchase this book now. discuss


Preface • vi

have learned the methods of data science, including statistics, elsewhere. The
subject index at the end of the book refers to the Python implementations of
the key concepts, but in most cases you will already be familiar with the
concepts.

You’ll find a summary of Python data structures; string, file, and Web func-
tions; regular expressions; and even list comprehension in Chapter 2, Core
Python for Data Science, on page ?. This summary is provided to refresh
your knowledge of these topics, not to teach them. There are a lot of excellent
Python texts, and having the mastery of the language is absolutely important
for a successful data scientist.

The first part of the book looks at working with different types of text data
including processing structured and unstructured text, processing numeric
data with the NumPy and Pandas modules, and network analysis. Three more
chapters address different analysis aspects: working with relational and non-
relational databases, data visualization, and simple predictive analysis.

This book is partly a story and partly a reference. Depending on how you see
it, you can either read it sequentially or jump right to the index, find the
function or concept of concern, and look up relevant explanations and
examples. In the former case, if you are an experienced Python programmer,
you can safely skip Chapter 2, Core Python for Data Science, on page ?. If
you do not plan to work with external databases (such as MySQL), you can
ignore Chapter 4, Working with Databases, on page ? as well. Lastly, Chapter
9, Probability and Statistics, on page ? assumes that you have no idea about
statistics. If you do, you have a good excuse to bypass the first two units and
find yourself at Unit 47, Doing Stats the Python Way on page ?.

About the Audience


At this point you may be asking yourself if you want to have this book on
your bookshelf or, if the book is already on your bookshelf, if you want to
read the rest of it.

The book is intended for graduate and undergraduate students, data science
instructors, entry-level data science professionals—especially those converting
from R to Python—as well as seasoned developers who want a reference to
help them remember all of the Python functions and options.

Is that you? If so, abandon all hesitation and enter.

• Click HERE to purchase this book now. discuss


About the Software • vii

About the Software


Despite some controversy surrounding the transition from Python 2.7 to
Python 3.3 and above, I firmly stand behind the newer Python dialect. Most
new Python software is developed for 3.3, and most of the legacy software
has been successfully ported to 3.3, too. Considering the trend, it would be
unwise to choose an outdated dialect, no matter how popular it may seem at
the time.

All Python examples in this book are known to work for the modules mentioned
in the following table. All of these modules, with the exception of the community
module that must be installed separately1, and the Python interpreter itself,
are included in the Anaconda distribution, which is provided by Continuum
Analytics and is available for free.2

Package Used Version Package Used Version


BeautifulSoup4 4.3.2 Community 0.3
JSON 2.0.9 Html5lib 0.999
MatPlotLib 1.4.3 NetworkX 1.10.0
NLTK 3.1.0 NumPy 1.10.1
Pandas 0.17.0 PyMongo 3.0.2
PyMySQL 0.6.2 Python 3.4.3
SciKit-learn 0.16.1 SciPy 0.16.0
Table 1—Software components used in the book
If you plan to experiment (or actually work) with databases, you will also need
to download and install MySQL3 and MongoDB.4 Both databases are free and
known to work on Linux, Mac OS, and Windows platforms.

Notes on Quotes
Python allows the user to enclose character strings in 'single', "double",
'''triple''', and even """triple double""" quotes (the latter two can be used for
multiline strings). However, when printing out strings, it always uses single
quote notation, regardless of which quotes you used in the program.

Many other languages (C, C++, Java) use single and double quotes differently:
single for individual characters, double for character strings. To pay tribute

1. pypi.python.org/pypi/python-louvain/0.3
2. www.continuum.io
3. www.mysql.com
4. www.mongodb.com

• Click HERE to purchase this book now. discuss


Preface • viii

to this differentiation, in this book I, too, use single quotes for single characters
and double quotes for character strings.

The Book Forum


The community forum for this book can be found online at the Pragmatic
Programmers web page for this book.5 There you can ask questions, post
commments, as well as submit errata.

Another great resource for questions and answers (not specific to this book)
is the newly created Data Science Stack Exchange forum.6

Your Turn
At the end of each chapter there is a unit called “Your Turn.” This unit has
descriptions of several projects that you may want to accomplish on your own
(or with someone you trust) to strengthen your understanding of the material.

The projects marked with a single star* are the simplest. All you need to work
on them is solid knowledge of the functions mentioned in the preceding
chapters. Expect to complete single-star projects in no more than 30 minutes.
You’ll find solutions to them in Appendix 2, Solutions to Single-Star Projects,
on page ?.

The projects marked with two stars** are hard(er). They may take you an hour
or more, depending on your programming skills and habits. Two-star projects
involve the use of intermediate data structures and well thought-out algo-
rithms.

Finally, the three-star*** projects are the hardest. Some of the three-star
projects may not even have a perfect solution, so don’t get desperate if you
cannot find one! Just by working on these projects, you certainly make
yourself a better programmer and a better data scientist. And if you are an
educator, think of the three-star projects as potential mid-semester assign-
ments.

Now, let’s get started!

Dmitry Zinoviev
[email protected]
May 2016

5. pragprog.com/book/dzpyds
6. datascience.stackexchange.com

• Click HERE to purchase this book now. discuss

You might also like