0% found this document useful (0 votes)

340 views

Data Science Essentials in Python PDF

Uploaded by

Vivek Ss

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

340 views

Data Science Essentials in Python PDF

Uploaded by

Vivek Ss

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Extracted from:

Python Companion to Data Science

Collect → Organize → Explore → Predict → Value

This PDF file contains pages extracted from Python Companion to Data Science,
published by the Pragmatic Bookshelf. For more information or to purchase a
paperback or PDF copy, please visit http://www.pragprog.com.
Note: This extract contains some colored text (particularly in code listing). This
is available only in online versions of the books. The printed versions are black
and white. Pagination might vary between the online and printed versions; the
content is otherwise identical.
Copyright © 2016 The Pragmatic Programmers, LLC.

All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted,

in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise,
without the prior consent of the publisher.

The Pragmatic Bookshelf

Dallas, Texas • Raleigh, North Carolina
Python Companion to Data Science
Collect → Organize → Explore → Predict → Value

Dmitry Zinoviev

The Pragmatic Bookshelf

Dallas, Texas • Raleigh, North Carolina
Many of the designations used by manufacturers and sellers to distinguish their products
are claimed as trademarks. Where those designations appear in this book, and The Pragmatic
Programmers, LLC was aware of a trademark claim, the designations have been printed in
initial capital letters or in all capitals. The Pragmatic Starter Kit, The Pragmatic Programmer,
Pragmatic Programming, Pragmatic Bookshelf, PragProg and the linking g device are trade-
marks of The Pragmatic Programmers, LLC.
Every precaution was taken in the preparation of this book. However, the publisher assumes
no responsibility for errors or omissions, or for damages that may result from the use of
information (including program listings) contained herein.
Our Pragmatic courses, workshops, and other products can help you and your team create
better software and have more fun. For more information, as well as the latest Pragmatic
titles, please visit us at https://pragprog.com.

For customer support, please contact [email protected].

For international rights, please contact [email protected].

Copyright © 2016 The Pragmatic Programmers, LLC.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted,

in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise,
without the prior consent of the publisher.

Printed in the United States of America.

ISBN-13: 978-1-68050-184-1
Encoded using the finest acid-free high-entropy binary digits.
Book version: B1.0—May 11, 2016
I must instruct you in a little science by-and-by, to distract your thoughts.
➤ Marie Corelli, British novellist

Preface
This book was inspired by an introductory data science course in Python that
I taught in Summer 2015 to a small group of select undergraduate students
of Suffolk University in Boston. The course was expected to be the first in a
two-course sequence, with an emphasis on obtaining, cleaning, organizing,
and visualizing data, sprinkled with some elements of statistics, machine
learning, and network analysis.

I quickly came to realize that the abundance of systems and Python modules
involved in these operations (databases, natural language processing frame-
works, JSON and HTML parsers, and high-performance numerical data
structures, to name a few) could easily overwhelm not only an undergraduate
student, but also a seasoned professional. In fact, I have to confess that while
working on my own research projects in the fields of data science and network
analysis, I had to spend more time calling the help() function and browsing
scores of online Python discussion boards than I was comfortable with, not
to mention the embarrassing moments in the classroom when the name of
some function or some optional parameter would seem to have been hopelessly
forgotten.

As a part of teaching the course, I compiled a set of cheat sheets on various

topics that turned out to be quite a useful reference. The cheat sheets even-
tually evolved into this book. Hopefully, having it on your desk will make you
think more about data science and data analysis than about function names
and optional parameters.

About This Book

The book covers data acquisition, cleaning, storing, retrieval, transformation,
visualization, elements of advanced data analysis (network analysis), statistics,
and machine learning. It is not an introduction to data science or a general
data science reference, although you’ll find a quick overview of how to do data
science in Chapter 1, What Is Data Science, on page ?. I assume that you

• Click HERE to purchase this book now. discuss

Preface • vi

have learned the methods of data science, including statistics, elsewhere. The
subject index at the end of the book refers to the Python implementations of
the key concepts, but in most cases you will already be familiar with the
concepts.

You’ll find a summary of Python data structures; string, file, and Web func-
tions; regular expressions; and even list comprehension in Chapter 2, Core
Python for Data Science, on page ?. This summary is provided to refresh
your knowledge of these topics, not to teach them. There are a lot of excellent
Python texts, and having the mastery of the language is absolutely important
for a successful data scientist.

The first part of the book looks at working with different types of text data
including processing structured and unstructured text, processing numeric
data with the NumPy and Pandas modules, and network analysis. Three more
chapters address different analysis aspects: working with relational and non-
relational databases, data visualization, and simple predictive analysis.

This book is partly a story and partly a reference. Depending on how you see
it, you can either read it sequentially or jump right to the index, find the
function or concept of concern, and look up relevant explanations and
examples. In the former case, if you are an experienced Python programmer,
you can safely skip Chapter 2, Core Python for Data Science, on page ?. If
you do not plan to work with external databases (such as MySQL), you can
ignore Chapter 4, Working with Databases, on page ? as well. Lastly, Chapter
9, Probability and Statistics, on page ? assumes that you have no idea about
statistics. If you do, you have a good excuse to bypass the first two units and
find yourself at Unit 47, Doing Stats the Python Way on page ?.

About the Audience

At this point you may be asking yourself if you want to have this book on
your bookshelf or, if the book is already on your bookshelf, if you want to
read the rest of it.

The book is intended for graduate and undergraduate students, data science
instructors, entry-level data science professionals—especially those converting
from R to Python—as well as seasoned developers who want a reference to
help them remember all of the Python functions and options.

Is that you? If so, abandon all hesitation and enter.

• Click HERE to purchase this book now. discuss

About the Software • vii

About the Software

Despite some controversy surrounding the transition from Python 2.7 to
Python 3.3 and above, I firmly stand behind the newer Python dialect. Most
new Python software is developed for 3.3, and most of the legacy software
has been successfully ported to 3.3, too. Considering the trend, it would be
unwise to choose an outdated dialect, no matter how popular it may seem at
the time.

All Python examples in this book are known to work for the modules mentioned
in the following table. All of these modules, with the exception of the community
module that must be installed separately1, and the Python interpreter itself,
are included in the Anaconda distribution, which is provided by Continuum
Analytics and is available for free.2

Package Used Version Package Used Version

BeautifulSoup4 4.3.2 Community 0.3
JSON 2.0.9 Html5lib 0.999
MatPlotLib 1.4.3 NetworkX 1.10.0
NLTK 3.1.0 NumPy 1.10.1
Pandas 0.17.0 PyMongo 3.0.2
PyMySQL 0.6.2 Python 3.4.3
SciKit-learn 0.16.1 SciPy 0.16.0
Table 1—Software components used in the book
If you plan to experiment (or actually work) with databases, you will also need
to download and install MySQL3 and MongoDB.4 Both databases are free and
known to work on Linux, Mac OS, and Windows platforms.

Notes on Quotes
Python allows the user to enclose character strings in 'single', "double",
'''triple''', and even """triple double""" quotes (the latter two can be used for
multiline strings). However, when printing out strings, it always uses single
quote notation, regardless of which quotes you used in the program.

Many other languages (C, C++, Java) use single and double quotes differently:
single for individual characters, double for character strings. To pay tribute

1. pypi.python.org/pypi/python-louvain/0.3
2. www.continuum.io
3. www.mysql.com
4. www.mongodb.com

• Click HERE to purchase this book now. discuss

Preface • viii

to this differentiation, in this book I, too, use single quotes for single characters
and double quotes for character strings.

The Book Forum

The community forum for this book can be found online at the Pragmatic
Programmers web page for this book.5 There you can ask questions, post
commments, as well as submit errata.

Another great resource for questions and answers (not specific to this book)
is the newly created Data Science Stack Exchange forum.6

Your Turn
At the end of each chapter there is a unit called “Your Turn.” This unit has
descriptions of several projects that you may want to accomplish on your own
(or with someone you trust) to strengthen your understanding of the material.

The projects marked with a single star* are the simplest. All you need to work
on them is solid knowledge of the functions mentioned in the preceding
chapters. Expect to complete single-star projects in no more than 30 minutes.
You’ll find solutions to them in Appendix 2, Solutions to Single-Star Projects,
on page ?.

The projects marked with two stars** are hard(er). They may take you an hour
or more, depending on your programming skills and habits. Two-star projects
involve the use of intermediate data structures and well thought-out algo-
rithms.

Finally, the three-star*** projects are the hardest. Some of the three-star
projects may not even have a perfect solution, so don’t get desperate if you
cannot find one! Just by working on these projects, you certainly make
yourself a better programmer and a better data scientist. And if you are an
educator, think of the three-star projects as potential mid-semester assign-
ments.

Now, let’s get started!

Dmitry Zinoviev
[email protected]
May 2016

5. pragprog.com/book/dzpyds
6. datascience.stackexchange.com

• Click HERE to purchase this book now. discuss

109 Python Problems For CCPS 109
100% (1)
109 Python Problems For CCPS 109
132 pages
Openstacksdk
100% (1)
Openstacksdk
646 pages
12 Comp Sci 1 Revision Notes Pythan Advanced Prog
No ratings yet
12 Comp Sci 1 Revision Notes Pythan Advanced Prog
5 pages
Crash N' Burn: Writing Linux Application Fault Handlers
100% (4)
Crash N' Burn: Writing Linux Application Fault Handlers
25 pages
Poser Python Methods Manual
No ratings yet
Poser Python Methods Manual
134 pages
Deve Letech Password Policy
No ratings yet
Deve Letech Password Policy
2 pages
Operating System Answers
No ratings yet
Operating System Answers
6 pages
Saw Filter (Siemens)
100% (2)
Saw Filter (Siemens)
350 pages
SQL Sqlite Commands Cheat Sheet PDF
No ratings yet
SQL Sqlite Commands Cheat Sheet PDF
5 pages
An Introduction To Python Programming Language
No ratings yet
An Introduction To Python Programming Language
63 pages
Beginners Tutorial For Regular Expressions in Python - Python Learning
No ratings yet
Beginners Tutorial For Regular Expressions in Python - Python Learning
23 pages
Blockchain Checklist
100% (1)
Blockchain Checklist
2 pages
Python Fundamentals
No ratings yet
Python Fundamentals
61 pages
Plete Python Manual 4th HQ PDF-Edition 2019
No ratings yet
Plete Python Manual 4th HQ PDF-Edition 2019
163 pages
Complete Download Python for Beginners: Master Python Programming from Basics to Advanced Level Tim Simon PDF All Chapters
100% (2)
Complete Download Python for Beginners: Master Python Programming from Basics to Advanced Level Tim Simon PDF All Chapters
47 pages
Usharani Bhimavarapu Jude D
100% (1)
Usharani Bhimavarapu Jude D
349 pages
Pattern Matching With Regular Expressions - by Zohaib Shahzad - The Startup - Medium
No ratings yet
Pattern Matching With Regular Expressions - by Zohaib Shahzad - The Startup - Medium
8 pages
(Treading On Python 2) Matt Harrison - Treading On Python Volume 2 - Intermediate Python 2 (2013, Hairysun)
No ratings yet
(Treading On Python 2) Matt Harrison - Treading On Python Volume 2 - Intermediate Python 2 (2013, Hairysun)
144 pages
Chapter5 CPIT110 v2 Loops
No ratings yet
Chapter5 CPIT110 v2 Loops
227 pages
Python For Beginners - Daniel Correa Paola Vallejo
No ratings yet
Python For Beginners - Daniel Correa Paola Vallejo
408 pages
Tools and Methods Used in Cyber Crime
No ratings yet
Tools and Methods Used in Cyber Crime
97 pages
M5 - Custom Model Building With SQL in BigQuery ML Slides
No ratings yet
M5 - Custom Model Building With SQL in BigQuery ML Slides
32 pages
Fundamentals of Python: First Programs Second Edition: Week 5 (Chapter 5)
No ratings yet
Fundamentals of Python: First Programs Second Edition: Week 5 (Chapter 5)
46 pages
Python I: Some Material Adapted From Upenn Cmpe391 Slides and Other Sources
No ratings yet
Python I: Some Material Adapted From Upenn Cmpe391 Slides and Other Sources
68 pages
Python Console Application Development 2
No ratings yet
Python Console Application Development 2
27 pages
Sams Asp Dotnet Evolution Isbn0672326477
No ratings yet
Sams Asp Dotnet Evolution Isbn0672326477
374 pages
SQLite Complete Self-Assessment Guide
From Everand
SQLite Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Python Tools Utilities
100% (1)
Python Tools Utilities
3 pages
What Is Python 3?
No ratings yet
What Is Python 3?
2 pages
Unit 1
No ratings yet
Unit 1
86 pages
Modul Python 1
No ratings yet
Modul Python 1
36 pages
Python - Programming
No ratings yet
Python - Programming
9 pages
Python Language Companion
No ratings yet
Python Language Companion
133 pages
Advanced NLP With Spacy Chapter3
No ratings yet
Advanced NLP With Spacy Chapter3
29 pages
Hey Guys I Have Collected Info Related To Ports From Diff Web It Might Be Helpful To U
No ratings yet
Hey Guys I Have Collected Info Related To Ports From Diff Web It Might Be Helpful To U
35 pages
Python Pandas Data Analysis
No ratings yet
Python Pandas Data Analysis
36 pages
System Administration Guide: SAP Adaptive Server Enterprise 16.0 SP02 Document Version: 1.3 - 2016-06-30
No ratings yet
System Administration Guide: SAP Adaptive Server Enterprise 16.0 SP02 Document Version: 1.3 - 2016-06-30
154 pages
Gnupg
No ratings yet
Gnupg
148 pages
TCP IP Hijacking
100% (1)
TCP IP Hijacking
4 pages
Python Excercise
100% (1)
Python Excercise
7 pages
26 Pythonic Code Tips and Tricks
No ratings yet
26 Pythonic Code Tips and Tricks
30 pages
OceanofPDF - Com Essential SQLAlchemy - Rick Copeland
No ratings yet
OceanofPDF - Com Essential SQLAlchemy - Rick Copeland
301 pages
An A-Z Index of The Bash Command Line For Linux - SS64
No ratings yet
An A-Z Index of The Bash Command Line For Linux - SS64
5 pages
CMD Keywords
No ratings yet
CMD Keywords
4 pages
Data Mining Slides
No ratings yet
Data Mining Slides
43 pages
Decorator Hand Out
No ratings yet
Decorator Hand Out
1 page
Functional Programming in Python-3
No ratings yet
Functional Programming in Python-3
7 pages
Red Hat Enterprise Linux 6 Deployment Guide en US
No ratings yet
Red Hat Enterprise Linux 6 Deployment Guide en US
584 pages
Advanced Applications of Python Data Structures and Algorith
No ratings yet
Advanced Applications of Python Data Structures and Algorith
410 pages
Python How To Regex
No ratings yet
Python How To Regex
19 pages
Creating A VM in Google Cloud
No ratings yet
Creating A VM in Google Cloud
7 pages
Agenda 1. Video 2. Vocabulary 3. Practice I Can Recognize Vocabulary About Python Programing Language
No ratings yet
Agenda 1. Video 2. Vocabulary 3. Practice I Can Recognize Vocabulary About Python Programing Language
33 pages
Bottle Python Framework
No ratings yet
Bottle Python Framework
18 pages
Bad Ideas
No ratings yet
Bad Ideas
69 pages
Court Cases Winning Stratergys
No ratings yet
Court Cases Winning Stratergys
9 pages
Python Programming
No ratings yet
Python Programming
10 pages
,,,,,,,,,,,,,,,,PYTHON.P NP
No ratings yet
,,,,,,,,,,,,,,,,PYTHON.P NP
154 pages
Python (Programming Language)
No ratings yet
Python (Programming Language)
20 pages
Building Telephony Systems with OpenSER
From Everand
Building Telephony Systems with OpenSER
Goncalves Flavio E.
No ratings yet
WebRTC Blueprints
From Everand
WebRTC Blueprints
Andrii Sergiienko
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
52shimamoto PDF
No ratings yet
52shimamoto PDF
12 pages
Hyperbolic Functions I
No ratings yet
Hyperbolic Functions I
1 page
Objective Questions Topic1 AM015 2324
No ratings yet
Objective Questions Topic1 AM015 2324
4 pages
Principles and Elements of Interior Design - 1
No ratings yet
Principles and Elements of Interior Design - 1
3 pages
Byju's Practice Workshop: Topic Covered: Isomerism
100% (1)
Byju's Practice Workshop: Topic Covered: Isomerism
5 pages
Enriched Math Grade 9 Q1 M1
No ratings yet
Enriched Math Grade 9 Q1 M1
14 pages
Measurement
No ratings yet
Measurement
1 page
2007 Usnco Exam Part I
No ratings yet
2007 Usnco Exam Part I
8 pages
Complex Numbers 1
No ratings yet
Complex Numbers 1
14 pages
Area of Compound Shapes
No ratings yet
Area of Compound Shapes
7 pages
OS Assignment 3
No ratings yet
OS Assignment 3
5 pages
Presentation
No ratings yet
Presentation
15 pages
What Is Query
No ratings yet
What Is Query
6 pages
Chapter 7 Second Order Transient
No ratings yet
Chapter 7 Second Order Transient
28 pages
Enzymatic Production of Biohydrogen: Brief Communications
No ratings yet
Enzymatic Production of Biohydrogen: Brief Communications
3 pages
Shade Sorting
100% (1)
Shade Sorting
3 pages
inventions and inventors
No ratings yet
inventions and inventors
9 pages
North Carolina State University Department of Physics Raleigh, NC 27695 Mbnardelli@ncsu - Edu
100% (1)
North Carolina State University Department of Physics Raleigh, NC 27695 Mbnardelli@ncsu - Edu
13 pages
BMC Atrium Core 9.0.01
0% (1)
BMC Atrium Core 9.0.01
2,268 pages
E - MCQ - PDF New
No ratings yet
E - MCQ - PDF New
4 pages
研究方法论案例研究及解决方案
100% (1)
研究方法论案例研究及解决方案
6 pages
MRMM Final Syllabus
No ratings yet
MRMM Final Syllabus
28 pages
Class 12 CH 1 Question Set 2
No ratings yet
Class 12 CH 1 Question Set 2
11 pages
R. H. French - Open-Channel Hydraulics (1985 McGraw-Hill Companies) - Compressed
No ratings yet
R. H. French - Open-Channel Hydraulics (1985 McGraw-Hill Companies) - Compressed
7 pages
Shadow Phone and Ghost SIM: A Step Toward Geolocation Anonymous Calling
100% (2)
Shadow Phone and Ghost SIM: A Step Toward Geolocation Anonymous Calling
59 pages
Lesson Plan - Cs3251 - C - Program
No ratings yet
Lesson Plan - Cs3251 - C - Program
8 pages
Einstein Rosenhíd
No ratings yet
Einstein Rosenhíd
9 pages
Quality 1.4021 Chemical Composition: Lucefin Group
No ratings yet
Quality 1.4021 Chemical Composition: Lucefin Group
2 pages
Ansys Chapter 06
No ratings yet
Ansys Chapter 06
8 pages