01 Complete-Tutorial-Learn-Data-Science-Python-Scratch-2
01 Complete-Tutorial-Learn-Data-Science-Python-Scratch-2
Overview
This article is a complete tutorial to learn data science using python from scratch
It will also help you to learn basic data analysis methods using python
You will also be able to enhance your knowledge of machine learning algorithms
Introduction
It happened a few years back. After working on SAS for more than 5 years, I decided to move out of my
comfort zone. Being a data scientist, my hunt for other useful tools was ON! Fortunately, it didn’t take me
long to decide – Python was my appetizer.
I always had an inclination for coding. This was the time to do what I really loved. Code. Turned out, coding
was actually quite easy!
I learned the basics of Python within a week. And, since then, I’ve not only explored this language to the
depth, but also have helped many other to learn this language. Python was originally a general purpose
language. But, over the years, with strong community support, this language got dedicated library for data
analysis and predictive modeling.
Due to lack of resource on python for data science, I decided to create this tutorial to help many others to
learn python faster. In this tutorial, we will take bite sized information about how to use Python for Data
Analysis, chew it till we are comfortable and practice it at our own end.
Table of Contents
Python has gathered a lot of interest recently as a choice of language for data analysis. I had basics of
Python some time back. Here are some reasons which go in favour of learning Python:
This is one of the most debated topics in Python. You will invariably cross paths with it, specially if you are
a beginner. There is no right/wrong choice here. It totally depends on the situation and your need to use. I
will try to give you some pointers to help you make an informed choice.
1. Awesome community support! This is something you’d need in your early days. Python 2 was released
in late 2000 and has been in use for more than 15 years.
2. Plethora of third-party libraries! Though many libraries have provided 3.x support but still a large
number of modules work only on 2.x versions. If you plan to use Python for specific applications like
web-development with high reliance on external modules, you might be better off with 2.7.
3. Some of the features of 3.x versions have backward compatibility and can work with 2.7 version.
1. Cleaner and faster! Python developers have fixed some inherent glitches and minor drawbacks in order
to set a stronger foundation for the future. These might not be very relevant initially, but will matter
eventually.
2. It is the future! 2.7 is the last release for the 2.x family and eventually everyone has to shift to 3.x
versions. Python 3 has released stable versions for past 5 years and will continue the same.
There is no clear winner but I suppose the bottom line is that you should focus on learning Python as a
language. Shifting between versions should just be a matter of time. Stay tuned for a dedicated article on
Python 2.x vs 3.x in the near future!
You can download Python directly from its project site and install individual components and libraries
you want
Alternately, you can download and install a package, which comes with pre-installed libraries. I would
recommend downloading Anaconda. Another option could be Enthought Canopy Express.
Second method provides a hassle free installation and hence I’ll recommend that to beginners. The
imitation of this approach is you have to wait for the entire package to be upgraded, even if you are
interested in the latest version of a single library. It should not matter until and unless, until and unless,
you are doing cutting edge statistical research.
Once you have installed Python, there are various options for choosing an environment. Here are the 3
most common options:
While the right environment depends on your need, I personally prefer iPython Notebooks a lot. It provides
a lot of good features for documenting while writing the code itself and you can choose to run the code in
blocks (rather than the line by line execution)
You can start iPython notebook by writing “ipython notebook” on your terminal / cmd, depending on
the OS you are working on
You can name a iPython notebook by simply clicking on the name – UntitledO in the above screenshot
The interface shows In [*] for inputs and Out[*] for output.
You can execute a code by pressing “Shift + Enter” or “ALT + Enter”, if you want to insert an additional
row after.
Before we deep dive into problem solving, lets take a step back and understand the basics of Python. As
we know that data structures and iteration and conditional constructs form the crux of any language. In
Python, these include lists, strings, tuples, dictionaries, for-loop, while-loop, if-else, etc. Let’s take a look at
some of these.
Following are some data structures, which are used in Python. You should be familiar with them in order to
use them as appropriate.
Lists – Lists are one of the most versatile data structure in Python. A list can simply be defined by
writing a list of comma separated values in square brackets. Lists might contain items of different
types, but usually the items all have the same type. Python lists are mutable and individual elements of
a list can be changed.
Since Tuples are immutable and can not change, they are faster in processing as compared to lists.
Hence, if your list is unlikely to change, you should use tuples, instead of lists.
Dictionary – Dictionary is an unordered set of key: value pairs, with the requirement that the keys are
unique (within one dictionary). A pair of braces creates an empty dictionary: {}.
Python Iteration and Conditional Constructs
Like most languages, Python also has a FOR-loop which is the most widely used method for iteration. It
has a simple syntax:
Here “Python Iterable” can be a list, tuple or other advanced data structures which we will explore in later
sections. Let’s take a look at a simple example, determining the factorial of a number.
Coming to conditional statements, these are used to execute code fragments based on a condition. The
most commonly used construct is if-else, with following syntax:
Now that you are familiar with Python fundamentals, let’s take a step further. What if you have to perform
the following tasks:
1. Multiply 2 matrices
2. Find the root of a quadratic equation
3. Plot bar charts and histograms
4. Make statistical models
5. Access web-pages
If you try to write code from scratch, its going to be a nightmare and you won’t stay on Python for more
than 2 days! But lets not worry about that. Thankfully, there are many libraries with predefined which we
can directly import into our code and make our life easy.
For example, consider the factorial example we just saw. We can do that in a single step as:
math.factorial(N)
Off-course we need to import the math library for that. Lets explore the various libraries next.
Python Libraries
Lets take one step ahead in our journey to learn Python by getting acquainted with some useful libraries.
The first step is obviously to learn to import them into our environment. There are several ways of doing so
in Python:
import math as m
In the first manner, we have defined an alias m to library math. We can now use various functions from
math library (e.g. factorial) by referencing it using the alias m.factorial().
In the second manner, you have imported the entire name space in math i.e. you can directly use factorial()
without referring to math.
Tip: Google recommends that you use first style of impor ting libraries, as you will know where the
functions have come from.
Following are a list of libraries, you will need for any scientific computations and data analysis:
NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This
library also contains basic linear algebra functions, Fourier transforms, advanced random number
capabilities and tools for integration with other low level languages like Fortran, C and C++
SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful library for
variety of high level science and engineering modules like discrete Fourier transform, Linear Algebra,
Optimization and Sparse matrices.
Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots.. You
can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting
features inline. If you ignore the inline option, then pylab converts ipython environment to an
environment, very similar to Matlab. You can also use Latex commands to add math to your plot.
Pandas for structured data operations and manipulations. It is extensively used for data munging and
preparation. Pandas were added relatively recently to Python and have been instrumental in boosting
Python’s usage in data scientist community.
Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of
efficient tools for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction.
Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to explore
data, estimate statistical models, and perform statistical tests. An extensive list of descriptive
statistics, statistical tests, plotting functions, and result statistics are available for different types of
data and each estimator.
Seaborn for statistical data visualization. Seaborn is a library for making attractive and informative
statistical graphics in Python. It is based on matplotlib. Seaborn aims to make visualization a central
part of exploring and understanding data.
Bokeh for creating interactive plots, dashboards and data applications on modern web-browsers. It
empowers the user to generate elegant and concise graphics in the style of D3.js. Moreover, it has the
capability of high-performance interactivity over very large or streaming datasets.
Blaze for extending the capability of Numpy and Pandas to distributed and streaming datasets. It can
be used to access data from a multitude of sources including Bcolz, MongoDB, SQLAlchemy, Apache
Spark, PyTables, etc. Together with Bokeh, Blaze can act as a very powerful tool for creating effective
visualizations and dashboards on huge chunks of data.
Scrapy for web crawling. It is a very useful framework for getting specific patterns of data. It has the
capability to start at a website home url and then dig through web-pages within the website to gather
information.
SymPy for symbolic computation. It has wide-ranging capabilities from basic symbolic arithmetic to
calculus, algebra, discrete mathematics and quantum physics. Another useful feature is the capability
of formatting the result of the computations as LaTeX code.
Requests for accessing the web. It works similar to the the standard python library urllib2 but is much
easier to code. You will find subtle differences with urllib2 but for beginners, Requests might be more
convenient.
Now that we are familiar with Python fundamentals and additional libraries, lets take a deep dive into
problem solving through Python. Yes I mean making a predictive model! In the process, we use some
powerful libraries and also come across the next level of data structures. We will take you through the 3
key phases:
In order to explore our data further, let me introduce you to another animal (as if Python was not enough!)
– Pandas
Pandas is one of the most useful data analysis library in Python (I know these names sounds weird, but
hang on!). They have been instrumental in increasing the use of Python in data science community. We will
now use Pandas to read a data set from an Analytics Vidhya competition, perform exploratory analysis and
build our first basic categorization algorithm for solving this problem.
Before loading the data, lets understand the 2 key data structures in Pandas – Series and DataFrames
Introduction to Series and Dataframes
Series can be understood as a 1 dimensional labelled / indexed array. You can access individual elements
of this series through these labels.
A dataframe is similar to Excel workbook – you have column names referring to columns and you have
rows, which can be accessed with use of row numbers. The essential difference being that column names
and row numbers are known as column and row index, in case of dataframes.
Series and dataframes form the core data model for Pandas in Python. The data sets are first to read into
these dataframes and then various operations (e.g. group by, aggregation etc.) can be applied very easily
to its columns.
You can download the dataset from here. Here is the description of the variables:
VARIABLE DESCRIPTIONS: Variable Description Loan_ID Unique Loan ID Gender Male/ Female Married Applicant
married (Y/N) Dependents Number of dependents Education Applicant Education (Graduate/ Under Graduate)
Self_Employed Self employed (Y/N) ApplicantIncome Applicant income CoapplicantIncome Coapplicant income
LoanAmount Loan amount in thousands Loan_Amount_Term Term of loan in months Credit_History credit history
meets guidelines Property_Area Urban/ Semi Urban/ Rural Loan_Status Loan approved (Y/N)
To begin, start iPython interface in Inline Pylab mode by typing following on your terminal/windows
command prompt:
This opens up iPython notebook in pylab environment, which has a few useful libraries already imported.
Also, you will be able to plot your data inline, which makes this a really good environment for interactive
data analysis. You can check whether the environment has loaded correctly, by typing the following
command (and getting the output as seen in the figure below):
plot(arange(5))
I am currently working in Linux, and have stored the dataset in the following location:
/home/kunal/Downloads/Loan_Prediction/train.csv
numpy
matplotlib
pandas
Please note that you do not need to import matplotlib and numpy because of Pylab environment. I have
still kept them in the code, in case you use the code in a different environment.
After importing the library, you read the dataset using function read_csv(). This is how the code looks like
till this stage:
Once you have read the dataset, you can have a look at few top rows by using the function head()
df.head(10)
This should print 10 rows. Alternately, you can also look at more rows by printing the dataset.
Next, you can look at summary of numerical fields by using describe() function
df.describe()
describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its
output (Read this article to refresh basic statistics to understand population distribution)
Here are a few inferences, you can draw by looking at the output of describe() function:
For the non-numerical values (e.g. Property_Area, Credit_History etc.), we can look at frequency
distribution to understand whether they make sense or not. The frequency table can be printed by
following command:
df['Property_Area'].value_counts()
Similarly, we can look at unique values of port of credit history. Note that dfname[‘column_name’] is a basic
indexing technique to acess a particular column of the dataframe. It can be a list of columns as well. For
more information, refer to the “10 Minutes to Pandas” resource shared above.
Distribution analysis
Now that we are familiar with basic data characteristics, let us study distribution of various variables. Let
us start with numeric variables – namely ApplicantIncome and LoanAmount
Lets start by plotting the histogram of ApplicantIncome using the following commands:
df['ApplicantIncome'].hist(bins=50)
Here we observe that there are few extreme values. This is also the reason why 50 bins are required to
depict the distribution clearly.
Next, we look at box plots to understand the distributions. Box plot for fare can be plotted by:
df.boxplot(column='ApplicantIncome')
This confirms the presence of a lot of outliers/extreme values. This can be attributed to the income
disparity in the society. Part of this can be driven by the fact that we are looking at people with different
education levels. Let us segregate them by Education:
df.boxplot(column='ApplicantIncome', by = 'Education')
We can see that there is no substantial different between the mean income of graduate and non-graduates.
But there are a higher number of graduates with very high incomes, which are appearing to be the outliers.
Now, Let’s look at the histogram and boxplot of LoanAmount using the following command:
df['LoanAmount'].hist(bins=50)
df.boxplot(column='LoanAmount')
Again, there are some extreme values. Clearly, both ApplicantIncome and LoanAmount require some
amount of data munging. LoanAmount has missing and well as extreme values values, while
ApplicantIncome has a few extreme values, which demand deeper understanding. We will take this up in
coming sections.
Now that we understand distributions for ApplicantIncome and LoanIncome, let us understand categorical
variables in more details. We will use Excel style pivot table and cross-tabulation. For instance, let us look
at the chances of getting a loan based on credit history. This can be achieved in MS Excel using a pivot
table as:
Note: here loan status has been coded as 1 for Yes and 0 for No. So the mean represents the probability of
getting loan.
Now we will look at the steps required to generate a similar insight using Python. Please refer to this
article for getting a hang of the different data manipulation techniques in Pandas.
Now we can observe that we get a similar pivot_table like the MS Excel one. This can be plotted as a bar
chart using the “matplotlib” library with following code:
This shows that the chances of getting a loan are eight-fold if the applicant has a valid credit history. You
can plot similar graphs by Married, Self-Employed, Property_Area, etc.
Alternately, these two plots can also be visualized by combining them in a stacked chart::
temp3 = pd.crosstab(df['Credit_History'], df['Loan_Status']) temp3.plot(kind='bar', stacked=True, color=
['red','blue'], grid=False)
You can also add gender into the mix (similar to the pivot table in Excel):
If you have not realized already, we have just created two basic classification algorithms here, one based
on credit history, while other on 2 categorical variables (including gender). You can quickly code this to
create your first submission on AV Datahacks.
We just saw how we can do exploratory analysis in Python using Pandas. I hope your love for pandas (the
animal) would have increased by now – given the amount of help, the library can provide you in analyzing
datasets.
Next let’s explore ApplicantIncome and LoanStatus variables further, perform data munging and create a
dataset for applying various modeling techniques. I would strongly urge that you take another dataset and
problem and go through an independent example before reading further.
For those, who have been following, here are your must wear shoes to start running.
1. There are missing values in some variables. We should estimate those values wisely depending on the
amount of missing values and the expected importance of variables.
2. While looking at the distributions, we saw that ApplicantIncome and LoanAmount seemed to contain
extreme values at either end. Though they might make intuitive sense, but should be treated
appropriately.
In addition to these problems with numerical fields, we should also look at the non-numerical fields i.e.
Gender, Property_Area, Married, Education and Dependents to see, if they contain any useful information.
If you are new to Pandas, I would recommend reading this article before moving on. It details some useful
techniques of data manipulation.
Let us look at missing values in all the variables because most of the models don’t work with missing data
and even if they do, imputing them helps more often than not. So, let us check the number of nulls / NaNs
in the dataset
df.apply(lambda x: sum(x.isnull()),axis=0)
This command should tell us the number of missing values in each column as isnull() returns 1, if the value
is null.
Though the missing values are not very high in number, but many variables have them and each one of
these should be estimated and added in the data. Get a detailed view on different imputation techniques
through this article.
Note: Remember that missing values may not always be NaNs. For instance, if the Loan_Amount_Term is 0,
does it makes sense or would you consider that missing? I suppose your answer is missing and you’re
right. So we should check for values which are unpractical.
How to fill missing values in LoanAmount?
There are numerous ways to fill the missing values of loan amount – the simplest being replacement by
mean, which can be done by following code:
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
The other extreme could be to build a supervised learning model to predict loan amount on the basis of
other variables and then use age along with other variables to predict survival.
Since, the purpose now is to bring out the steps in data munging, I’ll rather take an approach, which lies
some where in between these 2 extremes. A key hypothesis is that the whether a person is educated
or self-employed can combine to give a good estimate of loan amount.
Thus we see some variations in the median of loan amount for each group and this can be used to impute
the values. But first, we have to ensure that each of Self_Employed and Education variables should not
have a missing values.
As we say earlier, Self_Employed has some missing values. Let’s look at the frequency table:
Since ~86% values are “No”, it is safe to impute the missing values as “No” as there is a high probability of
success. This can be done using the following code:
df['Self_Employed'].fillna('No',inplace=True)
Now, we will create a Pivot table, which provides us median values for all the groups of unique values of
Self_Employed and Education features. Next, we define a function, which returns the values of these cells
and apply it to fill the missing values of loan amount:
This should provide you a good way to impute missing values of loan amount.
NOTE : This method will work only if you have not filled the missing values in Loan_Amount variable using
the previous approach, i.e. using mean.
Let’s analyze LoanAmount first. Since the extreme values are practically possible, i.e. some people might
apply for high value loans due to specific needs. So instead of treating them as outliers, let’s try a log
transformation to nullify their effect:
Now the distribution looks much closer to normal and effect of extreme values has been significantly
subsided.
Coming to ApplicantIncome. One intuition can be that some applicants have lower income but strong
support Co-applicants. So it might be a good idea to combine both incomes as total income and take a log
transformation of the same.
np.log(df['TotalIncome']) df['LoanAmount_log'].hist(bins=20)
Now we see that the distribution is much better than before. I will leave it upto you to impute the missing
values for Gender, Married, Dependents, Loan_Amount_Term, Credit_History. Also, I encourage you to think
about possible additional information which can be derived from the data. For example, creating a column
for LoanAmount/TotalIncome might make sense as it gives an idea of how well the applicant is suited to
pay back his loan.
After, we have made the data useful for modeling, let’s now look at the python code to create a predictive
model on our data set. Skicit-Learn (sklearn) is the most commonly used library in Python for this purpose
and we will follow the trail. I encourage you to get a refresher on sklearn through this article.
Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into
numeric by encoding the categories. Before that we will fill all the missing values in the dataset. This can
be done using the following code:
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)
#Import models from scikit learn module: from sklearn.linear_model import LogisticRegression from
sklearn.cross_validation import KFold #For K-fold cross validation from sklearn.ensemble import
function: model.fit(data[predictors],data[outcome])
Logistic Regression
Let’s make our first Logistic Regression model. One way would be to take all the variables into the model
but this might result in overfitting (don’t worry if you’re unaware of this terminology yet). In simple words,
taking all variables might result in the model understanding complex relations specific to the data and will
not generalize well. Read more about Logistic Regression.
We can easily make some intuitive hypothesis to set the ball rolling. The chances of getting a loan will be
higher for:
classification_model(model, df,predictor_var,outcome_var)
Accuracy : 80.945% Cross-Validation Score : 80.946%
['Credit_History','Education','Married','Self_Employed','Property_Area'] classification_model(model,
df,predictor_var,outcome_var)
Generally we expect the accuracy to increase on adding variables. But this is a more challenging case. The
accuracy and cross-validation score are not getting impacted by less important variables. Credit_History is
dominating the mode. We have two options now:
1. Feature Engineering: dereive new information and try to predict those. I will leave this to your creativity.
2. Better modeling techniques. Let’s explore this next.
Decision Tree
Decision tree is another method for making a predictive model. It is known to provide higher accuracy than
logistic regression model. Read more about Decision Trees.
Here the model based on categorical variables is unable to have an impact because Credit History is
dominating over them. Let’s try a few numerical variables:
df,predictor_var,outcome_var)
Here we observed that although the accuracy went up on adding variables, the cross-validation error went
down. This is the result of model over-fitting the data. Let’s try an even more sophisticated algorithm and
see if it helps:
Random Forest
Random forest is another algorithm for solving the classification problem. Read more about Random
Forest.
An advantage with Random Forest is that we can make it work with all the features and it returns a feature
importance matrix which can be used to select features.
Here we see that the accuracy is 100% for the training set. This is the ultimate case of overfitting and can
be resolved in two ways:
Let’s try both of these. First we see the feature importance matrix from which we’ll take the most important
features.
Let’s use the top 5 variables for creating a model. Also, we will modify the parameters of random forest
model a little bit:
classification_model(model, df,predictor_var,outcome_var)
Accuracy : 82.899% Cross-Validation Score : 81.461%
Notice that although accuracy reduced, but the cross-validation score is improving showing that the model
is generalizing well. Remember that random forest models are not exactly repeatable. Different runs will
result in slight variations because of randomization. But the output should stay in the ballpark.
You would have noticed that even after some basic parameter tuning on random forest, we have reached a
cross-validation accuracy only slightly better than the original logistic regression model. This exercise
gives us some very interesting and unique learning:
You can access the dataset and problem statement used in this post at this link: Loan Prediction Challenge
Projects
Now, its time to take the plunge and actually play with some other real datasets. So are you ready to take
on the challenge? Accelerate your data science journey with the following Practice Problems:
Practice Problem: Food Demand Forecasting Predict the demand of meals for a meal
Challenge delivery company
I hope this tutorial will help you maximize your efficiency when starting with data science in Python. I am
sure this not only gave you an idea about basic data analysis methods but it also showed you how to
implement some of the more sophisticated techniques available today.
You should also check out our free Python course and then jump over to learn how to apply it for Data
Science.
Python is really a great tool and is becoming an increasingly popular language among the data scientists.
The reason being, it’s easy to learn, integrates well with other databases and tools like Spark and Hadoop.
Majorly, it has the great computational intensity and has powerful data analytics libraries.
So, learn Python to perform the full life-cycle of any data science project. It includes reading, analyzing,
visualizing and finally making predictions.
If you come across any difficulty while practicing Python, or you have any thoughts /suggestions/feedback
on the post, please feel free to post them through comments below.
Note – The discussions of this article are going on at AV’s Discuss portal. Join
here!
If you like what you just read & want to continue your analytics
learning, subscribe to our emails, follow us on twitter or like
our facebook page.
Kunal Jain
Kunal is a post graduate from IIT Bombay in Aerospace Engineering. He has spent more than 10 years
in field of Data Science. His work experience ranges from mature markets like UK to a developing
market like India. During this period he has lead teams of various sizes and has worked on various
tools like SAS, SPSS, Qlikview, R, Python and Matlab.