Pandas PDF
Pandas PDF
#pandas
Table of Contents
About 1
Remarks 2
Versions 2
Examples 3
Descriptive statistics 3
Installation or Setup 4
Hello World 5
Examples 8
What is a factor 8
Initialization 8
Analysis 9
Plot Returns 9
Calculate Statistics 13
Examples 15
Introduction 18
Examples 18
Introduction 21
Examples 21
Object Creation 21
Examples 23
Introduction 24
Examples 24
Examples 31
Remarks 34
Examples 34
Changing dtypes 35
Summarizing dtypes 38
Examples 39
Examples 40
Select duplicated 40
Drop duplicated 40
Examples 44
Remarks 46
Examples 46
Integer and NA 46
Examples 48
Examples 51
Aggregating groups 51
Basic grouping 52
Grouping numbers 53
using transform to get group-level statistics while preserving the original dataframe 55
Examples 57
Examples 59
Examples 61
Select by position 61
generate sample DF 64
show all columns except those beginning with a (in other word remove / drop all columns sa 65
Boolean indexing 65
generate random DF 66
select rows where values in column A > 2 and values in column B < 5 66
using .query() method with variables for filtering 67
Examples 72
Examples 74
Read JSON 74
can either pass string of the json, or a filepath to a file with valid json 74
Chapter 21: Making Pandas Play Nice With Native Python Datatypes 76
Examples 76
Moving Data Out of Pandas Into Native Python and Numpy Data Structures 76
Remarks 78
Examples 78
Syntax 79
Parameters 79
Examples 80
Inner join: 80
Outer join: 80
Left join: 81
Right Join 81
Merging / concatenating / joining multiple data frames (horizontally and vertically) 81
Merge 82
Remarks 87
Examples 87
style 88
print statements 88
Remarks 89
Examples 89
Interpolation 89
Examples 93
MultiIndex Columns 97
Remarks 99
Examples 99
Reading financial data (for multiple tickers) into pandas panel - demo 99
Chapter 28: Pandas IO tools (reading and saving data sets) 102
Remarks 102
Examples 102
Reading cvs file into a pandas data frame when there is no header row 102
save our data frame into h5 (HDFStore) file, indexing [int32, int64, string] columns: 103
Read & merge multiple CSV files (with the same structure) into one DF 104
File: 107
Code: 107
Output: 107
Examples 111
Examples 113
Examples 114
Examples 116
Table file with header, footer, row names, and index column: 116
Examples 119
Examples 121
Split (reshape) CSV strings in columns into multiple rows, having one element per row 129
Parameters 131
Examples 132
Save Pandas DataFrame from list to dicts to csv with no index and with data encoding 133
Examples 135
Examples 140
Examples 141
Examples 148
Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame 153
Examples 153
Examples 156
Subsetting 156
Credits 158
About
You can share this PDF with anyone you feel could benefit from it, downloaded the latest version
from: pandas
It is an unofficial and free pandas ebook created for educational purposes. All the content is
extracted from Stack Overflow Documentation, which is written by many hardworking individuals at
Stack Overflow. It is neither affiliated with Stack Overflow nor official pandas.
The content is released under Creative Commons BY-SA, and the list of contributors to each
chapter are provided in the credits section at the end of this book. Images may be copyright of
their respective owners unless otherwise specified. All trademarks and registered trademarks are
the property of their respective company owners.
Use the content presented in this book at your own risk; it is not guaranteed to be correct nor
accurate, please send your feedback and corrections to [email protected]
http://www.riptutorial.com/ 1
Chapter 1: Getting started with pandas
Remarks
Pandas is a Python package providing fast, flexible, and expressive data structures designed to
make working with “relational” or “labeled” data both easy and intuitive. It aims to be the
fundamental high-level building block for doing practical, real world data analysis in Python.
Versions
Pandas
0.19.1 2016-11-03
0.19.0 2016-10-02
0.18.1 2016-05-03
0.18.0 2016-03-13
0.17.1 2015-11-21
0.17.0 2015-10-09
0.16.2 2015-06-12
0.16.1 2015-05-11
0.16.0 2015-03-22
0.15.2 2014-12-12
0.15.1 2014-11-09
0.15.0 2014-10-18
0.14.1 2014-07-11
0.14.0 2014-05-31
0.13.1 2014-02-03
0.13.0 2014-01-03
http://www.riptutorial.com/ 2
Version Release Date
0.12.0 2013-07-23
Examples
Descriptive statistics
In [2]: df
Out[2]:
A B C
0 1 12 a
1 2 14 a
2 1 11 b
3 4 16 a
4 3 18 b
5 5 18 c
6 2 22 b
7 3 13 a
8 4 21 b
9 1 17 a
In [3]: df.describe()
Out[3]:
A B
count 10.000000 10.000000
mean 2.600000 16.200000
std 1.429841 3.705851
min 1.000000 11.000000
25% 1.250000 13.250000
50% 2.500000 16.500000
75% 3.750000 18.000000
max 5.000000 22.000000
Note that since C is not a numerical column, it is excluded from the output.
In [4]: df['C'].describe()
Out[4]:
count 10
unique 3
freq 5
Name: C, dtype: object
In this case the method summarizes categorical data by number of observations, number of
unique elements, mode, and frequency of the mode.
http://www.riptutorial.com/ 3
Installation or Setup
Detailed instructions on getting pandas set up or installed can be found here in the official
documentation.
Installing pandas and the rest of the NumPy and SciPy stack can be a little difficult for
inexperienced users.
The simplest way to install not only pandas, but Python and the most popular packages that make
up the SciPy stack (IPython, NumPy, Matplotlib, ...) is with Anaconda, a cross-platform (Linux,
Mac OS X, Windows) Python distribution for data analytics and scientific computing.
After running a simple installer, the user will have access to pandas and the rest of the SciPy stack
without needing to install anything else, and without needing to wait for any software to be
compiled.
A full list of the packages available as part of the Anaconda distribution can be found here.
An additional advantage of installing with Anaconda is that you don’t require admin rights to install
it, it will install in the user’s home directory, and this also makes it trivial to delete Anaconda at a
later date (just delete that folder).
The previous section outlined how to get pandas installed as part of the Anaconda distribution.
However this approach means you will install well over one hundred packages and involves
downloading the installer which is a few hundred megabytes in size.
If you want to have more control on which packages, or have a limited internet bandwidth, then
installing pandas with Miniconda may be a better solution.
Conda is the package manager that the Anaconda distribution is built upon. It is a package
manager that is both cross-platform and language agnostic (it can play a similar role to a pip and
virtualenv combination).
Miniconda allows you to create a minimal self contained Python installation, and then use the
Conda command to install additional packages.
First you will need Conda to be installed and downloading and running the Miniconda will do this
for you. The installer can be found here.
The next step is to create a new conda environment (these are analogous to a virtualenv but they
also allow you to specify precisely which Python version to install also). Run the following
commands from a terminal window:
http://www.riptutorial.com/ 4
This will create a minimal environment with only Python installed in it. To put your self inside this
environment run:
activate name_of_my_env
The final step required is to install pandas. This can be done with the following command:
If you require any packages that are available to pip but not conda, simply install pip, and use pip
to install these packages:
pip example:
This will likely require the installation of a number of dependencies, including NumPy, will require a
compiler to compile required bits of code, and can take a few minutes to complete.
Hello World
Once Pandas has been installed, you can check if it is is working properly by creating a dataset of
randomly distributed values and plotting its histogram.
http://www.riptutorial.com/ 5
import matplotlib.pyplot as plt
np.random.seed(0)
s.describe()
# Output: count 100.000000
# mean 0.059808
# std 1.012960
# min -2.552990
# 25% -0.643857
# 50% 0.094096
# 75% 0.737077
# max 2.269755
# dtype: float64
First download anaconda from the Continuum site. Either via the graphical installer
(Windows/OSX) or running a shell script (OSX/Linux). This includes pandas!
http://www.riptutorial.com/ 6
If you don't want the 150 packages conveniently bundled in anaconda, you can install miniconda.
Either via the graphical installer (Windows) or shell script (OSX/Linux).
http://www.riptutorial.com/ 7
Chapter 2: Analysis: Bringing it all together
and making decisions
Examples
Quintile Analysis: with random data
Quintile analysis is a common framework for evaluating the efficacy of security factors.
What is a factor
A factor is a method for scoring/ranking sets of securities. For a particular point in time and for a
particular set of securities, a factor can be represented as a pandas series where the index is an
array of the security identifiers and the values are the scores or ranks.
If we take factor scores over time, we can, at each point in time, split the set of securities into 5
equal buckets, or quintiles, based on the order of the factor scores. There is nothing particularly
sacred about the number 5. We could have used 3 or 10. But we use 5 often. Finally, we track the
performance of each of the five buckets to determine if there is a meaningful difference in the
returns. We tend to focus more intently on the difference in returns of the bucket with the highest
rank relative to that of the lowest rank.
To facilitate the experimentation with the mechanics, we provide simple code to create random
data to give us an idea how this works.
• Returns: generate random returns for specified number of securities and periods.
• Signals: generate random signals for specified number of securities and periods and with
prescribed level of correlation with Returns. In order for a factor to be useful, there must be
some information or correlation between the scores/ranks and subsequent returns. If there
weren't correlation, we would see it. That would be a good exercise for the reader, duplicate
this analysis with random data generated with 0 correlation.
Initialization
import pandas as pd
import numpy as np
num_securities = 1000
num_periods = 1000
period_frequency = 'W'
http://www.riptutorial.com/ 8
start_date = '2000-12-31'
np.random.seed([3,1415])
means = [0, 0]
covariance = [[ 1., 5e-3],
[5e-3, 1.]]
Let's now generate a time series index and an index representing security ids. Then use them to
create dataframes for returns and signals
I divide m[0] by 25 to scale down to something that looks like stock returns. I also add 1e-7 to give a
modest positive mean return.
cut = security_signals.stack().groupby(level=0).apply(qcut)
returns_cut = security_returns.stack().rename('returns') \
.to_frame().set_index(cut, append=True) \
.swaplevel(2, 1).sort_index().squeeze() \
.groupby(level=[0, 1]).mean().unstack()
Analysis
Plot Returns
http://www.riptutorial.com/ 9
ax1 = plt.subplot2grid((1,3), (0,0))
ax2 = plt.subplot2grid((1,3), (0,1))
ax3 = plt.subplot2grid((1,3), (0,2))
# Cumulative Returns
returns_cut.add(1).cumprod() \
.plot(colormap='jet', ax=ax1, title="Cumulative Returns")
leg1 = ax1.legend(loc='upper left', ncol=2, prop={'size': 10}, fancybox=True)
leg1.get_frame().set_alpha(.8)
# Return Distribution
returns_cut.plot.box(vert=False, ax=ax3, title="Return Distribution")
fig.autofmt_xdate()
plt.show()
http://www.riptutorial.com/ 10
Calculate and visualize Maximum Draw Down
def max_dd(returns):
"""returns is a series"""
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = dd.min()
end = dd.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
def max_dd_df(returns):
"""returns is a dataframe"""
series = lambda x: pd.Series(x, ['Draw Down', 'Start', 'End'])
return returns.apply(max_dd).apply(series)
max_dd_df(returns_cut)
http://www.riptutorial.com/ 11
Let's plot it
draw_downs = max_dd_df(returns_cut)
http://www.riptutorial.com/ 12
Calculate Statistics
There are many potential statistics we can include. Below are just a few, but demonstrate how
simply we can incorporate new statistics into our summary.
def frequency_of_time_series(df):
start, end = df.index.min(), df.index.max()
delta = end - start
return round((len(df) - 1.) * 365.25 / delta.days, 2)
def annualized_return(df):
freq = frequency_of_time_series(df)
return df.add(1).prod() ** (1 / freq) - 1
def annualized_volatility(df):
freq = frequency_of_time_series(df)
return df.std().mul(freq ** .5)
def sharpe_ratio(df):
return annualized_return(df) / annualized_volatility(df)
def describe(df):
http://www.riptutorial.com/ 13
r = annualized_return(df).rename('Return')
v = annualized_volatility(df).rename('Volatility')
s = sharpe_ratio(df).rename('Sharpe')
skew = df.skew().rename('Skew')
kurt = df.kurt().rename('Kurtosis')
desc = df.describe().T
We'll end up using just the describe function as it pulls all the others together.
describe(returns_cut)
This is not meant to be comprehensive. It's meant to bring many of pandas' features together and
demonstrate how you can use it to help answer questions important to you. This is a subset of the
types of metrics I use to evaluate the efficacy of quantitative factors.
http://www.riptutorial.com/ 14
Chapter 3: Appending to DataFrame
Examples
Appending a new row to DataFrame
In [3]: df
Out[3]:
Empty DataFrame
Columns: [A, B, C]
Index: []
In [5]: df
Out[5]:
A B C
0 1 NaN NaN
In [7]: df
Out[7]:
A B C
0 1 NaN NaN
1 2 3 4
In [9]: df
Out[9]:
A B C
0 1 NaN NaN
1 2 3 4
2 3 9 9
The first input in .loc[] is the index. If you use an existing index, you will overwrite the values in that
row:
http://www.riptutorial.com/ 15
In [18]: df
Out[18]:
A B C
0 1 NaN NaN
1 5 6 7
2 3 9 9
In [20]: df
Out[20]:
A B C
0 1 8 NaN
1 5 6 7
2 3 9 9
In [7]: df1
Out[7]:
A B
0 a1 b1
1 a2 b2
In [8]: df2
Out[8]:
B C
0 b1 c1
The two DataFrames are not required to have the same set of columns. The append method does
not change either of the original DataFrames. Instead, it returns a new DataFrame by appending
the original two. Appending a DataFrame to another one is quite simple:
In [9]: df1.append(df2)
Out[9]:
A B C
0 a1 b1 NaN
1 a2 b2 NaN
0 NaN b1 c1
As you can see, it is possible to have duplicate indices (0 in this example). To avoid this issue, you
may ask Pandas to reindex the new DataFrame for you:
http://www.riptutorial.com/ 16
dataframe
http://www.riptutorial.com/ 17
Chapter 4: Boolean indexing of dataframes
Introduction
Accessing rows in a dataframe using the DataFrame indexer objects .ix, .loc, .iloc and how it
differentiates itself from using a boolean mask.
Examples
Accessing a DataFrame with a boolean index
df.loc[True]
color
True red
True red
df.iloc[True]
>> TypeError
df.iloc[1]
color blue
dtype: object
Important to note is that older pandas versions did not distinguish between boolean
and integer input, thus .iloc[True] would return the same as .iloc[1]
df.ix[True]
color
True red
True red
df.ix[1]
color blue
http://www.riptutorial.com/ 18
dtype: object
As you can see, .ix has two behaviors. This is very bad practice in code and thus it should be
avoided. Please use .iloc or .loc to be more explicit.
Using the magic __getitem__ or [] accessor. Giving it a list of True and False of the same length as
the dataframe will give you:
Accessing a single column from a data frame, we can use a simple comparison == to compare
every element in the column to the given variable, producing a pd.Series of True and False
df['size'] == 'small'
0 False
1 True
2 True
3 True
Name: size, dtype: bool
This pd.Series is an extension of an np.array which is an extension of a simple list, Thus we can
hand this to the __getitem__ or [] accessor as in the above example.
http://www.riptutorial.com/ 19
3 blue harebell small
color size
name
rose red big
violet blue small
tulip red small
harebell blue small
We can create a mask based on the index values, just like on a column value.
df.loc['rose']
color red
size big
Name: rose, dtype: object
The important difference being, when .loc only encounters one row in the index that matches, it
will return a pd.Series, if it encounters more rows that matches, it will return a pd.DataFrame. This
makes this method rather unstable.
This behavior can be controlled by giving the .loc a list of a single entry. This will force it to return
a data frame.
df.loc[['rose']]
color size
name
rose red big
http://www.riptutorial.com/ 20
Chapter 5: Categorical data
Introduction
Categoricals are a pandas data type, which correspond to categorical variables in statistics: a
variable, which can take on only a limited, and usually fixed, number of possible values
(categories; levels in R). Examples are gender, social class, blood types, country affiliations,
observation time or ratings via Likert scales. Source: Pandas Docs
Examples
Object Creation
In [189]: s
Out[189]:
0 a
1 b
2 c
3 a
4 c
dtype: category
Categories (3, object): [a, b, c]
In [193]: df
Out[193]:
A B C
0 a a a
1 b b b
2 c c c
3 a a a
4 c c c
In [194]: df.dtypes
Out[194]:
A object
B category
C category
dtype: object
http://www.riptutorial.com/ 21
In [2]: df = pd.DataFrame(np.random.choice(['foo','bar','baz'], size=(100000,3)))
df = df.apply(lambda col: col.astype('category'))
In [3]: df.head()
Out[3]:
0 1 2
0 bar foo baz
1 baz bar baz
2 foo foo bar
3 bar baz baz
4 foo bar baz
In [4]: df.dtypes
Out[4]:
0 category
1 category
2 category
dtype: object
In [5]: df.shape
Out[5]: (100000, 3)
http://www.riptutorial.com/ 22
Chapter 6: Computational Tools
Examples
Find The Correlation Between Columns
Then
>>> df.corr()
a b c
a 1.000000 0.018602 0.038098
b 0.018602 1.000000 -0.014245
c 0.038098 -0.014245 1.000000
will find the Pearson correlation between the columns. Note how the diagonal is 1, as each column
is (obviously) fully correlated with itself.
>>> df.corr(method='spearman')
a b c
a 1.000000 0.007744 0.037209
b 0.007744 1.000000 -0.011823
c 0.037209 -0.011823 1.000000
http://www.riptutorial.com/ 23
Chapter 7: Creating DataFrames
Introduction
DataFrame is a data structure provided by pandas library,apart from Series & Panel. It is a 2-
dimensional structure & can be compared to a table of rows and columns.
Each row can be identified by an integer index (0..N) or a label explicitly set when creating a
DataFrame object. Each column can be of distinct type and is identified by a label.
This topic covers various ways to construct/create a DataFrame object. Ex. from Numpy arrays,
from list of tuples, from dictionary.
Examples
Create a sample DataFrame
import pandas as pd
Create a DataFrame from a dictionary, containing two columns: numbers and colors. Each key
represent a column name and the value is a series of data, the content of the column:
print(df)
# Output:
# colors numbers
# 0 red 1
# 1 white 2
# 2 blue 3
Pandas orders columns alphabetically as dict are not ordered. To specify the order, use the
columns parameter.
print(df)
# Output:
# numbers colors
# 0 1 red
# 1 2 white
# 2 3 blue
http://www.riptutorial.com/ 24
Create a DataFrame of random numbers:
import numpy as np
import pandas as pd
print(df)
# Output:
# A B C
# 0 1.764052 0.400157 0.978738
# 1 2.240893 1.867558 -0.977278
# 2 0.950088 -0.151357 -0.103219
# 3 0.410599 0.144044 1.454274
# 4 0.761038 0.121675 0.443863
df = pd.DataFrame(np.arange(15).reshape(5,3),columns=list('ABC'))
print(df)
# Output:
# A B C
# 0 0 1 2
# 1 3 4 5
# 2 6 7 8
# 3 9 10 11
# 4 12 13 14
Create a DataFrame and include nans (NaT, NaN, 'nan', None) across columns and rows:
df = pd.DataFrame(np.arange(48).reshape(8,6),columns=list('ABCDEF'))
print(df)
# Output:
# A B C D E F
# 0 0 1 2 3 4 5
# 1 6 7 8 9 10 11
# 2 12 13 14 15 16 17
# 3 18 19 20 21 22 23
# 4 24 25 26 27 28 29
# 5 30 31 32 33 34 35
# 6 36 37 38 39 40 41
# 7 42 43 44 45 46 47
df.ix[::2,0] = np.nan # in column 0, set elements with indices 0,2,4, ... to NaN
df.ix[::4,1] = pd.NaT # in column 1, set elements with indices 0,4, ... to np.NaT
df.ix[:3,2] = 'nan' # in column 2, set elements with index from 0 to 3 to 'nan'
df.ix[:,5] = None # in column 5, set all elements to None
df.ix[5,:] = None # in row 5, set all elements to None
df.ix[7,:] = np.nan # in row 7, set all elements to NaN
print(df)
# Output:
# A B C D E F
http://www.riptutorial.com/ 25
# 0 NaN NaT nan 3 4 None
# 1 6 7 nan 9 10 None
# 2 NaN 13 nan 15 16 None
# 3 18 19 nan 21 22 None
# 4 NaN NaT 26 27 28 None
# 5 NaN None None NaN NaN None
# 6 NaN 37 38 39 40 None
# 7 NaN NaN NaN NaN NaN NaN
You can create a DataFrame from a list of simple tuples, and can even choose the specific
elements of the tuples you want to use. Here we will create a DataFrame using all of the data in
each tuple except for the last element.
import pandas as pd
data = [
('p1', 't1', 1, 2),
('p1', 't2', 3, 4),
('p2', 't1', 5, 6),
('p2', 't2', 7, 8),
('p2', 't3', 2, 8)
]
df = pd.DataFrame(data)
print(df)
# 0 1 2 3
# 0 p1 t1 1 2
# 1 p1 t2 3 4
# 2 p2 t1 5 6
# 3 p2 t2 7 8
# 4 p2 t3 2 8
import pandas as pd
import numpy as np
np.random.seed(123)
x = np.random.standard_normal(4)
y = range(4)
df = pd.DataFrame({'X':x, 'Y':y})
>>> df
X Y
0 -1.085631 0
1 0.997345 1
2 0.282978 2
3 -1.506295 3
Create a DataFrame from multiple lists by passing a dict whose values lists. The keys of the
dictionary are used as column labels. The lists can also be ndarrays. The lists/ndarrays must all be
http://www.riptutorial.com/ 26
the same length.
import pandas as pd
Using ndarrays
import pandas as pd
import numpy as np
np.random.seed(123)
x = np.random.standard_normal(4)
y = range(4)
df = pd.DataFrame({'X':x, 'Y':y})
df
# Output: X Y
# 0 -1.085631 0
# 1 0.997345 1
# 2 0.282978 2
# 3 -1.506295 3
import pandas as pd
import numpy as np
np.random.seed(0)
# create an array of 5 dates starting at '2015-02-24', one per minute
rng = pd.date_range('2015-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Val': np.random.randn(len(rng)) })
print (df)
# Output:
# Date Val
# 0 2015-02-24 00:00:00 1.764052
# 1 2015-02-24 00:01:00 0.400157
# 2 2015-02-24 00:02:00 0.978738
# 3 2015-02-24 00:03:00 2.240893
http://www.riptutorial.com/ 27
# 4 2015-02-24 00:04:00 1.867558
print (df)
# Output:
# Date Val
# 0 2015-02-24 -0.977278
# 1 2015-02-25 0.950088
# 2 2015-02-26 -0.151357
# 3 2015-02-27 -0.103219
# 4 2015-02-28 0.410599
print (df)
# Output:
# Date Val
# 0 2015-12-31 0.144044
# 1 2018-12-31 1.454274
# 2 2021-12-31 0.761038
# 3 2024-12-31 0.121675
# 4 2027-12-31 0.443863
import pandas as pd
import numpy as np
np.random.seed(0)
rng = pd.date_range('2015-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Val' : np.random.randn(len(rng)) }, index=rng)
print (df)
# Output:
# Val
# 2015-02-24 00:00:00 1.764052
# 2015-02-24 00:01:00 0.400157
# 2015-02-24 00:02:00 0.978738
# 2015-02-24 00:03:00 2.240893
# 2015-02-24 00:04:00 1.867558
Alias Description
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
BMS business month start frequency
CBMS custom business month start frequency
http://www.riptutorial.com/ 28
Q quarter end frequency
BQ business quarter endfrequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA business year end frequency
AS year start frequency
BAS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds
import pandas as pd
import numpy as np
Using from_tuples:
np.random.seed(0)
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
Using from_product:
import pandas as pd
http://www.riptutorial.com/ 29
# Save dataframe to pickled pandas object
df.to_pickle(file_name) # where to save it usually as a .plk
A DataFrame can be created from a list of dictionaries. Keys are used as column names.
import pandas as pd
L = [{'Name': 'John', 'Last Name': 'Smith'},
{'Name': 'Mary', 'Last Name': 'Wood'}]
pd.DataFrame(L)
# Output: Last Name Name
# 0 Smith John
# 1 Wood Mary
http://www.riptutorial.com/ 30
Chapter 8: Cross sections of different axes
with MultiIndex
Examples
Selection of cross-sections using .xs
In [1]:
import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
idx_row = pd.MultiIndex.from_arrays(arrays, names=['Row_First', 'Row_Second'])
idx_col = pd.MultiIndex.from_product([['A','B'], ['i', 'ii']],
names=['Col_First','Col_Second'])
df = pd.DataFrame(np.random.randn(8,4), index=idx_row, columns=idx_col)
Out[1]:
Col_First A B
Col_Second i ii i ii
Row_First Row_Second
bar one -0.452982 -1.872641 0.248450 -0.319433
two -0.460388 -0.136089 -0.408048 0.998774
baz one 0.358206 -0.319344 -2.052081 -0.424957
two -0.823811 -0.302336 1.158968 0.272881
foo one -0.098048 -0.799666 0.969043 -0.595635
two -0.358485 0.412011 -0.667167 1.010457
qux one 1.176911 1.578676 0.350719 0.093351
two 0.241956 1.082138 -0.516898 -0.196605
.xs accepts a level (either the name of said level or an integer), and an axis: 0 for rows, 1 for
columns.
Selection on rows:
Selection on columns:
http://www.riptutorial.com/ 31
Col_First A B
Row_First Row_Second
bar one -1.872641 -0.319433
two -0.136089 0.998774
baz one -0.319344 -0.424957
two -0.302336 0.272881
foo one -0.799666 -0.595635
two 0.412011 1.010457
qux one 1.578676 0.093351
two 1.082138 -0.196605
.xs only works for selection , assignment is NOT possible (getting, not setting):¨
Unlike the .xs method, this allows you to assign values. Indexing using slicers is available since
version 0.14.0.
In [1]:
import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
idx_row = pd.MultiIndex.from_arrays(arrays, names=['Row_First', 'Row_Second'])
idx_col = pd.MultiIndex.from_product([['A','B'], ['i', 'ii']],
names=['Col_First','Col_Second'])
df = pd.DataFrame(np.random.randn(8,4), index=idx_row, columns=idx_col)
Out[1]:
Col_First A B
Col_Second i ii i ii
Row_First Row_Second
bar one -0.452982 -1.872641 0.248450 -0.319433
two -0.460388 -0.136089 -0.408048 0.998774
baz one 0.358206 -0.319344 -2.052081 -0.424957
two -0.823811 -0.302336 1.158968 0.272881
foo one -0.098048 -0.799666 0.969043 -0.595635
two -0.358485 0.412011 -0.667167 1.010457
qux one 1.176911 1.578676 0.350719 0.093351
two 0.241956 1.082138 -0.516898 -0.196605
Selection on rows:
In [2]: df.loc[(slice(None),'two'),:]
Out[2]:
Col_First A B
Col_Second i ii i ii
Row_First Row_Second
bar two -0.460388 -0.136089 -0.408048 0.998774
baz two -0.823811 -0.302336 1.158968 0.272881
http://www.riptutorial.com/ 32
foo two -0.358485 0.412011 -0.667167 1.010457
qux two 0.241956 1.082138 -0.516898 -0.196605
Selection on columns:
In [3]: df.loc[:,(slice(None),'ii')]
Out[3]:
Col_First A B
Col_Second ii ii
Row_First Row_Second
bar one -1.872641 -0.319433
two -0.136089 0.998774
baz one -0.319344 -0.424957
two -0.302336 0.272881
foo one -0.799666 -0.595635
two 0.412011 1.010457
qux one 1.578676 0.093351
two 1.082138 -0.196605
In [4]: df.loc[(slice(None),'two'),(slice(None),'ii')]
Out[4]:
Col_First A B
Col_Second ii ii
Row_First Row_Second
bar two -0.136089 0.998774
baz two -0.302336 0.272881
foo two 0.412011 1.010457
qux two 1.082138 -0.196605
In [5]: df.loc[(slice(None),'two'),(slice(None),'ii')]=0
df
Out[5]:
Col_First A B
Col_Second i ii i ii
Row_First Row_Second
bar one -0.452982 -1.872641 0.248450 -0.319433
two -0.460388 0.000000 -0.408048 0.000000
baz one 0.358206 -0.319344 -2.052081 -0.424957
two -0.823811 0.000000 1.158968 0.000000
foo one -0.098048 -0.799666 0.969043 -0.595635
two -0.358485 0.000000 -0.667167 0.000000
qux one 1.176911 1.578676 0.350719 0.093351
two 0.241956 0.000000 -0.516898 0.000000
http://www.riptutorial.com/ 33
Chapter 9: Data Types
Remarks
dtypes are not native to pandas. They are a result of pandas close architectural coupling to
numpy.
the dtype of a column does not in any way have to correlate to the python type of the object
contained in the column.
pd.Series([1.,2.,3.,4.,5.]).astype(object)
0 1
1 2
2 3
3 4
4 5
dtype: object
The dtype is now object, but the objects in the list are still float. Logical if you know that in python,
everything is an object, and can be upcasted to object.
type(pd.Series([1.,2.,3.,4.,5.]).astype(object)[0])
float
pd.Series([1.,2.,3.,4.,5.]).astype(str)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
dtype: object
The dtype is now object, but the type of the entries in the list are string. This is because numpy does
not deal with strings, and thus acts as if they are just objects and of no concern.
type(pd.Series([1.,2.,3.,4.,5.]).astype(str)[0])
str
Do not trust dtypes, they are an artifact of an architectural flaw in pandas. Specify them as you
must, but do not rely on what dtype is set on a column.
Examples
http://www.riptutorial.com/ 34
Checking the types of columns
In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0], 'C': [True, False, True]})
In [2]: df
Out[2]:
A B C
0 1 1.0 True
1 2 2.0 False
2 3 3.0 True
In [3]: df.dtypes
Out[3]:
A int64
B float64
C bool
dtype: object
In [4]: df['A'].dtype
Out[4]: dtype('int64')
Changing dtypes
astype() method changes the dtype of a Series and returns a new Series.
In [3]: df.dtypes
Out[3]:
A int64
B float64
C object
D object
E object
dtype: object
In [4]: df['A'].astype('float')
Out[4]:
0 1.0
1 2.0
http://www.riptutorial.com/ 35
2 3.0
Name: A, dtype: float64
In [5]: df['B'].astype('int')
Out[5]:
0 1
1 2
2 3
Name: B, dtype: int32
astype() method is for specific type conversion (i.e. you can specify .astype(float64'),
.astype(float32), or .astype(float16)). For general conversion, you can use pd.to_numeric,
pd.to_datetime and pd.to_timedelta.
In [6]: pd.to_numeric(df['E'])
Out[6]:
0 1
1 2
2 3
Name: E, dtype: int64
By default, pd.to_numeric raises an error if an input cannot be converted to a number. You can
change that behavior by using the errors parameter.
If need check all rows with input cannot be converted to numeric use boolean indexing with isnull:
http://www.riptutorial.com/ 36
2 True
Name: A, dtype: bool
In [12]: pd.to_datetime(df['C'])
Out[12]:
0 2010-01-01
1 2011-02-01
2 2011-03-01
Name: C, dtype: datetime64[ns]
Note that 2.1.2011 is converted to February 1, 2011. If you want January 2, 2011 instead, you
need to use the dayfirst parameter.
In [14]: pd.to_timedelta(df['D'])
Out[14]:
0 1 days
1 2 days
2 3 days
Name: D, dtype: timedelta64[ns]
In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0], 'C': ['a', 'b', 'c'],
'D': [True, False, True]})
In [2]: df
Out[2]:
A B C D
0 1 1.0 a True
1 2 2.0 b False
2 3 3.0 c True
With include and exclude parameters you can specify which types you want:
# Select numbers
In [3]: df.select_dtypes(include=['number']) # You need to use a list
http://www.riptutorial.com/ 37
Out[3]:
A B
0 1 1.0
1 2 2.0
2 3 3.0
Summarizing dtypes
In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0], 'C': ['a', 'b', 'c'],
'D': [True, False, True]})
In [2]: df.get_dtype_counts()
Out[2]:
bool 1
float64 1
int64 1
object 1
dtype: int64
http://www.riptutorial.com/ 38
Chapter 10: Dealing with categorical
variables
Examples
One-hot encoding with `get_dummies()`
http://www.riptutorial.com/ 39
Chapter 11: Duplicated data
Examples
Select duplicated
If need set value 0 to column B, where in column A are duplicated data first create mask by
Series.duplicated and then use DataFrame.ix or Series.mask:
In [224]: df = pd.DataFrame({'A':[1,2,3,3,2],
...: 'B':[1,7,3,0,8]})
In [226]: mask
Out[226]:
0 False
1 True
2 True
3 True
4 True
Name: A, dtype: bool
In [229]: df
Out[229]:
A B C
0 1 1 1
1 2 0 0
2 3 0 0
3 3 0 0
4 2 0 0
In [231]: df
Out[231]:
A B C
0 1 1 0
1 2 0 2
2 3 0 3
3 3 0 3
4 2 0 2
Drop duplicated
Use drop_duplicates:
http://www.riptutorial.com/ 40
In [216]: df = pd.DataFrame({'A':[1,2,3,3,2],
...: 'B':[1,7,3,0,8]})
In [217]: df
Out[217]:
A B
0 1 1
1 2 7
2 3 3
3 3 0
4 2 8
When you don't want to get a copy of a data frame, but to modify the existing one:
In [221]: df = pd.DataFrame({'A':[1,2,3,3,2],
...: 'B':[1,7,3,0,8]})
In [223]: df
Out[223]:
A B
0 1 1
1 2 7
2 3 3
In [1]: id_numbers = pd.Series([111, 112, 112, 114, 115, 118, 114, 118, 112])
In [2]: id_numbers.nunique()
Out[2]: 5
http://www.riptutorial.com/ 41
In [3]: id_numbers.unique()
Out[3]: array([111, 112, 114, 115, 118], dtype=int64)
In [5]: df
Out[5]:
Group ID
0 A 1
1 B 1
2 A 2
3 A 3
4 B 3
5 A 2
6 B 1
7 A 2
8 A 1
9 B 3
In [6]: df.groupby('Group')['ID'].nunique()
Out[6]:
Group
A 3
B 2
Name: ID, dtype: int64
In [7]: df.groupby('Group')['ID'].unique()
Out[7]:
Group
A [1, 2, 3]
B [1, 3]
Name: ID, dtype: object
In [15]: df = pd.DataFrame({"A":[1,1,2,3,1,1],"B":[5,4,3,4,6,7]})
In [21]: df
Out[21]:
A B
0 1 5
1 1 4
2 2 3
3 3 4
4 1 6
5 1 7
In [22]: df["A"].unique()
http://www.riptutorial.com/ 42
Out[22]: array([1, 2, 3])
In [23]: df["B"].unique()
Out[23]: array([5, 4, 3, 6, 7])
To get the unique values in column A as a list (note that unique() can be used in two slightly
different ways)
In [24]: pd.unique(df['A']).tolist()
Out[24]: [1, 2, 3]
Here is a more complex example. Say we want to find the unique values from column 'B' where 'A'
is equal to 1.
First, let's introduce a duplicate so you can see how it works. Let's replace the 6 in row '4', column
'B' with a 4:
df['A'] == 1
This finds values in column A that are equal to 1, and applies True or False to them. We can then
use this to select values from column 'B' of the DataFrame (the outer DataFrame selection)
For comparison, here is the list if we don't use unique. It retrieves every value in column 'B' where
column 'A' is 1
http://www.riptutorial.com/ 43
Chapter 12: Getting information about
DataFrames
Examples
List DataFrame column names
>>> list(df)
['a', 'b', 'c']
This list comprehension method is especially useful when using the debugger:
sampledf.columns.tolist()
You can also print them as an index instead of a list (this won't be very visible for dataframes with
many columns though):
df.columns
To get basic information about a DataFrame including the column names and datatypes:
import pandas as pd
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
floats 3 non-null float64
integers 3 non-null int64
ints with None 2 non-null float64
text 3 non-null object
http://www.riptutorial.com/ 44
dtypes: float64(2), int64(1), object(1)
memory usage: 120.0+ bytes
>>> df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
floats 3 non-null float64
integers 3 non-null int64
ints with None 2 non-null float64
text 3 non-null object
dtypes: float64(2), int64(1), object(1)
memory usage: 234.0 bytes
import pandas as pd
df = pd.DataFrame(np.random.randn(5, 5), columns=list('ABCDE'))
To generate various summary statistics. For numeric values the number of non-NA/null values (
count), the mean (mean), the standard deviation std and values known as the five-number summary
:
>>> df.describe()
A B C D E
count 5.000000 5.000000 5.000000 5.000000 5.000000
mean -0.456917 -0.278666 0.334173 0.863089 0.211153
std 0.925617 1.091155 1.024567 1.238668 1.495219
min -1.494346 -2.031457 -0.336471 -0.821447 -2.106488
25% -1.143098 -0.407362 -0.246228 -0.087088 -0.082451
50% -0.536503 -0.163950 -0.004099 1.509749 0.313918
75% 0.092630 0.381407 0.120137 1.822794 1.060268
max 0.796729 0.828034 2.137527 1.891436 1.870520
http://www.riptutorial.com/ 45
Chapter 13: Gotchas of pandas
Remarks
Gotcha in general is a construct that is although documented, but not intuitive. Gotchas produce
some output that is normally not expected because of its counter-intuitive character.
Pandas package has several gotchas, that can confuse someone, who is not aware of them, and
some of them are presented on this documentation page.
Examples
Detecting missing values with np.nan
df=pd.DataFrame({'col':[1,np.nan]})
df==np.nan
col
0 False
1 False
This is because comparing missing value to anything results in a False - instead of this you should
use
df=pd.DataFrame({'col':[1,np.nan]})
df.isnull()
col
0 False
1 True
Integer and NA
Pandas don't support missing in attributes of type integer. For example if you have missings in the
grade column:
In this case you just should use float instead of integers or set the object dtype.
http://www.riptutorial.com/ 46
Automatic Data Alignment (index-awared behaviour)
If you want to append a series of values [1,2] to the column of dataframe df, you will get NaNs:
import pandas as pd
series=pd.Series([1,2])
df=pd.DataFrame(index=[3,4])
df['col']=series
df
col
3 NaN
4 NaN
because setting a new column automatically aligns the data by the indexe, and your values 1 and
2 would get the indexes 0 and 1, and not 3 and 4 as in your data frame:
df=pd.DataFrame(index=[1,2])
df['col']=series
df
col
1 2.0
2 NaN
If you want to ignore index, you should set the .values at the end:
df['col']=series.values
col
3 1
4 2
http://www.riptutorial.com/ 47
Chapter 14: Graphs and Visualizations
Examples
Basic Data Graphs
Pandas uses provides multiple ways to make graphs of the data inside the data frame. It uses
matplotlib for that purpose.
The basic graphs have their wrappers for both DataFrame and Series objects:
Line Plot
You can call the same method for a Series object to plot a subset of the Data Frame:
df['x'].plot()
http://www.riptutorial.com/ 48
Bar Chart
If you want to explore the distribution of your data, you can use the hist() method.
df['x'].hist()
All the possible graphs are available through the plot method. The kind of chart is selected by the
kind argument.
df['x'].plot(kind='pie')
Note In many environments, the pie chart will come out an oval. To make it a circle, use the
following:
pyplot.axis('equal')
df['x'].plot(kind='pie')
http://www.riptutorial.com/ 49
plot() can take arguments that get passed on to matplotlib to style the plot in different ways.
By default, plot() creates a new figure each time it is called. It is possible to plot on an existing
axis by passing the ax parameter.
http://www.riptutorial.com/ 50
Chapter 15: Grouping Data
Examples
Aggregating groups
df = pd.DataFrame(
{"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
"City":["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"],
"Val": [4, 3, 3, np.nan, np.nan, 4]})
df
# Output:
# City Name Val
# 0 Seattle Alice 4.0
# 1 Seattle Bob 3.0
# 2 Portland Mallory 3.0
# 3 Seattle Mallory NaN
# 4 Seattle Bob NaN
# 5 Portland Mallory 4.0
http://www.riptutorial.com/ 51
df.groupby(["Name", "City"])['Val'].size().reset_index(name='Size')
# Output:
# Name City Size
# 0 Alice Seattle 1
# 1 Bob Seattle 2
# 2 Mallory Portland 2
# 3 Mallory Seattle 1
df.groupby(["Name", "City"])['Val'].count().reset_index(name='Count')
# Output:
# Name City Count
# 0 Alice Seattle 1
# 1 Bob Seattle 1
# 2 Mallory Portland 2
# 3 Mallory Seattle 0
You can iterate on the object returned by groupby(). The iterator contains (Category, DataFrame)
tuples.
Basic grouping
df
# Output:
# A B C
# 0 a 2 102
# 1 b 8 98
# 2 c 1 107
# 3 a 4 104
# 4 b 3 115
# 5 b 8 87
http://www.riptutorial.com/ 52
df.groupby('A').mean()
# Output:
# B C
# A
# a 3.000000 103
# b 6.333333 100
# c 1.000000 107
df.groupby(['A','B']).mean()
# Output:
# C
# A B
# a 2 102.0
# 4 104.0
# b 3 115.0
# 8 92.5
# c 1 107.0
Note how after grouping each row in the resulting DataFrame is indexed by a tuple or MultiIndex
(in this case a pair of elements from columns A and B).
To apply several aggregation methods at once, for instance to count the number of items in each
group and compute their mean, use the agg function:
df.groupby(['A','B']).agg(['count', 'mean'])
# Output:
# C
# count mean
# A B
# a 2 1 102.0
# 4 1 104.0
# b 3 1 115.0
# 8 2 92.5
# c 1 1 107.0
Grouping numbers
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'Age': np.random.randint(20, 70, 100),
'Sex': np.random.choice(['Male', 'Female'], 100),
'number_of_foo': np.random.randint(1, 20, 100)})
df.head()
# Output:
http://www.riptutorial.com/ 53
# 4 23 Female 15
Group Age into three categories (or bins). Bins can be given as
• an integer n indicating the number of bins—in this case the dataframe's data is divided into n
intervals of equal size
• a sequence of integers denoting the endpoint of the left-open intervals in which the data is
divided into—for instance bins=[19, 40, 65, np.inf] creates three age groups (19, 40], (40,
65], and (65, np.inf].
Pandas assigns automatically the string versions of the intervals as label. It is also possible to
define own labels by defining a labels parameter as a list of strings.
pd.cut(df['Age'], bins=4)
# this creates four age groups: (19.951, 32.25] < (32.25, 44.5] < (44.5, 56.75] < (56.75, 69]
Name: Age, dtype: category
Categories (4, object): [(19.951, 32.25] < (32.25, 44.5] < (44.5, 56.75] < (56.75, 69]]
pd.crosstab(age_groups, df['Sex'])
# Output:
# Sex Female Male
# Age
# (19, 40] 22 28
# (40, 65] 18 24
# (65, inf] 3 5
When you do a groupby you can select either a single column or a list of columns:
In [11]: df = pd.DataFrame([[1, 1, 2], [1, 2, 3], [2, 3, 4]], columns=["A", "B", "C"])
In [12]: df
Out[12]:
A B C
http://www.riptutorial.com/ 54
0 1 1 2
1 1 2 3
2 2 3 4
In [13]: g = df.groupby("A")
You can also use agg to specify columns and aggregation to perform:
example:
df
Out[34]:
B C group1 group2
0 one NaN A C
1 NaN 1.0 A C
2 NaN NaN A C
3 NaN NaN A D
4 NaN NaN B E
5 two NaN B E
6 NaN NaN B F
7 NaN 4.0 B F
I want to get the count of non-missing observations of B for each combination of group1 and group2.
groupby.transform
http://www.riptutorial.com/ 55
is a very powerful function that does exactly that.
df['count_B']=df.groupby(['group1','group2']).B.transform('count')
df
Out[36]:
B C group1 group2 count_B
0 one NaN A C 1
1 NaN 1.0 A C 1
2 NaN NaN A C 1
3 NaN NaN A D 0
4 NaN NaN B E 1
5 two NaN B E 1
6 NaN NaN B F 0
7 NaN 4.0 B F 0
http://www.riptutorial.com/ 56
Chapter 16: Grouping Time Series Data
Examples
Generate time series of random numbers then down sample
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
ts.describe()
count 10080.000000
mean -0.008853
std 0.995411
min -3.936794
25% -0.683442
50% 0.002640
75% 0.654986
max 3.906053
Name: HelloTimeSeries, dtype: float64
Let's take this 7 days of per minute data and down sample to every 15 minutes. All frequency
codes can be found here.
We can even aggregate several useful things. Let's plot the min, mean, and max of this
resample('15M') data.
http://www.riptutorial.com/ 57
ts.resample('15T').agg(['min', 'mean', 'max']).plot()
Let's resample over '15T' (15 minutes), '30T' (half hour), and '1H' (1 hour) and see how our data
gets smoother.
http://www.riptutorial.com/ 58
Chapter 17: Holiday Calendars
Examples
Create a custom calendar
Here is how to create a custom calendar. The example given is a french calendar -- so it provides
many examples.
class FrBusinessCalendar(AbstractHolidayCalendar):
""" Custom Holiday calendar for France based on
https://en.wikipedia.org/wiki/Public_holidays_in_France
- 1 January: New Year's Day
- Moveable: Easter Monday (Monday after Easter Sunday)
- 1 May: Labour Day
- 8 May: Victory in Europe Day
- Moveable Ascension Day (Thursday, 39 days after Easter Sunday)
- 14 July: Bastille Day
- 15 August: Assumption of Mary to Heaven
- 1 November: All Saints' Day
- 11 November: Armistice Day
- 25 December: Christmas Day
"""
rules = [
Holiday('New Years Day', month=1, day=1),
EasterMonday,
Holiday('Labour Day', month=5, day=1),
Holiday('Victory in Europe Day', month=5, day=8),
Holiday('Ascension Day', month=1, day=1, offset=[Easter(), Day(39)]),
Holiday('Bastille Day', month=7, day=14),
Holiday('Assumption of Mary to Heaven', month=8, day=15),
Holiday('All Saints Day', month=11, day=1),
Holiday('Armistice Day', month=11, day=11),
Holiday('Christmas Day', month=12, day=25)
]
http://www.riptutorial.com/ 59
end = start + pd.offsets.MonthEnd(12)
# 1 20
# 2 21
# 3 22
# 4 21
# 5 21
http://www.riptutorial.com/ 60
Chapter 18: Indexing and selecting data
Examples
Select column by label
# Create a sample DF
df = pd.DataFrame(np.random.randn(5, 3), columns=list('ABC'))
# Show DF
df
A B C
0 -0.467542 0.469146 -0.861848
1 -0.823205 -0.167087 -0.759942
2 -1.508202 1.361894 -0.166701
3 0.394143 -0.287349 -0.978102
4 -0.160431 1.054736 -0.785250
Select by position
The iloc (short for integer location) method allows to select the rows of a dataframe based on their
position index. This way one can slice dataframes just like one does with Python's list slicing.
df
# Out:
# 0 1
# a 11 22
# b 33 44
# c 55 66
http://www.riptutorial.com/ 61
# Out:
# 0 11
# 1 22
# Name: a, dtype: int64
When using labels, both the start and the stop are included in the results.
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(5, 5)), columns = list("ABCDE"),
index = ["R" + str(i) for i in range(5)])
# Out:
# A B C D E
# R0 99 78 61 16 73
# R1 8 62 27 30 80
# R2 7 76 15 53 80
# R3 27 44 77 75 65
# R4 47 30 84 86 18
Rows R0 to R2:
df.loc['R0':'R2']
# Out:
# A B C D E
http://www.riptutorial.com/ 62
# R0 9 41 62 1 82
# R1 16 78 5 58 0
# R2 80 4 36 51 27
Notice how loc differs from iloc because iloc excludes the end index
Columns C to E:
df.loc[:, 'C':'E']
# Out:
# C D E
# R0 62 1 82
# R1 5 58 0
# R2 36 51 27
# R3 68 38 83
# R4 7 30 62
DataFrame:
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(5, 5)), columns = list("ABCDE"),
index = ["R" + str(i) for i in range(5)])
df
Out[12]:
A B C D E
R0 99 78 61 16 73
R1 8 62 27 30 80
R2 7 76 15 53 80
R3 27 44 77 75 65
R4 47 30 84 86 18
df.ix[1:3, 'C':'E']
Out[19]:
C D E
http://www.riptutorial.com/ 63
R1 5 58 0
R2 36 51 27
If the index is integer, .ix will use labels rather than positions:
df
Out[22]:
A B C D E
5 9 41 62 1 82
6 16 78 5 58 0
7 80 4 36 51 27
8 31 2 68 38 83
9 19 18 7 30 62
#same call returns an empty DataFrame because now the index is integer
df.ix[1:3, 'C':'E']
Out[24]:
Empty DataFrame
Columns: [C, D, E]
Index: []
generate sample DF
In [39]: df = pd.DataFrame(np.random.randint(0, 10, size=(5, 6)),
columns=['a10','a20','a25','b','c','d'])
In [40]: df
Out[40]:
a10 a20 a25 b c d
0 2 3 7 5 4 7
1 3 1 5 7 2 6
2 7 4 9 0 8 7
3 5 8 8 9 6 8
4 8 1 0 4 4 9
http://www.riptutorial.com/ 64
show columns using RegEx filter (b|c|d) - or or
b c
:
d
In [42]: df.filter(regex='(b|c|d)')
Out[42]:
b c d
0 5 4 7
1 7 2 6
2 0 8 7
3 9 6 8
4 4 4 9
Boolean indexing
One can select rows and columns of a dataframe using boolean arrays.
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(5, 5)), columns = list("ABCDE"),
index = ["R" + str(i) for i in range(5)])
print (df)
# A B C D E
# R0 99 78 61 16 73
# R1 8 62 27 30 80
# R2 7 76 15 53 80
# R3 27 44 77 75 65
# R4 47 30 84 86 18
http://www.riptutorial.com/ 65
# R3 True
# R4 True
# Name: A, dtype: bool
print (df[mask])
# A B C D E
# R0 99 78 61 16 73
# R3 27 44 77 75 65
# R4 47 30 84 86 18
import pandas as pd
generate random DF
In [16]: print(df)
A B C
0 4 1 4
1 0 2 0
2 7 8 8
3 2 1 9
4 7 3 8
5 4 0 7
6 1 5 5
7 6 7 8
8 6 7 3
9 6 4 5
http://www.riptutorial.com/ 66
5 4 0 7
9 6 4 5
To view the first or last few records of a dataframe, you can use the methods head and tail
df.head(n)
df.tail(n)
It may become necessary to traverse the elements of a series or the rows of a dataframe in a way
that the next element or next row is dependent on the previously selected element or row. This is
called path dependency.
http://www.riptutorial.com/ 67
# n is number of observations
n = 5000
day = pd.to_datetime(['2013-02-06'])
# irregular seconds spanning 28800 seconds (8 hours)
seconds = np.random.rand(n) * 28800 * pd.Timedelta(1, 's')
# start at 8 am
start = pd.offsets.Hour(8)
# irregular timeseries
tidx = day + start + seconds
tidx = tidx.sort_values()
Let's assume a path dependent condition. Starting with the first member of the series, I want to
grab each subsequent element such that the absolute difference between that element and the
current element is greater than or equal to x.
Generator function
http://www.riptutorial.com/ 68
moves.plot(legend=True)
s.plot(legend=True)
df = s.to_frame()
moves_df = pd.concat(mover_df(df, 'A', 10), axis=1).T
moves_df.A.plot(label='_A_', legend=True)
df.A.plot(legend=True)
If you have a dataframe with missing data (NaN, pd.NaT, None) you can filter out incomplete rows
df = pd.DataFrame([[0,1,2,3],
http://www.riptutorial.com/ 69
[None,5,None,pd.NaT],
[8,None,10,None],
[11,12,13,pd.NaT]],columns=list('ABCD'))
df
# Output:
# A B C D
# 0 0 1 2 3
# 1 NaN 5 NaN NaT
# 2 8 NaN 10 None
# 3 11 12 13 NaT
DataFrame.dropna drops all rows containing at least one field with missing data
df.dropna()
# Output:
# A B C D
# 0 0 1 2 3
To just drop the rows that are missing data at specified columns use subset
df.dropna(subset=['C'])
# Output:
# A B C D
# 0 0 1 2 3
# 2 8 NaN 10 None
# 3 11 12 13 NaT
Use the option inplace = True for in-place replacement with the filtered frame.
Let
df = pd.DataFrame({'col_1':['A','B','A','B','C'], 'col_2':[3,4,3,5,6]})
df
# Output:
# col_1 col_2
# 0 A 3
# 1 B 4
# 2 A 3
# 3 B 5
# 4 C 6
df['col_1'].unique()
# Output:
# array(['A', 'B', 'C'], dtype=object)
To simulate the select unique col_1, col_2 of SQL you can use DataFrame.drop_duplicates():
http://www.riptutorial.com/ 70
df.drop_duplicates()
# col_1 col_2
# 0 A 3
# 1 B 4
# 3 B 5
# 4 C 6
This will get you all the unique rows in the dataframe. So if
df = pd.DataFrame({'col_1':['A','B','A','B','C'], 'col_2':[3,4,3,5,6],
'col_3':[0,0.1,0.2,0.3,0.4]})
df
# Output:
# col_1 col_2 col_3
# 0 A 3 0.0
# 1 B 4 0.1
# 2 A 3 0.2
# 3 B 5 0.3
# 4 C 6 0.4
df.drop_duplicates()
# col_1 col_2 col_3
# 0 A 3 0.0
# 1 B 4 0.1
# 2 A 3 0.2
# 3 B 5 0.3
# 4 C 6 0.4
To specify the columns to consider when selecting unique records, pass them as arguments
df = pd.DataFrame({'col_1':['A','B','A','B','C'], 'col_2':[3,4,3,5,6],
'col_3':[0,0.1,0.2,0.3,0.4]})
df.drop_duplicates(['col_1','col_2'])
# Output:
# col_1 col_2 col_3
# 0 A 3 0.0
# 1 B 4 0.1
# 3 B 5 0.3
# 4 C 6 0.4
Source: How to “select distinct” across multiple data frame columns in pandas?.
http://www.riptutorial.com/ 71
Chapter 19: IO for Google BigQuery
Examples
Reading data from BigQuery with user account credentials
In order to run a query in BigQuery you need to have your own BigQuery project. We can request
some public sample data:
--noauth_local_webserver
If your are operating from local machine than browser will pop-up. After granting privileges pandas
will continue with output:
Authentication successful.
Requesting query... ok.
Query running...
Query done.
Processed: 13.8 Gb
Retrieving results...
Got 5 rows.
Result:
In [3]: data
Out[3]:
title id num_characters
0 Fusidic acid 935328 1112
1 Clark Air Base 426241 8257
2 Watergate scandal 52382 25790
3 2005 35984 75813
http://www.riptutorial.com/ 72
4 .BLP 2664340 1659
As a side effect pandas will create json file bigquery_credentials.dat which will allow you to run
further queries without need to grant privileges any more:
Out[9]:
cnt
0 313797035
If you have created service account and have private key json file for it, you can use this file to
authenticate with pandas
Out[5]:
corpus words
0 hamlet 32446
1 kingrichardiii 31868
2 coriolanus 29535
3 cymbeline 29231
4 2kinghenryiv 28241
http://www.riptutorial.com/ 73
Chapter 20: JSON
Examples
Read JSON
with open('test.json') as f:
data = pd.DataFrame(json.loads(line) for line in f)
http://www.riptutorial.com/ 74
return ("Done")
{"A": 1, "B": 2}
{"A": 3, "B": 4}
pd.read_json('file.json', lines=True)
# Output:
# A B
# 0 1 2
# 1 3 4
http://www.riptutorial.com/ 75
Chapter 21: Making Pandas Play Nice With
Native Python Datatypes
Examples
Moving Data Out of Pandas Into Native Python and Numpy Data Structures
In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0], 'C': ['a', 'b', 'c'],
'D': [True, False, True]})
In [2]: df
Out[2]:
A B C D
0 1 1.0 a True
1 2 2.0 b False
2 3 3.0 c True
In [3]: df['A'].tolist()
Out[3]: [1, 2, 3]
In [4]: df.tolist()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-fc6763af1ff7> in <module>()
----> 1 df.tolist()
In [5]: df['B'].values
Out[5]: array([ 1., 2., 3.])
You can also get an array of the columns as individual numpy arrays from an entire dataframe:
In [6]: df.values
Out[6]:
array([[1, 1.0, 'a', True],
[2, 2.0, 'b', False],
http://www.riptutorial.com/ 76
[3, 3.0, 'c', True]], dtype=object)
In [7]: df['C'].to_dict()
Out[7]: {0: 'a', 1: 'b', 2: 'c'}
In [8]: df.to_dict()
Out[8]:
{'A': {0: 1, 1: 2, 2: 3},
'B': {0: 1.0, 1: 2.0, 2: 3.0},
'C': {0: 'a', 1: 'b', 2: 'c'},
'D': {0: True, 1: False, 2: True}}
The to_dict method has a few different parameters to adjust how the dictionaries are formatted.
To get a list of dicts for each row:
In [9]: df.to_dict('records')
Out[9]:
[{'A': 1, 'B': 1.0, 'C': 'a', 'D': True},
{'A': 2, 'B': 2.0, 'C': 'b', 'D': False},
{'A': 3, 'B': 3.0, 'C': 'c', 'D': True}]
See the documentation for the full list of options available to create dictionaries.
Read Making Pandas Play Nice With Native Python Datatypes online:
http://www.riptutorial.com/pandas/topic/8008/making-pandas-play-nice-with-native-python-
datatypes
http://www.riptutorial.com/ 77
Chapter 22: Map Values
Remarks
it should be mentioned that if the key value does not exist then this will raise KeyError, in those
situations it maybe better to use merge or get which allows you to specify a default value if the key
doesn't exist
Examples
Map from Dictionary
U L
111 en
112 en
112 es
113 es
113 ja
113 zh
114 es
Imagine you want to add a new column called S taking values from the following dictionary:
You can use map to perform a lookup on keys returning the corresponding values as a new column:
df['S'] = df['U'].map(d)
that returns:
U L S
111 en en
112 en en
112 es en
113 es es
113 ja es
113 zh es
114 es es
http://www.riptutorial.com/ 78
Chapter 23: Merge, join, and concatenate
Syntax
• DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True,
indicator=False)
• If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining
indexes on indexes or indexes on a column or columns, the index will be passed on.
Parameters
Parameters Explanation
right DataFrame
boolean, default False. Use the index from the left DataFrame as the join
left_index key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either
the index or a number of columns) must match the number of levels
boolean, default False. Use the index from the right DataFrame as the join key.
right_index
Same caveats as left_index
boolean, default Fals. Sort the join keys lexicographically in the result
sort
DataFrame
2-length sequence (tuple, list, ...). Suffix to apply to overlapping column names
suffixes
in the left and right side, respectively
http://www.riptutorial.com/ 79
Parameters Explanation
Examples
Merging two DataFrames
In [3]: df1
Out[3]:
x y
0 1 a
1 2 b
2 3 c
In [4]: df2
Out[4]:
y z
0 b 4
1 c 5
2 d 6
Inner join:
Uses the intersection of keys from two DataFrames.
Outer join:
http://www.riptutorial.com/ 80
Uses the union of the keys from two DataFrames.
Left join:
Uses only keys from left DataFrame.
Right Join
Uses only keys from right DataFrame.
In [61]: df1
Out[61]:
col1 col2
0 11 21
1 12 22
http://www.riptutorial.com/ 81
2 13 23
In [62]: df2
Out[62]:
col1 col2
0 111 121
1 112 122
2 113 123
In [63]: df3
Out[63]:
col1 col2
0 211 221
1 212 222
2 213 223
merge / join / concatenate data frames [df1, df2, df3] vertically - add rows
Merge
T1
id x y
8 42 1.9
9 30 1.9
T2
id signal
8 55
8 56
http://www.riptutorial.com/ 82
8 59
9 57
9 58
9 60
id x y s1 s2 s3
8 42 1.9 55 56 58
9 30 1.9 57 58 60
Which is to create columns s1, s2 and s3, each corresponding to a row (the number of rows per id
is always fixed and equal to 3)
By applying join (which takes an optional on argument which may be a column or multiple column
names, which specifies that the passed DataFrame is to be aligned on that column in the
DataFrame). So the solution can be as shown below:
df = df1.merge(df2.groupby('id')['signal'].apply(lambda x:
x.reset_index(drop=True)).unstack().reset_index())
df
Out[63]:
id x y 0 1 2
0 8 42 1.9 55 56 59
1 9 30 1.9 57 58 60
If I separate them:
df2t = df2.groupby('id')['signal'].apply(lambda x:
x.reset_index(drop=True)).unstack().reset_index()
df2t
Out[59]:
id 0 1 2
0 8 55 56 59
1 9 57 58 60
df = df1.merge(df2t)
df
Out[61]:
id x y 0 1 2
0 8 42 1.9 55 56 59
1 9 30 1.9 57 58 60
http://www.riptutorial.com/ 83
Merging key names are different
Concate dataframes
Glued vertically
Glued horizontally
A B
X a 1
Y b 2
http://www.riptutorial.com/ 84
right = pd.DataFrame([['a', 3], ['b', 4]], list('XY'), list('AC'))
right
A C
X a 3
Y b 4
join
Think of join as wanting to combine to dataframes based on their respective indexes. If there are
overlapping columns, join will want you to add a suffix to the overlapping column name from left
dataframe. Our two dataframes do have an overlapping column name A.
left.join(right, lsuffix='_')
A_ B A C
X a 1 a 3
Y b 2 b 4
Notice the index is preserved and we have 4 columns. 2 columns from left and 2 from right.
A_ B index A C
0 NaN NaN X a 3.0
1 NaN NaN Y b 4.0
X a 1.0 NaN NaN NaN
Y b 2.0 NaN NaN NaN
I used an outer join to better illustrate the point. If the indexes do not align, the result will be the
union of the indexes.
We can tell join to use a specific column in the left dataframe to use as the join key, but it will still
use the index from the right.
index A_ B A C
0 X a 1 a 3
1 Y b 2 b 4
merge
Think of merge as aligning on columns. By default merge will look for overlapping columns in which
to merge on. merge gives better control over merge keys by allowing the user to specify a subset of
the overlapping columns to use with parameter on, or to separately allow the specification of which
columns on the left and which columns on the right to merge by.
merge will return a combined dataframe in which the index will be destroyed.
This simple example finds the overlapping column to be 'A' and combines based on it.
http://www.riptutorial.com/ 85
left.merge(right)
A B C
0 a 1 3
1 b 2 4
You can explicitly specify that you are merging on the index with the left_index or right_index
paramter
A_ B A C
X a 1 a 3
Y b 2 b 4
http://www.riptutorial.com/ 86
Chapter 24: Meta: Documentation Guidelines
Remarks
This meta post is similar to the python version
http://stackoverflow.com/documentation/python/394/meta-documentation-
guidelines#t=201607240058406359521.
Please make edit suggestions, and comment on those (in lieu of proper comments), so we can
flesh out/iterate on these suggestions :)
Examples
Showing code snippets and output
ipython notation:
In [12]: df
Out[12]:
0 1
0 1 2
1 3 4
Alternatively (this is popular over in the python documentation) and more concisely:
df[0]
# Out:
# 0 1
# 1 3
# Name: 0, dtype: int64
Note: The distinction between output and printing. ipython makes this clear (the prints occur before
the output is returned):
http://www.riptutorial.com/ 87
0
1
Out[21]: [None, None]
style
Use the pandas library as pd, this can be assumed (the import does not need to be in every
example)
import pandas as pd
PEP8!
• 4 space indentation
• kwargs should use no spaces f(a=1)
• 80 character limit (the entire line fitting in the rendered code snippet should be strongly
preferred)
Most examples will work across multiple versions, if you are using a "new" feature you should
mention when this was introduced.
Example: sort_values.
print statements
Most of the time printing should be avoided as it can be a distraction (Out should be preferred).
That is:
a
# Out: 1
print(a)
# prints: 1
http://www.riptutorial.com/ 88
Chapter 25: Missing Data
Remarks
Should we include the non-documented ffill and bfill?
Examples
Interpolation
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,np.nan,3,np.nan],
'B':[1.2,7,3,0,8]})
df['C'] = df.A.interpolate()
df['D'] = df.A.interpolate(method='spline', order=1)
print (df)
A B C D
0 1.0 1.2 1.0 1.000000
1 2.0 7.0 2.0 2.000000
2 NaN 3.0 2.5 2.428571
3 3.0 0.0 3.0 3.000000
4 NaN 8.0 3.0 3.714286
In order to check whether a value is NaN, isnull() or notnull() functions can be used.
Note that np.nan == np.nan returns False so you should avoid comparison against np.nan:
http://www.riptutorial.com/ 89
Both functions are also defined as methods on Series and DataFrames.
In [6]: ser.isnull()
Out[6]:
0 False
1 False
2 True
3 False
dtype: bool
Testing on DataFrames:
In [10]: df.notnull() # Opposite of .isnull(). If the value is not NaN, returns True.
Out[10]:
A B
0 True False
1 False True
2 True True
Out[11]:
0 1 2 3
0 1.0 2.0 NaN 3.0
1 4.0 NaN 5.0 6.0
2 7.0 8.0 9.0 10.0
3 NaN NaN NaN NaN
In [12]: df.fillna(0)
Out[12]:
0 1 2 3
0 1.0 2.0 0.0 3.0
1 4.0 0.0 5.0 6.0
2 7.0 8.0 9.0 10.0
3 0.0 0.0 0.0 0.0
http://www.riptutorial.com/ 90
This returns a new DataFrame. If you want to change the original DataFrame, either use the
inplace parameter (df.fillna(0, inplace=True)) or assign it back to original DataFrame (df =
df.fillna(0)).
When creating a DataFrame None (python's missing value) is converted to NaN (pandas' missing
value):
http://www.riptutorial.com/ 91
Out[11]:
0 1 2 3
0 1.0 2.0 NaN 3.0
1 4.0 NaN 5.0 6.0
2 7.0 8.0 9.0 10.0
3 NaN NaN NaN NaN
In [12]: df.dropna()
Out[12]:
0 1 2 3
2 7.0 8.0 9.0 10.0
This returns a new DataFrame. If you want to change the original DataFrame, either use the
inplace parameter (df.dropna(inplace=True)) or assign it back to original DataFrame (df =
df.dropna()).
In [13]: df.dropna(how='all')
Out[13]:
0 1 2 3
0 1.0 2.0 NaN 3.0
1 4.0 NaN 5.0 6.0
2 7.0 8.0 9.0 10.0
http://www.riptutorial.com/ 92
Chapter 26: MultiIndex
Examples
Select from MultiIndex by Level
In [13]: df
Out[13]:
C
A B
0.902764 -0.259656 -1.864541
-0.695893 0.308893 0.125199
1.696989 -1.221131 -2.975839
-1.132069 -1.086189 -1.945467
2.294835 -1.765507 1.567853
-1.788299 2.579029 0.792919
In [14]: df.index.get_level_values('A')
Out[14]:
Float64Index([0.902764041011, -0.69589264969, 1.69698924476, -1.13206872067,
2.29483481146, -1.788298829],
dtype='float64', name='A')
Or by number of level:
In [15]: df.index.get_level_values(level=0)
Out[15]:
Float64Index([0.902764041011, -0.69589264969, 1.69698924476, -1.13206872067,
2.29483481146, -1.788298829],
dtype='float64', name='A')
http://www.riptutorial.com/ 93
In [17]: df.loc[(df.index.get_level_values('A') > 0.5) & (df.index.get_level_values('B') < 0)]
Out[17]:
C
A B
0.902764 -0.259656 -1.864541
1.696989 -1.221131 -2.975839
2.294835 -1.765507 1.567853
In [18]: df.xs(key=0.9027639999999999)
Out[18]:
C
B
-0.259656 -1.864541
In [11]: df = pd.DataFrame({'a':[1,1,1,2,2,3],'b':[4,4,5,5,6,7,],'c':[10,11,12,13,14,15]})
In [13]: df
Out[13]:
c
a b
1 4 10
4 11
5 12
2 5 13
6 14
3 7 15
You can iterate by any level of the MultiIndex. For example, level=0 (you can also select the level
by name e.g. level='a'):
http://www.riptutorial.com/ 94
2 5 13
6 14
---
c
a b
3 7 15
This example shows how to use column data to set a MultiIndex in a pandas.DataFrame.
In [1]: df = pd.DataFrame([['one', 'A', 100], ['two', 'A', 101], ['three', 'A', 102],
...: ['one', 'B', 103], ['two', 'B', 104], ['three', 'B', 105]],
...: columns=['c1', 'c2', 'c3'])
In [2]: df
Out[2]:
c1 c2 c3
0 one A 100
1 two A 101
2 three A 102
3 one B 103
4 two B 104
5 three B 105
http://www.riptutorial.com/ 95
three A 102
one B 103
two B 104
three B 105
You can sort the index right after you set it:
Having a sorted index, will result in slightly more efficient lookups on the first level:
After the index has been set, you can perform lookups for specific records or groups of records:
In [10]: df_indexed.loc['one']
Out[10]:
c3
c2
A 100
B 103
http://www.riptutorial.com/ 96
How to change MultiIndex columns to standard columns
In [2]: df
Out[2]:
one zero
y x y
0 0.785806 -0.679039 0.513451
1 -0.337862 -0.350690 -1.423253
If you want to change the columns to standard columns (not MultiIndex), just rename the columns.
df.columns = ['A','B','C']
In [3]: df
Out[3]:
A B C
0 0.785806 -0.679039 0.513451
1 -0.337862 -0.350690 -1.423253
df = pd.DataFrame(np.random.randn(2,3), columns=['a','b','c'])
In [91]: df
Out[91]:
a b c
0 -0.911752 -1.405419 -0.978419
1 0.603888 -1.187064 -0.035883
In [94]: df
Out[94]:
one zero
y x y
0 -0.911752 -1.405419 -0.978419
1 0.603888 -1.187064 -0.035883
MultiIndex Columns
MultiIndex can also be used to create DataFrames with multilevel columns. Just use the columns
keyword in the DataFrame command.
http://www.riptutorial.com/ 97
midx = pd.MultiIndex(levels=[['zero', 'one'], ['x','y']], labels=[[1,1,0,],[1,0,1,]])
df = pd.DataFrame(np.random.randn(6,4), columns=midx)
In [86]: df
Out[86]:
one zero
y x y
0 0.625695 2.149377 0.006123
1 -1.392909 0.849853 0.005477
To view all elements in the index change the print options that “sparsifies” the display of the
MultiIndex.
pd.set_option('display.multi_sparse', False)
df.groupby(['A','B']).mean()
# Output:
# C
# A B
# a 1 107
# a 2 102
# a 3 115
# b 5 92
# b 8 98
# c 2 87
# c 4 104
# c 9 123
http://www.riptutorial.com/ 98
Chapter 27: Pandas Datareader
Remarks
The Pandas datareader is a sub package that allows one to create a dataframe from various
internet datasources, currently including:
• Yahoo! Finance
• Google Finance
• St.Louis FED (FRED)
• Kenneth French’s data library
• World Bank
• Google Analytics
Examples
Reading financial data (for multiple tickers) into pandas panel - demo
stocklist = ['AAPL','GOOG','FB','AMZN','COP']
start = datetime(2016,6,8)
end = datetime(2016,6,11)
p = wb.DataReader(stocklist, 'yahoo',start,end)
In [388]: p.axes
Out[388]:
[Index(['Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object'),
DatetimeIndex(['2016-06-08', '2016-06-09', '2016-06-10'], dtype='datetime64[ns]',
name='Date', freq='D'),
Index(['AAPL', 'AMZN', 'COP', 'FB', 'GOOG'], dtype='object')]
In [389]: p.keys()
Out[389]: Index(['Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object')
http://www.riptutorial.com/ 99
Date
2016-06-08 98.940002 726.640015 47.490002 118.389999 728.280029
2016-06-09 99.650002 727.650024 46.570000 118.559998 728.580017
2016-06-10 98.830002 717.909973 44.509998 116.620003 719.409973
In [391]: p['Volume']
Out[391]:
AAPL AMZN COP FB GOOG
Date
2016-06-08 20812700.0 2200100.0 9596700.0 14368700.0 1582100.0
2016-06-09 26419600.0 2163100.0 5389300.0 13823400.0 985900.0
2016-06-10 31462100.0 3409500.0 8941200.0 18412700.0 1206000.0
In [394]: p[:,:,'AAPL']
Out[394]:
Open High Low Close Volume Adj Close
Date
2016-06-08 99.019997 99.559998 98.680000 98.940002 20812700.0 98.940002
2016-06-09 98.500000 99.989998 98.459999 99.650002 26419600.0 99.650002
2016-06-10 98.529999 99.349998 98.480003 98.830002 31462100.0 98.830002
In [395]: p[:,'2016-06-10']
Out[395]:
Open High Low Close Volume Adj Close
AAPL 98.529999 99.349998 98.480003 98.830002 31462100.0 98.830002
AMZN 722.349976 724.979980 714.210022 717.909973 3409500.0 717.909973
COP 45.900002 46.119999 44.259998 44.509998 8941200.0 44.509998
FB 117.540001 118.110001 116.260002 116.620003 18412700.0 116.620003
GOOG 719.469971 725.890015 716.429993 719.409973 1206000.0 719.409973
http://www.riptutorial.com/ 100
# Convert the adjusted closing prices to cumulative returns.
returns = aapl.pct_change()
>>> ((1 + returns).cumprod() - 1).plot(title='AAPL Cumulative Returns')
http://www.riptutorial.com/ 101
Chapter 28: Pandas IO tools (reading and
saving data sets)
Remarks
The pandas official documentation includes a page on IO Tools with a list of relevant functions to
read and write to files, as well as some examples and common parameters.
Examples
Reading cvs file into a pandas data frame when there is no header row
File:
1;str_data;12;1.4
3;str_data;22;42.33
4;str_data;2;3.44
2;str_data;43;43.34
df
Out:
a b c
1 str_data 12 1.40
3 str_data 22 42.33
4 str_data 2 3.44
2 str_data 43 43.34
7 str_data 25 23.32
Using HDFStore
import string
import numpy as np
import pandas as pd
http://www.riptutorial.com/ 102
df = pd.DataFrame({
'int32': np.random.randint(0, 10**6, 10),
'int64': np.random.randint(10**7, 10**9, 10).astype(np.int64)*10,
'float': np.random.rand(10),
'string': np.random.choice([c*10 for c in string.ascii_uppercase], 10),
})
In [71]: df
Out[71]:
float int32 int64 string
0 0.649978 848354 5269162190 DDDDDDDDDD
1 0.346963 490266 6897476700 OOOOOOOOOO
2 0.035069 756373 6711566750 ZZZZZZZZZZ
3 0.066692 957474 9085243570 FFFFFFFFFF
4 0.679182 665894 3750794810 MMMMMMMMMM
5 0.861914 630527 6567684430 TTTTTTTTTT
6 0.697691 825704 8005182860 FFFFFFFFFF
7 0.474501 942131 4099797720 QQQQQQQQQQ
8 0.645817 951055 8065980030 VVVVVVVVVV
9 0.083500 349709 7417288920 EEEEEEEEEE
http://www.riptutorial.com/ 103
"int64": Int64Col(shape=(), dflt=0, pos=3),
"string": StringCol(itemsize=10, shape=(), dflt=b'', pos=4)}
byteorder := 'little'
chunkshape := (1724,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"int32": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"string": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"int64": Index(6, medium, shuffle, zlib(1)).is_csi=False}
Read & merge multiple CSV files (with the same structure) into one DF
import os
import glob
import pandas as pd
path = 'C:/Users/csvfiles'
fmask = os.path.join(path, '*mask*.csv')
print(df.head())
If you want to merge CSV files horizontally (adding columns), use axis=1 when calling pd.concat()
function:
http://www.riptutorial.com/ 104
Date always have a different format, they can be parsed using a specific parse_dates function.
This input.csv:
List comprehension
All files are in folder files. First create list of DataFrames and then concat them:
import pandas as pd
import glob
#a.csv
#a,b
#1,2
#5,8
#b.csv
#a,b
#9,6
#6,4
#c.csv
#a,b
#4,3
#7,0
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp) for fp in files]
http://www.riptutorial.com/ 105
a b
0 1 2
1 5 8
2 9 6
3 6 4
4 4 3
5 7 0
#concat by columns
df1 = pd.concat(dfs, axis=1)
print (df1)
a b a b a b
0 1 2 9 6 4 3
1 5 8 6 4 7 0
#reset column names
df1 = pd.concat(dfs, axis=1, ignore_index=True)
print (df1)
0 1 2 3 4 5
0 1 2 9 6 4 3
1 5 8 6 4 7 0
Read in chunks
import pandas as pd
chunksize = [n]
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
delete(chunk)
df.to_csv(file_name)
df.to_csv(file_name,sep="|")
df.to_csv(file_name, header=False)
http://www.riptutorial.com/ 106
df.to_csv(file_name, header = ['A','B','C',...]
df.to_csv(file_name, encoding='utf-8')
File:
index,header1,header2,header3
1,str_data,12,1.4
3,str_data,22,42.33
4,str_data,2,3.44
2,str_data,43,43.34
Code:
pd.read_csv('data_file.csv')
Output:
• With index_col = n (n an integer) you tell pandas to use column n to index the
index_col
DataFrame. In the above example:
pd.read_csv('data_file.csv', index_col=0)
Output:
http://www.riptutorial.com/ 107
index
1 str_data 12 1.40
3 str_data 22 42.33
4 str_data 2 3.44
2 str_data 43 43.34
7 str_data 25 23.32
pd.read_csv('data_file.csv', index_col=0,skip_blank_lines=False)
Output:
File:
date_begin;date_end;header3;header4;header5
1/1/2017;1/10/2017;str_data;1001;123,45
2/1/2017;2/10/2017;str_data;1001;67,89
3/1/2017;3/10/2017;str_data;1001;0
Output:
By default, the date format is inferred. If you want to specify a date format you can use for
instance
Output:
http://www.riptutorial.com/ 108
date_begin date_end header3 header4 header5
0 2017-01-01 2017-10-01 str_data 1001 123,45
1 2017-01-02 2017-10-02 str_data 1001 67,89
2 2017-01-03 2017-10-03 str_data 1001 0
More information on the function's parameters can be found in the official documentation.
Testing read_csv
import pandas as pd
import io
pd.read_excel('path_to_file.xls', sheetname='Sheet1')
There are many parsing options for read_excel (similar to the options in read_csv.
pd.read_excel('path_to_file.xls',
sheetname='Sheet1', header=[0, 1, 2],
skiprows=3, index_col=0) # etc.
http://www.riptutorial.com/ 109
df = pd.DataFrame(raw_data,columns=raw_data.keys())
df.to_csv('data_file.csv')
You can specify a column that contains dates so pandas would automatically parse them when
reading from the csv
pandas.read_csv('data_file.csv', parse_dates=['date_column'])
df = pd.read_csv(log_file,
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python',
usecols=[0, 3, 4, 5, 6, 7, 8],
names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
na_values='-',
header=None
)
http://www.riptutorial.com/ 110
Chapter 29: pd.DataFrame.apply
Examples
pandas.DataFrame.apply Basic Usage
The pandas.DataFrame.apply() method is used to apply a given function to an entire DataFrame ---
for example, computing the square root of every entry of a given DataFrame or summing across
each row of a DataFrame to return a Series.
>>> df
fst snd
0 40 94
1 58 93
2 95 95
3 88 40
4 25 27
5 62 64
6 18 92
>>> df.apply(np.sum)
fst 386
snd 505
dtype: int64
http://www.riptutorial.com/ 111
Read pd.DataFrame.apply online: http://www.riptutorial.com/pandas/topic/7024/pd-dataframe-
apply
http://www.riptutorial.com/ 112
Chapter 30: Read MySQL to DataFrame
Examples
Using sqlalchemy and PyMySQL
cnx = create_engine('mysql+pymysql://username:password@server:3306/database').connect()
sql = 'select * from mytable'
df = pd.read_sql(sql, cnx)
To fetch large data we can use generators in pandas and load data in chunks.
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL
# sqlalchemy engine
engine = create_engine(URL(
drivername="mysql"
username="user",
password="password"
host="host"
database="database"
))
conn = engine.connect()
http://www.riptutorial.com/ 113
Chapter 31: Read SQL Server to Dataframe
Examples
Using pyodbc
import pandas.io.sql
import pyodbc
import pandas as pd
# Parameters
server = 'server_name'
db = 'database_name'
UID = 'user_id'
connstr = "DSN={};UID={};PWD={}".format(dsn,uid,pwd)
connected = False
while not connected:
try:
with pyodbc.connect(connstr,autocommit=True) as con:
cur = con.cursor()
if params is not None: df = pdsql.read_sql(query, con,
params=params)
else: df = pdsql.read_sql(query, con)
cur.close()
http://www.riptutorial.com/ 114
break
except pyodbc.OperationalError:
time.sleep(60) # one minute could be changed
return df
http://www.riptutorial.com/ 115
Chapter 32: Reading files into pandas
DataFrame
Examples
Read table into DataFrame
Table file with header, footer, row names, and index column:
file: table.txt
This is a footer because your boss does not understand data files
code:
import pandas as pd
# index_col=0 tells pandas that column 0 is the index and not data
pd.read_table('table.txt', delim_whitespace=True, skiprows=3, skipfooter=2, index_col=0)
output:
name occupation
index
1 Alice Salesman
2 Bob Engineer
3 Charlie Janitor
Alice Salesman
Bob Engineer
Charlie Janitor
code:
import pandas as pd
http://www.riptutorial.com/ 116
pd.read_table('table.txt', delim_whitespace=True, names=['name','occupation'])
output:
name occupation
0 Alice Salesman
1 Bob Engineer
2 Charlie Janitor
index;name;occupation
1;Alice;Saleswoman
2;Bob;Engineer
3;Charlie;Janitor
code:
import pandas as pd
pd.read_csv('table.csv', sep=';', index_col=0)
output:
name occupation
index
1 Alice Salesman
2 Bob Engineer
3 Charlie Janitor
Alice,Saleswoman
Bob,Engineer
Charlie,Janitor
code:
http://www.riptutorial.com/ 117
import pandas as pd
pd.read_csv('table.csv', names=['name','occupation'])
output:
name occupation
0 Alice Salesman
1 Bob Engineer
2 Charlie Janitor
Sometimes we need to collect data from google spreadsheets. We can use gspread and
oauth2client libraries to collect data from google spreadsheets. Here is a example to collect data:
Code:
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('your-authorization-file.json',
scope)
gc = gspread.authorize(credentials)
work_sheet = gc.open_by_key("spreadsheet-key-here")
sheet = work_sheet.sheet1
data = pd.DataFrame(sheet.get_all_records())
print(data.head())
http://www.riptutorial.com/ 118
Chapter 33: Resampling
Examples
Downsampling and upsampling
import pandas as pd
import numpy as np
np.random.seed(0)
rng = pd.date_range('2015-02-24', periods=10, freq='T')
df = pd.DataFrame({'Val' : np.random.randn(len(rng))}, index=rng)
print (df)
Val
2015-02-24 00:00:00 1.764052
2015-02-24 00:01:00 0.400157
2015-02-24 00:02:00 0.978738
2015-02-24 00:03:00 2.240893
2015-02-24 00:04:00 1.867558
2015-02-24 00:05:00 -0.977278
2015-02-24 00:06:00 0.950088
2015-02-24 00:07:00 -0.151357
2015-02-24 00:08:00 -0.103219
2015-02-24 00:09:00 0.410599
#5Min is same as 5T
print (df.resample('5T').sum())
Val
2015-02-24 00:00:00 7.251399
2015-02-24 00:05:00 0.128833
http://www.riptutorial.com/ 119
2015-02-24 00:08:00 -0.103219
2015-02-24 00:08:30 -0.103219
2015-02-24 00:09:00 0.410599
http://www.riptutorial.com/ 120
Chapter 34: Reshaping and pivoting
Examples
Simple pivoting
import pandas as pd
import numpy as np
print (df)
Name Position City Age
0 Mary Manager Boston 34
1 Josh Programmer New York 37
2 Jon Manager Chicago 29
3 Lucy Manager Los Angeles 40
4 Jane Programmer Chicago 29
5 Sue Programmer Boston 31
If need reset index, remove columns names and fill NaN values:
http://www.riptutorial.com/ 121
Pivoting with aggregating
import pandas as pd
import numpy as np
print (df)
Name Position City Age Sex
0 Mary Manager Boston 35 Female
1 Jon Manager Chicago 37 Male
2 Lucy Manager Los Angeles 40 Female
3 Jane Programmer Chicago 29 Female
4 Sue Programmer Boston 31 Female
5 Mary Manager Boston 26 Female
6 Lucy Manager Chicago 28 Female
http://www.riptutorial.com/ 122
Position
Manager 35.0 37.0 40.0
Programmer 31.0 29.0 NaN
The information regarding the Sex has yet not been used. It could be switched by one of the
columns, or it could be added as another level:
http://www.riptutorial.com/ 123
Multiple columns can be specified in any of the attributes index, columns and values.
For example, find the mean age, and standard deviation of random by Position:
One can pass a list of functions to apply to the individual columns as well:
http://www.riptutorial.com/ 124
Stacking and unstacking
import pandas as pd
import numpy as np
np.random.seed(0)
tuples = list(zip(*[['bar', 'bar', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two','one', 'two']]))
print (df.stack())
first second
bar one A 1.764052
B 0.400157
two A 0.978738
B 2.240893
foo one A 1.867558
B -0.977278
two A 0.950088
B -0.151357
qux one A -0.103219
B 0.410599
two A 0.144044
B 1.454274
dtype: float64
print (df.unstack())
A B
second one two one two
first
bar 1.764052 0.978738 0.400157 2.240893
http://www.riptutorial.com/ 125
foo 1.867558 0.950088 -0.977278 -0.151357
qux -0.103219 0.144044 0.410599 1.454274
#reset index
df1 = df.unstack().reset_index()
#remove columns names
df1.columns.names = (None, None)
#reset MultiIndex in columns with list comprehension
df1.columns = ['_'.join(col).strip('_') for col in df1.columns]
print (df1)
first A_one A_two B_one B_two
0 bar 1.764052 0.978738 0.400157 2.240893
1 foo 1.867558 0.950088 -0.977278 -0.151357
2 qux -0.103219 0.144044 0.410599 1.454274
Cross Tabulation
import pandas as pd
df = pd.DataFrame({'Sex': ['M', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F'],
'Age': [20, 19, 17, 35, 22, 22, 12, 15, 17, 22],
'Heart Disease': ['Y', 'N', 'Y', 'N', 'N', 'Y', 'N', 'Y', 'N', 'Y']})
df
Hearth Disease N Y
Sex
F 2 3
M 3 2
http://www.riptutorial.com/ 126
Using dot notation:
pd.crosstab(df.Sex, df.Age)
Age 12 15 17 19 20 22 35
Sex
F 0 0 2 0 0 3 0
M 1 1 0 1 1 0 1
pd.crosstab(df.Sex, df.Age).T
Sex F M
Age
12 0 1
15 0 1
17 2 0
19 0 1
20 0 1
22 3 0
35 0 1
Sex F M All
Age
12 0 1 1
15 0 1 1
17 2 0 2
19 0 1 1
20 0 1 1
22 3 0 3
35 0 1 1
All 5 5 10
Getting percentages :
Heart Disease N Y
Sex
http://www.riptutorial.com/ 127
F 0.2 0.3
M 0.3 0.2
df2
Sex F M All
Age
12 0.0 10.0 10.0
15 0.0 10.0 10.0
17 20.0 0.0 20.0
19 0.0 10.0 10.0
20 0.0 10.0 10.0
22 30.0 0.0 30.0
35 0.0 10.0 10.0
All 50.0 50.0 100.0
df2[["F","M"]]
Sex F M
Age
12 0.0 10.0
15 0.0 10.0
17 20.0 0.0
19 0.0 10.0
20 0.0 10.0
22 30.0 0.0
35 0.0 10.0
All 50.0 50.0
>>> df
ID Year Jan_salary Feb_salary Mar_salary
0 1 2016 4500 4200 4700
1 2 2016 3800 3600 4400
2 3 2016 5500 5200 5300
>>> melted_df
ID Year month salary
0 1 2016 Jan_salary 4500
1 2 2016 Jan_salary 3800
2 3 2016 Jan_salary 5500
3 1 2016 Feb_salary 4200
4 2 2016 Feb_salary 3600
5 3 2016 Feb_salary 5200
6 1 2016 Mar_salary 4700
7 2 2016 Mar_salary 4400
http://www.riptutorial.com/ 128
8 3 2016 Mar_salary 5300
Split (reshape) CSV strings in columns into multiple rows, having one element
per row
import pandas as pd
print(df)
reshaped = \
(df.set_index(df.columns.drop('var1',1).tolist())
.var1.str.split(',', expand=True)
.stack()
.reset_index()
.rename(columns={0:'var1'})
.loc[:, df.columns]
)
print(reshaped)
Output:
http://www.riptutorial.com/ 129
6 x 2 ZZ
7 y 2 ZZ
http://www.riptutorial.com/ 130
Chapter 35: Save pandas dataframe to a csv
file
Parameters
Parameter Description
string or file handle, default None File path or object, if None is provided the
path_or_buf
result is returned as a string.
sep character, default ‘,’ Field delimiter for the output file.
float_format string, default None Format string for floating point numbers
boolean or list of string, default True Write out column names. If a list of
header
string is given it is assumed to be aliases for the column names
string or sequence, or False, default None Column label for index column(s)
if desired. If None is given, and header and index are True, then the index
index_label names are used. A sequence should be given if the DataFrame uses
MultiIndex. If False do not print fields for index names. Use
index_label=False for easier importing in R
string, optional A string representing the encoding to use in the output file,
encoding
defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
string, default ‘n’ The newline character or character sequence to use in the
line_terminator
output file
quotechar string (length 1), default ‘”’ character used to quote fields
http://www.riptutorial.com/ 131
Parameter Description
string (length 1), default None character used to escape sep and quotechar
escapechar
when appropriate
boolean, default False write multi_index columns as a list of tuples (if True)
tupleize_cols
or new (expanded format) if False)
string, default ‘.’ Character recognized as decimal separator. E.g. use ‘,’ for
decimal
European data
Examples
Create random DataFrame and write to .csv
import numpy as np
import pandas as pd
df
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
2 0.950088 -0.151357 -0.103219
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
df.to_csv('example.csv', index=False)
Contents of example.csv:
A,B,C
1.76405234597,0.400157208367,0.978737984106
2.2408931992,1.86755799015,-0.977277879876
http://www.riptutorial.com/ 132
0.950088417526,-0.151357208298,-0.103218851794
0.410598501938,0.144043571161,1.45427350696
0.761037725147,0.121675016493,0.443863232745
Note that we specify index=False so that the auto-generated indices (row #s 0,1,2,3,4) are not
included in the CSV file. Include it if you need the index column, like so:
df.to_csv('example.csv', index=True) # Or just leave off the index param; default is True
Contents of example.csv:
,A,B,C
0,1.76405234597,0.400157208367,0.978737984106
1,2.2408931992,1.86755799015,-0.977277879876
2,0.950088417526,-0.151357208298,-0.103218851794
3,0.410598501938,0.144043571161,1.45427350696
4,0.761037725147,0.121675016493,0.443863232745
Also note that you can remove the header if it's not needed with header=False. This is the simplest
output:
Contents of example.csv:
1.76405234597,0.400157208367,0.978737984106
2.2408931992,1.86755799015,-0.977277879876
0.950088417526,-0.151357208298,-0.103218851794
0.410598501938,0.144043571161,1.45427350696
0.761037725147,0.121675016493,0.443863232745
The delimiter can be set by sep= argument, although the standard separator for csv files is ',' .
Save Pandas DataFrame from list to dicts to csv with no index and with data
encoding
import pandas as pd
data = [
{'name': 'Daniel', 'country': 'Uganda'},
{'name': 'Yao', 'country': 'China'},
{'name': 'James', 'country': 'Colombia'},
]
df = pd.DataFrame(data)
filename = 'people.csv'
http://www.riptutorial.com/ 133
df.to_csv(filename, index=False, encoding='utf-8')
http://www.riptutorial.com/ 134
Chapter 36: Series
Examples
Simple Series creation examples
A series is a one-dimension data structure. It's a bit like a supercharged array, or a dictionary.
import pandas as pd
>>> s
0 10
1 20
2 30
dtype: int64
Every value in a series has an index. By default, the indices are integers, running from 0 to the
series length minus 1. In the example above you can see the indices printed to the left of the
values.
>>> s2
a 1.5
b 2.5
c 3.5
Name: my_series, dtype: float64
>>> s3
A a
B b
C c
dtype: object
import pandas as pd
import numpy as np
np.random.seed(0)
rng = pd.date_range('2015-02-24', periods=5, freq='T')
s = pd.Series(np.random.randn(len(rng)), index=rng)
print (s)
http://www.riptutorial.com/ 135
2015-02-24 00:02:00 0.978738
2015-02-24 00:03:00 2.240893
2015-02-24 00:04:00 1.867558
Freq: T, dtype: float64
0 2015-02-24 00:00:00
1 2015-02-24 00:01:00
2 2015-02-24 00:02:00
3 2015-02-24 00:03:00
4 2015-02-24 00:04:00
dtype: datetime64[ns]
Followings are a few simple things which come handy when you are working with Series:
>>> len(s)
8
To access an element in s:
>>> s[4]
8
>>> s.loc[2]
6
>>> s[1:3]
http://www.riptutorial.com/ 136
1 4
2 6
dtype: int64
>>> s.min()
1
>>> s.max()
8
>>> s.mean()
4.75
>>> s.std()
2.2519832529192065
>>> s.astype(float)
0 1.0
1 4.0
2 6.0
3 3.0
4 8.0
5 7.0
6 4.0
7 5.0
dtype: float64
>>> s.values
array([1, 4, 6, 3, 8, 7, 4, 5])
To make a copy of s:
>>> d = s.copy()
>>> d
0 1
1 4
2 6
3 3
4 8
5 7
6 4
7 5
dtype: int64
http://www.riptutorial.com/ 137
Applying a function to a Series
Pandas provides an effective way to apply a function to every element of a Series and get a new
Series. Let us assume we have the following Series:
We can simply apply square to every element of s and get a new Series:
>>> t = s.apply(square)
>>> t
0 9
1 49
2 25
3 64
4 81
5 1
6 0
7 16
dtype: int64
>>> s.apply(lambda x: x ** 2)
0 9
1 49
2 25
3 64
4 81
5 1
6 0
7 16
dtype: int64
http://www.riptutorial.com/ 138
0 bob
1 jack
2 rose
dtype: object
If all the elements of the Series are strings, there is an easier way to apply string methods:
>>> q.str.lower()
0 bob
1 jack
2 rose
dtype: object
>>> q.str.len()
0 3
1 4
2 4
http://www.riptutorial.com/ 139
Chapter 37: Shifting and Lagging Data
Examples
Shifting or lagging values in a dataframe
import pandas as pd
df
# chickens eggs
# 0 0 1
# 1 1 2
# 2 2 4
# 3 4 8
df.shift()
# chickens eggs
# 0 NaN NaN
# 1 0.0 1.0
# 2 1.0 2.0
# 3 2.0 4.0
df.shift(-2)
# chickens eggs
# 0 2.0 4.0
# 1 4.0 8.0
# 2 NaN NaN
# 3 NaN NaN
df['eggs'].shift(1) - df['chickens']
# 0 NaN
# 1 0.0
# 2 0.0
# 3 0.0
The first argument to .shift() is periods, the number of spaces to move the data. If not specified,
defaults to 1.
http://www.riptutorial.com/ 140
Chapter 38: Simple manipulation of
DataFrames
Examples
Reorder columns
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('listing')))
# use ix to reorder
df2 = df.ix[:, cols]
Given a DataFrame:
s1 = pd.Series([1,2,3])
s2 = pd.Series(['a','b','c'])
Output:
C1 C2 C3
0 1 2 3
1 a b c
df = pd.DataFrame(np.array([[10,11,12]]), \
columns=["C1", "C2", "C3"]).append(df, ignore_index=True)
print df
Output:
C1 C2 C3
0 10 11 12
1 1 2 3
2 a b c
http://www.riptutorial.com/ 141
let's generate a DataFrame first:
df = pd.DataFrame(np.arange(10).reshape(5,2), columns=list('ab'))
print(df)
# Output:
# a b
# 0 0 1
# 1 2 3
# 2 4 5
# 3 6 7
# 4 8 9
df.drop([0,4], inplace=True)
print(df)
# Output
# a b
# 1 2 3
# 2 4 5
# 3 6 7
df = pd.DataFrame(np.arange(10).reshape(5,2), columns=list('ab'))
df = df.drop([0,4])
print(df)
# Output:
# a b
# 1 2 3
# 2 4 5
# 3 6 7
df = pd.DataFrame(np.arange(10).reshape(5,2), columns=list('ab'))
df = df[~df.index.isin([0,4])]
print(df)
# Output:
# a b
# 1 2 3
# 2 4 5
# 3 6 7
print(df)
http://www.riptutorial.com/ 142
# Output:
# A B
# 0 1 4
# 1 2 5
# 2 3 6
Directly assign
df['C'] = [7, 8, 9]
print(df)
# Output:
# A B C
# 0 1 4 7
# 1 2 5 8
# 2 3 6 9
df['C'] = 1
print(df)
# Output:
# A B C
# 0 1 4 1
# 1 2 5 1
# 2 3 6 1
# print(df)
# Output:
# A B C
# 0 1 4 5
# 1 2 5 7
# 2 3 6 9
df['C'] = df['A']**df['B']
print(df)
# Output:
# A B C
# 0 1 4 1
# 1 2 5 32
# 2 3 6 729
a = [1, 2, 3]
http://www.riptutorial.com/ 143
b = [4, 5, 6]
print(c)
# Output:
# [1, 32, 729]
print(df_means)
# Output:
# A 2.0
# B 5.0
# C 7.0
# D 20.0 # adds a new column D before taking the mean
# dtype: float64
print(df)
# Output:
# A B A2 B2
# 0 1 4 1 16
# 1 2 5 4 25
# 2 3 6 9 36
print(new_df)
# Output:
# A B A2 B2 A3 B3
# 0 1 4 1 16 1 20
# 1 2 5 4 25 8 25
# 2 3 6 9 36 27 30
import numpy as np
http://www.riptutorial.com/ 144
import pandas as pd
np.random.seed(0)
print(df)
# Output:
# A B C D E F
# 0 -0.895467 0.386902 -0.510805 -1.180632 -0.028182 0.428332
# 1 0.066517 0.302472 -0.634322 -0.362741 -0.672460 -0.359553
# 2 -0.813146 -1.726283 0.177426 -0.401781 -1.630198 0.462782
# 3 -0.907298 0.051945 0.729091 0.128983 1.139401 -1.234826
# 4 0.402342 -0.684810 -0.870797 -0.578850 -0.311553 0.056165
1) Using del
del df['C']
print(df)
# Output:
# A B D E F
# 0 -0.895467 0.386902 -1.180632 -0.028182 0.428332
# 1 0.066517 0.302472 -0.362741 -0.672460 -0.359553
# 2 -0.813146 -1.726283 -0.401781 -1.630198 0.462782
# 3 -0.907298 0.051945 0.128983 1.139401 -1.234826
# 4 0.402342 -0.684810 -0.578850 -0.311553 0.056165
2) Using drop
print(df)
# Output:
# A D F
# 0 -0.895467 -1.180632 0.428332
# 1 0.066517 -0.362741 -0.359553
# 2 -0.813146 -0.401781 0.462782
# 3 -0.907298 0.128983 -1.234826
# 4 0.402342 -0.578850 0.056165
To use column integer numbers instead of names (remember column indices start at zero):
print(df)
# Output:
# D
# 0 -1.180632
# 1 -0.362741
# 2 -0.401781
# 3 0.128983
# 4 -0.578850
http://www.riptutorial.com/ 145
Rename a column
print(df)
# Output:
# old_name_1 old_name_2
# 0 1 5
# 1 2 6
# 2 3 7
To rename one or more columns, pass the old names and new names as a dictionary:
Or a function:
You can also set df.columns as the list of the new names:
df.columns = ['new_name_1','new_name_2']
print(df)
# Output:
# new_name_1 new_name_2
# 0 1 5
# 1 2 6
# 2 3 7
import pandas as pd
http://www.riptutorial.com/ 146
To encode the male to 0 and female to 1:
df.loc[df["gender"] == "male","gender"] = 0
df.loc[df["gender"] == "female","gender"] = 1
>>> df
gender id
0 0 1
1 1 2
2 1 3
http://www.riptutorial.com/ 147
Chapter 39: String manipulation
Examples
Slicing strings
Strings in a Series can be sliced using .str.slice() method, or more conveniently, using brackets
(.str[]).
In [1]: ser = pd.Series(['Lorem ipsum', 'dolor sit amet', 'consectetur adipiscing elit'])
In [2]: ser
Out[2]:
0 Lorem ipsum
1 dolor sit amet
2 consectetur adipiscing elit
dtype: object
In [3]: ser.str[0]
Out[3]:
0 L
1 d
2 c
dtype: object
In [4]: ser.str[:3]
Out[4]:
0 Lor
1 dol
2 con
dtype: object
In [5]: ser.str[-1]
Out[5]:
0 m
1 t
2 t
dtype: object
In [6]: ser.str[-3:]
Out[6]:
0 sum
1 met
2 lit
http://www.riptutorial.com/ 148
dtype: object
In [7]: ser.str[:10:2]
Out[7]:
0 Lrmis
1 dlrst
2 cnett
dtype: object
Pandas behaves similarly to Python when handling slices and indices. For example, if an index is
outside the range, Python raises an error:
In [8]:'Lorem ipsum'[12]
# IndexError: string index out of range
In [10]: ser.str[12]
Out[10]:
0 NaN
1 e
2 a
dtype: object
In [11]: ser.str[12:15]
Out[11]:
0
1 et
2 adi
dtype: object
str.contains() method can be used to check if a pattern occurs in each string of a Series.
str.startswith() and str.endswith() methods can also be used as more specialized versions.
In [1]: animals = pd.Series(['cat', 'dog', 'bear', 'cow', 'bird', 'owl', 'rabbit', 'snake'])
In [2]: animals.str.contains('a')
http://www.riptutorial.com/ 149
Out[2]:
0 True
1 False
2 True
3 False
4 False
5 False
6 True
7 True
8 True
dtype: bool
This can be used as a boolean index to return only the animals containing the letter 'a':
In [3]: animals[animals.str.contains('a')]
Out[3]:
0 cat
2 bear
6 rabbit
7 snake
dtype: object
str.startswith and str.endswith methods work similarly, but they also accept tuples as inputs.
Capitalization of strings
In [1]: ser = pd.Series(['lORem ipSuM', 'Dolor sit amet', 'Consectetur Adipiscing Elit'])
In [2]: ser.str.upper()
Out[2]:
0 LOREM IPSUM
1 DOLOR SIT AMET
2 CONSECTETUR ADIPISCING ELIT
dtype: object
All lowercase:
In [3]: ser.str.lower()
Out[3]:
0 lorem ipsum
1 dolor sit amet
2 consectetur adipiscing elit
dtype: object
http://www.riptutorial.com/ 150
Capitalize the first character and lowercase the remaining:
In [4]: ser.str.capitalize()
Out[4]:
0 Lorem ipsum
1 Dolor sit amet
2 Consectetur adipiscing elit
dtype: object
Convert each string to a titlecase (capitalize the first character of each word in each string,
lowercase the remaining):
In [5]: ser.str.title()
Out[5]:
0 Lorem Ipsum
1 Dolor Sit Amet
2 Consectetur Adipiscing Elit
dtype: object
In [6]: ser.str.swapcase()
Out[6]:
0 LorEM IPsUm
1 dOLOR SIT AMET
2 cONSECTETUR aDIPISCING eLIT
dtype: object
Aside from these methods that change the capitalization, several methods can be used to check
the capitalization of strings.
In [7]: ser = pd.Series(['LOREM IPSUM', 'dolor sit amet', 'Consectetur Adipiscing Elit'])
In [8]: ser.str.islower()
Out[8]:
0 False
1 True
2 False
dtype: bool
Is it all uppercase:
In [9]: ser.str.isupper()
Out[9]:
0 True
1 False
2 False
dtype: bool
Is it a titlecased string:
http://www.riptutorial.com/ 151
In [10]: ser.str.istitle()
Out[10]:
0 False
1 False
2 True
dtype: bool
Regular expressions
For information on how to match strings using regex, see Getting started with Regular Expressions
.
http://www.riptutorial.com/ 152
Chapter 40: Using .ix, .iloc, .loc, .at and .iat to
access a DataFrame
Examples
Using .iloc
one two
a 1 6
b 2 7
c 3 8
d 4 9
e 5 10
Now we can use .iloc to read and write values. Let's read the first row, first column:
print df.iloc[0, 0]
We can also set values. Lets set the second column, second row to something new:
df.iloc[1, 1] = '21'
print df
one two
a 1 6
b 2 21
c 3 8
d 4 9
e 5 10
http://www.riptutorial.com/ 153
Using .loc
print df
one two
a 1 6
b 2 7
c 3 8
d 4 9
e 5 10
We use the column and row labels to access data with .loc. Let's set row 'c', column 'two' to the
value 33:
df.loc['c', 'two'] = 33
one two
a 1 6
b 2 7
c 3 33
d 4 9
e 5 10
Of note, using df['two'].loc['c'] = 33 may not report a warning, and may even work, however,
using df.loc['c', 'two'] is guaranteed to work correctly, while the former is not.
print df.loc['a':'c']
one two
a 1 6
b 2 7
c 3 8
http://www.riptutorial.com/ 154
And finally, we can do both together:
Will output rows b to c of column 'two'. Notice that the column label is not printed.
b 7
c 8
d 9
If .loc is supplied with an integer argument that is not a label it reverts to integer indexing of axes
(the behaviour of .iloc). This makes mixed label and integer indexing possible:
df.loc['b', 1]
will return the value in 2nd column (index starting at 0) in row 'b':
Read Using .ix, .iloc, .loc, .at and .iat to access a DataFrame online:
http://www.riptutorial.com/pandas/topic/7074/using--ix---iloc---loc---at-and--iat-to-access-a-
dataframe
http://www.riptutorial.com/ 155
Chapter 41: Working with Time Series
Examples
Creating Time Series
import pandas as pd
import numpy as np
# 2016-09-24 44
# 2016-09-25 47
se.tail(2)
# 2016-12-31 85
# 2017-01-01 48
A very handy way to subset Time Series is to use partial string indexing. It permits to select
range of dates with a clear syntax.
Getting Data
We are using the dataset in the Creating Time Series example
se.head(2).append(se.tail(2))
# 2016-09-24 44
# 2016-09-25 47
# 2016-12-31 85
# 2017-01-01 48
http://www.riptutorial.com/ 156
Subsetting
Now we can subset by year, month, day very intuitively.
By year
se['2017']
# 2017-01-01 48
By month
se['2017-01']
# 2017-01-01 48
By day
se['2017-01-01']
# 48
se['2016-12-31':'2017-01-01']
# 2016-12-31 85
# 2017-01-01 48
pandas also provides a dedicated truncate function for this usage through the after and before
parameters -- but I think it's less clear.
se.truncate(before='2017')
# 2017-01-01 48
se.truncate(before='2016-12-30', after='2016-12-31')
# 2016-12-30 13
# 2016-12-31 85
http://www.riptutorial.com/ 157
Credits
S.
Chapters Contributors
No
Getting started with Alexander, Andy Hayden, ayhan, Bryce Frank, Community,
1
pandas hashcode55, Nikita Pestrov, user2314737
Analysis: Bringing it
2 all together and piRSquared
making decisions
Appending to
3 shahins
DataFrame
Boolean indexing of
4 firelynx
dataframes
Computational
6 Ami Tavory
Tools
Cross sections of
8 different axes with Julien Marrec
MultiIndex
Dealing with
10 categorical Gorkem Ozkaya
variables
Getting information
12 Alexander, ayhan, Ayush Kumar Singh, bernie, Romain, ysearka
about DataFrames
Graphs and
14 Ami Tavory, Nikita Pestrov, Scimonster
Visualizations
http://www.riptutorial.com/ 158
Andy Hayden, ayhan, danio, Geeklhem, jezrael, ℕʘʘḆḽḘ, QM.py,
15 Grouping Data
Romain, user2314737
Grouping Time
16 ayhan, piRSquared
Series Data
IO for Google
19 ayhan, tworec
BigQuery
Making Pandas
Play Nice With
21 DataSwede
Native Python
Datatypes
Merge, join, and ayhan, Josh Garlitos, MaThMaX, MaxU, piRSquared, SerialDev,
23
concatenate varunsinghal
Meta:
24 Documentation Andy Hayden, ayhan, Stephen Leppik
Guidelines
Pandas IO tools amin, Andy Hayden, bernie, Fabich, Gal Dreiman, jezrael, João
28 (reading and saving Almeida, Julien Spronck, MaxU, Nikita Pestrov, SerialDev,
data sets) user2314737
Read MySQL to
30 andyabel, rrawat
DataFrame
32 Reading files into Arthur Camara, bee-sting, Corey Petty, Sirajus Salayhin
http://www.riptutorial.com/ 159
pandas DataFrame
33 Resampling jezrael
Reshaping and
34 Albert Camps, ayhan, bernie, DataSwede, jezrael, MaxU, Merlin
pivoting
Save pandas
amin, bernie, eraoul, Gal Dreiman, maxliving, Musafir Safwan,
35 dataframe to a csv
Nikita Pestrov, Olel Daniel, Stephan
file
Shifting and
37 ASGM
Lagging Data
Simple manipulation Alexander, ayhan, Ayush Kumar Singh, Gal Dreiman, Geeklhem,
38
of DataFrames MaxU, paulo.filip3, R.M., SerialDev, user2314737, ysearka
http://www.riptutorial.com/ 160