Pandas 0.17 Release and Other Notes
Posted by Chris Moffitt in articles
Introduction
As many of you know, pandas released version 0.17.0 on October 9th. In typical pandas fashion there are a bunch of updates, bug fixes and new features which I encourage you to read all about here. I do not plan to go through all of the changes but there are a couple of key things that I think will be useful to me in my daily work that I will explore briefly in this article. In addition, I am including a couple of other tips and tricks for pandas that I use on a frequent basis and hope will be useful to you.
Excel MultiIndex
Strangely, one of the most exciting things about 0.17 is that MultiIndex
supports
to_excel
again. I say again because it used to work
just fine for me (prior to 0.16.2). However in that release (and up to 0.17)
the method would raise a
NotImplementedError
. With 0.17, all works as
expected. Yay!
In order to illustrate in a little more detail what this supports, here is a short example:
from __future__ import print_function
import pandas as pd
sales_df = pd.read_excel('https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=true')
sales_df["prod_group"] = sales_df["sku"].str[0:2]
print(sales_df.head())
Here is our Data:
account number | name | sku | quantity | unit price | ext price | date | prod_group | |
---|---|---|---|---|---|---|---|---|
0 | 740150 | Barton LLC | B1-20000 | 39 | 86.69 | 3380.91 | 2014-01-01 07:21:51 | B1 |
1 | 714466 | Trantow-Barrows | S2-77896 | -1 | 63.16 | -63.16 | 2014-01-01 10:00:47 | S2 |
2 | 218895 | Kulas Inc | B1-69924 | 23 | 90.7 | 2086.1 | 2014-01-01 13:24:58 | B1 |
3 | 307599 | Kassulke, Ondricka and Metz | S1-65481 | 41 | 21.05 | 863.05 | 2014-01-01 15:05:22 | S1 |
4 | 412290 | Jerde-Hilpert | S2-34077 | 6 | 83.21 | 499.26 | 2014-01-01 23:26:55 | S2 |
We are using our simple sales data again to show purchase history over several months. The prod_group column is just a way to break down the products into different categories based on their SKU.
Now, let’s create a pivot table of the data. The key here is that the
pivot table creates a
MultiIndex
, which will cause problems when we try to save to excel.
sales_summary = pd.pivot_table(sales_df, index=["name"],
columns=["prod_group"],
values=["ext price", "quantity"])
ext price | quantity | |||||
---|---|---|---|---|---|---|
prod_group | B1 | S1 | S2 | B1 | S1 | S2 |
name | ||||||
Barton LLC | 1171.640278 | 1306.237600 | 1647.786190 | 20.777778 | 28.040000 | 28.190476 |
Cronin, Oberbrunner and Spencer | 1387.616842 | 1342.598571 | 1302.787407 | 26.210526 | 25.428571 | 23.740741 |
Frami, Hills and Schmidt | 1475.132143 | 1439.311875 | 1401.317857 | 25.071429 | 31.062500 | 25.142857 |
Fritsch, Russel and Anderson | 1372.360286 | 1188.012857 | 1569.353200 | 23.714286 | 26.047619 | 29.400000 |
Halvorson, Crona and Champlin | 1356.640000 | 1267.756667 | 983.963158 | 25.857143 | 22.333333 | 17.842105 |
All looks good until you try to save this using
to_excel
sales_summary.to_excel('sampleout.xlsx')
In prior versions of pandas you’ll see something like this:
Traceback (most recent call last):
File "pandas_version.py", line 25, in <module>
sales_summary.to_excel('sampleout.xlsx')
File "/home/chris/miniconda3/envs/pbpython2/lib/python2.7/site-packages/pandas/core/frame.py", line 1252, in to_excel
raise NotImplementedError("Writing as Excel with a MultiIndex is "
NotImplementedError: Writing as Excel with a MultiIndex is not yet implemented.
Thankfully, pandas 0.17 works again and you get a clean Excel file that you would expect. The other reason I wanted to specifically call this out is that some of the code in my older blog posts stopped working as a result of this change. Now, I can point people here to get the solution - upgrade to pandas 0.17!
Rounding Data
One simple (but useful) new feature is the addition of a
round
method to a DataFrame.
Prior to this release, if you wanted to round data in a column you had to use
np.round
or change the display using
display.float_format
.
In this example, let’s just look at the first 5 rows of data and round two columns:
sales_df = sales_df[0:5]
sales_df.round({'ext price':0, 'unit price': 0})
account number | name | sku | quantity | unit price | ext price | date | prod_group | |
---|---|---|---|---|---|---|---|---|
0 | 740150 | Barton LLC | B1-20000 | 39 | 87 | 3381 | 2014-01-01 07:21:51 | B1 |
1 | 714466 | Trantow-Barrows | S2-77896 | -1 | 63 | -63 | 2014-01-01 10:00:47 | S2 |
2 | 218895 | Kulas Inc | B1-69924 | 23 | 91 | 2086 | 2014-01-01 13:24:58 | B1 |
3 | 307599 | Kassulke, Ondricka and Metz | S1-65481 | 41 | 21 | 863 | 2014-01-01 15:05:22 | S1 |
4 | 412290 | Jerde-Hilpert | S2-34077 | 6 | 83 | 499 | 2014-01-01 23:26:55 | S2 |
I can control the number of decimal places to round as well as specify which columns to round by including in the dictionary of column names. This is pretty useful.
Troubleshooting Merges
In my experience manipulating and combining data, pandas tends to just do what I would
expect. One of the areas where I can find myself getting tripped up is when I do a
merge
of DataFrames. Sometimes I need to play around with whether or not
I really want a left, right, outer or inner join. To help troubleshoot these sometimes
pesky problems, there is a new
indicator
argument that can help you
figure out why your merge may not be doing what you expect.
For this example, let’s add in a description to our data. Here is the
sales_groups
DataFrame
that looks like this:
prod_group | Desc | |
---|---|---|
0 | B1 | Belt-Large |
1 | B2 | Belt-Medium |
2 | S1 | Shirt-Large |
3 | S2 | Shirt-Medium |
4 | H1 | Hat-Large |
If we want to merge the sales_groups data with our sales data, we could do something like this:
pd.merge(sales_df, sales_groups, on='prod_group', how='left', indicator=True)
Notice the _merge column shows that the data is only included if it is in both DataFrames.
account number | name | sku | quantity | unit price | ext price | date | prod_group | Desc | _merge | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 740150 | Barton LLC | B1-20000 | 39 | 86.69 | 3380.91 | 2014-01-01 07:21:51 | B1 | Belt-Large | both |
1 | 714466 | Trantow-Barrows | S2-77896 | -1 | 63.16 | -63.16 | 2014-01-01 10:00:47 | S2 | Shirt-Medium | both |
2 | 218895 | Kulas Inc | B1-69924 | 23 | 90.70 | 2086.10 | 2014-01-01 13:24:58 | B1 | Belt-Large | both |
3 | 307599 | Kassulke, Ondricka and Metz | S1-65481 | 41 | 21.05 | 863.05 | 2014-01-01 15:05:22 | S1 | Shirt-Large | both |
4 | 412290 | Jerde-Hilpert | S2-34077 | 6 | 83.21 | 499.26 | 2014-01-01 23:26:55 | S2 | Shirt-Medium | both |
Let’s see what happens when we do a right join:
pd.merge(sales_df, sales_groups, on='prod_group', how='right', indicator=True)
account number | name | sku | quantity | unit price | ext price | date | prod_group | Desc | _merge | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 740150 | Barton LLC | B1-20000 | 39 | 86.69 | 3380.91 | 2014-01-01 07:21:51 | B1 | Belt-Large | both |
1 | 218895 | Kulas Inc | B1-69924 | 23 | 90.70 | 2086.10 | 2014-01-01 13:24:58 | B1 | Belt-Large | both |
2 | 714466 | Trantow-Barrows | S2-77896 | -1 | 63.16 | -63.16 | 2014-01-01 10:00:47 | S2 | Shirt-Medium | both |
3 | 412290 | Jerde-Hilpert | S2-34077 | 6 | 83.21 | 499.26 | 2014-01-01 23:26:55 | S2 | Shirt-Medium | both |
4 | 307599 | Kassulke, Ondricka and Metz | S1-65481 | 41 | 21.05 | 863.05 | 2014-01-01 15:05:22 | S1 | Shirt-Large | both |
5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | B2 | Belt-Medium | right_only |
6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | H1 | Hat-Large | right_only |
Given this small dataset, the
indicator
is not required but you could imagine
that if you had thousands of rows, this could be really helpful to make sure you
are getting the results you expect.
Sorting API
One warning you will probably see after upgrading to pandas 0.17 is something like this:
FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
sales_df.sort(columns=["name", "sku"])
To fix it:
sales_df.sort_values(by=["name", "sku"])
The old syntax will continue to work but it’s best to proactively make these changes now so that upgrades will be simpler.
The next couple items are not related to this latest release but include some quick notes on working with pandas DataFrames and Series.
Saving Series to Excel
Pandas has a couple of functions I use quite a bit that return a
Series of data. One that I use is
value_counts
. What is sometimes
frustrating is that there is no
to_excel
function for a series.
Here is an example of using
value_counts
(I included all my data but
truncated the results for brevity):
sales_df["sku"].value_counts()
S2-77896 73
S1-82801 60
S2-10342 59
S1-47412 58
S1-93683 57
B1-38851 56
S2-82423 56
S1-50961 55
S1-30248 55
B1-53636 53
S1-06532 53
S1-27722 53
B1-20000 53
S2-34077 51
S2-83881 51
........
These results are useful and wouldn’t it be nice to dump it to Excel? Unfortunately
when you try, you get a nasty
AttributeError: 'Series' object has no attribute to_excel
:
sales_df["sku"].value_counts().to_excel("values.xlsx")
Traceback (most recent call last):
File "pandas_version.py", line 50, in <module>
sales_df["sku"].value_counts().to_excel("values.xlsx")
File "/home/chris/miniconda3/lib/python3.4/site-packages/pandas/core/generic.py", line 2246, in __getattr__
(type(self).__name__, name))
AttributeError: 'Series' object has no attribute 'to_excel'
A handy shortcut to fix this is to use
to_frame
to force it to a DataFrame
which can be saved:
sales_df["sku"].value_counts().to_frame("frequency").to_excel("values.xlsx")
You can pass one argument to
to_frame
- the label to use for your
data. It is optional but I find it handy to include.
Tabulate
Pandas has robust options to export DataFrames. However, I have had a few instances where I want to display a DataFrame in a more ASCII friendly way. I was pleasantly surprised to find the tabulate program understands pandas DataFrames.
Here is a sample usage:
print(tabulate(sales_df, tablefmt="fancy_grid"))
╒═══╤════════╤═════════════════════════════╤══════════╤════╤═══════╤═════════╤═════════════════════╤════╕
│ 0 │ 740150 │ Barton LLC │ B1-20000 │ 39 │ 86.69 │ 3380.91 │ 2014-01-01 07:21:51 │ B1 │
├───┼────────┼─────────────────────────────┼──────────┼────┼───────┼─────────┼─────────────────────┼────┤
│ 1 │ 714466 │ Trantow-Barrows │ S2-77896 │ -1 │ 63.16 │ -63.16 │ 2014-01-01 10:00:47 │ S2 │
├───┼────────┼─────────────────────────────┼──────────┼────┼───────┼─────────┼─────────────────────┼────┤
│ 2 │ 218895 │ Kulas Inc │ B1-69924 │ 23 │ 90.7 │ 2086.1 │ 2014-01-01 13:24:58 │ B1 │
├───┼────────┼─────────────────────────────┼──────────┼────┼───────┼─────────┼─────────────────────┼────┤
│ 3 │ 307599 │ Kassulke, Ondricka and Metz │ S1-65481 │ 41 │ 21.05 │ 863.05 │ 2014-01-01 15:05:22 │ S1 │
├───┼────────┼─────────────────────────────┼──────────┼────┼───────┼─────────┼─────────────────────┼────┤
│ 4 │ 412290 │ Jerde-Hilpert │ S2-34077 │ 6 │ 83.21 │ 499.26 │ 2014-01-01 23:26:55 │ S2 │
╘═══╧════════╧═════════════════════════════╧══════════╧════╧═══════╧═════════╧═════════════════════╧════╛
One extra step is required if you want to display headers:
headers = list(sales_df)
print(tabulate(sales_df, headers, tablefmt="fancy_grid"))
╒════╤══════════════════╤═════════════════════════════╤══════════╤════════════╤══════════════╤═════════════╤═════════════════════╤══════════════╕
│ │ account number │ name │ sku │ quantity │ unit price │ ext price │ date │ prod_group │
╞════╪══════════════════╪═════════════════════════════╪══════════╪════════════╪══════════════╪═════════════╪═════════════════════╪══════════════╡
│ 0 │ 740150 │ Barton LLC │ B1-20000 │ 39 │ 86.69 │ 3380.91 │ 2014-01-01 07:21:51 │ B1 │
├────┼──────────────────┼─────────────────────────────┼──────────┼────────────┼──────────────┼─────────────┼─────────────────────┼──────────────┤
│ 1 │ 714466 │ Trantow-Barrows │ S2-77896 │ -1 │ 63.16 │ -63.16 │ 2014-01-01 10:00:47 │ S2 │
├────┼──────────────────┼─────────────────────────────┼──────────┼────────────┼──────────────┼─────────────┼─────────────────────┼──────────────┤
│ 2 │ 218895 │ Kulas Inc │ B1-69924 │ 23 │ 90.7 │ 2086.1 │ 2014-01-01 13:24:58 │ B1 │
├────┼──────────────────┼─────────────────────────────┼──────────┼────────────┼──────────────┼─────────────┼─────────────────────┼──────────────┤
│ 3 │ 307599 │ Kassulke, Ondricka and Metz │ S1-65481 │ 41 │ 21.05 │ 863.05 │ 2014-01-01 15:05:22 │ S1 │
├────┼──────────────────┼─────────────────────────────┼──────────┼────────────┼──────────────┼─────────────┼─────────────────────┼──────────────┤
│ 4 │ 412290 │ Jerde-Hilpert │ S2-34077 │ 6 │ 83.21 │ 499.26 │ 2014-01-01 23:26:55 │ S2 │
╘════╧══════════════════╧═════════════════════════════╧══════════╧════════════╧══════════════╧═════════════╧═════════════════════╧══════════════╛
I realize the need for tabulate+pandas may be a little niche but I think it is useful to understand all the options that might be available to you.
Thanks for reading. Enjoy using your shiny new pandas 0.17 install!
Comments