Analyzing Data Using Python - Cleaning and Analyzing Data in Pandas
Analyzing Data Using Python - Cleaning and Analyzing Data in Pandas
docx 1
You'll start by using the pandas cut method to discretize data into bins, using bins to
plot histograms and identify outliers using box-and-whisker plots. You'll parse and
work with datetime objects read in from strings and convert string columns to
datetime using the dateutils python library.
Moving on, you'll master different pandas methods for aggregating data - including
the groupby, pivot, and pivot_table methods. Lastly, you'll perform various joins -
inner, left outer, right outer, and full outer - using both the merge and join methods.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 2
Contents
Analyzing Data Using Python:
Cleaning & Analyzing Data in Pandas............................................................................1
1. Course Overview..................................................................................................3
2. Identifying Duplicates in Data..............................................................................5
3. Categorizing and Binning Data.............................................................................9
4. Computing Aggregations on Data......................................................................13
5. Grouping and Aggregating Data.........................................................................19
6. Viewing Data Using Pivot Tables........................................................................26
7. Summarizing Data Using Pivot Tables................................................................32
8. Combining Data in DataFrames.........................................................................39
9. Implementing Inner Joins..................................................................................46
10. Implementing Left and Right Joins.................................................................51
11. Performing Joins Using DataFrame Indexes...................................................55
12. Analyzing Time Series Data.............................................................................59
13. Course Summary.............................................................................................67
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 3
1. Course Overview
Topic title:
Course Overview
Hi, and welcome to this course, cleaning and analyzing data in Pandas.
Your host for this session is Vitthal Srinivasan. He is a software engineer and a big
data expert.
My name is Vitthal Srinivasan and I will be your instructor for this course. A little bit
about myself first. I did my masters from Stanford University and have worked at
various companies, including Google and Credit Suisse. I presently work for
Looneycorn, a studio for high quality video content. Pandas is an extremely popular
and powerful Python library used for working with tabular and time series data. The
key abstraction in Pandas is the dataframe object, which encapsulates data
organized into named columns and uniquely identified rows. This of course, is exactly
how spreadsheets as well as relational databases represent data, and is also how
many data analyst and computer scientist are accustomed to modeling data
mentally. This universality in design, coupled with a natural syntax that combines the
best elements of pythonic, as well as R style programming, and constantly expanding
APIs. All help explain the meteoric rise in popularity of Pandas over the last decade.
In this course, we start by exploring the use of the dot duplicated method, which as
its name would suggest checks for duplicate values. We then learn how the drop
duplicates method can be used to eliminate those duplicates, and also how the
Panda's cut method can be used to discretize data into bins.
Those bins can then be used to easily plot a histogram and identify outliers using a
box and whisker plot. We then spend time learning an important skill how to
correctly work with date strings in Pandas. We will see how to convert string
columns to date times using the date utils, Python library. And then, move on to the
use of other processing techniques, such as the use of the group by method. After
experimenting with group by, we will move on to pivot tables. As we shall see, this
can be done in two ways using the dot pivot or the dot pivot table method. We will
see how the dot pivot method differs from an Excel style pivot table by not
performing an aggregation operation. In contrast, the pivot table method which does
perform an aggregation is a lot more intuitive.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 4
Finally, we will move on to the use of joins in Pandas. Joins are a crucially important
topic in working with data, whether this is a relational data or spreadsheets. So we
will spend time mastering this important topic, learning how to perform inner, left
outer, right outer, and full outer joins. Along the way, we will also learn the
difference between the pandas join and the pandas merge method. By the end of
this course, which also marks the end of this learning path, you will have a solid
knowledge of most important aspects of data analysis using Pandas. And will be well
positioned to use various powerful analysis techniques in your specific use cases.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 5
In this demo we are going to change gears. We are going to move away from working
on removing rows and columns, and shift focus to binning and duplicate removal
operations. Because this is a brand new Python workbook,
A Jupyter notebook is open on the screen. He enters the following command in the
first code cell:
let's start by re-importing the pandas module. We'll do this with the familiar alias of
pd. The first order of business is to read in the data file that we wish to process.
superstore_data.head( ).
We'll do this using pd.read_csv. This is another file in the Datasets folder. It's
superstore_dataset.csv. We have encountered this one before. We read the results
into a pandas data frame, and invoke the head method in order to get a sense of the
rows and columns. We can see from the little note in the bottom left that there are
24 columns in this data. As usual, let's invoke the shape property to get a sense of
the number of rows and columns.
superstore_data.shape.
There are 51,290 rows in addition to those 24 columns. When we read this data in,
pandas assigned the default label values which started from 0.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 6
superstore_data.head( ).
Let's go ahead and instead use the Row ID column, which is a part of the data. We'll
do this using the set_index method, which you can see on screen. We are effectively
performing this operation in place because we set index to be Row ID, and then we
store the return data frame back into superstore_data. When we examine the result
of the head operation, we can see that the Row ID field has now become the index
column. And we can also see that the number of columns has decreased from 24 to
23, which is exactly as we wanted it to be. Now, in addition to the Row ID, this data
set also has a column called Order ID.
On screen now is a command which can be used to identify all duplicate order IDs.
This is an invocation of the duplicated function. This is a little complicated, so let's
take a moment to understand it. The return value from the duplicated method is
going to consist of a series which has True and False values. Every value of True
indicates that a duplicate was found. Whether or not a row is a duplicate is
determined by a subset of the columns. Here we've specified just the one column,
which is Order ID. The result of this is going to be to return True every time a
duplicate value is found in the Order ID column. In addition, there is another input
variable called keep. This determines which duplicates, if any, are to be marked as
True.
Here we've specified keep=False. This means that all duplicates are going to be
marked as True. We could also specify keep to be the string, first. This would mark all
duplicates as True except the first occurrence, which would be treated as a non-
duplicate value. We could also have marked keep as last, which would mark all
duplicates as True except for the last occurrence. Here, by specifying keep=False, we
are marking all duplicate order IDs with False. Let's zero in on a couple of row IDs in
order to see what exactly this means. In the output on screen now we can see that
the row IDs 22255 as well as 22254 both have been marked as True. Let's examine
each of these records in turn.
superstore_data.loc[22255]['Order ID'].
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 7
We first use the loc command. We pass in the row ID that is 22255, as well as the
column name which is Order ID. We see the string which corresponds to the order
ID. This is 'IN-2011-47883'. If we now perform this same operation for the row ID
22254, we will find that it has the same order ID.
superstore_data.loc[22254]['Order ID'].
This is why both of these row IDs were marked with True. That's because both of
them had the same exact order ID.
We've now made use of the pandas method, duplicated, in order to identify all of the
duplicates based on the Order ID column. Again, the name of this method is,
duplicated, and it's used to identify duplicates. We are now going to use another
method with a very similar name
superstore_data.head( ).
which is, drop_duplicates. You can see this function being invoked on screen now.
We've invoked the drop_duplicates method on superstore_data. drop_duplicates is
going to do exactly what its name suggests. It takes in a subset of the columns. Here
we've just specified the one column, Order ID. It also has a value for the keep
parameter. This time we are going to specify keep='first'. This means that if duplicate
values of order ID are found, all except the first row are going to be dropped. There
is also the inplace parameter which is set to True. We go ahead and perform this
drop_duplicates operation on the Order ID. We keep the first row and then we
invoke the head method on the data frame. And remember that this data frame was
modified in place.
We can't really get a whole bunch of useful information from the head method. But
we can get a hint about how many rows were eliminated by examining the shape.
superstore_data.shape.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 8
We can see that the number of columns is still 23, but the number of rows has
reduced dramatically. It's now 25,035. Before we performed this drop_duplicates
operation, the number of rows was 51,290. So we have reduced the size of our
dataset to less than half of what it was. If we now rerun the duplicated method on
this data frame on the Order ID column, we will find that all of the values in the
series are False.
You can see this on screen now. We have invoked the duplicated method. The subset
is Order ID, keep=False, and in the output down below, every single value is False.
That's because we've just dropped duplicates from this column. So invoking the
drop_duplicates method with keep equal to first or last is going to eliminate all but
one of the duplicate rows.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 9
In this demo, we will continue to explore the use of the duplicated and the drop
duplicates methods from pandas, and we'll then learn how to bin values into discrete
intervals using the pandas.cut method. This comes in handy while performing
histogramming operations in data analysis. Next, let's turn our attention to the City
column. We first make use of the duplicated method
in order to find duplicates in the City column. You can see that we specify subset
equal to City and we also specify keep equal to false. As before, the return value is
going to be a series. Every duplicate city is going to be marked with a true, and the
vast majority of values in the series are going to be true because very few cities are
going to have just one order. It's exactly those cities that we like to zero in on.
superstore_data.shape.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 10
superstore_data[ 'City'].value_counts( ).
So we select the City column and apply value_counts. And we can see that there are
indeed 947 cities, each one of those cities has a value count of 1. Let's examine the
value_count results on a different column. This time on Category.
superstore_data[ 'Category'].value_counts( ).
Remember that this is only going to give us categories from those 947 cities. But
even so, it's a useful bit of analysis. We can see that there are 947 rows in total. 471
of them correspond to the category Office Supplies, 261 to Technology, and 215 to
Furniture.
Note also that 471 plus 261 plus 215 gives us, you guessed it, 947. Let's now turn our
attention from identifying and eliminating duplicates to binning data based on a
particular column. Binning is an important operation and it's often used in
constructing histograms.
On screen now we've begun by sorting our data frame on the Shipping Cost column.
We've done so with ascending=False and inplace =True. When we examine the head,
we are going to find the largest values of shipping cost appear up top. We can see,
for instance, that row ID 47905 had a shipping cost of $678.15. In similar fashion, if
we invoke the tail method on the Shipping Cost, we can see that the shipping cost at
the other end are really small.
For instance, row ID 47557 had a shipping cost of just $0.09. By sorting by shipping
cost and then examining the head and the tail, we've got a good sense of the
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 11
maximum and minimum values as well as other values which are close to the
maximum or the minimum. Let's now go ahead and organize this data into bins.
And for this we make use of the pandas.cut command. Notice how the first input
argument into pandas.cut is superstore_data and we've selected out the Shipping
Cost.
The second input argument consists of the bins. This is the list with those bin values.
The result of this operation is a series. The labels of that series are going to be the
row IDs from the original data. And the values in that series are going to be the
corresponding opening and closing bin values. We can see that this is a really
powerful function. When we examine the output, you can see that for each row ID,
we now have the bin interval that it belongs in. As an example, row ID 49063 has a
shipping cost of 42.63, which lies in the bin between 0 and 50. Row ID 49442, which
has a shipping cost of $73.51, belongs in the bin from 50 to 150, and so on.
Also note how parentheses and square brackets are used to delimit the intervals.
When the delimiter is a square bracket, this indicates that the interval is closed at
that limit point, which means that it includes its end value. For instance, the interval
of 50, 150 includes 150, but it does not include 50. That's why it's delimited at one
end with a parentheses and at the other end with a square bracket. The pandas.cut
method is an extremely powerful one.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 12
You can see on screen now how it can quickly be used to assign a label based on the
shipping cost. Here, for instance, all shipping cost which lie between 0 and $50 are
assigned the label Very Low Shipping Cost.
This is a label which applies to most rows, which we can see on screen now.
However, if we consider, for instance, row ID 47905 where the shipping cost is
$678.15, this lies in the bin interval 300 to 700. And that's why the label reads High
Shipping Cost. This gets us to the end of this little demo in which we saw how easy
pandas makes it for us to compute bins and histogram values, and assign labels to
those bins using the pandas.cut method.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 13
After exploring de-duplication and binning of data in a Pandas dataframe, let's turn
our attention to another set of interesting operations. And now we're going to see
how grouping and aggregation operations work on Pandas dataframes.
We are starting from scratch in a brand new iPython notebook, so we import Pandas
using the familiar alias of pd. We also import the dateutil module, that's because
we're going to need it for some date processing functionality.
import dateutil.
Next, we read in the data from a CSV file which we've encountered many times
before, into a Pandas dataframe.
superstore_data.head( ).
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 14
superstore_data.head( ).
The result of this is visible on screen, notice how we've made these changes in place,
we've used list comprehension to assign to the superstore_data.columns property.
Let's now move on and we begin by dropping a couple of columns.
superstore_data.head( ).
We've seen before how columns are dropped, and here we have a list which contains
a two column names, discount and postal code. We invoke the drop method, and we
specify axis = 1, recall that axis =1 is telling Pandas that this is a column wise
operation. After we execute this command, we can see that we have only 22
columns left, so we've eliminated two columns, 24 - 2 is 22. Next, let's use the
.dtypes property, in order to examine the data types of each column in our
dataframe.
superstore_data.dtypes.
Row_ID is of type int 64, and then there are a number of columns whose dtype is
object, these are all the string columns. We do want to change the dtype of the order
date and the ship date. As these currently stand, these are both being interpreted as
strings, but we'd like to change that and interpret them as dates instead. Let's go
ahead and make that happen, you can see the code for this on screen now.
superstore_data['ORDER_DATE' ] =
superstore_data['ORDER_DATE' ].apply(dateutil.parser.parse, dayfirst = True).
superstore_data['SHIP_DATE' ] =
superstore_data['SHIP_DATE'].apply(dateutil.parser.parse, dayfirst = True).
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 15
We have two assignment statements, on the left hand side of those assignment
statements, are the date columns from our superstore_data dataframe. On the right
hand side of those assignment statements, are the corresponding columns again. But
this time with a function which has been applied to them, that function has been
applied using the .apply method. And what function is it that we are passing in? Well,
it is dateutil.parser.parse. The right hand side of each of these two assignment
statements, is applying a transformation function to every element inside the
corresponding column. That function is the parse function from dateutil.parser, and
that function requires one input argument. That's dayfirst, we're specifying dayfirst =
true, this has the effect of parsing dates in dd/mm/yyyy format, rather than in
mm/dd/yyyy format. This becomes important in order to disambiguate dates, where
a particular number could represent either the day of the month or the month of the
year.
Let's hit Shift+Enter, and then let's re examine the dtypes of our dataframe.
superstore_data.dtypes.
When we do this, we will see that ORDER_DATE and SHIP_DATE, now no longer have
dtype of object, the dtype has changed, it now shows up as a datetime 64ns. This is
the correct dtype for datetime column to contain, so our parsing operation was
clearly successful. Now, we've also got this row_ID column in there, and that's what
we should be using as the index labels. So let's make that change as well,
superstore_data.set_index('ROW_ID', inplace=True).
superstore_data.head( ).
we use the set index method on our dataframe. We specify the row_ID column, and
we specify the value of inplace = True. If we now re examine the head of
superstore_data, and see that the number of columns has reduced by 1, it's become
21 columns instead of 22. And that's because the index labels are taken from the
row_ID column.
We are now ready to start performing the grouping and aggregation operations on
this data.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 16
Well, then instead of count, we've got to use the nunique() aggregation function, this
returns a value of 3. So there are only 3 unique values of categories in our data.
What are those three unique values, let's find out.
we get back an array with the three values, office supplies, furniture and technology.
So you can see that, nunique() returned the number of unique values, unique()
returned what those values were, and count simply returned the number of values in
total, without eliminating duplicates. However, with each of these three operations
in cells, 9, 10 and 11, we've specified the function that we'd like to invoke in code.
What if we'd like to do this on the fly? We can use the agg function instead, on
screen now,
superstore_data.agg({'SALES':'min'}).
you can see that we've invoked the agg function on our dataframe. This agg function
has taken in a dictionary, that dictionary has a list of columns, as well as
corresponding aggregation functions.
Here we've currently only passed in the one column which has SALES, and the one
aggregation function which is min. As we'll see in a moment, we can specify many
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 17
more choices of column as well as aggregation function, but for now let's just
execute this. We can see that the lowest value in the sales column is 0.444. What if
you'd like to find the largest sales value? Well, it's simple enough, specify the value
to be max instead of min, and this returns the value of $22,638.48.
superstore_data.agg({'SALES':'max'}).
If we can compute the min and the max, surely we can compute the sum as well,
let's compute the sum or the shipping cost column.
superstore_data.agg({'SHIPPING_COST':'sum'}).
That aggregation function works just fine as well. In similar fashion, we can compute
the mean shipping cost, that's a lot more tractable, that works over $26.375915.
superstore_data.agg({'SHIPPING_COST':'mean'}).
Let's now try another variation of the agg function. Here we are going to specify
multiple choices of column, as well as of aggregation function.
We do this in a manner that's simple enough, the keys continue to be the column
names.
So here, the columns that we would like to aggregate over are quantity and sales.
The corresponding values now are lists, each list has more than one aggregation
function operator, min, max, and sum. Let's go ahead and execute this, and we can
see that the result is now a dataframe. The columns of this dataframe correspond to
the columns we specified, that's quantity and sales, and the rows correspond to the
different aggregation functions. Next, let's see how we can apply a lambda function
to every element of a particular column.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 18
superstore_data['SHIPPING_COST_UPDATED'] =
superstore_data['SHIPPING_COST’].transform(lambda x :x + .5).
On screen now, we have two lines of code, let's just focus on the first line of code,
which is an assignment operation. On the left hand side of the = sign, is the shipping
cost updated column in our superstore_data dataframe. So that's a new column, it's
not one which presently exists in our dataframe. We are adding a new column into
our dataframe, and how is that new column going to be populated?
Well, to answer that question, let's turn our attention to the right hand side of the =
sign, there we invoke the transform method on the shipping cost column. The
transform method takes in a lambda function, if you're not familiar with lambda
functions, these are really lightweight, anonymous functions, which can be defined
and used on the fly. Here, our lambda function is a really simple one, it takes in a
value x, and it returns a value x + 0.5. This operation, this lambda function operation,
is going to be applied to every element in the shipping cost column. And all of the
resulting values are going to be packed together and placed in the new column of
our dataframe, that column is going to be called shipping cost updated. And does it
for the first line on screen now, the second line simply selects two columns, shipping
cost and shipping cost updated, and displays the first five values. By examining these
first five values, we can see, that the updated shipping cost is always 50 cents higher
than the original shipping cost. That is exactly what we would expect, because that is
what our lambda function did. Let's now perform the same operation but in place.
superstore_data['SHIPPING_COST'] =
superstore_data['SHIPPING_COST’].transform(lambda x :x + .5).
superstore_data.columns.
The code for this is on screen, we first perform the same operation where we apply
the lambda function via the .transform method. We do this on the shipping cost
column, the difference is that we now reassign it into the same column, shipping
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 19
cost. We no longer have the shipping cost updated column in there at all. But what
about the pre-existing shipping cost updated column? Well, we simply drop that,
that's the second line, we invoke the .drop method.
We pass in the column name, shipping cost updated, and we specify axis = 1,
because this is a column wise operation. When we execute this code, we get to see
the columns which remain in our dataframe. And we have shipping cost in there, but
we do not have shipping cost updated. And in this way, we have updated the data in
the shipping cost column inplace. That gets us to the end of this little demo, in which
we explored how we apply transform and agg methods of Pandas dataframes can be
used. In the demo coming up ahead, we will turn our attention to another useful
category of functions, these are grouped by operations.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 20
superstore_data.groupby( 'COUNTRY' ).
In its simplest form, groupby can be invoked with just the name of a single column.
That's how we've invoked it here. The return value from groupby is of a type called
DataFrameGroupby. This is an object which encapsulates information about all of the
groups. So the return value from groupby is not a dataframe, it's a
DataFrameGroupby. Let's go ahead and repeat this groupby operation and store the
result in a variable. On screen now, we've grouped by country and then stored the
result in a variable called country_group.
country_group.first( ).
Then we've invoked the first method on that country_group object. Let's understand
what exactly this is doing. When we groupby a specific column, effectively we are
creating groups of rows such that within each group the value of that column is the
same. Here we are grouping by country. And so we are going to create groups of
rows where the country is the same for each group. Now clearly, the values of all of
the other columns are not going to be the same within each group. Here we are
asking for the first row within each group, and that's why we've invoked the .first
method. When we view the results of this command, we can see that we do indeed
have one row per country. Over on the extreme left, we have the name of the
country. And we can also see that we have 147 countries in total in our dataframe.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 21
As we scroll to the right, we can see the values of the different columns for that
particular first row. But it's important to keep in mind that what we see on screen
now just represents one row from within each group. It does not represent any kind
of aggregation operation over each group. So groupby does not perform aggregation
by default in pandas. Just as we can view the first row within each group, we can also
choose to view the last row.
country_group.last( ).
And to do that, we simply invoke the .last method on our groupby dataframe object.
That's what we've done on screen now. You can see that we once again have 147
rows, one corresponding to each country. These rows are now the last rows
sequentially ordered within each group. You've probably encountered groupby
operations in SQL. And if that's the case, you'll be thinking that this looks pretty
different. In SQL, when we perform groupby operations, we absolutely have to
perform aggregation operations on each group. We cannot access individual rows,
but that's different from pandas. In pandas, when we perform a groupby, we can
choose to access the first or last row within each group using the methods we've just
seen. Now let's move on and see how we can perform aggregations on each group.
And this is where the topics of groupby and aggregation converge in pandas as well.
On screen now, we've invoked the .count method on our groupby object.
country_group.count( ).
As you might expect, this is going to return the count of rows within each group.
How is each group defined? Well, each group is simply the set of rows where the
country column has the same value. Here we have individual counts for each column
within each group. And this again is somewhat different from a SQL groupby. For
instance, we can see that the count of Order_IDs for Albania is 16. The count of
Order_Dates for Albania is also 16, and so is every other value in that row. Of course,
it's easy enough for us to only request the count in a specific column. And, this is
what you see on screen now.
country_group['ORDER_ID'].count( ).
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 22
country_group.get_group('India' ).
On screen now, we've invoked the get_group method and we've passed in the name
of one country, that's India. That return value is going to be, you guessed it, a
dataframe. And that's because the group of rows where the country is equal to India
can be treated as a dataframe, and it is. You can see here that we have 1,555 rows
where the country is equal to India, and we have 20 columns. The labels of all of
these rows are the same row IDs that we had in the original dataframe. So there is an
important lesson here. When we perform a groupby operation on a dataframe, we
get a special object. That object encapsulates it, contains within itself multiple
dataframes one per value of the groupby column. Here we performed a groupby on
the Country column, and so the object has multiple groups.
Each group corresponds to one dataframe. On screen now, we are viewing the
dataframe corresponding to the country India. Once we understand this fact, the
other operations that we can perform with groupby aggregates are simple enough.
On screen now, you can see that we've gone back to our original dataframe, that's
superstore_data. We've invoked groupby on this. We are grouping by the Country
column, we then choose the Quantity column. This is going to be the Quantity
column on each of the contained dataframes. And on that we invoke the sum
aggregate function. And finally, we've got to sample just 10 values. And when we run
this, we have a series. That series has labels corresponding to each country. And the
values correspond to the sum of quantities for that country. And finally, because we
only did a sampling of 10 values, that's the number of rows that are displayed. Let's
try another operation. Once again, we groupby Country, and then we select the
Profit.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 23
We can see down below that this gives us the sum of profits for each country in our
dataset. Once again, the result is in the form of a series, the labels correspond to
country names. And the values correspond to the total profits for that particular
country. We've now understood how to perform groupby and aggregation
operations. Let's turn our attention to segmentation.
superstore_data [consumer_seg].head( ).
On screen now, you can see that we've created a variable called consumer_seg. This
is a subset of the rows inside our superstore_data where the Segment=Consumer.
This gives us true or false values for each label. We then apply those as a filter on
superstore_data and we invoke the head method on this. The result is we get 5 rows.
These are the first 5 rows in our data for Segment=Consumer. We can see that the
value of the Segment column is indeed equal to consumer for all of these rows. So
now we have this variable called consumer_seg. And in similar fashion, we can also
create variables for Segment=Corporate, that's stored in a variable called
corporate_seg.
superstore_data [corporate_seg].head( ).
And a third segment, which has Segment=Home Office, and that's stored in
home_office_seg.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 24
superstore_data [home_office_seg].head( ).
Once we have these three segments, it's easy enough for us to compute the average
of the sales column within each segment.
On screen now is the code to do exactly this. You can see that we have three
assignment statements, each one of them computes the mean of the sales column of
a particular segment. And how do we get that segment? Well, we filter our original
dataframe using the segment variable we just created. If we print the variables on
the left hand side of those equal to signs, we can see that those are just numbers.
So we've computed the average sales by segment for our data. Let's wrap those raw
numbers up into a pandas dataframe.
pd.DataFrame({'SEGMENT':
'TOTAL_SALES' :
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 25
On screen now, you can see that we've made use of the groupby method on our
superstore_data frame and we've groupby the segment column. We've then
selected the Sales column and computed the mean. And the result is identical to the
dataframe which we had to painstakingly compute up above. This example shows
just how powerful the groupby method on a pandas dataframe is. Notice that in cell
35, we referenced the Sales column using ]] brackets. That's why the output takes
the form of a dataframe rather than a series. Let's make this point a little more clear.
On screen now, you can see that we groupby Country, then we select the SALES
column using ]. Then we apply the sum function and then compute the head.
The result which is down below is not a dataframe, it takes the form of a series. Let's
now tweak this code. The only change we make is that we access SALES using double
square brackets instead of single square brackets.
And the result is now a dataframe. You can see that we have named columns. And
we can also see that each of the country names has become a label on the index of
this dataframe. Let's now go back to a method which we worked with in the previous
demo.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 26
This is the agg method. On screen now, we've combined the use of groupby and agg.
We first groupby country, and we specify as index=False because we'd like to retain
the country as a column in our data. On the result of this, we invoke the agg method.
And the aggregation we'd like to perform is on the SALES column, and it's going to be
a sum aggregation. When we run this code, we get back the total sales by country for
ten countries from our data. So this combination of groupby and agg is doing pretty
much the same thing as a SQL groupby operation. Let's modify this so that we are
computing not one, but two aggregate operations.
{'SALES' :
'sum', 'QUANTITY':
'mean'}).head(10).
So we are still grouping by country. But we are now computing the sum of the SALES
column and the mean of the QUANTITY column. The result of this which is on screen
now gives us not only the total sales, but also the average quantity by country. And if
we wish to drill down into a particular segment of business in each country, well,
that's easy enough to do as well.
On screen now, you can see that we first only select a subset. That's the subset
where Category is equal to Office Supplies. We then get that subset of our original
dataframe where this condition was true before performing our groupby on the
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 27
Country column, and then computing aggregates. And in this way, we now have the
max _sales, min_sales, and total_sales by country, but this is only computed over
those rows where Category is equal to Office Supplies. This gets us to the end of our
exploration of groupby and aggregate operations. As you can see, these are a really
powerful, better functionality in pandas.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 28
If you are a business analyst or a data analyst who's worked a lot with SQL, Pandas,
group by and aggregation operations are likely to be of interest to you because they
are similar to operations that you already perform. On the other hand, let's say that
your current technology of choice is Microsoft Excel, you might be comfortable
thinking in terms of pivot tables and pivoting operations. Well, fear not, Pandas has
extensive support for pivoting as well. That's what we're going to explore in this
demo.
A Jupyter notebook is open on the screen. He enters a set of commands in code cell
1.
This is a brand new workbook, so let's begin by importing Pandas with the alias pd
and numpy with the alias np. Let's start simple. We begin by creating a Pandas
dataframe with data that we have handcrafted.
emp_details = pd.DataFrame({'Employee_name':
'Company':
'Salary':
'Age':
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 29
emp_details.
This is a dataframe called emp_details you can see it on screen now. There are
columns corresponding to Employee Name, Company, Salary, and Age. And the index
labels are the default index labels. We have three rows with labels 0, 1, and 2. Let's
jump right into our first pivot operation. On screen now we've invoked the pivot
method on this emp_details dataframe.
Notice how we've got to specify three input arguments. The first of these is the index
column. This is Employee_Name and the second is the columns name variable and
the third is the values name variable. Let's hit Shift + Enter and examine what we get.
We can see that whichever column was specified as the index has now become the
index column of the resulting dataframe. So Employee_Name is now the index field.
So we have rows for each employee.
We've specified that the columns should be taken from the company field and that's
why we have one column corresponding to the companies Apple, Facebook and
Google. We've now defined the rows as well as the columns in our pivot table.
What's left? Well, the actual values and those values are taken from the salary
column of the original dataframe. There are three important points worth noting
here.
First off, notice that the result of a pivot operation is a dataframe. So we invoke the
pivot method on a dataframe and we get back a reshape dataframe. Second point
worth noting is that the pivot method does not take in an aggregate function. This
makes the pivot method in Pandas very different from a pivot table in Excel. It's
exactly for this reason that Pandas also has another method called pivot table, which
does take in an aggregate function.
We'll get to that in a bit. And the third point worth noting is the presence of all of
these NaN values. Clearly not every combination of employee and company is going
to have any salary values and where there are no salary values, Pandas places a NaN.
Let's try another pivot operation just so we get the hang of this. We are now re-
invoking the pivot method on emp_details, the index is still employee name, columns
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 30
are still taken from the company column, but this time the values are taken from the
age column.
We hit Shift + Enter, and once again we get back a reshape dataframe. Once again,
we have one row per employee, and once again we have one column per company.
The values inside this dataframe, however, are now taken from the age column of
the original dataframe. At this point you might be wondering what happens if you
have multiple values which correspond to one combination of employee and
company. Well, that would give rise to something known as a multi-index dataframe.
Multi-index dataframes are actually quite complex and hard to work with.
And for that reason, you will typically use an aggregation function and the pivot table
operation instead. We'll get to pivot tables but first let's try out another variant of
the pivot method.
We've again invoke the pivot method on emp_details. The difference this time is that
we have not one but two columns specified for the values input argument. We still
have index=Employee_Name and columns=Company and the difference is that we
specify a list with age and salary for values. And when we run this, we get back a
reshape dataframe.
This reshape dataframe again has one row per employee, but the number of
columns has now doubled. And that's because we have by company splits for two
different input columns, age and salary. We've now got a good sense of how the
pivot method works. Let's try it out with a larger data set.
superstore_data.head( ).
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 31
We are reading in the same superstore data set, which we've worked with in so
many previous demos. We do this using pd.read_csv.
We need to set up the date columns correctly. And for this we make use of
pd.to_datetime and transform the Order Date and Ship Date columns.
So we've changed these columns from being of type string that is object to being of
type date. Next, we go ahead and sort values based on the Order Date. So the rows
are going to be in order of Order Date. Next, let's go ahead and eliminate duplicates.
So we invoke the drop duplicates method on Order Date as a result of this, we are
only going to have one row per Order Date.
filter_africa_data.head ( ).
And finally, we prove our original data set, so that we only select rows where the
region is Africa and where the Order Date is greater than the first of January 2011.
We store this in a new dataframe called filter_africa_data. We invoke the head
command on this dataframe and you can see that we have 24 columns. That means
we have all of the columns from the original dataframe and we have rows, which are
restricted to orders from Africa, from after 2011. With that pre-processing done, we
are ready to invoke the pivot method on this dataframe.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 32
filter_africa_pivot.head (10).
In our data we have three categories Furniture, Office Supplies, and Technology.
We've defined the rows and columns of our reshape dataframe. Now the individual
values are going to be taken from the sales column. As before, not every
combination of order date and category actually had any sales. And that's why there
are various cells which contain the NaN value.
A lesson that we learnt from this little example is that, if we would like to make use
of the pivot method, we've often got to perform a lot of pre-processing on our
dataframe. And that's because the pivot method did not take an aggregation
function. Let's check out yet another example of this. On screen now, we've filtered
our africa_data again.
filter_africa_data.head (10).
This time the only filter condition is that the region should be equal to Africa. So this
time we are not eliminating duplicates on Order Date. Next we've got to perform a
groupby and an aggregation.
.agg({'Sales' :
'sum', 'Quantity' :
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 33
'mean'}).
grouped_africa_data.sample (12).
So we are going to groupby country and city. We specify as index equal to False
because we want the country and city to show up as columns in our grouped
dataframe. We then invoke the .agg method, we specify two aggregations the first is
on the Sales column, where we'd like to compute the sum and the second is on the
Quantity column, where we'd like to compute the mean.
And the result of this, which we can see down below has one row corresponding to
each country and each city in Africa. We also have values for the total sales and the
average quantity. Now that we set up this data in this format in a variable called
group Africa data, we are ready to invoke pivot on it.
So we are going to pivot with index equal to Country, columns equal to City and the
values equal to Sales. When we do this, unfortunately, we find that the result of our
pivot is a very sparse dataframe. It's sparse because the vast majority of values here
are not a number. Let's hone in on one of the few values which is actually a
legitimate one. Remember that we have one row for Country because we specified
index=Country. The first row corresponds to Namibia. And then we have one column
for every City. Not for every city in Namibia alone, but for every city in Africa.
That's why there are so many NaN values. Of all these cities, Windhoek is indeed a
city in Namibia. And that's why the intersection of row equal to Namibia and column
equal to Windhoek is a number. It's not a NaN, it's an actual number. But that's not
true for most combinations of City and Country. And that's why this pivot method
has returned so many NaN's and such a sparse result. In the demo coming up ahead,
we will change focus from the pivot method to its close cousin, the pivot table
method. And we'll see how pivot table gives much more natural results because it
takes in an aggregate function.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 34
In the previous demo we saw how the pivot method of a Pandas dataframe can be
used to dice up data and reshape it. Such that the rows, columns, and values are all
redefined. We also saw how the pivot method did not take an aggregate function.
And as a result, we ended up having to do a whole lot of preprocessing, that
preprocessing in turn led to a lot of NaN values. This is called a sparse result. In this
demo, we will turn our attention to a close cousin of the pivot method, this is the
pivot table method.
This does take in an aggregate function, and that's why it's a lot closer to an Excel
pivot table, which so many business and data analysts are familiar with. In this demo,
we are going to go back to working with the original data, which we had read in into
a dataframe called superstore_data. Let's begin by dropping a couple of columns
which we are not going to need in this demo.
superstore_data.head ( ).
These are the Row ID and the Postal Code. Now, you might be wondering why we're
going to drop these columns just because we don't need them. And the answer to
that question will be revealed when we invoke pd.pivot_table. As usual, we use the
head method to check what our data looks like, and we have 22 columns. Let's
further use of shape method to get the exact number of rows and columns.
superstore_data.shape.
We can see that as usual we get to tuple which has two elements. The first element
gives us the number of rows 51,290 and the second gives us the number of columns
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 35
which is 22. Next, let's make use of the unique method in order to find the values of
the unique regions.
superstore_data['Region' ].unique( ).
And let's do the same on the category column. So we use the unique method in
order to get a list of the unique categories.
superstore_data['Category' ].unique( ).
As expected, there are just the three categories Office Supplies, Furniture, and
Technology. This small number of unique values makes the category column perfect
for our first pivot table.
category_metrics.
On screen now you can see that we've invoked pandas.pivot_table. The first input
argument is a dataframe, which is our superstore data. And the second is a list of
index columns. Here we have just one index column, and that's the category. Now
you might be wondering where we've specified the aggregate function. After all, we
were making a big deal about how the difference between the pivot table method
and the pivot method is that pivot table takes an aggregate function.
Where then is the aggregate function? The answer is that the default aggregate
function used by pd.pivot_table is the mean. And when we execute this code, we are
going to get a dataframe, more specifically an Excel style dataframe. The pivot table
method did something quite interesting. It went through the data passed in, it found
every numeric column. This was done on the basis of the D type.
And then it went ahead and computed the mean for each value of the category. So
we have one row for each value of the category column, that's why we have three
rows for Furniture, Office Supplies, and Technology. So that explains the structure of
the rows. The structure of the columns is more interesting. We have five columns,
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 36
one for each of the numeric columns in the original dataframe. And this also explains
why we chose to drop the row ID and postal code columns.
aggfunc=np.mean ).
category_metrics.
By comparing the output of cell 18 with the output of cell 17, we can see that the
two outputs are identical. They have the same rows, the same columns, and the
exact same values contained within. The only difference between the code in these
two cells is that in cell 18, we've specified aggfunc=np.mean. Remember that np is an
alias for NumPy.
We don't know for instance, what the minimum discount for category equal to
furniture is. Well, this is very easy to remedy because Pandas supports a really
powerful function, and that's the plot function.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 37
On screen now, we've made use of the plot function. We've specified the kind of
visualization and that's box and we've also specified the figure size. When we run
this we get a boxplot, also known as a box and whisker plot, or Tukey box plot is
named after its inventor, which gives us valuable statistical information about each
series.
The box itself is defined by the 25th and the 75th percentiles and the horizontal line
inside the box in corresponds to the median. The difference between the 75th and
25th percentiles is called the interquartile range or the IQR, and the whiskers extend
out to 1.5 times the IQR in each direction. If there are outliers which lie beyond 1.5
times the interquartile range, those will appear as dots outside the whiskers. There
are no outliers in this particular plot. But in just a bit, we'll see an example which
does have some outliers.
All we had to do to get this powerful visualization was invoke the plot method on our
dataframe. Next, let's try out an example where we end up with a multi-index
dataframe. This is our first encounter with such a dataframe.
index=['Category' , 'Sub-Category' ] ).
category_and_subcategory_metrics.
As the name would suggest, this is a dataframe where we have more than one index
attribute. We get this by invoking pd.pivot_table. We specify our superstore data as
the first input argument, but it's the second input argument that's really interesting.
This is called index and it specifies a list. That list has two fields, Category and Sub-
Category. When we run this, we see that our resulting dataframe has two index
columns, Category and Sub-Category. And we can also see that there is a hierarchical
relationship between the two. So for instance, we now have all of the subcategories,
corresponding to the furniture category, appearing grouped together up top. This is
a common feature with multi-index dataframes, the different index attributes need
to have a hierarchical relationship with each other.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 38
You can also see that we have a drill down of each of the numeric columns for each
combination of category and subcategory. We can also specify exactly which
columns we'd like in the set of values.
values=['Profit' ]).
avg_profit_metrics.head(12).
You can see on screen now that we've invoked pd.pivot_table with the same
dataframe, we again have two index columns. This time, they are Region and
Category. The values field is set to Profit.
In the output dataframe, we first have a Region column. Then we have a Category
column and then we have the Profit for each combination of Region and Category.
Here, we do not have a clear hierarchical relationship between Region and Category,
but this is still a meaningful split. So if you do have a multi-index dataframe, even if
the two index attributes are not hierarchically related, they should represent some
meaningful combination.
Otherwise, we will have the same problem with the sparse matrices and the NaN
values. Which we had at the end of the previous demo, when we had performed a
pivot operation on country and city. Let's now see how to implement a different
aggregate operation. All we have to do is specify a different value for the aggfunc
name parameter.
values=['Profit' ],aggfunc=np.sum).
total_profit_metrics.head(12).
Here we've chosen np.sum. This is a method from the NumPy module. We've still
gone with Region and Category as the index attributes and Profit as the values
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 39
attribute. So we now are going to get the total profit for each combination of Region
and Category. Let's tweak this just a little bit so that we have multiple aggregate
operations. On screen now, we still have two index attributes Region and Category.
But we've now invoked aggfunc with a dictionary instead of a single function.
aggfunc={'Sales' :
np.sum, 'Profit' :
np.sum}).
sales_and_profit_metrics.head(12).
This time we have omitted the values input argument because the keys in the
aggfunc dictionary correspond to the values columns. There are two such columns
Sales and Profit. We can individually specify aggregate functions for each of these
columns. We've gone with np.sum for both sales and profits. When we run this code,
we get a dataframe where for every combination of Region and Category, we have
the sum of the profits as well as the sum of the sales. The beauty of the aggfunc
input argument is that we can specify any function object that we like.
median_sales.
On screen now, we've made use of np.median as the aggfunc. This is going to give us
the median value of sales for every combination of segment and category. We've
saved the result of this operation in a variable called median_sales. And median_
sales is itself a dataframe. Specifically, it's a multi-index dataframe. We can now
invoke the plot method on median_sales. This time we are going with kind=bar.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 40
As before, we specify the figure size, and now we get a nice bar chart. We can see
that this is a matplotlib object where every row from the underlying dataframe is
represented by a bar. So for instance, we can see that the median sales value for
consumer and office supplies is relatively low, it's below $50. The plot method is a
great way of visualizing data contained within a dataframe. Let's round out this demo
with a few more types of pivot table operations.
shipping_cost.head(15).
On screen now, we have yet another multi-index pivot. But this time, we have not
two but three index attributes. The index columns are Region, Ship Mode and
Category. The values are taken from the shipping cost column and the aggregate
function is np.mean. So there's absolutely no restriction on the number of index
attributes for a multi-index dataframe. And finally, it is also possible to specify a
columns input argument while invoking pd.pivot_table.
values=['Sales' ],aggfunc=np.mean).
region_sales.
On screen now we've invoked pd.pivot_table with one index attribute that's the
Region. The values are taken from the Sales column. But what's new is that the
columns are going to be taken from the Category column. So every unique value of
category is now a column in our output dataframe. The difference between this
particular version of pd.pivot_table and the pivot method is that here we do specify
an aggregate function. That aggregate function is np.mean. Let's build out and
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 41
visualize one last pivot table. On screen now, we've constructed a pivot table which
has three index attributes, Region, Category and Subcategory.
order_priority.head(10).
Columns are taken from the order priority column and these include Critical, High,
Low, and Medium. The actual cells contained within this matrix are the np.mean of
the sales for each combination of the index attributes and the shipping priority.
We've saved the result of this pivot table into a variable called order_priority. We
can invoke the plot method on it with kind=box.
And this time when we visualize our dataframe, we can see that there are a large
number of outliers in our data. These outliers show up as the individual circles. Those
circles appear outside the whiskers. Those whiskers in turn were defined in terms of
1.5 times the interquartile range. You can see from this just how powerful the pivot
table method is. It not only helps us to drill down into different cuts of our data, but
with just one line of code. We can also use it to identify outliers in the underlying
data.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 42
We are now done with our coverage of grouping aggregation and pivot operations
on Pandas DataFrames. We are going to turn our focus to another important topic,
combining data which is present in different data frames. Data in different data
frames could be combined either horizontally or vertically. Horizontal data
combinations, which are basically database joins, are more important, and we'll get
to those in the next demo. In this demo, we'll work with the relatively simpler forms
of combining data vertically. This is a brand new IPython Notebook. So let's begin by
importing pandas as pd. For the purposes of this particular demo, we are going to
read in data, which is divvied up or sliced by country.
australia_customers.head ( ).
australia_customers.shape.
This tells us we have 8 rows and 12 columns. Now let's perform a similar operation,
this time for customers in Canada. This data set has the same columns as the
previous one, and
canada_customers.head ( ).
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 43
that's because all of these can be thought of as shards or horizontal cuts of data from
the same underlying table. Let's use the shape command, and we can see that we
have 15 rows and 12 columns. Note that the number of columns is the same for each
of these files that we read in.
canada_customers.shape.
In similar fashion, we'll go ahead and read in data for various other countries.
germany_customers.head ( ).
On screen now is pd.read_csv for customers in Germany. Once again, we run the
shape command. And, once again, we see that our data has 12 columns.
germany_customers.shape.
Each of these individual data sets is quite small. We have 10 rows corresponding to
customers in Germany. In exactly the same fashion, we read in data for customers in
Israel. We have 5 customers and the same 12 columns.
israel_customers.shape.
We move on to South Africa where we have 4 customers and the same 12 columns.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 44
southafrica_customers.shape.
turkey_customers.shape.
usa_customers.shape.
On screen now, you can see that we've invoked the pd.concat command, we've
passed in a list. That list has individual data frames as its elements, and the result is
also a data frame. This resulting data frame has all of the customers for Australia and
the USA. Now one important note about pd.concat. Take a close look at the row
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 45
labels. You can see that the index values of the row labels from the original data
frames have been preserved. For instance, the first row on screen now has a row
label of 8, and that is because it had the value 8 in the underlying data frame of US
customers. So this particular way of invoking pd.concat is really simple and it's really
convenient, but it does lead to kind of misaligned row labels.
We'll have to come back to this in a moment. Let's examine the shape of this
resulting concatenated data frame. We can see that this has 36 rows and 12
columns.
aus_and_usa .shape.
This is no surprise. You might recall that we had data for 8 customers in Australia and
28 customers in the USA. 8 plus 28 is equal to 36, and that's why our resulting data
frame has 36 rows. We have 12 columns in this data frame because that's the same
number of columns that we had in the Australia customers as well as in the USA
customers data frames. So when we concatenate data frames, splicing together the
rows vertically.
Let's now go ahead and compute the value_counts on this combined data frame. We
do this on the country column. And we can confirm that we have 28 rows where the
country is US and 8 where it's equal to Australia. Now let's try a different way of
invoking pd.concat.
This time we're going to try and make it very clear which rows come from which
underlying data frame. We've invoked pd.concat passing in a list. That list has two
data frames for Canada customers and Germany customers. But the difference now
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 46
is that we have an input argument called keys. These keys are going to be applied to
the corresponding elements of the list and they will be used in order to create a
multi-index data frame.
Let's see how this works in action. Let's execute this code. And we see that the result
has not one but two index attributes. The first index attribute has the value of either
Canada or Germany. The second index attribute has the original row label from the
underlying data frame. And now we see why those keys are so useful. One advantage
is that we can now clearly trace which of the underlying data frames contributed a
particular row to the combined data frame. The other advantage of using keys is that
we now have no ambiguity caused by duplicate row labels. pd.concat has some input
arguments which don't work quite as intuitively as you might hope. For instance, on
screen now we've invoked pd.concat with three data frames
customer_details.sample(10).
corresponding to the customers from Australia, the USA, and Canada. The difference
is that this time we've specified sort=True. As before, we've also specified keys
corresponding to these three countries. When we run this code, we see that there
doesn't seem to be any discernible sorting that has been carried out on the rows.
And the reason for that is that the sorting here has been applied to the columns. If
you look really closely, you'll see that the columns now appear sorted in
lexicographical order. You might quibble that it would have been a lot more
convenient for these rows to have been sorted rather than the columns, but that's
just what you get with such a powerful and flexible library.
We made the point earlier that concat is used for splicing together rows horizontally.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 47
canada_and_germany .head().
But it's also possible to use concat in order to stack data frames vertically next to
each other. On screen now we've invoked pd.concat. We've passed in a couple of
data frames and we've specified axis=1. Now, as you're aware, in all of Pandas, axis
equal to 0 refers to row-wise operations, axis equal to 1 refers to column-wise
operations. We can see when we run this code that Pandas has stacked the two data
frames next to each other horizontally, and this means that the result has not 12 but
24 columns. This stacking has been performed using row labels. So the first row,
which has label 0, has customer name equal to Bill Cuddy. But then if we scroll way
over to the right, we see that the customer name appears again, and this time it's
Cornelia Krahl.
That's because Bill Cuddy was a customer with label 0 from the Canada customers
and Cornelia Krahl was the customer with label 0 from the Germany customers. No
doubt there are going to be some use cases for this particular functionality. But,
most generally, when you want to combine data frames horizontally, you will want
to make use of pd.join or pd.merge. We'll get to those in the next demo. And in case
you're wondering what happens when you try to horizontally concat two data
frames with a differing number of rows, well, the shorter data frame is padded with
NaN's. For completeness, let's also explore another way of concatenating data
frames, this is by using the .append method.
za_and_il = southafrica_customers.append(israel_customers).
za_and_il.sample(5).
On screen now, we've invoked the .append method on South Africa customers and
we've appended Israel customers. The result of this is going to be to take all of the
rows from the Israel customers data frame and splice those into the South Africa
customers data frame. You can see from the result that we have duplicate row
labels. So the five rows that we see on screen now have labels corresponding to 0, 1,
2, and then 1 again, followed by 3.
In this instance, we are able to infer the source of the corresponding row from the
country column that either has the value ZA or IL. We can fix these duplicate row
labels, and we'll do so in a moment. But first let's go ahead and append yet another
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 48
set of rows. We are going to append rows corresponding to Turkey and Canada
customers into this data frame.
customer_details.head().
And the combined data frame is going to be called customer details. Note that we
now have rows corresponding to four different countries. Let's fix the overlapping
row labels. We do this by using the reset index method, which you see on screen
now.
customer_details = customer_details.reset_index(drop=True).
customer_details.head().
Note that we've specified drop=True, this means that the old overlapping labels are
going to be dropped. A new entirely fresh set of row labels is going to be generated
starting from 0. And we can see from the result of the head command that the row
labels do indeed starts from 0 and increase contiguously. Let's also confirm the shape
of this customer_details data frame. We invoke the .shape method
customer_details.shape.
and we can see that our data frame has 31 rows and 12 columns.
We now know how to combine rows from different data frames together so that we
preserve the number of columns and increase the number of rows. This is a vertical
combination of data because the table or data frame grows vertically in the number
of rows, it does not grow horizontally in the number of columns. In the demo coming
up ahead, we will turn our attention to horizontal data combination using joins.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 49
In this demo, we will turn our attention from concatenating data row wise to joining
data on columns. Let's begin by reading in data from a file called customer.csv.
customer_details.head( 5).
You can see that we have various columns here, which represent information about
one customer. Let's filter out those columns so that we've got a simpler set of
columns to deal with.
customer_details.head().
So we use the double square brackets operator and have our customer details a data
frame only hold four columns, Customer_ID, Country, Customer_Name, and
Street_ID. Next, let's read in another data table, this one is going to have product
details rather than customer details.
product_details.sample( 5).
We read this in from the file PRODUCT_DIM.csv. As before, this has many more
columns than we are interested in. So we again make use of the double square
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 50
brackets operator to restrict this data frame to only hold Product_ID, Product_Name,
Supplier_Country, and Supplier_ID.
product_details.head().
You might be thinking that many database join examples make use of this set of
schemas, they have customers and products, and then they combine them using
orders, and you'd be exactly right. Now that we have customer details and product
details, the next item on the agenda is to read in order details.
order_details.sample( 10).
The orders table is going to correspond to a linking table, a relationship table, which
has attributes of a customer, as well as of a product. On screen now, is a sampling of
10 rows from order details. You can see that the first column is Customer ID. Then, if
we scroll way over to the right, we can see that there is a Product ID column in there
as well. Joins are a fairly involved topic, so let's keep our dataset simple, and let's
restrict order details to a smaller number of columns.
order_details.head().
On screen now you can see that we've again made use of the double square brackets
operator, and restricted the number of columns in the order details table. We've
been very careful to leave in there the Customer ID.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 51
So this is the column that order details and customer details have in common. The
Product_ID which is again the linking column between order details and product
details. And finally, the Order_ID itself, which can be thought of as a key attribute for
the orders table. We've also retained columns for a Quantity, Total_Retail_Price, and
the Street_ID. We've now set the stage for us to make use of pd.merge. pd.merge is
a very powerful Pandas function, and this is the function that
merge_product_order.head( ).
you are going to use when you wish to perform database style joins. We have
invoked pd.merge with two data frames, product_details and order_details. Notice
that we have not specified a join column. What do you think is going to happen?
Well, there are any number of reasonable choices here. One choice could be for
pd.merge to join or to connect these two data frames on the basis of their index
labels. So that's a very reasonable estimate, but that's not what pd.merge does by
default. By default, if we do not specify a join column, pd.merge is going to look for
columns which have the same name in the two data frames. Here, product details
and order details both do have a column with the same name, and that column is
Product_ID. This kind of a join is called a natural join, because it's based on a natural
relationship between two columns which have the same name.
So here we've effectively performed a natural join on the Product_ID column. You
can see that the Product_ID column only appears once in the resulting data frame,
and it's way over on the left right next to the row IDs. These row IDs have been
automatically generated starting from 0. So it's important to note that by default
pd.merge ignores, it simply ignores the row labels from the underlying data frames
which are being joined. Let's run the shape property, we can see that our merge data
frame has 617 rows and 9 columns.
merge_product_order.shape.
The nine columns makes sense. Product details had four columns, order details had
six columns, when the two were merged there was a de-duplication that got rid of
one copy of the linking or the join column, and that's why the result has nine
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 52
columns. Let's now rerun pd.merge, but this time with one additional argument, this
is the on input argument.
merge_product_order.head( ).
Please note that this will only work if that Product_ID column is present in both of
the data frames that are being joined. This has the exact same effect as the default
operation which we had a moment ago. We can see that the output is identical. We
can also run the shape command, we can see that this output also has 617 rows and
9 columns.
merge_product_order.shape.
Next, let's move on to other types of joins other than natural joins. Before we do so,
let's rename the column Product_ID, let's call it PID.
' PID'}).
product_details_renamed.sample( 5).
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 53
merge_product_order.head( ).
No worries, we can still perform the join, all that we have to do is explicitly specify
the names of the join columns in both of the join tables. This is exactly what's being
done on screen now. We invoke pd.merge with two data frames. There's the left
data frame which is product details renamed. And there's the right data frame which
is order details. Then we have the left_on column which is PID, and the right_on
column which is Product_ID.
This is an example of a join which is not a natural join, and that's because the join
columns have different names in the two tables. Now, when we run this Join
operation, we're going to find that the resulting data frame has a column named PID,
and then way over towards the right, it's also got a column named Product_ID. And
we can see that the values in these two columns are always identical. And of course,
that's not a coincidence. These two columns have identical values because we've
performed the join on equality of these two columns. Let's now check out the shape
of the result. We find that the shape has 10 columns instead of 9. The reason for this
is that now we no longer had duplicate versions
merge_product_order.shape.
of the join column, and that's why both of those columns were preserved.
merge_order_customer.head( ).
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 54
This time we are using pd.merge to join order_details and customer_details. And
that join is going to happen on two columns, Customer_ID and Street_ID. Note again,
that whenever we use the named input argument called on, this is only going to
work if the join attributes are present in both of the join tables. In this particular
instance, that's not a problem, because Customer_ID and Street_ID were columns in
both of the two input data frames. Let's also examine the shape properties of the
joined relation. We can see from the output that we now have only eight columns,
and that's because there were two join attributes. And what's more, those two join
attributes had the same names in both of the underlying data frames, and so they
were both de-duplicated.
merge_order_customer.shape.
We now have a good basic understanding of how pd.merge can be used to perform
database style joins in Pandas. In the demo coming up ahead, we will continue with
our exploration of pd.merge, we'll see how left, right, inner, and outer joins can be
implemented. And we will also see how pd.merge and it's close cousin pd.join are
both similar and somewhat different.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 55
In this demo we will continue with our exploration of join functionality in Pandas.
We've already worked with pd.merge. We will continue working with pd.merge and
explore different types of joins and then we'll explore pd.join so that we can
understand exactly what's going on. Let's create even simpler stripped down
versions of dataframes to work with.
product_details_subset.shape.
order_details_subset.shape.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 56
Once again, we are only choosing order IDs where the product ID is in a specific list.
Notice how the shape property of order details subset tells us that this has 6 rows
and 6 columns. We've only specified five product IDs, but we have 6 rows. How is
that possible? Well, it's because one of the product IDs is referenced in more than
one of the orders. This list is again filtered using the .isin method, but the difference
is that some but not all of the product IDs are the same as in the previous command.
Let's take a minute to analyze them. The first product ID which ends with 0017 is
present in both product details subset and order detail subset.
This is also true for the second order ID, which ends with 0006, as well as the third
product ID, which ends with 0023. Then, things get different. The fourth product ID
in product details subset ends with 0009. That is not present in order details subset,
and neither is the fifth and last product ID which ends with 0046. So those two
product IDs are present in product detail subset, but they are not present in order
details subset. We have been very careful, very deliberate about how we construct
these dataframes, and you'll see why in a moment. Coming now to the last two
product IDs in order details subset, these end with 0121 and 0116. Neither of these
two product IDs is present in product details subset. Now that we've carefully set up
our little dataframes, let's go ahead and perform various join operations on them.
product_order_inner_join = pd.merge(product_details_subset,
order_details_subset,.
on=‘Product_ID').
product_order_inner_join.
The first join is one which we have encountered before. This is an inner join. This
inner join is on the column named product ID.
Remember that we can specify the on parameter, when a particular join column
appears in both of the join relations. The name inner join is used here because the
two columns product ID in the two tables are going to be compared for equality. And
whenever there is a match, and that match is going to be based on equality, a row
will appear in the output. The inner join is conceptually very similar to the
intersection set between two dataframes with the important difference that the
intersection operation removes duplicates. However, in SQL, we deal with bags
rather than sets. And that's why the inner join maintains duplicates. Now you might
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 57
And that's because the default type of join performed by pd.merge is an inner join.
To confirm this, let's reinvoke pd.merge, this time with a new named argument. This
one is called how, and we've specified how is equal to inner.
product_order_inner_join = pd.merge(product_details_subset,
order_details_subset,.
on=‘Product_ID', how='inner').
product_order_inner_join.
When we run this code, we get the exact same result as we got when we left out the
how argument. And this proves that by default pd.merge is going to perform an
inner join. Notice again that the only rows included in the output are those rows for
product IDs in common between the two tables. So we have none of the rows in
there where the product ID ends with 0009, 0046, 0121 or 0116. These are the four
product IDs which only appear in one of the two relations being joined. Next, let's
change the value of that how parameter and use that to perform a left outer join.
product_order_leftouter_join = pd.merge(product_details_subset,
order_details_subset,.
on=‘Product_ID', how='left').
product_order_leftouter_join.
On screen now you can see that we've invoked pd.merge, once again, we've passed
in the same two relations. Once again, we have the same value of the on attribute.
That's product ID. All that's changed now is that how is equal to left. When we run
this code, we can see the output of the left outer join. In this join operation, the left
relation is product details subset. And so the left outer join has preserved every one
of the product IDs from the relation on the left, whether or not it had a match in the
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 58
relation on the right. And that's why we now see that a couple of new product IDs
appear in the result.
Specifically, these are the ones ending in 0009 and 0046. These are the product IDs
which were present in product details subset, but not present in order details subset.
What about all of the columns from order details subset? Well for these two rows,
those columns have all been populated with NaNs. So the left outer join preserves all
of the rows from the relation on the left, whether or not they match with a relation
on the right. What do you think the right outer join is going to do? Well, you guessed
it, it's going to flip this so that all of the rows from the relation on the right are
preserved instead. Let's go ahead and run this right outer join. All we did is change
the value of the how parameter to be right rather than left.
product_order_rightouter_join = pd.merge(product_details_subset,
order_details_subset,.
on=‘Product_ID', how='right').
product_order_rightouter_join.
We can now see that there are rows in there for product IDs ending with 0121, and
0116. These were the product IDs which appeared in order details subset, but we're
missing from product details subset. You can see that the columns from product
details subset have been added with NaNs for these two rows. Specifically, you can
see that Product_Name, Supplier_Country, and Supplier_ID are set to NaN for the
product IDs ending in 0121 and 0116. Notice that we no longer have any rows in the
result for the product IDs ending with 0009 and 0046.
That's because those product IDs appear in the relation on the left, but they are not
in the relation on the right. We've now successfully demonstrated how to perform
both left and right outer joins using pd.merge. Let's move on full outer joins in the
demo coming up next.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 59
Let's pick up right from where we left off at the end of the last demo. We had seen
how a left outer join preserves all rows from the data frame on the left. A right outer
join preserves all rows from the data frame on the right. What if we'd like to have all
rows from both sets of relations and padded with NaNs as necessary? Well, all we
need to do is to change the type of join, we'll do a full outer join instead of a right
outer join. All we need to do to accomplish this is tweak the value of how
product_order_fullouter_join = pd.merge(product_details_subset,
order_details_subset,.
on=‘Product_ID', how='outer').
product_order_fullouter_join.
to be outer instead of either left or right. You can now see the results of the full
outer join. We now, have rows in there for every product ID from either one of the
two relations. If it did not appear in the relation on the left, then the columns from
that relation are padded with NaNs.
You can see this for the product IDs ending with 0121 and 0116. And similarly, if it
did not appear in the relation on the right, those columns are going to be padded
with NaNs. And this is the case for product IDs ending with 0009 and 0046. We've
now successfully learned how to implement every one of the important types of
joins using pd.merge. Now, let's turn our attention to another very powerful but less
well known feature of pd.merge, and this is validation. On screen now we've invoked
pd.merge with a new
product_order_inner_join = pd.merge(product_details_subset,
order_details_subset,.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 60
on=‘Product_ID', validate='one_to_many').
product_order_inner_join.
input argument named validate. We've set validate to one_to_many. Notice that
we've also dropped the how parameter, so this is going to be an inner join. This
validate input argument is going to check while the join is being calculated whether
the relationship is indeed one to many. You might recall that while constructing
these two data frames, we had carefully chosen a Product_ID, so that it matched
with two orders from order ID. So that relationship is indeed one to many. And so,
when we run this, it goes through successfully. We can see from the results of the
inner join, that the Product_ID which satisfies the one to many relationship ends
with 0017. That appears in two of the rows in the inner join result. Similarly, let's
perform a join between customer_details and order_details.
We specify validate to be one to many, and this also works just fine.
We have an error saying that merge keys are not unique in the right data set. And so,
this is not a many to one merge. This of course, makes sense. This is correct, because
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 61
the relationship between customers and orders is not many to one. We are not going
to have many customers who pool money to place one order. That's just not how e-
commerce works. That's why the validation directly flagged this as an error. We've
now worked through several variations of pd.merge, it's time to introduce the join
method of the data frame object. On-screen now, you can see that we've invoked
the join method on one data
lsuffix=‘_left', rsuffix='_right').
order_customer_join.sample(5).
frame, that's the order details data frame. And we've passed in another data frame,
that's customer details.
The join method is going to perform the join based on the row labels of the two
relations or the two data frames. This is not what we usually have in mind and if we
are coming from a database's or SQL background. And that's why the join method is
used far less often than the pd.merge method. It's just much more common to carry
out join operations based on columns in the data, rather than on the index or the
labels. In any case, it's good to know how the join method works. There's one other
interesting bit to note here. Notice how we have two input arguments called lsuffix
and rsuffix. As their names would suggest, those suffixes are going to be used,
they're going to be tacked on to the names of any columns, which are present in
both of the joined data frames.
In the result, we can see, for instance, that we have a column called
Customer_ID_left and that's because the lsuffix was _left. If we look over to the right
we can see that there is also a Customer_ID_right. And that's because of the rsuffix
being _right. If we compare the values of these two columns, we can see that there
is no relationship at all. The first row which is labelled 229, has Customer_ID_left of
70210, and Customer_ID_right of NaN. So pretty clearly, it's not like the join is being
performed on any of the columns in the data set. Again, remember that the join
method performs the join based on the row labels of the two data frames. Even so, it
is possible to perform an inner join using the join method. You can see how this is
done on screen now.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 62
order_customer_join = order_details.set_index('Customer_ID')\.
.join(customer_details.set_index('Customer_ID'),.
lsuffix=‘_left', rsuffix='_right').
order_customer_join.sample(5).
What we've done is first set the index of the order details data frame to be the
Customer_ID column. Then, we've set the index of the customer details dataset to be
the Customer_ID column. And then we've gone ahead and joined these two. We can
now see from the result that the Customer_ID column has become the label column,
so we no longer have it disambiguated.
We can also compare the Street_ID_left and Street_ID_right columns to confirm that
this join is now indeed a natural join on the Customer_ID. This was still a rather
roundabout way of achieving a natural join. And that's why the join method is used
much less frequently than the pd.merge method. What if we'd like to perform a join
using the join method, but on multiple attributes? Well, that's possible as well.
order_customer_join.sample(5).
You can see how this is done on screen now. We've used set index to turn both order
details and customer details into multi-index data frames. We've specified the same
index attributes for both of these data frames, they are Customer_ID, and Street_ID.
And we've then gone ahead and used the join method to combine them. The results
appear down below, and you can now see that both Customer_ID and Street_ID
show up as index labels. In this way, we have jerry-rigged an inner join on multiple
join attributes using the join method. This brings us to the end of our exploration of
data combining techniques in Pandas. We saw how to combine data both
horizontally and vertically. Most recently, we explored different types of joins using
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 63
pd.merge. And we also saw how pd.join and pd.merge are closely related but
somewhat different.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 64
We are getting close to the end of our exploration of pandas. Before we are done
though, it's worth our while to spend some time understanding how pandas can be
used to work with time series data. Because this is a brand new Jupyter Notebook,
let's go ahead and carry out a number of important import statements.
import dateutil.
import random.
Specifically, we import the dateutil and random modules, then from datetime, we
import datetime, timedelta, date, and time. In addition from dateutil, we import
parser, and then we come to the more familiar import statements of pandas as pd
and numpy as np. We begin by making use of the pd.date_range function.
hourly _periods.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 65
This is going to take in a start and an end date as well as a frequency, and generate a
large number of periods of that frequency. Here, we've specified a start date of
2018-06-05, the end date is the 2019-06-05. And the frequency is denoted with the
H, which means that each of those periods is going to last one hour. We can see that
there are 8,761 hourly periods. The first of these starts at midnight on the 2018-06-
05 and ends at 1 AM on that same night, and so on from there. Notice also how each
one of these values has the dtype datetime64[ns]. Now that we have these hourly
intervals, let's package them up nicely in a pandas data frame.
toy_stock_data = pd.DataFrame({'hourly_ticks':
hourly_periods}).
toy_stock_data.head( ).
We've made use of the pd.DataFrame method in order to create a data frame called
toy_stock_data. This data frame has just the one column called hourly_ticks and the
values within that column are taken from the hourly periods, which we just
generated. We invoke the head method as usual and we can see that our data frame
starts at midnight on the 2018-06-05. We then go ahead and examine the other end
of the data frame using the tail method.
toy_stock_data.tail( 1).
And we can see that the hourly_ticks go on all the way until the end of the 2019-06-
04. The last interval precisely coincides with midnight on the 2019-06-05. Next, let's
go ahead and generate some random numbers drawn from a uniform random
distribution.
toy_stock_data['stock_price'] = [random.uniform(100.,150.) \.
for i in range(toy_stock_data.shape[0])].
toy_stock_data.head( ).
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 66
This is done using random.uniform and we specify the start and the end values. The
lowest value is 100, the highest value is 150. random.uniform is going to pick
numbers from this range, such that any given value has the same probability of being
drawn. There's more to this, however. We are making use of list comprehension in
order to generate one random number for each element in our toy_stock_data. And
then once we are done generating these, we assign them into a new column of the
data frame, which is called stock_price. When we run this code, we can see that our
data frame now has two columns rather than one, we still have the hourly_ticks in
there. But in addition, we also have the stock_price. Of course, this stock_price is
entirely made up. The values there were just generated from random.uniform. Let's
use the info method to get metadata about our data frame.
toy_stock_data.info( ).
This gives us information about the memory usage, it's 137 KB. We can also see the
dtypes. We have one datetime64[ns] and one float64, that's the one containing the
stock_price. The next operations we're going to perform are really interesting.
toy_stock_data.sample (10).
one of these is called date and the other is called time. Each of these is obtained by
extracting the corresponding component of the date time in the hourly_ticks
column. Remember that each of those hourly_ticks values is of datetime64[ns] and
that means that we can access the date alone by using the .dt.date. In similar
fashion, we can access the time alone by using the .dt.time property. We run this
code and we see that our data frame now has four columns, the new additional
columns are date and time. In exactly the same fashion, let's go ahead and add
columns corresponding to the year, month, and day of month of each date.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 67
toy_stock_data.sample (10).
We once again make use of the .dt property of the datetime64[ns] object. And from
that, we parse out the corresponding component of the datetime. Extracting all of
these datetime components can come in handy when looking for seasonality or for
other clues about whether there are recurring patterns in our stock price data. We
can continue and also extract the hour, minute, and second, and
toy_stock_data.sample (10).
we've done that, and added three additional columns to our data frame. Now that
we have a good handle over how datetime objects can be parsed in Python, let's
switch to some more real data. We are now going to read in a CSV file, which
contains Tesla stock data. This data was downloaded from Yahoo Finance, we read it
in using pd.read_csv
tesla_data.head( ).
into a data frame called tesla_data, and then we examine the head and tail.
tesla_data.tail( ).
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 68
If you've worked with Yahoo Finance a fair bit, you will recognize this set of columns.
Date, Open, High, Low, Close, Adjusted Close, and Volume. We have this information
one row for each date. Let's invoke the shape property on this data frame. We can
see that we have 2,416 rows and 7 columns.
tesla_data.shape.
Let's also examine the index and this is a RangeIndex, which starts at 0, and ends at
2416 in step sizes of 1.
tesla_data.index.
Now, we'd like to replicate all of the datetime analysis we had just performed with
the toy_stock_data with this data. And for that, we've got to make sure that our date
column is of the right dtype.
tesla_data.dtypes.
When we examine the dtypes of our data frame, however, we see that the date
column is still of type object. In other words, it's still being interpreted as a string,
we've got to change that. In order to transform the type of this column, we are going
to use the apply function. That apply function is going to take in a function object,
and apply that
tesla_data.dtypes.
function object to every value in the date column of our tesla_data. This apply
function is dateutil.parser.parse. So it's going to take in a string and output a date.
But there's more to this.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 69
After transforming all of these strings into datetime objects, we go ahead and
reassign this column into the date column of our data frame. And the result of this is
that we've performed a change in place on our data frame. So when we re-examine
the dtypes, we can now see that the date column has dtype datetime64[ns]. This is
exactly what we wanted to see. The next step is a really important one and it's a step
which often gets neglected by folks who are new to working with dates and times in
pandas.
tesla_data.tail( ).
We've got to sort our data on the date column. We do this setting ascending=True.
This is an inplace sort, so inplace=True as well. Performing sorts when we're working
with datetime data is really important. This is because we usually need to perform
various operations which relate to different rows. Most commonly, computing log
returns or percent changes. If we've not sorted our data correctly before we
compute those percent returns, or log returns, we are going to end up with serious
bugs in our program. Let's hit Shift+Enter and examine the tail of this data frame, and
we can see that it does indeed end with the last dates in our sample. The last five
dates range from the 2020-01-28 to the 2020-02-03. We are now ready to go ahead
and recreate all of the analysis we performed with our toy_stock_data.
tesla_data.head().
So on screen now, you can see that we've added columns for the day, month, and
year. In each case, we've just used the .dt property of the datetime64[ns] column.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 70
tesla_data.head( ).
In the past, we've spoken about how it's dangerous to perform inplace sorts of a
data frame because that causes system assigned row labels to get out of sync. So
let's take care of that problem.
Let's go ahead and get rid of the system generated labels and replace them with the
date column. On screen now, we've invoked the set_index method. We are
specifying the date column as what should be used for the labels of each row, and
we have inplace=True. When we run this code, we get rid of the system generated
index values, and we now have the index values taken from the date. This works fine
because we only have the one reading, only the one row for each date. Let's confirm
that this is the case by verifying the .index property and we now see that this has a
DatetimeIndex. So we still have 2416 index values, but they no longer represent the
integers 0 through 2415. We can now make use of the loc method in order to look up
tesla_data.index.
tesla_data.loc[ '2010-07-1' ]
we've specified the date of the 2010-07-1, we then passed this into the loc property,
and we get back all of the details for this particular date. We can see that the
adjusted close for Tesla on this date was $21.96. It really puts things in perspective to
see how much things have changed in the space of a decade. Now that our index is a
date column, we can also make use of a really powerful method called resample. On
screen now, you can see that we've changed the frequency of our data from
monthly_mean.
daily to monthly.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 71
To do this, all we had to do was to invoke the .resample method and pass in a new
frequency. That frequency was specified by M, which is interpreted as monthly
sampling. After we change from daily to monthly sampling, we select the close
column. We do so using the double pair of square brackets and then we invoke the
mean method. This gives us the average monthly close. In other words, the mean is
going to be applied as an aggregate function groups of rows. Each row is going to
correspond to the closing values for one month at a time. This resample method is a
really powerful one in pandas. We can also perform all of the other group by and
aggregate operations, which we have become quite familiar with by now.
On screen now, you can see that we are grouping by year and then calling the .first
method in order to get all of the readings for the first row in each year. Let's run this
and we can see from the adjusted close how in 2010 the first trading day had Tesla
close at $23.8899. But by 2020, the first day had Tesla close at $430.26.
In similar fashion, we can apply the mean aggregate function after grouping by year.
And when we do this, we will get the average for each numeric column. We can see
from this, for instance, that in 2020 in the data in our sample, Tesla has had an
average close of $540.03. We can also perform group by operations on multiple
columns.
We've grouped by year and month before computing the mean and
we now have drilled down or double-clicked into our data. All of these calculations
have been performed on price. And as you might be aware, it's not a good practice
to perform statistical time series analysis on price because price is not a stationary
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 72
process. And that's why it's always considered good practice to convert prices into
returns using either log returns or percent returns.
tesla_percentage_diff .
On screen now, you can see the pct_change method, which every pandas data frame
supports. This is a one step way of converting prices into returns. pandas has a lot
more functionality for working with time series.
For instance, you can see how easy it is to perform a plot operation of the returns we
just computed. It's also really easy to compute various types of windows.
rolling_window = tesla_percentage_diff.rolling(window=4).mean().
rolling_window.head(8).
rolling_window_std = tesla_percentage_diff.rolling(window=4).std().
rolling_window_std.head(8).
This is going to calculate the rolling for the mean of Tesla stock prices.
We could go on and on with this topic, but we've got to end somewhere. And this
gets us to the end of our exploration of time series analysis and of pandas'
formidable capabilities.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 73
We have now come to the end of this course Cleaning and Analyzing Data in Pandas.
We started by exploring the use of the dot duplicated method, which, as its name
would suggest, checks for duplicate values in a Pandas dataframe. We then learned
how the drop duplicates method can be used to eliminate those duplicates. And also
how the Pandas cut method can be used to discretize data into bins. Those bins can
then be used to easily plot a histogram, and identify outliers using a box and whisker
plot. We then spent some time learning an important skill while using Pandas, how
to correctly work with date strings. We saw how to convert string columns to date
times using the dateutils Python library. And then moved on to the use of the group
by method to compute aggregates on groups of rows in our data.
We learned how to use the Pandas get-group function to access a group of rows
grouped together by a condition. Then configured the group by to perform an
aggregation operation. Having experimented with group-by, we then saw how pivot
tables could be created in Pandas. As we learned, this can be done using either the
dot pivot method or the dot pivot table method. We saw that the dot pivot method
differs from excel-style pivots because it does not perform an aggregation operation.
In contrast, the dot pivot table method does perform an aggregation and so is a lot
more intuitive.
Finally we moved on to the use of data joins in Pandas. Joins are a crucially important
topic in working with relational data and are also ubiquitous in spreadsheets. Where
they are implemented via functions such as Vlookup in excel. So this was an
important topic for us to master. We used the Pandas merge method to perform
inner, left outer, right outer, and full outer joints. And also briefly discussed the use
of the Pandas join method and saw that it's a lot less powerful and popular than the
merge method.
As we learn, the join method joins on labels in the dataframe index, while the merge
method can join on any subset of columns. We also saw how the data in two
dataframes can be combined vertically using the concat method. This brings us to
the end of this learning path. You now have a solid knowledge of most important
aspects of data analysis using Pandas.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 74
And are well positioned to use various powerful analysis techniques to implement a
robust, yet simple solutions for your specific use cases. That's it from me here today.
Thank you for watching.
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 75
14. Test
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 76
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 77
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 78
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 79
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 80
/conversion/tmp/activity_task_scratch/549519164.docx
14-Oct-21 549519164.docx 81
/conversion/tmp/activity_task_scratch/549519164.docx