Combining Datasets: Concat
and Append
PREPARED BY
R.AKILA,AP(SG)
BSACIST
REFERENCE:HTTPS://JAKEVDP.GITHUB.IO/
PYTHONDATASCIENCEHANDBOOK/03.07-
MERGE-AND-JOIN.HTML
Concatenation
Pandas has a function, pd.concat(), which has a
similar syntax to np.concatenate but contains a
number of options.
pd.concat() can be used for a simple concatenation
of Series or DataFrame objects, just
as np.concatenate() can be used for simple
concatenations of arrays.
It also works to concatenate higher-dimensional
objects, such as DataFrames:
Duplicate indices
One important difference
between np.concatenate and pd.concat is that
Pandas concatenation preserves indices, even if the
result will have duplicate indices.
Ignoring the index
Sometimes the index itself does not matter, and you
would prefer it to simply be ignored. This option can
be specified using the ignore_index flag. With this
set to true, the concatenation will create a new
integer index for the resulting Series:
Adding MultiIndex keys
Another option is to use the keys option to specify a
label for the data sources; the result will be a
hierarchically Indexed series containing the data:
Concatenation with joins
In practice, data from different sources might have
different sets of column names, and pd.concat offers
several options in this case. Consider the
concatenation of the following two DataFrames,
which have some (but not all!) columns in common:
By default, the entries for which no data is available
are filled with NA values.
To change this, we can specify one of several options
for the join and join_axes parameters of the
concatenate function.
By default, the join is a union of the input columns
(join='outer'),
but we can change this to an intersection of the
columns using join='inner':
Another option is to directly specify the index of the
remaininig colums using the join_axes argument,
which takes a list of index objects.
The append() method
Because direct array concatenation is so
common, Series and DataFrame objects have
an append method that can accomplish the same
thing in fewer keystrokes.
For example, rather than calling pd.concat([df1,
df2]), you can simply call df1.append(df2):
Relational Algebra
The behavior implemented in pd.merge() is a subset of
what is known as relational algebra, which is a formal
set of rules for manipulating relational data, and forms
the conceptual foundation of operations available in
most databases.
Categories of Joins
The pd.merge() function implements a number of types
of joins: the one-to-one, many-to-one, and many-to-
many joins. All three types of joins are accessed via an
identical call to the pd.merge() interface;
One-to-one joins
The simplest type of merge expresion is the one-to-one join, which is
in many ways very similar to the column-wise concatenation.
To combine this information into a
single DataFrame, we can use
the pd.merge() function:
The pd.merge() function recognizes that
each DataFrame has an "employee" column, and
automatically joins using this column as a key.
The result of the merge is a new DataFrame that
combines the information from the two inputs.
Notice that the order of entries in each column is not
necessarily maintained: in this case, the order of the
"employee" column differs between df1 and df2, and
the pd.merge() function correctly accounts for this.
Many-to-one joins
Many-to-one joins are joins in which one of the two
key columns contains duplicate entries.
For the many-to-one case, the
resulting DataFrame will preserve those duplicate
entries as appropriate.
Consider the following example of a many-to-one join:
The resulting DataFrame has an aditional column with
the "supervisor" information, where the information is
repeated in one or more locations as required by the
inputs.
Many-to-many joins
If the key column in both the left and right array
contains duplicates, then the result is a many-to-
many merge.
This will be perhaps most clear with a concrete
example. Consider the following, where we have
a DataFrame showing one or more skills associated
with a particular group.
Specification of the Merge Key
The on keyword
Most simply, you can explicitly specify the name of
the key column using the on keyword, which takes a
column name or a list of column names:
The left_on and right_on keywords
This option works only if both the left and
right DataFrames have the specified column name.
At times you may wish to merge two datasets with
different column names;
for example, we may have a dataset in which the
employee name is labeled as "name" rather than
"employee".
In this case, we can use
the left_on and right_on keywords to specify the two
column names:
The result has a redundant column that we can drop
if desired–for example, by using the drop() method
of DataFrames:
The left_index and right_index keywords
Sometimes, rather than merging on a column, you
would instead like to merge on an index. For
example, your data might look like this:
Specifying Set Arithmetic for Joins
Here we have merged two datasets that have only a
single "name" entry in common: Mary.
By default, the result contains the intersection of the
two sets of inputs; this is what is known as an inner
join.
We can specify this explicitly using the how keyword,
which defaults to "inner":
Other options for the how keyword are 'outer', 'left',
and 'right'. An outer join returns a join over the
union of the input columns, and fills in all missing
values with NAs:
The left join and right join return joins over the left
entries and right entries, respectively. For example: