0% found this document useful (0 votes)
24 views79 pages

Module 2 Pandas 1

The document provides an overview of data manipulation using Pandas, focusing on its core data structures: Series, DataFrame, and Index. It discusses handling missing data, including methods for detecting and managing null values, and highlights the differences between using None and NaN for missing data representation. The content is structured into modules that cover various aspects of Pandas functionality, including data construction and manipulation techniques.

Uploaded by

Sonia N.S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views79 pages

Module 2 Pandas 1

The document provides an overview of data manipulation using Pandas, focusing on its core data structures: Series, DataFrame, and Index. It discusses handling missing data, including methods for detecting and managing null values, and highlights the differences between using None and NaN for missing data representation. The content is structured into modules that cover various aspects of Pandas functionality, including data construction and manipulation techniques.

Uploaded by

Sonia N.S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 79

Exploratory Data Analysis- BDS613B

Prepared By,
Dr. Anitha DB
Associate Professor & Head
Department of CSE-Data Science
ATME College of Engineering, Mysuru

ATME College of 1
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module2 : Data Manipulation with Pandas

• Introducing Pandas Objects,


• Handling Missing Data,
• Hierarchical Indexing,
• Pivot Tables.

ATME College of 2
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas

Introducing Pandas Objects


• At the very basic level, Pandas objects can be thought of as enhanced versions of
NumPy structured arrays in which the rows and columns are identified with labels
rather than simple integer indices.
• Pandas provides a host of useful tools, methods, and functionality on top of the basic
data structures, but nearly everything that follows will require an understanding of
what these structures are.
• Three fundamental Pandas data structures: The Series, DataFrame, and Index.
• The standard NumPy and Pandas imports:
import numpy as np
import pandas as pd

ATME College of 3
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects

The Pandas Series Object


A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

The Series wraps both a sequence of values and a sequence of indices, which we can access with the values
and index attributes.

ATME College of 4
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects

Data can be accessed by the associated index via square-bracket notation:

The Pandas Series is much more general and flexible than the one-dimensional NumPy array

ATME College of 5
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects
Series as generalized NumPy array
The essential difference is the presence of the index: while the NumPy array has an implicitly defined integer
index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
This explicit index definition gives the Series object additional capabilities. For example, the index need not
be an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an
index

ATME College of 6
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects

Series as specialized dictionary


• A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure that
maps typed keys to a set of typed values. The type information of a Pandas Series makes it much more
efficient than Python dictionaries for certain operations.
• By default, a Series will be created where the index is
drawn from the sorted keys. From here, typical
dictionary-style item access can be performed:

• The Series also supports array-style operations such as slicing:

ATME College of 7
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects

Constructing Series objects

where index is an optional argument, and data can be one of many entities. For example, data can be a list or
NumPy array, in which case index defaults to an integer sequence

ATME College of 8
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects
Constructing Series objects

ATME College of 9
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects

The Pandas DataFrame Object:


If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-
dimensional array with both flexible row indices and flexible column names.

ATME College of 10
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects

The Pandas DataFrame Object: DataFrame as a generalized NumPy array

ATME College of 11
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects

The Pandas DataFrame Object: DataFrame as a generalized NumPy array

Additionally, the DataFrame has a columns attribute, which is an Index object holding the column
labels:

Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both
the rows and columns have a generalized index for accessing the data.

ATME College of 12
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects

The Pandas DataFrame Object: DataFrame as a specialized dictionary

We can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value,
a DataFrame maps a column name to a Series of column data. For example, asking for the 'area' attribute
returns the Series object containing the areas

In a two-dimensional NumPy array, data[0] will return the first row. For a DataFrame, data['col0'] will return
the first column.

ATME College of 13
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects

Constructing DataFrame objects

A Pandas DataFrame can be constructed in a variety of ways.

1.From a single Series object: A DataFrame is a collection of Series objects, and a single column DataFrame
can be constructed from a single Series:

ATME College of 14
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects

Constructing DataFrame objects


2.From a list of dicts: Any list of dictionaries can be made into a DataFrame. We’ll use a simple list
comprehension to create some data:

Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e., “not a number”) values

ATME College of 15
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects

Constructing DataFrame objects


3.From a dictionary of Series objects: A DataFrame can be constructed from a dictionary of Series objects as
well:

4.From a two-dimensional NumPy array: Given a two-dimensional array of data, we can create a DataFrame with
any specified column and index names. If omitted, an integer index will be used for each:

ATME College of 16
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects
Constructing DataFrame objects
5.From a NumPy structured array: A Pandas DataFrame operates much like a structured array, and can be created
directly from one:

ATME College of 17
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects
The Pandas Index Object
The index object is considered as an immutable array or as an ordered set (technically a multiset, as Index
objects may contain repeated values).
As a simple example, let’s construct an Index from a list of integers:

Index as immutable array


The Index object in many ways operates like an array. For example, we can use standard Python indexing
notation to retrieve values or slices:

ATME College of 18
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects

This immutability makes it safer to share indices between multiple DataFrames and arrays, without the
potential for side effects from inadvertent index modification.
ATME College of 19
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Introducing Pandas Objects
Index as ordered set
Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects
of set arithmetic.
The Index object follows many of the conventions used by Python’s built-in set data structure, so that unions,
intersections, differences, and other combinations can be computed in a familiar way:

ATME College of 20
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data
Handling Missing Data
The difference between data found in many tutorials and data in the real world is that real-world data is rarely
clean and homogeneous. In particular, many interesting datasets will have some amount of data missing. To make
matters even more complicated, different data sources may indicate missing data in different ways. In general
Missing data is referred as null, NaN, or NA values.

Trade-Offs in Missing Data Conventions


A number of schemes have been developed to indicate the presence of missing data in a table or DataFrame.
Generally, they revolve around one of two strategies: using a mask that globally indicates missing values, or
choosing a sentinel value that indicates a missing entry.
In the masking approach, the mask might be an entirely separate Boolean array, or it may involve
appropriation of one bit in the data representation to locally indicate the null status of a value.
In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing
integer value with –9999 or some rare bit pattern, or it could be a more global convention, such as indicating a
missing floating-point value with NaN (Not a Number), a special value which is part of the IEEE floating-point
specification.

ATME College of 21
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data

None of these approaches is without trade-offs: use of a separate mask array requires allocation of an additional
Boolean array, which adds overhead in both storage and computation. A sentinel value reduces the range of
valid values that can be represented, and may require extra (often non-optimized) logic in CPU and GPU
arithmetic. Common special values like NaN are not available for all data types. As in most cases where no
universally optimal choice exists, different languages and systems use different conventions. For example, the
R language uses reserved bit patterns within each data type as sentinel values indicating missing data, while the
SciDB system uses an extra byte attached to every cell to indicate a NA state.

ATME College of 22
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data
Missing Data in Pandas

Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null
values: the special floating point NaN value, and the Python None object.
None: Pythonic missing data
The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in
Python code. Because None is a Python object, it cannot be used in any arbitrary NumPy/Pandas array, but only
in arrays with data type 'object' (i.e., arrays of Python objects):

This dtype = object means that the best common type


representation NumPy could infer for the contents of the
array is that they are Python objects.

ATME College of 23
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data

The use of Python objects in an array also means that if you perform aggregations like sum() or min() across an
array with a None value, you will generally get an error:

This reflects the fact that addition between an integer and None is undefined.
ATME College of 24
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data
NaN: Missing numerical data
The other missing data representation, NaN (acronym for Not a Number), is different; it is a special floating-
point value recognized by all systems that use the standard IEEE floating-point representation:

Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from
before, this array supports fast operations pushed into compiled code. You should be aware that NaN is a bit like
a data virus—it infects any other object it touches. Regardless of the operation, the result of arithmetic with NaN
will be another NaN:

ATME College of 25
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data

ATME College of 26
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data
NaN and None in Pandas
NaN and None both have their place, and Pandas is built to handle the two of them nearly
interchangeably, converting between them where appropriate:

For types that don’t have an available sentinel value, Pandas automatically type-casts when NA values are
present.
For example, if we set a value in an integer array to np.nan, it will automatically be upcast to a floating-point
type to accommodate the NA:

ATME College of 27
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data

Notice that in addition to casting the integer array to


floating point, Pandas automatically converts the None to
a NaN value. Table 3-2 lists the upcasting conventions in
Pandas when NA values are introduced.

ATME College of 28
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data
Operating on Null Values
Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this
convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data
structures. They are
isnull() - Generate a Boolean mask indicating missing values
notnull() - Opposite of isnull()
dropna() - Return a filtered version of the data
fillna() - Return a copy of the data with missing values filled or imputed

Detecting null values


Pandas data structures have two useful methods for detecting null data: isnull() and notnull().

ATME College of 29
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data

Dropping null values


In addition to the masking used before, there are the convenience methods, dropna() (which removes NA values)
and fillna() (which fills in NA values). For a Series, the result is straightforward:

ATME College of 30
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data
For a DataFrame, there are more options. Consider the following DataFrame:

We cannot drop single values from a DataFrame; we can only drop full rows or full columns. Depending on
the application, you might want one or the other, so dropna() gives a number of options for a DataFrame. By
default, dropna() will drop all rows in which any null value is present:

ATME College of 31
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data
Alternatively, you can drop NA values along a different axis; axis=1 drops all columns containing a null value:

But this drops some good data as well; you might rather be interested in dropping rows or columns with all NA
values, or a majority of NA values. This can be specified through the how or thresh parameters, which allow fine
control of the number of nulls to allow through.
The default is how='any', such that any row or column (depending on the axis key word) containing a null value
will be dropped. You can also specify how='all', which will only drop rows/columns that are all null values
ATME College of 32
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data

For finer-grained control, the thresh parameter lets you specify a minimum number of non-null values for the
row/column to be kept:

Here the first and last row have been dropped, because they contain only two non null values.

ATME College of 33
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data
Filling null values
Instead of dropping NA values, replace them with a valid value. Pandas provides the fillna() method, which
returns a copy of the array with the null values replaced.

Consider the following Series:

We can fill NA entries with a single value, such as zero:

ATME College of 34
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data

ATME College of 35
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Handling Missing Data
For DataFrames, the options are similar, but we can also specify an axis along which the fills take place:

ATME College of 36
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Hierarchical Indexing
So far we have discussed about one-dimensional and two dimensional data, stored in Pandas Series and
DataFrame objects, respectively.
Often it is useful to go beyond this and store higher-dimensional data—that is, data indexed by more than one or
two keys. While Pandas does provide Panel and Panel4D objects that natively handle three-dimensional and
four-dimensional data.
A far more common pattern in practice is to make use of hierarchical indexing (also known as multi-indexing) to
incorporate multiple index levels within a single index.
In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series
and two-dimensional DataFrame objects.

ATME College of 37
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
A Multiply Indexed Series
Represent two-dimensional data within a one-dimensional Series. Consider a series of data where each point has a
character and numerical key.
The bad way
Suppose you would like to track data about states from two different years. Using the Pandas tools we’ve already
covered, you might be tempted to simply use Python tuples as keys:

ATME College of 38
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing

With this indexing scheme, you can straightforwardly index or slice the series based on this multiple index:

But the convenience ends there. For example, if you need to select all values from 2010, you’ll need to do
some messy (and potentially slow) munging to make it happen:

This produces the desired result, but is not as clean (or as efficient for large datasets) as the slicing syntax
we’ve grown to love in Pandas.
ATME College of 39
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
The better way: Pandas MultiIndex
Fortunately, Pandas provides a better way. Our tuple-based indexing is essentially a rudimentary multi-index, and
the Pandas MultiIndex type gives us the type of operations we wish to have. We can create a multi-index from
the tuples as follows:

Notice that the MultiIndex contains multiple levels of indexing—in this case, the state names and the years, as
well as multiple labels for each data point which encode these levels.

If we reindex our series with this MultiIndex, we see the hierarchical


representation of the data:
Here the first two columns of the Series representation show the
multiple index values, while the third column shows the data. Notice
that some entries are missing in the first column: in this multi-index
representation, any blank entry indicates the same value as the line
above it.
ATME College of 40
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:

The result is a singly indexed array with just the keys we’re interested in.
This syntax is much more convenient (and the operation is much more efficient!) than the home spun tuple-
based multi-indexing solution that we started with.

ATME College of 41
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
MultiIndex as extra dimension
The unstack() method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame:

Naturally, the stack() method provides the opposite operation:

ATME College of 42
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
MultiIndex as extra dimension
we can also use multi-indexing to represent data of three or more dimensions in a Series or DataFrame. Each
extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us
much more flexibility in the types of data we can represent.
Concretely, we might want to add another column of demographic data for each state at each year (say,
population under 18); with a MultiIndex this is as easy as adding another column to the DataFrame:

ATME College of 43
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
MultiIndex as extra dimension

Here we compute the fraction of people under 18 by year, given the above data:

This allows us to easily and quickly manipulate and explore even high-dimensional data.

ATME College of 44
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Methods of MultiIndex Creation
The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of
two or more index arrays to the constructor. For example

ATME College of 45
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Methods of MultiIndex Creation
The work of creating the MultiIndex is done in the background. Similarly, if you pass a dictionary with
appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default:

Sometimes it is useful to explicitly create a MultiIndex


ATME College of 46
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Explicit MultiIndex constructors
pd.MultiIndex can be used to construct MultiIndex. For example, we can construct the MultiIndex from a
simple list of arrays, giving the index values within each level:

We can construct it from a list of tuples, giving the multiple index values of each point:

We can even construct it from a Cartesian product of single indices:

ATME College of 47
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Explicit MultiIndex constructors
Similarly, we can construct the MultiIndex directly using its internal encoding by passing levels (a list of lists
containing available index values for each level) and labels (a list of lists that reference these labels)

We can pass any of these objects as the index argument when creating a Series or DataFrame, or to the reindex
method of an existing Series or DataFrame.

ATME College of 48
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
MultiIndex level names
Sometimes it is convenient to name the levels of the MultiIndex. We can accomplish this by passing the names
argument to any of the above MultiIndex constructors, or by setting the names attribute of the index after the
fact:

With more involved datasets, this can be a useful way to keep track of the meaning of various index values.

ATME College of 49
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
MultiIndex for columns
In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels
of indices, the columns can have multiple levels as well. Consider the following, which is a mock-up of some
(somewhat realistic) medical data:

# hierarchical indices and columns


index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['year', 'visit'])
columns = pd.MultiIndex.from_product([[‘Jhon', ‘Peter', ‘Thomas'], ['HR', 'Temp']], names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

ATME College of 50
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
MultiIndex for columns
Here we see where the multi-indexing for both rows and columns can come in very handy. This is
fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year,
and the visit number. With this in place we can, for example, index the top-level column by the person’s name
and get a full Data Frame containing just that person’s information:

ATME College of 51
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Indexing and Slicing a MultiIndex
Multiply indexed Series
Consider the multiply indexed Series of state populations

We can access single elements by indexing with multiple terms:

ATME College of 52
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Multiply indexed Series
The MultiIndex also supports partial indexing, or indexing just one of the levels in the index. The result is another
Series, with the lower-level indices maintained

With sorted indices, we can perform partial indexing on


lower levels by passing an empty slice in the first index

Partial slicing is also possible.

ATME College of 53
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Multiply indexed Series
Other types of indexing and selection is also possible; for example, selection based on Boolean masks:

Selection based on fancy indexing also works:

ATME College of 54
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Multiply indexed DataFrames
A multiply indexed DataFrame behaves in a similar manner. Consider our toy medical DataFrame
health_data

Remember that columns are primary in a DataFrame, and the syntax used for multiply indexed Series applies to the
columns. For example, we can recover Jhon’s heart rate data with a simple operation:
Also, as with the single-index case, we can use the loc, iloc, and ix indexers. The loc function is lable –based. We
can select rows columns by their labels(actual names of the rows and columns).
The iloc function is integrated-based. It allows you to select rows and columns by their integer positions(The
numerical index of the rows and columns).
ATME College of 55
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Multiply indexed DataFrames
These indexers provide an array-like view of the underlying two-dimensional data, but each individual
index in loc or iloc can be passed a tuple of multiple indices. For example:

ATME College of 56
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Multiply indexed DataFrames
Working with slices within these index tuples is not especially convenient; trying to create a slice within a
tuple will lead to a syntax error:

You could get around this by building the desired slice explicitly using Python’s built in slice() function, but
a better way in this context is to use an IndexSlice object, which Pandas provides for precisely this situation.
For example:
ATME College of 57
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Rearranging Multi-Indices
There are many ways to control the rearrangement of data between hierarchical indices and columns.
• Sorted and unsorted indices
• Stacking and unstacking indices
• Index setting and resetting
Sorted and unsorted indices
Many of the MultiIndex slicing operations will fail if
the index is not sorted. Let’s take a look at this here.
We’ll start by creating some simple multiply indexed
data where the indices are not lexographically sorted

Although it is not entirely clear from the error


message, this is the result of the Multi Index not
being sorted. For various reasons, partial slices and
other similar operations require the levels in the
MultiIndex to be in sorted (i.e., lexographical) order.

ATME College of 58
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Rearranging Multi-Indices
Pandas provides a number of convenience routines to perform this type of sorting; examples are the
sort_index() and sortlevel() methods of the DataFrame. We’ll use the simplest, sort_index(), here:

ATME College of 59
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Rearranging Multi-Indices
Stacking and unstacking indices
it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation,
optionally specifying the level to use

The opposite of unstack() is stack(), which here can be


used to recover the original series:

ATME College of 60
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Rearranging Multi-Indices
Index setting and resetting
Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with
the reset_index method. Calling this on the population dictionary will result in a DataFrame with a state and year
column holding the information that was formerly in the index. For clarity, we can optionally specify the name of
the data for the column representation:

ATME College of 61
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Index setting and resetting
Often when you are working with data in the real world, the raw input data looks like this and it’s useful to
build a MultiIndex from the column values. This can be done with the set_index method of the DataFrame,
which returns a multiply indexed Data Frame

ATME College of 62
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Data Aggregations on Multi-Indices
Pandas has built-in data aggregation methods, such as mean(), sum(), and max(). For hierarchically indexed data,
these can be passed a level parameter that controls which subset of the data the aggregate is computed on.
Perhaps we’d like to average out the measurements in the
For example, let’s return to our health data:
two visits each year. We can do this by naming the index
level we’d like to explore, in this case the year:

ATME College of 63
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Hierarchical Indexing
Data Aggregations on Multi-Indices
By further making use of the axis keyword, we can take the mean among levels on the columns as well

ATME College of 64
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
The pivot table takes simple column wise data as input, and groups the entries into a two-dimensional table that
provides a multidimensional summarization of the data.
Pivot Table allow you to perform common aggregate statistical calculations such as sums, counts, averages, and so
on.
Motivating Pivot Tables

This contains a wealth of information


on each passenger of that ill-fated
voyage, including gender, age, class,
fare paid, and much more.

ATME College of 65
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
Pivot Tables by Hand
To start learning more about this data, we might begin by grouping it according to gender, Age status, or some
combination thereof.
Example

ATME College of 66
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
Pivot Tables by Hand

We group by class and gender, select Age, apply a mean aggregate, combine the resulting groups, and then
unstack the hierarchical index to reveal the hidden multidimensionality.
Example:

ATME College of 67
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
Pivot Table Syntax
Here is the equivalent to the preceding operation using the pivot_table method of DataFrames

This is eminently more readable than the GroupBy approach, and produces the same result.

ATME College of 68
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
Multilevel pivot tables
Just as in the GroupBy, the grouping in pivot tables We can apply this same strategy when working with the
can be specified with multiple levels, and via a columns as well; let’s add info on the fare paid using
number of options. For example, we might be pd.qcut to automatically compute quantiles:
interested in looking at age as a third dimension.
We’ll bin the age using the pd.cut function:

ATME College of 69
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
Additional pivot table options
The full call signature/syntax of the pivot_table method of DataFrames is as follows

DataFrame.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None,


margins=False, dropna=True, margins_name='All')
The aggfunc keyword controls what type of aggregation is applied, which is a mean by default. As in the GroupBy,
the aggregation specification can be a string representing one of several common choices ('sum', 'mean', 'count',
'min', 'max', etc.) or a function that implements an aggregation (np.sum(), min(), sum(), etc.). Additionally, it can be
specified as a dictionary mapping a column to any of the above desired options:

ATME College of 70
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
Additional pivot table options
At times it’s useful to compute totals along each grouping. This can be done via the margins keyword:

Here this automatically gives us information about the class-agnostic survival rate by gender, the gender-
agnostic survival rate by class, and the overall survival rate of 36%. The margin label can be specified with the
margins_name keyword, which defaults to "All".

ATME College of 71
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
Example: Birthrate Data

data on births in the United States, provided by the Centers for Disease Control (CDC)

births = pd.read_csv('births.csv')
births.head()
births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')
%matplotlib inline
import matplotlib.pyplot as plt
sns.set() # use Seaborn styles
births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()
plt.ylabel('total births per year');

ATME College of 72
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
Example: Birthrate Data

ATME College of 73
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
Further data exploration
There are a few more interesting features we can pull out of this dataset using the Pandas tools We must start
by cleaning the data a bit, removing outliers caused by mistyped dates (e.g., June 31st) or missing values (e.g.,
June 99th). One easy way to remove these all at once is to cut outliers; we’ll do this via a robust sigma-
clipping operation:

ATME College of 74
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
Further data exploration

ATME College of 75
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
Further data exploration

ATME College of 76
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
Further data exploration

ATME College of 77
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 2 : Data Manipulation with Pandas
Pivot Tables
Further data exploration

ATME College of 78
Department of CSE-DS, ATMECE
Engineering, Mysuru
THANK
YOU
ATME College of 79
Department of CSE-DS, ATMECE
Engineering, Mysuru

You might also like