0% found this document useful (0 votes)
27 views65 pages

Lec 05-DSFa23

The document discusses advanced pandas concepts like grouping, aggregation, pivot tables, and merging data. It covers reviewing pandas groupby methods, exploring more groupby methods, pivot tables, and joining tables. It also discusses exploring and preparing tabular data as part of an intro to exploratory data analysis.

Uploaded by

labnexaplan9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views65 pages

Lec 05-DSFa23

The document discusses advanced pandas concepts like grouping, aggregation, pivot tables, and merging data. It covers reviewing pandas groupby methods, exploring more groupby methods, pivot tables, and joining tables. It also discusses exploring and preparing tabular data as part of an intro to exploratory data analysis.

Uploaded by

labnexaplan9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

LECTURE 5

Pandas, Part III


Advanced Pandas (More on Grouping, Aggregation, Pivot Tables, and Merging)

Data Science, Fall 2023 @ Knowledge Stream


Sana Jabbar

1
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
Today’s Roadmap • Structure: Variable Types

Lecture 5

2
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
Groupby Review • Structure: Variable Types

Lecture 5

3
Revisiting groupby.agg

dataframe.groupby(column_name).agg(aggregation_function)

babynames.groupby("Year")[["Count"]].agg(sum) computes the total number of babies born in


each year.

4
Revisiting groupby.agg
A groupby operation involves some combination of splitting the object, applying a function, and
combining the results.
● So far, we've seen that df.groupby("year").agg(sum):
○ Split df into sub-DataFrames based on year.
○ Apply the sum function to each column of each sub-DataFrame.
○ Combine the results of sum into a single DataFrame, indexed by year.

5
Groupby Review Question

groupby .agg(f), where f = max

A 3 ak
A 3 ak
A 1 hi
B 1 tx
A 2 ca
C 4 fl
A 1 hi B 1 tx A 3 ??
B 5 mi B 5 mi B 6 ??
C 9 ak B 6 nc C 9 ??
A 2 ca
C 4 fl
C 5 sd
C 9 ak
B 6 nc
C 5 sd What will go in the ?? 6
Answer

groupby .agg(f), where f = max

A 3 ak
A 3 ak
A 1 hi
B 1 tx
A 2 ca
C 4 fl
A 1 hi B 1 tx A 3 hi
B 5 mi B 5 mi B 6 tx
C 9 ak B 6 nc C 9 sd
A 2 ca
C 4 fl
C 5 sd
C 9 ak
B 6 nc
C 5 sd 7
Aggregation Functions
What goes inside of .agg( )?
● Any function that aggregates several values into one summary value
● Common examples:

In-Built Python NumPy In-Built pandas


Functions Functions functions
.agg(sum) .agg(np.sum) .agg("sum")
.agg(max) .agg(np.max) .agg("max")
.agg(min) .agg(np.min) .agg("min")
.agg(np.mean) .agg("mean")
.agg("first")
.agg("last")

Some commonly-used aggregation functions can even be called directly, without the explicit use
of .agg( )
babynames.groupby("Year").mean()

8
Putting Things Into Practice
Goal: Find the baby name with sex "F" that has fallen in popularity the most in California.

f_babynames = babynames[babynames["Sex"] == "F"]


f_babynames = f_babynames.sort_values(["Year"])
jenn_counts_series = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"]

Number of Jennifers Born in California Per Year.

9
What Is "Popularity"?
Goal: Find the baby name with sex "F" that has fallen in popularity the most in California.

How do we define "fallen in popularity?"


● Let’s create a metric: "Ratio to Peak" (RTP).
● The RTP is the ratio of babies born with a given name in 2022 to the maximum number of babies
born with that name in any year.

Example for "Jennifer":


● In 1972, we hit peak Jennifer. 6,065 Jennifers were born.
● In 2022, there were only 114 Jennifers.
● RTP is 114 / 6065 = 0.018796372629843364.

10
Calculating RTP

max_jenn = max(f_babynames[f_babynames["Name"] == "Jennifer"]["Count"])


6065

curr_jenn = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"].iloc[-1]


114
Remember: f_babynames is sorted by year.
.iloc[-1] means “grab the latest year”
rtp = curr_jenn / max_jenn
0.018796372629843364

def ratio_to_peak(series):
return series.iloc[-1] / max(series)

jenn_counts_ser = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"]


ratio_to_peak(jenn_counts_ser)
0.018796372629843364
11
Renaming Columns After Grouping
By default, .groupby will not rename any aggregated columns (the column is still named "Count",
even though it now represents the RTP.

For better readability, we may wish to rename "Count" to "Count RTP"

rtp_table = f_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
rtp_table = rtp_table.rename(columns = {"Count": "Count RTP"})

12
Some Data Science Payoff
By sorting rtp_table we can see the names whose popularity has decreased the most.

rtp_table.sort_values("Count RTP")

13
Some Data Science Payoff
By sorting rtp_table we can see the names whose popularity has decreased the most.

rtp_table.sort_values("Count RTP")

px.line(f_babynames[f_babynames["Name"] == "Debra"],
x = "Year", y = "Count")

We’ll learn about plotting in week 4. 14


Some Data Science Payoff
We can get the list of the top 10 names and then plot popularity with::

top10 = rtp_table.sort_values("Count RTP").head(10).index

px.line(f_babynames[f_babynames["Name"].isin(top10)],
x = "Year", y = "Count", color = "Name")

15
Answer
Before, we saw that the code below generates the Count RTP for all female names.

babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)

We use similar logic to compute the summed counts of all baby names.

babynames.groupby("Name")[["Count"]].agg(sum)
or
babynames.groupby("Name")[["Count"]].sum()

16
Answer
Now, we create groups for each year.

babynames.groupby("Year")[["Count"]].agg(sum)
or
babynames.groupby("Year")[["Count"]].sum()
or
babynames.groupby("Year").sum(numeric_only=True)

17
Plotting Birth Counts
Plotting the DataFrame we just generated tells an interesting story.

puzzle2 = babynames.groupby("Year")[["Count"]].agg(sum)
px.line(puzzle2, y = "Count")

18
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
More on Groupby • Structure: Variable Types

Lecture 5

19
Raw GroupBy Objects and Other Methods
The result of a groupby operation applied to a DataFrame is a DataFrameGroupBy object.
● It is not a DataFrame!
grouped_by_year = elections.groupby("Year")
type(grouped_by_year)

pandas.core.groupby.generic.DataFrameGroupBy

Given a DataFrameGroupBy object, can use various functions to generate DataFrames (or
Series). agg is only one choice:

df.groupby(col).mean() df.groupby(col).first() df.groupby(col).filter()


df.groupby(col).sum() df.groupby(col).last()
df.groupby(col).min() df.groupby(col).size()
🤨What’s the difference?
df.groupby(col).max() df.groupby(col).count()
See https://pandas.pydata.org/docs/reference/groupby.html for a list of DataFrameGroupBy methods. 20
groupby.size() and groupby.count()

groupby("year") .size()

1992 3 ak
1992 3 ak
1992 NaN mi Returns a Series object
1996 1 tx counting the number of
rows in each group.
2000 4 fl
1996 1 hi 1996 1 tx 1992 2
1992 NaN mi 1996 1 hi 1996 2
2000 9 NaN 2000 4
2000 4 fl
2000 2 ca
2000 9 NaN Similar to value_counts()
2000 6 sd except that size() does not sort
2000 2 ca the index based on the frequency
of entries.
2000 6 sd 21
groupby.size() and groupby.count()

groupby("year") .count()

1992 3 ak
1992 3 ak
1992 NaN mi Returns a DataFrame with
1996 1 tx the counts of non-missing
values in each column.
2000 4 fl
1996 1 hi 1996 1 tx 1992 1 2
1992 NaN mi 1996 1 hi 1996 2 2
2000 9 NaN 2000 4 3
2000 4 fl
2000 2 ca
2000 9 NaN
2000 6 sd
2000 2 ca
2000 6 sd 22
Filtering by Group
Another common use for groups is to filter data.
● groupby.filter takes an argument func.
● func is a function that:
○ Takes a DataFrame as input.
○ Returns either True or False.
● filter applies func to each group/sub-DataFrame:
○ If func returns True for a group, then all rows belonging to the group are preserved.
○ If func returns False for a group, then all rows belonging to that group are filtered out.
● Notes:
○ Filtering is done per group, not per row. Different from boolean filtering.
○ Unlike agg(), the column we grouped on does NOT become the index!

23
groupby.filter()

groupby
num .filter(f), where
num A 3 f = lambda sf: sf["num"].sum() > 10

A 3 A 1
B 1 A 2 6
C 4 B 1
B 1 C 4
A 1
B 5 12
B 5
B 5
B 6 C 9
C 9
13
A 2 C 4 B 6
D 5 C 9
5
B 6
D 5 24
Filtering Elections Dataset
Going back to the elections dataset.
Let's keep only election year results where the max '%' is less than 45%.

elections.groupby("Year").filter(lambda sf: sf["%"].max() < 45)

25
groupby Puzzle
Puzzle: We want to know the best election by each party.

● Best election: The election with the highest % of votes.


● For example, Democrat’s best election was in 1964, with candidate Lyndon Johnson
winning 61.3% of votes.

26
Attempt #1
Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?

elections.groupby("Party").max().head(10)

27
Problem with Attempt #1
Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?

elections.groupby("Party").max().head(10)

Every column is calculated


independently! Among Democrats:
● Last year they ran: 2020.
● Alphabetically the latest
candidate name: Woodrow
Wilson.
● Highest % of vote: 61.34%.

28
Attempt #2: Motivation
● We want to preserve entire rows, so we need an aggregate function that does that.

29
Raw GroupBy Objects and Other Methods
The result of a groupby operation applied to a DataFrame is a DataFrameGroupBy object.
● It is not a DataFrame!
grouped_by_year = elections.groupby("Year")
type(grouped_by_year)

pandas.core.groupby.generic.DataFrameGroupBy

Given a DataFrameGroupBy object, can use various functions to generate DataFrames (or
Series). agg is only one choice:

df.groupby(col).mean() df.groupby(col).first() df.groupby(col).filter()


df.groupby(col).sum() df.groupby(col).last()
df.groupby(col).min() df.groupby(col).size()
df.groupby(col).max() df.groupby(col).count()
30
Attempt #2: Solution
.sort_values("%",
.groupby("Party")
ascending = False) .first()
Order is preserved in
sub-DataFrames!
DR 1824 57% Dem 1964 61% Dem 1964 61%
DR 1824 43% Dem 1936 60% Dem 1936 60%
Dem 1828 56% Rep 1972 60%

Nat 1828 44% Rep 1920 60% Rep 1972 60% Dem 1964 61%
Dem 1832 54% Rep 1984 59% Rep 1920 60% Rep 1972 60%

… … Rep 1984 59% Green 2000 2.7%

Dem 2020 51% Cons 2004 0.1%


Green 2020 0.2%
Rep 2020 47% Pop 1992 0.1%
Green 2004 0.01%
Green 2020 0.2% Green 2004 0.01%

31
Attempt #2: Solution
● First sort the DataFrame so that rows are in descending order of %.
● Then group by Party and take the first item of each sub-DataFrame.
● Note: Lab will give you a chance to try this out if you didn't quite follow during lecture.

elections_sorted_by_percent = elections.sort_values("%", ascending=False)


elections_sorted_by_percent.groupby("Party").first()

elections_sorted_by_percent 32
groupby Puzzle - Alternate Approaches

Using a lambda function

elections_sorted_by_percent = elections.sort_values("%", ascending=False)


elections_sorted_by_percent.groupby("Party").agg(lambda x : x.iloc[0])

Using idxmax function

best_per_party = elections.loc[elections.groupby("Party")["%"].idxmax()]

Using drop_duplicates function

best_per_party2 = elections.sort_values("%").drop_duplicates(["Party"], keep="last")


33
There's More Than One Way to Find the Best Result by Party

In Pandas, there’s more than one way to get to the same answer.

● Each approach has different tradeoffs in terms of readability, performance, memory


consumption, complexity, etc.
● Takes a very long time to understand these tradeoffs!
● If you find your current solution to be particularly convoluted or hard to read, maybe try finding
another way!

34
More on DataFrameGroupby Object
We can look into DataFrameGroupby objects in following ways:

grouped_by_party = elections.groupby("Party")
grouped_by_party.groups

grouped_by_party.get_group("Socialist")

35
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
Pivot Tables • Structure: Variable Types

Lecture 5

36
Grouping by Multiple Columns
Suppose we want to build a table showing the total number of babies born of each Gender in each
year. One way is to groupby using both columns of interest:

babynames.groupby(["Year", "Sex"])[["Count"]].agg(sum).head(6)

Note: Resulting DataFrame is


multi-indexed. That is, its index
has multiple dimensions. Will
explore in a later lecture.

37
Pivot Tables
A more natural approach is to create a pivot table.

babynames_pivot = babynames.pivot_table(
index = "Year", # rows (turned into index)
columns = "Sex", # column values
values = ["Count"], # field(s) to process in each group
aggfunc = np.sum, # group operation
)
babynames_pivot.head(6)

38
groupby(["Year", "Sex"]) vs. pivot_table
The pivot table more naturally represents our data.

groupby output pivot_table output

39
Pivot Table Mechanics

f = sum
R C A F 3 f
A F 3 A F 5
A F 2
B M 1 f
A M 1 A M 1 F M
C F 4
B F 5 f B F 5 A 5 1
A M 1
group
B M 1 B 5 7
B F 5 f ...
B M 7 C 4 9
C M 9 B M 6
f D 5 NaN
A F 2 C F 4 C F 4
D F 5 f
C M 9 C M 9
B M 6 f
D F 5 D F 5
40
Pivot Tables with Multiple Values
We can include multiple values in our pivot tables.

babynames_pivot = babynames.pivot_table(
index = "Year", # rows (turned into index)
columns = "Sex", # column values
values = ["Count", "Name"],
aggfunc = np.max, # group operation
)
babynames_pivot.head(6)

41
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
Join Tables • Structure: Variable Types

Lecture 5

42
Joining Tables
Suppose want to know the popularity of presidential candidate's names in 2022.
● Example: Dwight Eisenhower's name Dwight is not popular today, with only 5 babies born with
this name in California in 2022.

To solve this problem, we’ll have to join tables.

43
Creating Table 1: Babynames in 2022
Let's set aside names in California from 2022 first:

babynames_2022 = babynames[babynames["Year"] == 2022]


babynames_2022

44
Creating Table 2: Presidents with First Names
To join our table, we’ll also need to set aside the first names of each candidate.

elections["First Name"] = elections["Candidate"].str.split().str[0]

45
Joining Our Tables: Two Options
merged = pd.merge(left = elections, right = babynames_2022,
left_on = "First Name", right_on = "Name")

merged = elections.merge(right = babynames_2022,


left_on = "First Name", right_on = "Name")

46
Data Wrangling and EDA, Part I
Exploratory Data Analysis and its role in the data science lifecycle

47
Bo x
of D
ata

EDA Guiding Principles

The Next Step


48
Plan for First Few Weeks

? Question & Problem


Formulation
Data
Acquisition

Prediction and Exploratory


Inference Data Analysis

Reports, Decisions,
and Solutions
(Weeks 1 and 2) (Weeks 2 and 3)

Exploring and Cleaning Tabular Data Data Science in Practice


From datascience to pandas EDA, Data Cleaning, Text processing (regular expressions)

49
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I

Structure: Tabular • Structure: Tabular Data


• Granularity
• Structure: Variable Types
Data
Lecture 05

50
Rectangular and Non-rectangular Data
Data come in many different shapes.
1723786
Non-rectangular data

Rectangular data

51
Rectangular Data
We often prefer rectangular data for data analysis (why?) Fields/Attributes/
• Regular structures are easy manipulate and analyze Features/Columns
• A big part of data cleaning is about

Records/Rows
transforming data to be more rectangular
Two kinds of rectangular data: Tables and Matrices.

Tables (a.k.a. DataFrames in R/Python and Matrices


relations in SQL) ● Numeric data of the same type (float,
● Named columns with different types int, etc.)
● Manipulated using data transformation ● Manipulated using linear algebra
languages (map, filter, group by, join,
…)

What are the differences?


Why would you use one over the other? 52
Tuberculosis – United States, 2021
CDC Morbidity and Mortality Weekly Report (MMWR) 03/25/2022.

What is incidence?
Why use it here?

How was “9.4%


increase” computed?

Question: Can we reproduce these rates


using government data?

CDC: source, 2021 53


CSV: Comma-Separated Values
Tuberculosis in the US [CDC source].

CSV is a very common tabular file format.


● Records (rows) are delimited by a newline:
'\n', "\r\n"
● Fields (columns) are delimited by commas:
','
Pandas: pd.read_csv(header=...)

Fields/Attributes/Features/Columns

Demo Slides

Records/Rows
U.S. jurisdiction TB cases 2019 …

0 Total 8,900 …

1 Alabama 87 …
54
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
• Structure: Variable Types
Granularity
Lecture 04, Data 100 Fall 2023

55
(we’ll come back to this) Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in


time
Key Data Properties
Faithfulness -- how well does the data
to Consider in EDA capture “reality”

56
Granularity: How Fine/Coarse Is Each Datum?
What does each record represent?
● Examples: a purchase, a person, a group of users
Do all records capture granularity at the same level? Covered until here
on 9/5
● Some data will include summaries (aka rollups) as records.
If the data are coarse, how were the records aggregated?
● Sampling, averaging, maybe some of both…
Rec. 1
Rec. 1 Rec. 2 Rec. 3
Rec. 1 Rec. 2 Rec. 3

Fine Coarse
Grained Grained
To the demo!! 57
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I

Structure: Variable • Structure: Tabular Data


• Granularity
• Structure: Variable Types
Types
Lecture 04, Data 100 Fall 2023

58
(we’re back to this)
Variable Type Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in


time

Faithfulness -- how well does the data


capture “reality”

59
Variables Are Columns
Let’s look at records with the same granularity. The U.S. Jurisdiction variable
What does each column represent? U.S. jurisdiction TB cases 2019 …
A variable is a measurement of a particular concept. 1 Alabama 87 …

2 Alaska 58 …
It has two common properties:
… … … …
● Datatype/Storage type:
How each variable value is stored in memory. df[colname].dtype
○ integer, floating point, boolean, object (string-like), etc.
Affects which pandas functions you use.
● Variable type/Feature type:
Conceptualized measurement of information (and therefore what values it can take on).
○ Use expert knowledge
○ Explore data itself
○ Consult data codebook (if it exists).
In this class, “variable types” are
Affects how you visualize and interpret the data. conceptual!!
60
Variable Feature Types

Many variables do not sit


Intervals have Variable neatly in one of these
categories!!
meaning.

Quantitative Qualitative
(categorical)

Continuous Discrete Ordinal Nominal


Could be measured to Finite possible values Categories w/ordered levels; no Categories w/ no
arbitrary precision. consistent meaning to difference specific ordering.
Examples: Examples: Examples: Examples:
• Price • Number of siblings • Preferences • Political Affiliation
• Temperature • Yrs of education • Level of education • Cal lD number

Note that qualitative variables could have numeric levels;


61
conversely, quantitative variables could be stored as strings!
Variable Feature Types

Many variables do not sit


Intervals have Variable neatly in one of these
categories!!
meaning.

Quantitative Qualitative
(categorical)

Continuous Discrete Ordinal Nominal


Could be measured to Finite possible values Categories w/ordered levels; no Categories w/ no
arbitrary precision. consistent meaning to difference specific ordering.
Examples: Examples: Examples: Examples:
• Price • Number of siblings • Preferences • Political Affiliation
• Temperature • Yrs of education • Level of education • Cal lD number

Note that qualitative variables could have numeric levels;


62
conversely, quantitative variables could be stored as strings!
Variable Types
Variable

What is the feature type (i.e., variable type)
of each variable?

�Q1
Quantitative Qualitative
Variable Feature Type
CO2 level (ppm)

2 Number of siblings Cont- Dis- Ord- Nom-


inuous crete inal inal
3 GPA
A B C D
4 Income bracket
(low, med, high)
5 Race/Ethnicity

6 Number of years of
education
7 Yelp Rating
63
Variable Types
Variable
� What is the feature type of each variable?

�Q1
Quantitative Qualitative
Variable Feature Type
CO2 level (ppm) A. Quantitative Cont.

2 Number of siblings B. Quantitative Discrete Cont- Dis- Ord- Nom-


inuous crete inal inal
3 GPA A. Quantitative Cont.
A B C D
4 Income bracket C. Qualitative Ordinal
(low, med, high)
5 Race/Ethnicity D. Qualitative Nominal Many of these examples show how
6 Number of years of B. Quantitative Discrete “shaggy” these categories are!!
education We will revisit variable types when we
7 Yelp Rating learn how to visualize variables.
C. Qualitative Ordinal
64
LECTURE 4

Data Wrangling and EDA, Part I


Content credit: Acknowledgments

65

You might also like