Lec 05-DSFa23
Lec 05-DSFa23
1
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
Today’s Roadmap • Structure: Variable Types
Lecture 5
2
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
Groupby Review • Structure: Variable Types
Lecture 5
3
Revisiting groupby.agg
dataframe.groupby(column_name).agg(aggregation_function)
4
Revisiting groupby.agg
A groupby operation involves some combination of splitting the object, applying a function, and
combining the results.
● So far, we've seen that df.groupby("year").agg(sum):
○ Split df into sub-DataFrames based on year.
○ Apply the sum function to each column of each sub-DataFrame.
○ Combine the results of sum into a single DataFrame, indexed by year.
5
Groupby Review Question
A 3 ak
A 3 ak
A 1 hi
B 1 tx
A 2 ca
C 4 fl
A 1 hi B 1 tx A 3 ??
B 5 mi B 5 mi B 6 ??
C 9 ak B 6 nc C 9 ??
A 2 ca
C 4 fl
C 5 sd
C 9 ak
B 6 nc
C 5 sd What will go in the ?? 6
Answer
A 3 ak
A 3 ak
A 1 hi
B 1 tx
A 2 ca
C 4 fl
A 1 hi B 1 tx A 3 hi
B 5 mi B 5 mi B 6 tx
C 9 ak B 6 nc C 9 sd
A 2 ca
C 4 fl
C 5 sd
C 9 ak
B 6 nc
C 5 sd 7
Aggregation Functions
What goes inside of .agg( )?
● Any function that aggregates several values into one summary value
● Common examples:
Some commonly-used aggregation functions can even be called directly, without the explicit use
of .agg( )
babynames.groupby("Year").mean()
8
Putting Things Into Practice
Goal: Find the baby name with sex "F" that has fallen in popularity the most in California.
9
What Is "Popularity"?
Goal: Find the baby name with sex "F" that has fallen in popularity the most in California.
10
Calculating RTP
def ratio_to_peak(series):
return series.iloc[-1] / max(series)
rtp_table = f_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
rtp_table = rtp_table.rename(columns = {"Count": "Count RTP"})
12
Some Data Science Payoff
By sorting rtp_table we can see the names whose popularity has decreased the most.
rtp_table.sort_values("Count RTP")
13
Some Data Science Payoff
By sorting rtp_table we can see the names whose popularity has decreased the most.
rtp_table.sort_values("Count RTP")
px.line(f_babynames[f_babynames["Name"] == "Debra"],
x = "Year", y = "Count")
px.line(f_babynames[f_babynames["Name"].isin(top10)],
x = "Year", y = "Count", color = "Name")
15
Answer
Before, we saw that the code below generates the Count RTP for all female names.
babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
We use similar logic to compute the summed counts of all baby names.
babynames.groupby("Name")[["Count"]].agg(sum)
or
babynames.groupby("Name")[["Count"]].sum()
16
Answer
Now, we create groups for each year.
babynames.groupby("Year")[["Count"]].agg(sum)
or
babynames.groupby("Year")[["Count"]].sum()
or
babynames.groupby("Year").sum(numeric_only=True)
17
Plotting Birth Counts
Plotting the DataFrame we just generated tells an interesting story.
puzzle2 = babynames.groupby("Year")[["Count"]].agg(sum)
px.line(puzzle2, y = "Count")
18
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
More on Groupby • Structure: Variable Types
Lecture 5
19
Raw GroupBy Objects and Other Methods
The result of a groupby operation applied to a DataFrame is a DataFrameGroupBy object.
● It is not a DataFrame!
grouped_by_year = elections.groupby("Year")
type(grouped_by_year)
pandas.core.groupby.generic.DataFrameGroupBy
Given a DataFrameGroupBy object, can use various functions to generate DataFrames (or
Series). agg is only one choice:
groupby("year") .size()
1992 3 ak
1992 3 ak
1992 NaN mi Returns a Series object
1996 1 tx counting the number of
rows in each group.
2000 4 fl
1996 1 hi 1996 1 tx 1992 2
1992 NaN mi 1996 1 hi 1996 2
2000 9 NaN 2000 4
2000 4 fl
2000 2 ca
2000 9 NaN Similar to value_counts()
2000 6 sd except that size() does not sort
2000 2 ca the index based on the frequency
of entries.
2000 6 sd 21
groupby.size() and groupby.count()
groupby("year") .count()
1992 3 ak
1992 3 ak
1992 NaN mi Returns a DataFrame with
1996 1 tx the counts of non-missing
values in each column.
2000 4 fl
1996 1 hi 1996 1 tx 1992 1 2
1992 NaN mi 1996 1 hi 1996 2 2
2000 9 NaN 2000 4 3
2000 4 fl
2000 2 ca
2000 9 NaN
2000 6 sd
2000 2 ca
2000 6 sd 22
Filtering by Group
Another common use for groups is to filter data.
● groupby.filter takes an argument func.
● func is a function that:
○ Takes a DataFrame as input.
○ Returns either True or False.
● filter applies func to each group/sub-DataFrame:
○ If func returns True for a group, then all rows belonging to the group are preserved.
○ If func returns False for a group, then all rows belonging to that group are filtered out.
● Notes:
○ Filtering is done per group, not per row. Different from boolean filtering.
○ Unlike agg(), the column we grouped on does NOT become the index!
23
groupby.filter()
groupby
num .filter(f), where
num A 3 f = lambda sf: sf["num"].sum() > 10
A 3 A 1
B 1 A 2 6
C 4 B 1
B 1 C 4
A 1
B 5 12
B 5
B 5
B 6 C 9
C 9
13
A 2 C 4 B 6
D 5 C 9
5
B 6
D 5 24
Filtering Elections Dataset
Going back to the elections dataset.
Let's keep only election year results where the max '%' is less than 45%.
25
groupby Puzzle
Puzzle: We want to know the best election by each party.
26
Attempt #1
Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?
elections.groupby("Party").max().head(10)
27
Problem with Attempt #1
Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?
elections.groupby("Party").max().head(10)
28
Attempt #2: Motivation
● We want to preserve entire rows, so we need an aggregate function that does that.
29
Raw GroupBy Objects and Other Methods
The result of a groupby operation applied to a DataFrame is a DataFrameGroupBy object.
● It is not a DataFrame!
grouped_by_year = elections.groupby("Year")
type(grouped_by_year)
pandas.core.groupby.generic.DataFrameGroupBy
Given a DataFrameGroupBy object, can use various functions to generate DataFrames (or
Series). agg is only one choice:
Nat 1828 44% Rep 1920 60% Rep 1972 60% Dem 1964 61%
Dem 1832 54% Rep 1984 59% Rep 1920 60% Rep 1972 60%
31
Attempt #2: Solution
● First sort the DataFrame so that rows are in descending order of %.
● Then group by Party and take the first item of each sub-DataFrame.
● Note: Lab will give you a chance to try this out if you didn't quite follow during lecture.
elections_sorted_by_percent 32
groupby Puzzle - Alternate Approaches
best_per_party = elections.loc[elections.groupby("Party")["%"].idxmax()]
In Pandas, there’s more than one way to get to the same answer.
34
More on DataFrameGroupby Object
We can look into DataFrameGroupby objects in following ways:
grouped_by_party = elections.groupby("Party")
grouped_by_party.groups
grouped_by_party.get_group("Socialist")
35
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
Pivot Tables • Structure: Variable Types
Lecture 5
36
Grouping by Multiple Columns
Suppose we want to build a table showing the total number of babies born of each Gender in each
year. One way is to groupby using both columns of interest:
babynames.groupby(["Year", "Sex"])[["Count"]].agg(sum).head(6)
37
Pivot Tables
A more natural approach is to create a pivot table.
babynames_pivot = babynames.pivot_table(
index = "Year", # rows (turned into index)
columns = "Sex", # column values
values = ["Count"], # field(s) to process in each group
aggfunc = np.sum, # group operation
)
babynames_pivot.head(6)
38
groupby(["Year", "Sex"]) vs. pivot_table
The pivot table more naturally represents our data.
39
Pivot Table Mechanics
f = sum
R C A F 3 f
A F 3 A F 5
A F 2
B M 1 f
A M 1 A M 1 F M
C F 4
B F 5 f B F 5 A 5 1
A M 1
group
B M 1 B 5 7
B F 5 f ...
B M 7 C 4 9
C M 9 B M 6
f D 5 NaN
A F 2 C F 4 C F 4
D F 5 f
C M 9 C M 9
B M 6 f
D F 5 D F 5
40
Pivot Tables with Multiple Values
We can include multiple values in our pivot tables.
babynames_pivot = babynames.pivot_table(
index = "Year", # rows (turned into index)
columns = "Sex", # column values
values = ["Count", "Name"],
aggfunc = np.max, # group operation
)
babynames_pivot.head(6)
41
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
Join Tables • Structure: Variable Types
Lecture 5
42
Joining Tables
Suppose want to know the popularity of presidential candidate's names in 2022.
● Example: Dwight Eisenhower's name Dwight is not popular today, with only 5 babies born with
this name in California in 2022.
43
Creating Table 1: Babynames in 2022
Let's set aside names in California from 2022 first:
44
Creating Table 2: Presidents with First Names
To join our table, we’ll also need to set aside the first names of each candidate.
45
Joining Our Tables: Two Options
merged = pd.merge(left = elections, right = babynames_2022,
left_on = "First Name", right_on = "Name")
46
Data Wrangling and EDA, Part I
Exploratory Data Analysis and its role in the data science lifecycle
47
Bo x
of D
ata
Reports, Decisions,
and Solutions
(Weeks 1 and 2) (Weeks 2 and 3)
49
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
50
Rectangular and Non-rectangular Data
Data come in many different shapes.
1723786
Non-rectangular data
Rectangular data
51
Rectangular Data
We often prefer rectangular data for data analysis (why?) Fields/Attributes/
• Regular structures are easy manipulate and analyze Features/Columns
• A big part of data cleaning is about
Records/Rows
transforming data to be more rectangular
Two kinds of rectangular data: Tables and Matrices.
What is incidence?
Why use it here?
Fields/Attributes/Features/Columns
Demo Slides
Records/Rows
U.S. jurisdiction TB cases 2019 …
0 Total 8,900 …
1 Alabama 87 …
54
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
• Structure: Variable Types
Granularity
Lecture 04, Data 100 Fall 2023
55
(we’ll come back to this) Structure -- the “shape” of a data file
56
Granularity: How Fine/Coarse Is Each Datum?
What does each record represent?
● Examples: a purchase, a person, a group of users
Do all records capture granularity at the same level? Covered until here
on 9/5
● Some data will include summaries (aka rollups) as records.
If the data are coarse, how were the records aggregated?
● Sampling, averaging, maybe some of both…
Rec. 1
Rec. 1 Rec. 2 Rec. 3
Rec. 1 Rec. 2 Rec. 3
Fine Coarse
Grained Grained
To the demo!! 57
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
58
(we’re back to this)
Variable Type Structure -- the “shape” of a data file
59
Variables Are Columns
Let’s look at records with the same granularity. The U.S. Jurisdiction variable
What does each column represent? U.S. jurisdiction TB cases 2019 …
A variable is a measurement of a particular concept. 1 Alabama 87 …
2 Alaska 58 …
It has two common properties:
… … … …
● Datatype/Storage type:
How each variable value is stored in memory. df[colname].dtype
○ integer, floating point, boolean, object (string-like), etc.
Affects which pandas functions you use.
● Variable type/Feature type:
Conceptualized measurement of information (and therefore what values it can take on).
○ Use expert knowledge
○ Explore data itself
○ Consult data codebook (if it exists).
In this class, “variable types” are
Affects how you visualize and interpret the data. conceptual!!
60
Variable Feature Types
Quantitative Qualitative
(categorical)
Quantitative Qualitative
(categorical)
�Q1
Quantitative Qualitative
Variable Feature Type
CO2 level (ppm)
6 Number of years of
education
7 Yelp Rating
63
Variable Types
Variable
� What is the feature type of each variable?
�Q1
Quantitative Qualitative
Variable Feature Type
CO2 level (ppm) A. Quantitative Cont.
65