0% found this document useful (0 votes)

27 views65 pages

Lec 05-DSFa23

The document discusses advanced pandas concepts like grouping, aggregation, pivot tables, and merging data. It covers reviewing pandas groupby methods, exploring more groupby methods, pivot tables, and joining tables. It also discusses exploring and preparing tabular data as part of an intro to exploratory data analysis.

Uploaded by

labnexaplan9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views65 pages

Lec 05-DSFa23

Uploaded by

labnexaplan9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 65

LECTURE 5

Pandas, Part III

Advanced Pandas (More on Grouping, Aggregation, Pivot Tables, and Merging)

Data Science, Fall 2023 @ Knowledge Stream

Sana Jabbar

1
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
Today’s Roadmap • Structure: Variable Types

Lecture 5

2
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
Groupby Review • Structure: Variable Types

Lecture 5

3
Revisiting groupby.agg

dataframe.groupby(column_name).agg(aggregation_function)

babynames.groupby("Year")[["Count"]].agg(sum) computes the total number of babies born in

each year.

4
Revisiting groupby.agg
A groupby operation involves some combination of splitting the object, applying a function, and
combining the results.
● So far, we've seen that df.groupby("year").agg(sum):
○ Split df into sub-DataFrames based on year.
○ Apply the sum function to each column of each sub-DataFrame.
○ Combine the results of sum into a single DataFrame, indexed by year.

5
Groupby Review Question

groupby .agg(f), where f = max

A 3 ak
A 3 ak
A 1 hi
B 1 tx
A 2 ca
C 4 fl
A 1 hi B 1 tx A 3 ??
B 5 mi B 5 mi B 6 ??
C 9 ak B 6 nc C 9 ??
A 2 ca
C 4 fl
C 5 sd
C 9 ak
B 6 nc
C 5 sd What will go in the ?? 6
Answer

groupby .agg(f), where f = max

A 3 ak
A 3 ak
A 1 hi
B 1 tx
A 2 ca
C 4 fl
A 1 hi B 1 tx A 3 hi
B 5 mi B 5 mi B 6 tx
C 9 ak B 6 nc C 9 sd
A 2 ca
C 4 fl
C 5 sd
C 9 ak
B 6 nc
C 5 sd 7
Aggregation Functions
What goes inside of .agg( )?
● Any function that aggregates several values into one summary value
● Common examples:

In-Built Python NumPy In-Built pandas

Functions Functions functions
.agg(sum) .agg(np.sum) .agg("sum")
.agg(max) .agg(np.max) .agg("max")
.agg(min) .agg(np.min) .agg("min")
.agg(np.mean) .agg("mean")
.agg("first")
.agg("last")

Some commonly-used aggregation functions can even be called directly, without the explicit use
of .agg( )
babynames.groupby("Year").mean()

8
Putting Things Into Practice
Goal: Find the baby name with sex "F" that has fallen in popularity the most in California.

f_babynames = babynames[babynames["Sex"] == "F"]

f_babynames = f_babynames.sort_values(["Year"])
jenn_counts_series = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"]

Number of Jennifers Born in California Per Year.

9
What Is "Popularity"?
Goal: Find the baby name with sex "F" that has fallen in popularity the most in California.

How do we define "fallen in popularity?"

● Let’s create a metric: "Ratio to Peak" (RTP).
● The RTP is the ratio of babies born with a given name in 2022 to the maximum number of babies
born with that name in any year.

Example for "Jennifer":

● In 1972, we hit peak Jennifer. 6,065 Jennifers were born.
● In 2022, there were only 114 Jennifers.
● RTP is 114 / 6065 = 0.018796372629843364.

10
Calculating RTP

max_jenn = max(f_babynames[f_babynames["Name"] == "Jennifer"]["Count"])

6065

curr_jenn = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"].iloc[-1]

114
Remember: f_babynames is sorted by year.
.iloc[-1] means “grab the latest year”
rtp = curr_jenn / max_jenn
0.018796372629843364

def ratio_to_peak(series):
return series.iloc[-1] / max(series)

jenn_counts_ser = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"]

ratio_to_peak(jenn_counts_ser)
0.018796372629843364
11
Renaming Columns After Grouping
By default, .groupby will not rename any aggregated columns (the column is still named "Count",
even though it now represents the RTP.

For better readability, we may wish to rename "Count" to "Count RTP"

rtp_table = f_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
rtp_table = rtp_table.rename(columns = {"Count": "Count RTP"})

12
Some Data Science Payoff
By sorting rtp_table we can see the names whose popularity has decreased the most.

rtp_table.sort_values("Count RTP")

13
Some Data Science Payoff
By sorting rtp_table we can see the names whose popularity has decreased the most.

rtp_table.sort_values("Count RTP")

px.line(f_babynames[f_babynames["Name"] == "Debra"],
x = "Year", y = "Count")

We’ll learn about plotting in week 4. 14

Some Data Science Payoff
We can get the list of the top 10 names and then plot popularity with::

top10 = rtp_table.sort_values("Count RTP").head(10).index

px.line(f_babynames[f_babynames["Name"].isin(top10)],
x = "Year", y = "Count", color = "Name")

15
Answer
Before, we saw that the code below generates the Count RTP for all female names.

babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)

We use similar logic to compute the summed counts of all baby names.

babynames.groupby("Name")[["Count"]].agg(sum)
or
babynames.groupby("Name")[["Count"]].sum()

16
Answer
Now, we create groups for each year.

babynames.groupby("Year")[["Count"]].agg(sum)
or
babynames.groupby("Year")[["Count"]].sum()
or
babynames.groupby("Year").sum(numeric_only=True)

17
Plotting Birth Counts
Plotting the DataFrame we just generated tells an interesting story.

puzzle2 = babynames.groupby("Year")[["Count"]].agg(sum)
px.line(puzzle2, y = "Count")

18
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
More on Groupby • Structure: Variable Types

Lecture 5

19
Raw GroupBy Objects and Other Methods
The result of a groupby operation applied to a DataFrame is a DataFrameGroupBy object.
● It is not a DataFrame!
grouped_by_year = elections.groupby("Year")
type(grouped_by_year)

pandas.core.groupby.generic.DataFrameGroupBy

Given a DataFrameGroupBy object, can use various functions to generate DataFrames (or
Series). agg is only one choice:

df.groupby(col).mean() df.groupby(col).first() df.groupby(col).filter()

df.groupby(col).sum() df.groupby(col).last()
df.groupby(col).min() df.groupby(col).size()
🤨What’s the difference?
df.groupby(col).max() df.groupby(col).count()
See https://pandas.pydata.org/docs/reference/groupby.html for a list of DataFrameGroupBy methods. 20
groupby.size() and groupby.count()

groupby("year") .size()

1992 3 ak
1992 3 ak
1992 NaN mi Returns a Series object
1996 1 tx counting the number of
rows in each group.
2000 4 fl
1996 1 hi 1996 1 tx 1992 2
1992 NaN mi 1996 1 hi 1996 2
2000 9 NaN 2000 4
2000 4 fl
2000 2 ca
2000 9 NaN Similar to value_counts()
2000 6 sd except that size() does not sort
2000 2 ca the index based on the frequency
of entries.
2000 6 sd 21
groupby.size() and groupby.count()

groupby("year") .count()

1992 3 ak
1992 3 ak
1992 NaN mi Returns a DataFrame with
1996 1 tx the counts of non-missing
values in each column.
2000 4 fl
1996 1 hi 1996 1 tx 1992 1 2
1992 NaN mi 1996 1 hi 1996 2 2
2000 9 NaN 2000 4 3
2000 4 fl
2000 2 ca
2000 9 NaN
2000 6 sd
2000 2 ca
2000 6 sd 22
Filtering by Group
Another common use for groups is to filter data.
● groupby.filter takes an argument func.
● func is a function that:
○ Takes a DataFrame as input.
○ Returns either True or False.
● filter applies func to each group/sub-DataFrame:
○ If func returns True for a group, then all rows belonging to the group are preserved.
○ If func returns False for a group, then all rows belonging to that group are filtered out.
● Notes:
○ Filtering is done per group, not per row. Different from boolean filtering.
○ Unlike agg(), the column we grouped on does NOT become the index!

23
groupby.filter()

groupby
num .filter(f), where
num A 3 f = lambda sf: sf["num"].sum() > 10

A 3 A 1
B 1 A 2 6
C 4 B 1
B 1 C 4
A 1
B 5 12
B 5
B 5
B 6 C 9
C 9
13
A 2 C 4 B 6
D 5 C 9
5
B 6
D 5 24
Filtering Elections Dataset
Going back to the elections dataset.
Let's keep only election year results where the max '%' is less than 45%.

elections.groupby("Year").filter(lambda sf: sf["%"].max() < 45)

25
groupby Puzzle
Puzzle: We want to know the best election by each party.

● Best election: The election with the highest % of votes.

● For example, Democrat’s best election was in 1964, with candidate Lyndon Johnson
winning 61.3% of votes.

26
Attempt #1
Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?

elections.groupby("Party").max().head(10)

27
Problem with Attempt #1
Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?

elections.groupby("Party").max().head(10)

Every column is calculated

independently! Among Democrats:
● Last year they ran: 2020.
● Alphabetically the latest
candidate name: Woodrow
Wilson.
● Highest % of vote: 61.34%.

28
Attempt #2: Motivation
● We want to preserve entire rows, so we need an aggregate function that does that.

29
Raw GroupBy Objects and Other Methods
The result of a groupby operation applied to a DataFrame is a DataFrameGroupBy object.
● It is not a DataFrame!
grouped_by_year = elections.groupby("Year")
type(grouped_by_year)

pandas.core.groupby.generic.DataFrameGroupBy

Given a DataFrameGroupBy object, can use various functions to generate DataFrames (or
Series). agg is only one choice:

df.groupby(col).mean() df.groupby(col).first() df.groupby(col).filter()

df.groupby(col).sum() df.groupby(col).last()
df.groupby(col).min() df.groupby(col).size()
df.groupby(col).max() df.groupby(col).count()
30
Attempt #2: Solution
.sort_values("%",
.groupby("Party")
ascending = False) .first()
Order is preserved in
sub-DataFrames!
DR 1824 57% Dem 1964 61% Dem 1964 61%
DR 1824 43% Dem 1936 60% Dem 1936 60%
Dem 1828 56% Rep 1972 60%

Nat 1828 44% Rep 1920 60% Rep 1972 60% Dem 1964 61%
Dem 1832 54% Rep 1984 59% Rep 1920 60% Rep 1972 60%

… … Rep 1984 59% Green 2000 2.7%

Dem 2020 51% Cons 2004 0.1%

Green 2020 0.2%
Rep 2020 47% Pop 1992 0.1%
Green 2004 0.01%
Green 2020 0.2% Green 2004 0.01%

31
Attempt #2: Solution
● First sort the DataFrame so that rows are in descending order of %.
● Then group by Party and take the first item of each sub-DataFrame.
● Note: Lab will give you a chance to try this out if you didn't quite follow during lecture.

elections_sorted_by_percent = elections.sort_values("%", ascending=False)

elections_sorted_by_percent.groupby("Party").first()

elections_sorted_by_percent 32
groupby Puzzle - Alternate Approaches

Using a lambda function

elections_sorted_by_percent = elections.sort_values("%", ascending=False)

elections_sorted_by_percent.groupby("Party").agg(lambda x : x.iloc[0])

Using idxmax function

best_per_party = elections.loc[elections.groupby("Party")["%"].idxmax()]

Using drop_duplicates function

best_per_party2 = elections.sort_values("%").drop_duplicates(["Party"], keep="last")

33
There's More Than One Way to Find the Best Result by Party

In Pandas, there’s more than one way to get to the same answer.

● Each approach has different tradeoffs in terms of readability, performance, memory

consumption, complexity, etc.
● Takes a very long time to understand these tradeoffs!
● If you find your current solution to be particularly convoluted or hard to read, maybe try finding
another way!

34
More on DataFrameGroupby Object
We can look into DataFrameGroupby objects in following ways:

grouped_by_party = elections.groupby("Party")
grouped_by_party.groups

grouped_by_party.get_group("Socialist")

35
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
Pivot Tables • Structure: Variable Types

Lecture 5

36
Grouping by Multiple Columns
Suppose we want to build a table showing the total number of babies born of each Gender in each
year. One way is to groupby using both columns of interest:

babynames.groupby(["Year", "Sex"])[["Count"]].agg(sum).head(6)

Note: Resulting DataFrame is

multi-indexed. That is, its index
has multiple dimensions. Will
explore in a later lecture.

37
Pivot Tables
A more natural approach is to create a pivot table.

babynames_pivot = babynames.pivot_table(
index = "Year", # rows (turned into index)
columns = "Sex", # column values
values = ["Count"], # field(s) to process in each group
aggfunc = np.sum, # group operation
)
babynames_pivot.head(6)

38
groupby(["Year", "Sex"]) vs. pivot_table
The pivot table more naturally represents our data.

groupby output pivot_table output

39
Pivot Table Mechanics

f = sum
R C A F 3 f
A F 3 A F 5
A F 2
B M 1 f
A M 1 A M 1 F M
C F 4
B F 5 f B F 5 A 5 1
A M 1
group
B M 1 B 5 7
B F 5 f ...
B M 7 C 4 9
C M 9 B M 6
f D 5 NaN
A F 2 C F 4 C F 4
D F 5 f
C M 9 C M 9
B M 6 f
D F 5 D F 5
40
Pivot Tables with Multiple Values
We can include multiple values in our pivot tables.

babynames_pivot = babynames.pivot_table(
index = "Year", # rows (turned into index)
columns = "Sex", # column values
values = ["Count", "Name"],
aggfunc = np.max, # group operation
)
babynames_pivot.head(6)

41
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
Join Tables • Structure: Variable Types

Lecture 5

42
Joining Tables
Suppose want to know the popularity of presidential candidate's names in 2022.
● Example: Dwight Eisenhower's name Dwight is not popular today, with only 5 babies born with
this name in California in 2022.

To solve this problem, we’ll have to join tables.

43
Creating Table 1: Babynames in 2022
Let's set aside names in California from 2022 first:

babynames_2022 = babynames[babynames["Year"] == 2022]

babynames_2022

44
Creating Table 2: Presidents with First Names
To join our table, we’ll also need to set aside the first names of each candidate.

elections["First Name"] = elections["Candidate"].str.split().str[0]

45
Joining Our Tables: Two Options
merged = pd.merge(left = elections, right = babynames_2022,
left_on = "First Name", right_on = "Name")

merged = elections.merge(right = babynames_2022,

left_on = "First Name", right_on = "Name")

46
Data Wrangling and EDA, Part I
Exploratory Data Analysis and its role in the data science lifecycle

47
Bo x
of D
ata

EDA Guiding Principles

The Next Step

48
Plan for First Few Weeks

? Question & Problem

Formulation
Data
Acquisition

Prediction and Exploratory

Inference Data Analysis

Reports, Decisions,
and Solutions
(Weeks 1 and 2) (Weeks 2 and 3)

Exploring and Cleaning Tabular Data Data Science in Practice

From datascience to pandas EDA, Data Cleaning, Text processing (regular expressions)

49
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I

Structure: Tabular • Structure: Tabular Data

• Granularity
• Structure: Variable Types
Data
Lecture 05

50
Rectangular and Non-rectangular Data
Data come in many different shapes.
1723786
Non-rectangular data

Rectangular data

51
Rectangular Data
We often prefer rectangular data for data analysis (why?) Fields/Attributes/
• Regular structures are easy manipulate and analyze Features/Columns
• A big part of data cleaning is about

Records/Rows
transforming data to be more rectangular
Two kinds of rectangular data: Tables and Matrices.

Tables (a.k.a. DataFrames in R/Python and Matrices

relations in SQL) ● Numeric data of the same type (float,
● Named columns with different types int, etc.)
● Manipulated using data transformation ● Manipulated using linear algebra
languages (map, filter, group by, join,
…)

What are the differences?

Why would you use one over the other? 52
Tuberculosis – United States, 2021
CDC Morbidity and Mortality Weekly Report (MMWR) 03/25/2022.

What is incidence?
Why use it here?

How was “9.4%

increase” computed?

Question: Can we reproduce these rates

using government data?

CDC: source, 2021 53

CSV: Comma-Separated Values
Tuberculosis in the US [CDC source].

CSV is a very common tabular file format.

● Records (rows) are delimited by a newline:
'\n', "\r\n"
● Fields (columns) are delimited by commas:
','
Pandas: pd.read_csv(header=...)

Fields/Attributes/Features/Columns

Demo Slides

Records/Rows
U.S. jurisdiction TB cases 2019 …

0 Total 8,900 …

1 Alabama 87 …
54
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I
• Structure: Tabular Data
• Granularity
• Structure: Variable Types
Granularity
Lecture 04, Data 100 Fall 2023

55
(we’ll come back to this) Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in

time
Key Data Properties
Faithfulness -- how well does the data
to Consider in EDA capture “reality”

56
Granularity: How Fine/Coarse Is Each Datum?
What does each record represent?
● Examples: a purchase, a person, a group of users
Do all records capture granularity at the same level? Covered until here
on 9/5
● Some data will include summaries (aka rollups) as records.
If the data are coarse, how were the records aggregated?
● Sampling, averaging, maybe some of both…
Rec. 1
Rec. 1 Rec. 2 Rec. 3
Rec. 1 Rec. 2 Rec. 3

Fine Coarse
Grained Grained
To the demo!! 57
• Pandas, Part III
• Groupby Review
• More on Groupby
• Pivot Tables
• Joining Tables
• EDA, Part I

Structure: Variable • Structure: Tabular Data

• Granularity
• Structure: Variable Types
Types
Lecture 04, Data 100 Fall 2023

58
(we’re back to this)
Variable Type Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in

time

Faithfulness -- how well does the data

capture “reality”

59
Variables Are Columns
Let’s look at records with the same granularity. The U.S. Jurisdiction variable
What does each column represent? U.S. jurisdiction TB cases 2019 …
A variable is a measurement of a particular concept. 1 Alabama 87 …

2 Alaska 58 …
It has two common properties:
… … … …
● Datatype/Storage type:
How each variable value is stored in memory. df[colname].dtype
○ integer, floating point, boolean, object (string-like), etc.
Affects which pandas functions you use.
● Variable type/Feature type:
Conceptualized measurement of information (and therefore what values it can take on).
○ Use expert knowledge
○ Explore data itself
○ Consult data codebook (if it exists).
In this class, “variable types” are
Affects how you visualize and interpret the data. conceptual!!
60
Variable Feature Types

Many variables do not sit

Intervals have Variable neatly in one of these
categories!!
meaning.

Quantitative Qualitative
(categorical)

Continuous Discrete Ordinal Nominal

Could be measured to Finite possible values Categories w/ordered levels; no Categories w/ no
arbitrary precision. consistent meaning to difference specific ordering.
Examples: Examples: Examples: Examples:
• Price • Number of siblings • Preferences • Political Affiliation
• Temperature • Yrs of education • Level of education • Cal lD number

Note that qualitative variables could have numeric levels;

61
conversely, quantitative variables could be stored as strings!
Variable Feature Types

Many variables do not sit

Intervals have Variable neatly in one of these
categories!!
meaning.

Quantitative Qualitative
(categorical)

Continuous Discrete Ordinal Nominal

Note that qualitative variables could have numeric levels;

62
conversely, quantitative variables could be stored as strings!
Variable Types
Variable
�
What is the feature type (i.e., variable type)
of each variable?

�Q1
Quantitative Qualitative
Variable Feature Type
CO2 level (ppm)

2 Number of siblings Cont- Dis- Ord- Nom-

inuous crete inal inal
3 GPA
A B C D
4 Income bracket
(low, med, high)
5 Race/Ethnicity

6 Number of years of
education
7 Yelp Rating
63
Variable Types
Variable
� What is the feature type of each variable?

�Q1
Quantitative Qualitative
Variable Feature Type
CO2 level (ppm) A. Quantitative Cont.

2 Number of siblings B. Quantitative Discrete Cont- Dis- Ord- Nom-

inuous crete inal inal
3 GPA A. Quantitative Cont.
A B C D
4 Income bracket C. Qualitative Ordinal
(low, med, high)
5 Race/Ethnicity D. Qualitative Nominal Many of these examples show how
6 Number of years of B. Quantitative Discrete “shaggy” these categories are!!
education We will revisit variable types when we
7 Yelp Rating learn how to visualize variables.
C. Qualitative Ordinal
64
LECTURE 4

Data Wrangling and EDA, Part I

Content credit: Acknowledgments

Data Aggregation
No ratings yet
Data Aggregation
68 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
Lec 05-DSFa23 data science
No ratings yet
Lec 05-DSFa23 data science
65 pages
Understanding Pandas Groupby For Data Aggregation
No ratings yet
Understanding Pandas Groupby For Data Aggregation
49 pages
Comprehensive Guide To Grouping and Aggregating With Pandas - Practical Business Python
No ratings yet
Comprehensive Guide To Grouping and Aggregating With Pandas - Practical Business Python
23 pages
Lec 04 - Pandas 2 - Continued
No ratings yet
Lec 04 - Pandas 2 - Continued
98 pages
EDA_Module_3-1
No ratings yet
EDA_Module_3-1
40 pages
UNIT_IV (1)
No ratings yet
UNIT_IV (1)
63 pages
Lecture 14
No ratings yet
Lecture 14
33 pages
05-Data Exploring and Analysis
No ratings yet
05-Data Exploring and Analysis
18 pages
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
100% (1)
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
12 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Module - 3 New
No ratings yet
Module - 3 New
38 pages
Python Data Analysis Visualization
No ratings yet
Python Data Analysis Visualization
34 pages
DATA AGGREGATION USING PYTHON (1)
No ratings yet
DATA AGGREGATION USING PYTHON (1)
33 pages
Unit - 1 Eda Continuation 2
No ratings yet
Unit - 1 Eda Continuation 2
34 pages
groupby.rst
No ratings yet
groupby.rst
32 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Uob Python Lecture2p
No ratings yet
Uob Python Lecture2p
22 pages
Data Manipulation With Pandas - Aggregates in Pandas Reference Guide - Codecademy
No ratings yet
Data Manipulation With Pandas - Aggregates in Pandas Reference Guide - Codecademy
2 pages
MLStackCafe2
No ratings yet
MLStackCafe2
11 pages
Pandas PDF(2)
No ratings yet
Pandas PDF(2)
25 pages
Phython Example
No ratings yet
Phython Example
12 pages
Pandas
No ratings yet
Pandas
42 pages
Pandas
No ratings yet
Pandas
25 pages
Lesson 1 - Data Visualisation
No ratings yet
Lesson 1 - Data Visualisation
35 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
Data Manipulation With Pandas - Yulei's Sandbox
No ratings yet
Data Manipulation With Pandas - Yulei's Sandbox
18 pages
Effective Pandas Sampleocr
No ratings yet
Effective Pandas Sampleocr
13 pages
chai
No ratings yet
chai
5 pages
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
100% (1)
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
37 pages
data analysis
No ratings yet
data analysis
42 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Data Aggregation and Group Operations
No ratings yet
Data Aggregation and Group Operations
34 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Pandas Notes
No ratings yet
Pandas Notes
8 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Pandas Questions
No ratings yet
Pandas Questions
11 pages
Pandas
No ratings yet
Pandas
9 pages
Python Unit 4&5 Que
No ratings yet
Python Unit 4&5 Que
33 pages
Unit 3 Data Analysis using pandas - Copy
No ratings yet
Unit 3 Data Analysis using pandas - Copy
49 pages
Pandas
No ratings yet
Pandas
13 pages
64[7]
No ratings yet
64[7]
4 pages
ML Lab1 Python Panda
No ratings yet
ML Lab1 Python Panda
9 pages
Week 3 Laboratory Activity
No ratings yet
Week 3 Laboratory Activity
7 pages
Python-for-Data-Analysis (Pandas
No ratings yet
Python-for-Data-Analysis (Pandas
31 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
20 pages
Pandas
No ratings yet
Pandas
13 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Data Visualization
No ratings yet
Data Visualization
41 pages
CO3_1_Pandas Series and Data Frame
No ratings yet
CO3_1_Pandas Series and Data Frame
37 pages
Unit 1 Python Pandas
No ratings yet
Unit 1 Python Pandas
20 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Usage of NumPy for Numerical Data in Detail
No ratings yet
Usage of NumPy for Numerical Data in Detail
52 pages
01-Numpy & Pandas
No ratings yet
01-Numpy & Pandas
69 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Lec 07-II-DSFa23
No ratings yet
Lec 07-II-DSFa23
44 pages
Lec 06-DSFa23
No ratings yet
Lec 06-DSFa23
45 pages
Tahir CV
No ratings yet
Tahir CV
3 pages
Fa22 RCS 008
No ratings yet
Fa22 RCS 008
14 pages
Circuit Cellar 01-2009
No ratings yet
Circuit Cellar 01-2009
84 pages
Swiggy (BA) - Technical Interview (
No ratings yet
Swiggy (BA) - Technical Interview (
5 pages
Apigee Web Api Design The Missing Link Ebook 1 1
No ratings yet
Apigee Web Api Design The Missing Link Ebook 1 1
5 pages
Data_Security_and_Privacy_Concepts_Appro
No ratings yet
Data_Security_and_Privacy_Concepts_Appro
8 pages
Manage On-Premises Mailbox Moves in Exchange Server
No ratings yet
Manage On-Premises Mailbox Moves in Exchange Server
19 pages
Coding Tugas 4
No ratings yet
Coding Tugas 4
6 pages
Answer: Points: References: P298: Chapter 10: Approaches To System Development
No ratings yet
Answer: Points: References: P298: Chapter 10: Approaches To System Development
15 pages
Funrec
No ratings yet
Funrec
15 pages
Analyze Source Gas
No ratings yet
Analyze Source Gas
2 pages
360 Total Security Antivirus
No ratings yet
360 Total Security Antivirus
8 pages
Proteus IoT Builder PDF
No ratings yet
Proteus IoT Builder PDF
2 pages
Detection of Peer-To-Peer Botnets
No ratings yet
Detection of Peer-To-Peer Botnets
56 pages
5th Sem Detailed Syllabus (B. Sc. in Data Science)
No ratings yet
5th Sem Detailed Syllabus (B. Sc. in Data Science)
6 pages
Adobe Scan 19 Feb 2021
No ratings yet
Adobe Scan 19 Feb 2021
2 pages
Chatbot
No ratings yet
Chatbot
30 pages
The History of Python
No ratings yet
The History of Python
3 pages
Analyzing The Social Dilemma Movie and Its Implications
No ratings yet
Analyzing The Social Dilemma Movie and Its Implications
1 page
Read Me
No ratings yet
Read Me
9 pages
GRADE5 2ndunit
No ratings yet
GRADE5 2ndunit
2 pages
CYPRESS notes
No ratings yet
CYPRESS notes
14 pages
Ip Games
No ratings yet
Ip Games
26 pages
8259 Block Diagram - Modes
No ratings yet
8259 Block Diagram - Modes
6 pages
SDLC Using Agile Concept
No ratings yet
SDLC Using Agile Concept
62 pages
Programmable Logic Controllers (PLC) : Industrial Automation Unit 3
No ratings yet
Programmable Logic Controllers (PLC) : Industrial Automation Unit 3
14 pages
Enter USB TV Stick Manual
100% (1)
Enter USB TV Stick Manual
7 pages
EWE Know: Chinese Numbers
No ratings yet
EWE Know: Chinese Numbers
24 pages
LCT-STMS V8.2 Getting Started and Administration Guide
No ratings yet
LCT-STMS V8.2 Getting Started and Administration Guide
34 pages
VLOOKUP Function Practice Tasks
No ratings yet
VLOOKUP Function Practice Tasks
6 pages
Pre Seed Deck
No ratings yet
Pre Seed Deck
31 pages
Apsys Security Controls Review Process
No ratings yet
Apsys Security Controls Review Process
5 pages

Lec 05-DSFa23

Uploaded by

Lec 05-DSFa23

Uploaded by

LECTURE 5

Pandas, Part III

Data Science, Fall 2023 @ Knowledge Stream

babynames.groupby("Year")[["Count"]].agg(sum) computes the total number of babies born in

groupby .agg(f), where f = max

groupby .agg(f), where f = max

In-Built Python NumPy In-Built pandas

f_babynames = babynames[babynames["Sex"] == "F"]

Number of Jennifers Born in California Per Year.

How do we define "fallen in popularity?"

Example for "Jennifer":

max_jenn = max(f_babynames[f_babynames["Name"] == "Jennifer"]["Count"])

curr_jenn = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"].iloc[-1]

jenn_counts_ser = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"]

For better readability, we may wish to rename "Count" to "Count RTP"

We’ll learn about plotting in week 4. 14

top10 = rtp_table.sort_values("Count RTP").head(10).index

df.groupby(col).mean() df.groupby(col).first() df.groupby(col).filter()

elections.groupby("Year").filter(lambda sf: sf["%"].max() < 45)

● Best election: The election with the highest % of votes.

Every column is calculated

df.groupby(col).mean() df.groupby(col).first() df.groupby(col).filter()

… … Rep 1984 59% Green 2000 2.7%

Dem 2020 51% Cons 2004 0.1%

elections_sorted_by_percent = elections.sort_values("%", ascending=False)

Using a lambda function

elections_sorted_by_percent = elections.sort_values("%", ascending=False)

Using idxmax function

Using drop_duplicates function

best_per_party2 = elections.sort_values("%").drop_duplicates(["Party"], keep="last")

● Each approach has different tradeoffs in terms of readability, performance, memory

Note: Resulting DataFrame is

groupby output pivot_table output

To solve this problem, we’ll have to join tables.

babynames_2022 = babynames[babynames["Year"] == 2022]

elections["First Name"] = elections["Candidate"].str.split().str[0]

merged = elections.merge(right = babynames_2022,

EDA Guiding Principles

The Next Step

? Question & Problem

Prediction and Exploratory

Exploring and Cleaning Tabular Data Data Science in Practice

Structure: Tabular • Structure: Tabular Data

Tables (a.k.a. DataFrames in R/Python and Matrices

What are the differences?

How was “9.4%

Question: Can we reproduce these rates

CDC: source, 2021 53

CSV is a very common tabular file format.

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in

Structure: Variable • Structure: Tabular Data

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in

Faithfulness -- how well does the data

Many variables do not sit

Continuous Discrete Ordinal Nominal

Note that qualitative variables could have numeric levels;

Many variables do not sit

Continuous Discrete Ordinal Nominal

Note that qualitative variables could have numeric levels;

2 Number of siblings Cont- Dis- Ord- Nom-

2 Number of siblings B. Quantitative Discrete Cont- Dis- Ord- Nom-

Data Wrangling and EDA, Part I

You might also like