Pandas Handbook
Pandas Handbook
What is Pandas?
Pandas makes working with structured data fast, expressive, and flexible.
Installing Pandas
Importing Pandas
import pandas as pd
Excel Easy UI, great for small data Slow, manual, not scalable
Pandas bridges the gap between NumPy performance and Excel-like usability.
Pandas is built on top of NumPy.
Summary
import pandas as pd
Output:
0 10
1 20
2 30
3 40
dtype: int64
data = {
"name": ["Alice", "Bob", "Charlie"],
df = pd.DataFrame(data)
print(df)
Output:
• Fast lookups
• Aligning data
• Merging & joining
• Time series operations
• Selection
• Filtering
• Merging
• Aggregation
Summary
Creating DataFrames
Let’s look at different ways to create a Pandas DataFrame — the core data
structure you’ll be using 90% of the time in data science.
import pandas as pd
data = [
["Alice", 25],
["Bob", 30],
["Charlie", 35]
]
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35]
}
df = pd.DataFrame(data)
Each key becomes a column, and each list is the column data.
import numpy as np
df = pd.read_csv("data.csv")
Use options like: - sep , header , names , index_col , usecols , nrows , etc.
Example:
df = pd.read_excel("data.xlsx")
df = pd.read_json("data.json")
import sqlite3
conn = sqlite3.connect("mydb.sqlite")
df = pd.read_sql("SELECT * FROM users", conn)
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
df = pd.read_csv(url)
Exploratory Data Analysis (EDA) is an essential first step in any data science project.
It involves taking a deep look at the dataset to understand its structure, spot
patterns, identify anomalies, and uncover relationships between variables. This
process includes generating summary statistics, checking for missing or duplicate
data, and creating visualizations like histograms, box plots, and scatter plots. The
goal of EDA is to get a clear picture of what the data is telling you before applying
any analysis or machine learning models.
By exploring the data thoroughly, you can make better decisions about how to
clean, transform, and model it effectively.
Summary
• You can create DataFrames from lists, dicts, arrays, files, web, and SQL
• Use .head() , .info() , .describe() to quickly explore any dataset
Selecting Columns
Simple Condition
The .query() method in pandas lets you filter DataFrame rows using a string
expression — it’s a more readable and often more concise alternative to using
boolean indexing.
col = "Age"
df.query(f"{col} > 25")
Here are the main rules and tips for using .query() in pandas:
df.query("name == 'Harry'")
df.query('city == "Mumbai"')
age_limit = 30
5. Logical operators
Use these: - and , or , not — instead of & , | , ~ - == , != , < , > , <= , >=
Bad:
df.query("age > 30 & city == 'Delhi'") # ❌
Good:
6. Chained comparisons
Just like Python:
df.query("`class` == 'Physics'")
8. Case-sensitive
Column names and string values are case-sensitive:
The result is a new DataFrame. Changes won’t affect the original unless reassigned:
filtered = df.query("age < 50")
Summary
df.duplicated() returns a boolean Series where: True means that row is a duplicate
of a previous row. False means it’s the first occurrence (not a duplicate yet).
df.duplicated(subset=["Name", "Age"])
df["Age"] = df["Age"].astype(int)
df["Date"] = pd.to_datetime(df["Date"])
df["Category"] = df["Category"].astype("category")
• Handle mixed types (e.g., some date strings, some NaT, or missing values).
df.dtypes
Applying Functions
Summary
Data Transformation
Once your data is clean, the next step is to reshape, reformat, and reorder it as
needed for analysis. Pandas gives you plenty of flexible tools to do this.
Sorting & Ranking
Sort by Values
df.sort_values([“Age”, “Salary”]) sorts the DataFrame first by the “Age” column, and
if there are ties (i.e., two or more rows with the same “Age”), it will sort by the
“Salary” column.
Reset Index
If you want the index to start from 0 and be sequential, you can reset it using
reset_index()
df.reset_index(drop=True, inplace=True) # Reset the index and drop the old index
Sort by Index
df.sort_index()
The df.sort_index() function is used to sort the DataFrame based on its index
values. If the index is not in a sequential order (e.g., you have dropped rows or
performed other operations that change the index), you can use sort_index() to
restore it to a sorted order. ### Ranking The .rank() function in pandas is used to
assign ranks to numeric values in a column, like scores or points. By default, it gives
the average rank to tied values, which can result in decimal numbers. For example,
if two people share the top score, they both get a rank of 1.5. You can customize
the ranking behavior using the method parameter. One useful option is
method=‘dense’, which assigns the same rank to ties but doesn’t leave gaps in the
ranking sequence. This is helpful when you want a clean, consecutive ranking
system without skips.
df["Rank"] = df["Score"].rank() # Default: average method
df["Rank"] = df["Score"].rank(method="dense") # 1, 2, 2, 3
Summary
The melt() method in Pandas is used to unpivot a DataFrame from wide format
to long format. In other words, it takes columns that represent different variables
and combines them into key-value pairs (i.e., long-form data).
• When you have a DataFrame where each row is an observation, and each
column represents a different variable or measurement, and you want to
reshape the data into a longer format for easier analysis or visualization.
Syntax:
Parameters:
• id_vars : The columns that you want to keep fixed (these columns will remain
as identifiers).
• value_vars : The columns you want to unpivot (the ones you want to “melt”
into a single column).
• var_name : The name to use for the new column that will contain the names of
the melted columns (default is 'variable' ).
• value_name : The name to use for the new column that will contain the values
from the melted columns (default is 'value' ).
• col_level : Used for multi-level column DataFrames.
Example:
Use this code to generate the Dataframe
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Math': [85, 78, 92],
'Science': [90, 82, 89],
'English': [88, 85, 94]
}
df = pd.DataFrame(data)
Alice 85 90 88
Bob 78 82 85
Charlie 92 89 94
Using melt() :
Alice Math 85
Name Subject Score
Alice Science 90
Alice English 88
Bob Math 78
Bob Science 82
Bob English 85
Charlie Math 92
Charlie Science 89
Charlie English 94
Explanation:
• id_vars=["Name"] : We keep the “Name” column as it is because it’s the
identifier.
• value_vars=["Math", "Science", "English"] : These are the columns we
want to melt.
• var_name="Subject" : The new column containing the names of the subjects.
• value_name="Score" : The new column containing the scores.
This is useful for converting columns into rows — perfect for plotting or tidy data
formats.
pivot() — Long to Wide
The pivot() function in Pandas is used to reshape data, specifically to turn long-
format data into wide-format data. This is the reverse operation of melt() .
How it works:
• pivot() takes a long-format DataFrame and turns it into a wide-format
DataFrame by specifying which columns will become the new columns, the
rows, and the values.
Syntax:
Parameters:
• index : The column whose unique values will become the rows of the new
DataFrame.
• columns : The column whose unique values will become the columns of the
new DataFrame.
• values : The column whose values will fill the new DataFrame. These will
become the actual data (values in the table).
Example:
Suppose we have the following long-format DataFrame:
Alice Math 85
Alice Science 90
Alice English 88
Bob Math 78
Bob Science 82
Name Subject Score
Bob English 85
Charlie Math 92
Charlie Science 89
Charlie English 94
Resulting DataFrame:
Alice 88 85 90
Bob 85 78 82
Charlie 94 92 89
Explanation:
• index="Name" : The unique values in the “Name” column will become the rows
in the new DataFrame.
• columns="Subject" : The unique values in the “Subject” column will become
the columns in the new DataFrame.
• values="Score" : The values from the “Score” column will populate the table.
1. Better data structure: It makes data easier to analyze when you have
categories that you want to split into multiple columns.
2. Easier visualization: Often, you want to represent data in a format where
categories are split across columns (for example, when creating pivot tables
for reporting).
3. Aggregating data: You can perform aggregations (like sum , mean , etc.) to
group values before pivoting.
Important Notes:
1. Duplicate Entries: If you have multiple rows with the same combination of
index and columns , pivot() will raise an error. In such cases, you should use
pivot_table() (which can handle duplicate entries by aggregating them).
Alice Math 85
Alice Math 80
Alice Science 90
Bob Math 78
Bob Math 82
We can use pivot_table() to aggregate values (e.g., taking the mean for
duplicate entries):
Resulting DataFrame:
Alice 82.5 90
Bob 80 NaN
In this case, the Math score for Alice is averaged (85 + 80) / 2 = 82.5. If a cell is
empty, it means there was no value for that combination.
Summary:
• Use melt() to go long, pivot() to go wide
• pivot() is used to turn long-format data into wide-format by spreading
unique column values into separate columns.
• If there are duplicate values for a given combination of index and columns ,
you should use pivot_table() with an aggregation function to handle the
duplicates.
.groupby() Function
df = pd.DataFrame({
"Department": ["HR", "HR", "IT", "IT", "Marketing", "Marketing", "Sales", "Sale
s"],
"Team": ["A", "A", "B", "B", "C", "C", "D", "D"],
"Gender": ["M", "F", "M", "F", "M", "F", "M", "F"],
"Salary": [85, 90, 78, 85, 92, 88, 75, 80],
"Age": [23, 25, 30, 22, 28, 26, 21, 27],
"JoinDate": pd.to_datetime([
"2020-01-10", "2020-02-15", "2021-03-20", "2021-04-10",
"2020-05-30", "2020-06-25", "2021-07-15", "2021-08-01"
])
})
df.groupby("Department")["Salary"].mean()
This says:
> “Group by Department, then calculate average Salary for each group.”
df.groupby(["Team", "Gender"])["Salary"].mean()
In pandas, .agg and .aggregate are exactly the same — they’re aliases for the same
method
df.groupby("Team")["Salary"].agg(
avg_score="mean",
high_score="max"
)
df.groupby("Team").agg({
"Salary": "mean",
"Age": "max"
})
.transform() Example:
Now each row gets its team average — great for comparisons!
.filter() Example:
Often, data is split across multiple tables or files. Pandas lets you combine them
just like SQL — or even more flexibly!
Sample DataFrames
employees = pd.DataFrame({
"EmpID": [1, 2, 3],
"Name": ["Alice", "Bob", "Charlie"],
"DeptID": [10, 20, 30]
})
departments = pd.DataFrame({
"DeptID": [10, 20, 40],
"DeptName": ["HR", "Engineering", "Marketing"]
})
1 Alice 10 HR
2 Bob 20 Engineering
Left Join
Right Join
Outer Join
Concatenating DataFrames
pd.concat([df1, df2])
Horizontal (columns)
pd.merge()
SQL-style joins (merge keys)
or .join()
pd.concat([df1,
Stack datasets vertically
df2])
pd.concat([df1,
Combine different features side-by-side
df2], axis=1)
.join() or
merge with
Align on index
right_index=Tru
e
Summary
CSV Files
Read CSV
df = pd.read_csv("data.csv")
Options:
Write CSV
df.to_csv("output.csv", index=False)
Excel Files
Read Excel
df = pd.read_excel("data.xlsx")
Options:
pd.read_excel("data.xlsx", sheet_name="Sales")
Write Excel
df.to_excel("output.xlsx", index=False)
Multiple sheets:
JSON Files
Read JSON
df = pd.read_json("data.json")
Summary