0% found this document useful (0 votes)

36 views38 pages

Day 10 Pandas For Data Science Part 1

Uploaded by

avinash kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views38 pages

Day 10 Pandas For Data Science Part 1

Uploaded by

avinash kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Pandas for Data Science - Part 1

Type Data science masterclass

I. Introduction to Pandas
1.1 What is Pandas?
Overview and History
Pandas is a powerful and flexible open-source data manipulation and analysis
library for Python.

Origins:

Created by Wes McKinney in 2008 while he was working at AQR Capital.

Initially developed to address the need for high-performance, easy-to-use

data structures for financial data analysis.

Evolution:

Over time, Pandas has grown into one of the cornerstone libraries in the
data science ecosystem.

It has become widely adopted in academia, industry, and research due to

its efficiency and ease of use.

Importance in Data Science

Data Structures:

Series: A one-dimensional labeled array capable of holding any data type.

DataFrame: A two-dimensional, size-mutable, heterogeneous tabular data

structure with labeled axes (rows and columns).

Core Functionality:

Pandas for Data Science - Part 1 1

Simplifies tasks such as data cleaning, transformation, aggregation, and
visualization.

Provides intuitive and powerful tools for data indexing, slicing, and
reshaping.

Productivity and Performance:

Optimized for performance with many operations vectorized (leveraging

NumPy under the hood).

Facilitates rapid prototyping and analysis with concise and readable

syntax.

Real-World Application:

Essential for exploratory data analysis (EDA), feature engineering, and

preparation of data for machine learning models.

Widely used in sectors like finance, healthcare, marketing, and research.

1.2 Installation and Setup

Installing via pip/conda
Using pip:

Open your terminal or command prompt and run:

pip install pandas

This command downloads and installs Pandas along with its dependencies
(like NumPy).

Using conda:

If you are using the Anaconda distribution, run:

conda install pandas

Conda automatically manages the environment and dependencies.

Pandas for Data Science - Part 1 2

Verification:

After installation, open a Python interpreter or script and type:

import pandas as pd
print(pd.__version__)

This should display the installed version of Pandas, confirming a

successful installation.

Setting up Your Development Environment (Jupyter Notebooks,

IDEs)
Jupyter Notebooks:

Why Jupyter?

Interactive, browser-based environment ideal for data analysis and

visualization.

Allows for inline code execution, rich text annotations, and visual
output display.

Installation:

Install via pip:

pip install notebook

Launch by typing jupyter notebook in your terminal.

Usage:

Create notebooks that mix code, visualizations, and markdown text to

document your analysis.

Integrated Development Environments (IDEs):

Popular IDEs:

Visual Studio Code (with Python extension)

PyCharm

Pandas for Data Science - Part 1 3

Spyder

Setup Tips:

Ensure the IDE is configured to use the correct Python interpreter

where Pandas is installed.

Many IDEs offer features like interactive consoles, debugging, and

integrated terminal support.

Some IDEs (e.g., VS Code) now support notebook-like interfaces or

interactive Python sessions.

Best Practices:

Virtual Environments:

Create a virtual environment for your projects using venv , virtualenv , or

conda environments.

This practice isolates your project dependencies and avoids version

conflicts.

Dependency Management:

Maintain a requirements.txt (for pip) or environment.yml (for conda) file to

ensure reproducibility.

Example for pip:

pip freeze > requirements.txt

1.3 Pandas in the Data Science Ecosystem

Integration with NumPy, Matplotlib, Seaborn, etc.
NumPy:

Pandas is built on top of NumPy and leverages its powerful array

computations.

Many Pandas operations are vectorized, enabling high-performance

calculations on large datasets.

Pandas for Data Science - Part 1 4

NumPy arrays are used internally to store data in Series and DataFrames,
providing speed and efficiency.

Matplotlib:

Visualization:

Pandas integrates seamlessly with Matplotlib, enabling quick plotting

directly from DataFrames.

Example: df.plot() produces a variety of plot types (line, bar, histograms)

with minimal code.

Customization:

For more advanced visualizations, you can combine Pandas’ built-in

plotting with Matplotlib’s customization options.

Seaborn:

Enhanced Statistical Plots:

Seaborn is built on top of Matplotlib and works well with Pandas

DataFrames.

It provides a high-level interface for creating informative and attractive

statistical graphics.

Use Cases:

Ideal for visualizing distributions, relationships, and trends in data with

less boilerplate code.

Other Libraries:

Scikit-learn:

Used for machine learning, Pandas is often the go-to tool for
preprocessing and cleaning data before feeding it into ML models.

Statsmodels:

Useful for statistical modeling and hypothesis testing, often using data
processed with Pandas.

Dask:

Pandas for Data Science - Part 1 5

For handling larger-than-memory datasets, Dask provides a Pandas-
like interface that scales computations.

Plotly:

For interactive visualizations, Plotly works well with Pandas data

structures, offering dynamic plots for web applications.

Use Cases in Data Analysis and Machine Learning

Data Cleaning and Preparation:

Missing Data:

Pandas provides functions to detect, handle, and impute missing data.

Techniques include using methods like dropna() , fillna() , and

interpolation.

Data Transformation:

Functions for converting data types, normalizing, and transforming

data are readily available.

String manipulation and categorical data management are simplified

with Pandas.

Exploratory Data Analysis (EDA):

Descriptive Statistics:

Quickly generate summary statistics with methods like describe() , info() ,

and value_counts() .

Data Visualization:

Use plotting functions to explore data distributions, detect outliers, and

identify trends.

Integrated plotting with Matplotlib and Seaborn enhances the

exploratory process.

Feature Engineering:

Creation of New Features:

Pandas for Data Science - Part 1 6

Use Pandas to derive new columns based on existing data (e.g.,
extracting date parts, computing ratios).

Transformation Techniques:

Apply transformations such as scaling, encoding categorical variables,

and creating interaction features.

Time Series Handling:

Pandas offers robust tools for dealing with date/time data, making it a
natural choice for time series analysis.

Machine Learning Workflows:

Data Preprocessing:

Clean and prepare datasets for training machine learning models.

Merge, join, and reshape data to fit model input requirements.

Pipeline Integration:

Combine with libraries like Scikit-learn to create seamless workflows

from data cleaning to model evaluation.

Visualization of Model Results:

Use Pandas alongside visualization libraries to interpret and

communicate model performance and insights.

II. Core Data Structures

Understanding Pandas' core data structures is fundamental to effectively
manipulating and analyzing data. The primary structures are Series, DataFrame,
and Index Objects.

2.1 Series

Pandas for Data Science - Part 1 7

A Series is a one-dimensional labeled array capable of holding any data type
(integers, floats, strings, Python objects, etc.). It is similar to a one-dimensional
NumPy array but comes with additional functionalities like custom indexing.

Creation and Basic Operations

Creating a Series:

From a List:

import pandas as pd
s = pd.Series([10, 20, 30, 40])
print(s)

From a Dictionary:

data = {'a': 1, 'b': 2, 'c': 3}

s = pd.Series(data)
print(s)

Accessing Data:

Values:

print(s.values) # Returns the underlying numpy array

Index:

print(s.index) # Returns the index labels

Basic Operations:

Vectorized Operations: Series operations are applied element-wise.

s2 = s * 2 # Each element is multiplied by 2

print(s2)

Common Methods:

Pandas for Data Science - Part 1 8

s.head() – returns the first few elements.

s.tail() – returns the last few elements.

s.describe() – provides summary statistics.

Indexing, Slicing, and Arithmetic Operations

Indexing:

Label-based Indexing: Access elements using their index labels.

s = pd.Series([100, 200, 300], index=['a', 'b', 'c'])

print(s['a']) # Accesses the element with label 'a'

Position-based Indexing: Use .iloc for positional indexing.

print(s.iloc[0]) # Accesses the first element

Slicing:

Label Slicing: The slice is inclusive of the end label.

print(s['a':'b']) # Returns elements from 'a' to 'b' (inclusive)

Position Slicing: Standard Python slicing rules (end-exclusive).

print(s.iloc[0:2]) # Returns the first two elements

Arithmetic Operations:

Element-wise Arithmetic: Operations align by index labels.

s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
result = s1 + s2 # [5, 7, 9]
print(result)

Pandas for Data Science - Part 1 9

Alignment: If indices do not match, the result contains NaN for non-
matching labels.

s3 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

s4 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
print(s3 + s4) # 'a' and 'd' will have NaN

2.2 DataFrame
A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data
structure with labeled axes (rows and columns). It is one of the most commonly
used structures in Pandas for data manipulation and analysis.

Creating DataFrames from Various Sources

From a List of Lists:

data = [
[1, 2, 3],
[4, 5, 6]
]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df)

From a Dictionary:

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Paris', 'London']
}
df = pd.DataFrame(data)
print(df)

From a CSV File:

Pandas for Data Science - Part 1 10

df = pd.read_csv('path/to/your/file.csv')
print(df.head())

Other Sources:

Excel files via pd.read_excel()

SQL databases via pd.read_sql()

Understanding Rows, Columns, and Indexes

Rows and Columns:

Rows: Represent individual records or observations.

Columns: Represent variables or features.

Index:

Default Index: By default, DataFrame rows are labeled with a RangeIndex

starting at 0.

Custom Index: You can set a custom index from one of the columns.

df.set_index('Name', inplace=True)
print(df)

Accessing Columns:

Using dictionary-like notation:

print(df['Age'])

Using attribute notation (if column names are valid identifiers):

print(df.Age)

Additional Metadata:

Shape of DataFrame: df.shape returns a tuple (rows, columns).

Pandas for Data Science - Part 1 11

Column Labels: df.columns returns an index object containing the column
names.

Row Labels: df.index returns the index object for rows.

2.3 Index Objects

Indexes in Pandas play a crucial role in aligning data, selecting subsets, and
enhancing performance through optimized lookups.

Role of Indexes in Data Selection and Alignment

Indexing and Alignment:

Index objects hold the labels for the rows (and columns) and are
immutable.

They enable efficient lookups, reindexing, and alignment during arithmetic

operations.

When combining or performing operations on Series/DataFrames, Pandas

aligns data based on index labels.

Data Selection:

Indexes simplify data selection, allowing you to use labels to retrieve data.

# Assuming a DataFrame with a custom index:

df.loc['Alice']

They facilitate operations such as merging, joining, and reshaping by

ensuring proper alignment of data.

Setting and Resetting Indexes

Setting an Index:

Use the set_index method to designate one (or more) columns as the index.

df = pd.DataFrame({
'id': [1, 2, 3],

Pandas for Data Science - Part 1 12

'name': ['Alice', 'Bob', 'Charlie']
})
df.set_index('id', inplace=True)
print(df)

Benefits:

Improves performance for lookup operations.

Provides a unique identifier for each row.

Resetting an Index:

To revert the index back to a default integer index, use reset_index .

df.reset_index(inplace=True)
print(df)

When to Reset:

When you need the index values as a regular column again.

When preparing data for operations that require a default index.

Hierarchical Indexes (MultiIndex):

Pandas supports MultiIndex for handling higher-dimensional data in a 2D

DataFrame.

df = pd.DataFrame({
'state': ['CA', 'CA', 'NY', 'NY'],
'city': ['San Francisco', 'Los Angeles', 'New York', 'Buffalo'],
'population': [883305, 3990456, 8398748, 261310]
})
df = df.set_index(['state', 'city'])
print(df)

Usage:

Useful when working with data that has a natural hierarchical

structure.

Pandas for Data Science - Part 1 13

Provides more advanced slicing and grouping capabilities.

III. Data Input/Output

Pandas provides a rich set of functions for reading from and writing to various
data formats. This section covers the essentials for loading data from common file
types and exporting your DataFrames, along with best practices for handling
different file encodings, delimiters, and file management.

3.1 Reading Data

Pandas makes it straightforward to load data from various sources. Below are the
common methods and best practices for reading data.

Loading CSV, Excel, JSON, and SQL Data

CSV Files:

Basic Usage:

import pandas as pd

# Read a CSV file into a DataFrame

df_csv = pd.read_csv('path/to/your/file.csv')
print(df_csv.head())

Parameters:

sep : Specify a delimiter other than a comma.

df_csv = pd.read_csv('path/to/your/file.csv', sep=';')

header : Define the row number to use as column names (default is 0).

df_csv = pd.read_csv('path/to/your/file.csv', header=0)

Pandas for Data Science - Part 1 14

Excel Files:

Basic Usage:

# Read an Excel file into a DataFrame

df_excel = pd.read_excel('path/to/your/file.xlsx')
print(df_excel.head())

Specifying a Sheet:

Use the sheet_name parameter to read a particular sheet.

df_excel = pd.read_excel('path/to/your/file.xlsx', sheet_name='She

et1')

JSON Files:

Basic Usage:

# Read a JSON file into a DataFrame

df_json = pd.read_json('path/to/your/file.json')
print(df_json.head())

Handling Complex JSON:

For nested JSON, consider using the json_normalize function from

Pandas.

from pandas import json_normalize

import json

# Load JSON data

with open('path/to/your/file.json') as f:
data = json.load(f)

df_json_nested = json_normalize(data)
print(df_json_nested.head())

Pandas for Data Science - Part 1 15

SQL Databases:

Basic Usage:

import sqlite3

# Establish a connection to the SQLite database

conn = sqlite3.connect('path/to/your/database.db')

# Read data using a SQL query

query = "SELECT * FROM your_table"
df_sql = pd.read_sql(query, conn)
print(df_sql.head())

# Always close the connection when done

conn.close()

Other Databases:

For databases like PostgreSQL or MySQL, you may use libraries such
as SQLAlchemy for connection management.

Handling Different File Encodings and Delimiters

File Encodings:

Some files might use encodings other than the default UTF-8.

df_csv = pd.read_csv('path/to/your/file.csv', encoding='ISO-8859-1')

Always verify the file encoding if you encounter errors related to character
decoding.

Custom Delimiters:

When dealing with files that use non-standard delimiters (e.g., semicolons,
tabs):

Pandas for Data Science - Part 1 16

# For a semicolon-delimited file
df_csv = pd.read_csv('path/to/your/file.csv', sep=';')

# For a tab-delimited file

df_csv = pd.read_csv('path/to/your/file.csv', sep='\t')

Parsing Dates:

If your data contains dates, you can instruct Pandas to parse them:

df_csv = pd.read_csv('path/to/your/file.csv', parse_dates=['date_colu

mn'])

3.2 Writing Data

After processing and analyzing data with Pandas, you’ll often need to export your
DataFrame to share your results or for further use. Here are the common methods
for writing data.

Exporting DataFrames to CSV, Excel, and Other Formats

Exporting to CSV:

Basic Export:

# Write DataFrame to a CSV file

df_csv.to_csv('path/to/save/file.csv', index=False)

Additional Options:

index=False prevents Pandas from writing row indices to the file.

sep can be used to specify a different delimiter.

df_csv.to_csv('path/to/save/file.csv', sep=';', index=False)

Exporting to Excel:

Pandas for Data Science - Part 1 17

Basic Export:

# Write DataFrame to an Excel file

df_excel.to_excel('path/to/save/file.xlsx', index=False)

Multiple Sheets:

Use ExcelWriter to export multiple DataFrames to different sheets within

the same workbook.

with pd.ExcelWriter('path/to/save/file.xlsx') as writer:

df1.to_excel(writer, sheet_name='Sheet1', index=False)
df2.to_excel(writer, sheet_name='Sheet2', index=False)

Exporting to JSON:

Basic Export:

# Write DataFrame to a JSON file

df_json.to_json('path/to/save/file.json', orient='records', lines=True)

Parameters:

orient : Controls the format of the JSON string. Common values include
'records' , 'split' , 'index' , etc.

: Writes JSON objects separated by newline characters, which

lines=True

is useful for large datasets.

Other Formats:

HTML:

df_html.to_html('path/to/save/file.html')

Pickle:

df_pickle.to_pickle('path/to/save/file.pkl')

Pandas for Data Science - Part 1 18

Parquet:

df_parquet.to_parquet('path/to/save/file.parquet')

Best Practices for Data Export and File Handling

File Naming and Organization:

Use descriptive file names and organize exports in a structured folder

hierarchy.

Consider adding timestamps or version numbers to file names for version

control.

Handling Large Datasets:

For very large datasets, consider exporting to binary formats like Parquet
or using compression.

df_csv.to_csv('path/to/save/file.csv.gz', compression='gzip', index=Fal

se)

When working with large Excel files, be aware of Excel’s row and column
limitations.

Ensuring Data Integrity:

Always verify the output file by reloading it to ensure data integrity.

df_check = pd.read_csv('path/to/save/file.csv')
print(df_check.head())

Avoiding Data Loss:

Be cautious when using parameters like index=False if the row index holds
important information.

If your DataFrame contains sensitive data, ensure that files are stored in
secure directories with proper access controls.

Documentation:

Pandas for Data Science - Part 1 19

Comment your export code to indicate the purpose of each file, especially
when working in collaborative environments.

IV. Data Exploration and Summary

Effective data exploration is a critical first step in any data analysis workflow.
Pandas offers several built-in methods to inspect your DataFrames and quickly
understand the underlying structure and summary statistics. Additionally, Pandas
integrates well with visualization libraries like Matplotlib and Seaborn, enabling
you to visualize your data effortlessly.

4.1 Inspecting DataFrames

Inspecting your DataFrame helps you understand its structure, data types, and the
presence of missing values. Here are some key methods:

Using .head() , .tail() , and .info()

.head()

Displays the first few rows of the DataFrame.

Default: Returns the first 5 rows.

Example:

import pandas as pd

# Sample DataFrame creation

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank'],
'Age': [25, 30, 35, 40, 45, 50],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'P
hiladelphia']
})

Pandas for Data Science - Part 1 20

# Display the first 5 rows
print(df.head())

.tail()

Displays the last few rows of the DataFrame.

Default: Returns the last 5 rows.

Example:

# Display the last 5 rows

print(df.tail())

.info()

Provides a concise summary of the DataFrame.

Details include:

Number of rows and columns.

Data types of each column.

Non-null count for each column.

Memory usage.

Example:

# Display summary information about the DataFrame

df.info()

Descriptive Statistics with .describe()

Purpose:

Offers a quick statistical summary for numeric columns.

Statistics include:

Count, mean, standard deviation, min, max, and quartiles.

Example:

Pandas for Data Science - Part 1 21

# Display descriptive statistics for numeric columns
print(df.describe())

Additional Options:

To include all columns (numeric and non-numeric):

print(df.describe(include='all'))

For percentiles other than the default quartiles, you can specify:

print(df.describe(percentiles=[0.1, 0.5, 0.9]))

4.2 Data Visualization Integration

Visualization is a powerful tool for understanding patterns, distributions, and
relationships in your data. Pandas provides basic plotting functions, and its
integration with libraries like Matplotlib and Seaborn enhances visualization
capabilities.

Basic Plotting with Pandas

Built-in Plotting:

DataFrames and Series come with a built-in .plot() method that acts as a
wrapper around Matplotlib.

Example – Line Plot:

import matplotlib.pyplot as plt

# Create a simple DataFrame

df_plot = pd.DataFrame({
'Year': [2016, 2017, 2018, 2019, 2020],
'Sales': [250, 300, 350, 400, 450]
})

Pandas for Data Science - Part 1 22

# Set the 'Year' column as the index for better plotting
df_plot.set_index('Year', inplace=True)

# Create a line plot

df_plot.plot(kind='line', marker='o')
plt.title('Yearly Sales Trend')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.grid(True)
plt.show()

Other Plot Types:

Bar Plot:

df_plot.plot(kind='bar', legend=False)
plt.title('Sales by Year')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

Histogram:

df_plot['Sales'].plot(kind='hist', bins=5, alpha=0.7)

plt.title('Distribution of Sales')
plt.xlabel('Sales')
plt.show()

Scatter Plot:

# For scatter plot, ensure you have two numerical columns

df_scatter = pd.DataFrame({
'Advertising': [10, 20, 30, 40, 50],
'Sales': [15, 25, 35, 45, 55]
})

Pandas for Data Science - Part 1 23

df_scatter.plot(kind='scatter', x='Advertising', y='Sales')
plt.title('Sales vs. Advertising')
plt.xlabel('Advertising Budget')
plt.ylabel('Sales')
plt.show()

Integration with Matplotlib/Seaborn for Enhanced Visualizations

Matplotlib Customizations:

Since Pandas plotting uses Matplotlib under the hood, you can use
Matplotlib functions to further customize your plots.

Example – Customizing Plot Appearance:

ax = df_plot.plot(kind='line', marker='o', figsize=(8, 5))

ax.set_title('Yearly Sales Trend', fontsize=16)
ax.set_xlabel('Year', fontsize=12)
ax.set_ylabel('Sales', fontsize=12)
ax.grid(True)
plt.show()

Seaborn Integration:

Seaborn is a statistical data visualization library built on top of Matplotlib.

It works seamlessly with Pandas DataFrames and provides attractive and

informative plots.

Example – Seaborn Bar Plot:

import seaborn as sns

# Using the same df_plot DataFrame

sns.set(style="whitegrid")
sns.barplot(x=df_plot.index, y='Sales', data=df_plot.reset_index())
plt.title('Sales by Year')
plt.xlabel('Year')

Pandas for Data Science - Part 1 24

plt.ylabel('Sales')
plt.show()

Additional Seaborn Examples:

Box Plot:

# Create a sample DataFrame with multiple groups

df_box = pd.DataFrame({
'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
'Value': [10, 15, 10, 20, 15, 25]
})

sns.boxplot(x='Category', y='Value', data=df_box)

plt.title('Box Plot of Values by Category')
plt.show()

Heatmap:

# Create a correlation matrix for demonstration

df_corr = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [5, 4, 3, 2, 1],
'C': [2, 3, 4, 3, 2]
})
corr = df_corr.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

V. Data Cleaning and Preprocessing

Pandas for Data Science - Part 1 25

Data cleaning and preprocessing are critical steps in any data analysis or machine
learning workflow. Clean, well-structured data leads to more reliable insights and
robust models. This section covers four key areas:

5.1 Handling Missing Data

Missing data can occur for various reasons—errors during data collection, manual
entry mistakes, or system issues. Pandas provides powerful methods to detect,
count, and handle missing values.

Identifying and Counting Missing Values

Detecting Missing Values:

isnull() / isna() :

Returns a DataFrame or Series of Boolean values ( True if the value is

missing).

Example:

import pandas as pd

# Sample DataFrame with missing values

data = {
'Name': ['Alice', 'Bob', 'Charlie', None],
'Age': [25, None, 35, 40],
'Salary': [50000, 60000, None, 80000]
}
df = pd.DataFrame(data)

# Detect missing values

missing_mask = df.isnull() # or df.isna()
print("Missing Values Mask:")
print(missing_mask)

Counting Missing Values:

Per Column:

Pandas for Data Science - Part 1 26

Use df.isnull().sum() to count missing values in each column.

missing_counts = df.isnull().sum()
print("Missing Values Per Column:")
print(missing_counts)

Per Row:

Count missing values across rows using axis=1 .

missing_per_row = df.isnull().sum(axis=1)
print("Missing Values Per Row:")
print(missing_per_row)

Techniques for Imputation or Removal

Removing Missing Data:

dropna() Method:

Drop Rows: Remove rows that contain any missing values.

df_drop_rows = df.dropna()
print("DataFrame After Dropping Rows with Any Missing Values:")
print(df_drop_rows)

Drop Columns: Remove columns that contain any missing values.

df_drop_cols = df.dropna(axis=1)
print("DataFrame After Dropping Columns with Any Missing Value
s:")
print(df_drop_cols)

Threshold Option:

Keep rows (or columns) that have at least a minimum number of

non-missing values.

Pandas for Data Science - Part 1 27

df_thresh = df.dropna(thresh=2) # Keep rows with at least 2 non-n
ull values
print("DataFrame After Dropping Rows Not Meeting the Threshol
d:")
print(df_thresh)

Filling Missing Data:

fillna() Method:

Fill with a Constant Value:

df_filled_const = df.fillna(0)
print("DataFrame After Filling Missing Values with 0:")
print(df_filled_const)

Forward Fill (Propagate Last Valid Observation):

df_ffill = df.fillna(method='ffill')
print("DataFrame After Forward Fill:")
print(df_ffill)

Backward Fill (Use Next Valid Observation):

df_bfill = df.fillna(method='bfill')
print("DataFrame After Backward Fill:")
print(df_bfill)

Fill with a Computed Value (e.g., Mean or Median):

# Filling missing values in the 'Age' column with the mean age
mean_age = df['Age'].mean()
df['Age'] = df['Age'].fillna(mean_age)
print("DataFrame After Filling Missing 'Age' with Mean:")
print(df)

Pandas for Data Science - Part 1 28

Interpolation:

interpolate() Method:

Estimate missing values using various interpolation methods (e.g.,

linear, time-based).

# For a time series or ordered numerical data

df_interpolated = df.interpolate(method='linear')
print("DataFrame After Linear Interpolation:")
print(df_interpolated)

Best Practices and Considerations

Understand the Data Context:

Assess the significance of missing values and decide whether removal or

imputation is appropriate.

Document Imputation Strategies:

Keep records of the methods used (e.g., why a particular value was
chosen) to ensure reproducibility.

Combine Strategies if Needed:

Sometimes, a combination of dropping and imputing is necessary based

on the nature and volume of missing data.

Mind Data Types:

Ensure that the chosen method (especially imputation) respects the data
types in your DataFrame.

5.2 Data Transformation

Data transformation involves converting data into a format suitable for analysis.
This includes converting data types, formatting values, and applying functions to
modify or derive new features.

Changing Data Types and Formatting

Pandas for Data Science - Part 1 29

Type Conversion:

Using astype() :

Convert a column to a specific data type.

# Convert 'Age' from float to integer (if appropriate)

df['Age'] = df['Age'].astype('int', errors='ignore')
print("DataFrame After Converting 'Age' to int:")
print(df.dtypes)

Datetime Conversion:

Convert strings or numeric values to datetime objects using

pd.to_datetime() .

# Example DataFrame with date strings

df_dates = pd.DataFrame({
'date_str': ['2023-01-01', '2023-02-15', '2023-03-30']
})
df_dates['date'] = pd.to_datetime(df_dates['date_str'], format='%Y
-%m-%d')
print("DataFrame with Converted Datetime:")
print(df_dates)

Categorical Data:

Convert columns to the 'category' dtype for memory efficiency and

performance.

df['Name'] = df['Name'].astype('category')
print("DataFrame After Converting 'Name' to Category:")
print(df.dtypes)

Applying Functions to Columns and Rows

Using apply() :

Pandas for Data Science - Part 1 30

Column-wise Application:

Apply a function to each element in a column.

# Define a simple function to double a number

def double_value(x):
return x * 2

# Apply the function to the 'Age' column

df['Age_doubled'] = df['Age'].apply(double_value)
print("DataFrame After Doubling 'Age':")
print(df)

Row-wise Application:

Use axis=1 to apply a function across each row.

# Define a function that combines multiple columns

def combine_info(row):
return f"{row['Name']} is {row['Age']} years old"

# Apply the function row-wise

df['Info'] = df.apply(combine_info, axis=1)
print("DataFrame with Combined Information:")
print(df)

Using map() :

Mapping Values in a Series:

Replace values using a mapping dictionary or a function.

# Create a mapping dictionary for renaming or transforming values

mapping = {'Alice': 'Alicia', 'Bob': 'Robert', 'Charlie': 'Charles'}
df['Name'] = df['Name'].map(mapping)
print("DataFrame After Mapping Names:")
print(df)

Pandas for Data Science - Part 1 31

Using applymap() for DataFrames:

Element-wise Application on the Entire DataFrame:

Useful for applying a function to every element.

# For example, converting all numeric values to string representati

ons with a suffix
df_transformed = df.applymap(lambda x: str(x) + "_value" if isinsta
nce(x, (int, float)) else x)
print("DataFrame After Applying Element-wise Transformation:")
print(df_transformed)

5.3 String Manipulation

Working with text data is common in preprocessing. Pandas provides a set of
vectorized string functions under the .str accessor, enabling efficient string
operations on Series.

Using Vectorized String Methods for Text Data

Basic Operations:

Lowercase / Uppercase:

df['Name_lower'] = df['Name'].str.lower()
print("Names in Lowercase:")
print(df['Name_lower'])

Splitting Strings:

# Split the 'Name' column into lists based on whitespace

df['Name_split'] = df['Name'].str.split()
print("Names Split into Lists:")
print(df['Name_split'])

Replacing Substrings:

Pandas for Data Science - Part 1 32

df['Name_replaced'] = df['Name'].str.replace('Alicia', 'Alice', regex=Fal
se)
print("Names After Replacement:")
print(df['Name_replaced'])

Advanced Operations:

Extracting Substrings:

Use .str.extract() with a regular expression to capture parts of strings.

# Extract the first word from the 'Name' column

df['First_Name'] = df['Name'].str.extract(r'(\w+)', expand=False)
print("Extracted First Names:")
print(df['First_Name'])

String Length and Contains:

df['Name_length'] = df['Name'].str.len()
df['Contains_a'] = df['Name'].str.contains('a', case=False)
print("Name Lengths and Contains Check:")
print(df[['Name_length', 'Contains_a']])

Regular Expressions in Pandas

Using Regex with .str :

Pattern Matching and Extraction:

# Example: Extract numeric values from a string column (assuming 'Sa

lary' as string)
df['Salary_str'] = df['Salary'].astype('str')
df['Extracted_Salary'] = df['Salary_str'].str.extract(r'(\d+)', expand=Fal
se)
print("Extracted Numeric Salary Values:")
print(df[['Salary_str', 'Extracted_Salary']])

Pandas for Data Science - Part 1 33

Replacing Patterns with Regex:

# Remove all non-numeric characters from 'Salary_str'

df['Clean_Salary'] = df['Salary_str'].str.replace(r'\D', '', regex=True)
print("Cleaned Salary Values:")
print(df[['Salary_str', 'Clean_Salary']])

Finding All Matches:

# Find all digit sequences in 'Salary_str'

df['All_Digits'] = df['Salary_str'].str.findall(r'\d+')
print("All Digit Sequences in Salary:")
print(df[['Salary_str', 'All_Digits']])

Best Practices for String Manipulation

Use Vectorized Methods:

Always prefer the .str accessor over Python loops for efficiency.

Test Regex Patterns:

Utilize regex testing tools to ensure your patterns work as intended.

Handle Missing Data:

Consider filling missing values (e.g., with empty strings) before applying
string operations if necessary.

5.4 Dealing with Outliers

Outliers are data points that differ significantly from other observations. They can
skew analyses and adversely affect model performance. Detecting and treating
outliers is essential for robust data analysis.

Detection Methods
Statistical Methods:

Z-Score:

Pandas for Data Science - Part 1 34

Calculate the Z-score to determine how many standard deviations a
value is from the mean.

from scipy import stats

import numpy as np

# Convert 'Salary' to numeric (if not already)

df['Salary_numeric'] = pd.to_numeric(df['Salary'], errors='coerce')

# Calculate Z-scores (ignoring missing values)

df['Z_score'] = np.abs(stats.zscore(df['Salary_numeric'].dropna()))

# Identify outliers where Z-score > 3 (common threshold)

outliers_z = df[df['Z_score'] > 3]
print("Outliers Detected Using Z-score:")
print(outliers_z)

Interquartile Range (IQR):

Compute the IQR and flag values below Q1 - 1.5×IQR or above Q3 +

1.5×IQR.

Q1 = df['Salary_numeric'].quantile(0.25)
Q3 = df['Salary_numeric'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = df[(df['Salary_numeric'] < lower_bound) | (df['Salary_

numeric'] > upper_bound)]
print("Outliers Detected Using IQR:")
print(outliers_iqr)

Visual Methods:

Box Plots:

Pandas for Data Science - Part 1 35

Visualize distributions and easily spot outliers.

import matplotlib.pyplot as plt

import seaborn as sns

sns.boxplot(x=df['Salary_numeric'])
plt.title("Box Plot of Salary")
plt.show()

Scatter Plots:

Identify outliers in the context of other variables.

sns.scatterplot(data=df, x='Age', y='Salary_numeric')

plt.title("Scatter Plot: Age vs. Salary")
plt.show()

Strategies for Outlier Treatment

Removal:

Drop Outliers:

Remove outlier rows if they are clearly erroneous or not relevant.

df_no_outliers = df[(df['Salary_numeric'] >= lower_bound) & (df['S

alary_numeric'] <= upper_bound)]
print("DataFrame After Removing Outliers:")
print(df_no_outliers)

Capping/Clipping:

Cap Extreme Values:

Replace values beyond the threshold with the boundary value.

df['Salary_capped'] = df['Salary_numeric'].clip(lower_bound, uppe

r_bound)

Pandas for Data Science - Part 1 36

print("DataFrame After Capping Outliers in Salary:")
print(df[['Salary_numeric', 'Salary_capped']])

Transformation:

Log Transformation:

Apply a logarithmic transformation to reduce the impact of extreme

values.

df['Salary_log'] = np.log(df['Salary_numeric'].replace(0, np.nan))

print("DataFrame After Log Transformation of Salary:")
print(df[['Salary_numeric', 'Salary_log']])

Imputation:

Replace Outliers with Statistical Values:

For some analyses, replacing outlier values with the median or mean
may be preferable.

median_salary = df['Salary_numeric'].median()
df.loc[(df['Salary_numeric'] < lower_bound) | (df['Salary_numeric']
> upper_bound), 'Salary_numeric'] = median_salary
print("DataFrame After Replacing Outliers with Median Salary:")
print(df[['Salary_numeric']])

Best Practices for Handling Outliers

Analyze Context:

Not every outlier is an error; understand your data's domain context before
treating them.

Visual Inspection:

Always complement statistical methods with visualizations (box plots,

scatter plots) to inspect outliers.

Document Changes:

Pandas for Data Science - Part 1 37

Record any modifications or removals to maintain transparency in your
data preprocessing.

Sensitivity Analysis:

Consider analyzing your data both with and without outlier treatment to
understand their impact.

Pandas for Data Science - Part 1 38

Pandas Guide for Data Science
No ratings yet
Pandas Guide for Data Science
42 pages
Week 4.1
No ratings yet
Week 4.1
16 pages
Pandas Series - Notes For PA3
No ratings yet
Pandas Series - Notes For PA3
9 pages
Lab Manual ET Lab III
No ratings yet
Lab Manual ET Lab III
38 pages
Pandas Introduction
No ratings yet
Pandas Introduction
4 pages
Pandas Introduction: Getting Started With Pandas
No ratings yet
Pandas Introduction: Getting Started With Pandas
4 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
Unit - V Introduction To Pandas in Python
No ratings yet
Unit - V Introduction To Pandas in Python
21 pages
Python Pandas
No ratings yet
Python Pandas
21 pages
Pandas DataFrame Basics Guide
No ratings yet
Pandas DataFrame Basics Guide
41 pages
Pandas
100% (1)
Pandas
163 pages
Unit III Part 2 1725700061785
No ratings yet
Unit III Part 2 1725700061785
85 pages
1 Pandas
No ratings yet
1 Pandas
6 pages
Unit V Pandas AIML A B Lastupdated 18-06-2024
No ratings yet
Unit V Pandas AIML A B Lastupdated 18-06-2024
33 pages
Python Pandas Beginner's Guide
No ratings yet
Python Pandas Beginner's Guide
45 pages
Pandas Learndatasci
No ratings yet
Pandas Learndatasci
86 pages
Practical Guide To Pandas For Data Science
100% (1)
Practical Guide To Pandas For Data Science
26 pages
Unit III - Notes
No ratings yet
Unit III - Notes
12 pages
Pandas
No ratings yet
Pandas
8 pages
Introduction to Python Pandas Library
No ratings yet
Introduction to Python Pandas Library
22 pages
Pandas
No ratings yet
Pandas
41 pages
Data Science - Sec3
No ratings yet
Data Science - Sec3
27 pages
UNIT II Material
No ratings yet
UNIT II Material
34 pages
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
No ratings yet
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
25 pages
Unit 4
No ratings yet
Unit 4
36 pages
Exp1 - Manipulating Datasets Using Pandas
No ratings yet
Exp1 - Manipulating Datasets Using Pandas
15 pages
Pandas Basics: Data Structures & Features
No ratings yet
Pandas Basics: Data Structures & Features
30 pages
Pandas
No ratings yet
Pandas
36 pages
Introduction to Pandas Data Handling
100% (1)
Introduction to Pandas Data Handling
230 pages
Pandas Assignment Version-2
No ratings yet
Pandas Assignment Version-2
9 pages
Introduction to Pandas Library in Python
No ratings yet
Introduction to Pandas Library in Python
39 pages
Python Pandas
100% (1)
Python Pandas
35 pages
Panduan Pandas
No ratings yet
Panduan Pandas
33 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Practical 7
No ratings yet
Practical 7
8 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
16 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
Unit 5
No ratings yet
Unit 5
40 pages
Research Paper Presentation Pandas Moshiul Arefin
No ratings yet
Research Paper Presentation Pandas Moshiul Arefin
30 pages
Python Pandas Tutorial
No ratings yet
Python Pandas Tutorial
6 pages
UNIT - 3 Pandas
No ratings yet
UNIT - 3 Pandas
21 pages
Module 4
No ratings yet
Module 4
38 pages
Pandas Notes
No ratings yet
Pandas Notes
6 pages
Pandas
No ratings yet
Pandas
13 pages
Pandas
No ratings yet
Pandas
13 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
75 pages
Pandas
No ratings yet
Pandas
3 pages
FDS Exp4
No ratings yet
FDS Exp4
5 pages
Module 6
No ratings yet
Module 6
48 pages
Python Pandas
100% (1)
Python Pandas
96 pages
Using rbind with Pandas DataFrames
No ratings yet
Using rbind with Pandas DataFrames
17 pages
Eda U2
No ratings yet
Eda U2
61 pages
Python Pandas Tutorial For Beginners
100% (1)
Python Pandas Tutorial For Beginners
203 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Practical - 3 (Ai)
No ratings yet
Practical - 3 (Ai)
12 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
100% (1)
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
135 pages
Introduction to Pandas Basics
No ratings yet
Introduction to Pandas Basics
2 pages
Day 45 All Machine Learning Algorithms With Code When To Use Each
No ratings yet
Day 45 All Machine Learning Algorithms With Code When To Use Each
67 pages
Day 44 Multioutput Classification in Machine Learning
No ratings yet
Day 44 Multioutput Classification in Machine Learning
9 pages
Day 20 1734369705
No ratings yet
Day 20 1734369705
47 pages
INX Future Inc Employee Performance
100% (1)
INX Future Inc Employee Performance
10 pages
Efficient Doctor Utilization: Data Driven Analysis
No ratings yet
Efficient Doctor Utilization: Data Driven Analysis
4 pages
Decision-Tree Learning .
No ratings yet
Decision-Tree Learning .
29 pages
Data Analytics Bootcamp Overview
No ratings yet
Data Analytics Bootcamp Overview
22 pages
PDS Unit1-1
No ratings yet
PDS Unit1-1
104 pages
Python Data Science Tools & Resources
No ratings yet
Python Data Science Tools & Resources
3 pages
Dask - Dask Documentation
No ratings yet
Dask - Dask Documentation
3 pages
Course - Apache Airflow Training 2025 Batch
No ratings yet
Course - Apache Airflow Training 2025 Batch
13 pages
Numpanda
No ratings yet
Numpanda
24 pages
Balance Sheet Optimization Principal Software Engineer - JPMC Candidate Experience Page Careers
No ratings yet
Balance Sheet Optimization Principal Software Engineer - JPMC Candidate Experience Page Careers
4 pages
MD Mustain Billah Resume
No ratings yet
MD Mustain Billah Resume
1 page
Python For Data Analysis
No ratings yet
Python For Data Analysis
4 pages
AI Architect Tools Roadmap
No ratings yet
AI Architect Tools Roadmap
1 page
21 Essential Python Tools Guide
No ratings yet
21 Essential Python Tools Guide
12 pages
3181-Article Text-2908-1-10-20231006
No ratings yet
3181-Article Text-2908-1-10-20231006
4 pages
1111 ChatGPT Prompts PDF
100% (1)
1111 ChatGPT Prompts PDF
180 pages
Python For Data Analysis
100% (1)
Python For Data Analysis
14 pages
Sample Outline Azure Machine Learning Engineering
No ratings yet
Sample Outline Azure Machine Learning Engineering
17 pages
A Study On Distributed Machine Learning Techniques For Large Scale Weather Forecasting
No ratings yet
A Study On Distributed Machine Learning Techniques For Large Scale Weather Forecasting
22 pages
Python Packages To Learn Data Science E-Book
No ratings yet
Python Packages To Learn Data Science E-Book
76 pages
Cloud Computing
No ratings yet
Cloud Computing
39 pages
Machine Learning Python
No ratings yet
Machine Learning Python
48 pages
xCDAT: Climate Data Analysis Tools
No ratings yet
xCDAT: Climate Data Analysis Tools
219 pages
Wa0021.
No ratings yet
Wa0021.
33 pages
EY Techathon
No ratings yet
EY Techathon
5 pages
Roop Unleashed 02.ipynb
No ratings yet
Roop Unleashed 02.ipynb
15 pages
Organ Donar Prediction Using Machine Learning
No ratings yet
Organ Donar Prediction Using Machine Learning
13 pages
Accelerate Machine Learning With A Unified Analytics Architecture
No ratings yet
Accelerate Machine Learning With A Unified Analytics Architecture
56 pages
40 Most Popular Python Scientific Libraries
No ratings yet
40 Most Popular Python Scientific Libraries
9 pages
E Commerce project-NL
No ratings yet
E Commerce project-NL
35 pages
Pratik Ratadiya: ML Research & Leadership
No ratings yet
Pratik Ratadiya: ML Research & Leadership
2 pages
XGBOOST
No ratings yet
XGBOOST
36 pages
Nvidia-Learning-Training Course-Catalog
No ratings yet
Nvidia-Learning-Training Course-Catalog
32 pages