0% found this document useful (0 votes)
6 views

2,3. Introduction Pandas & Matplotlib - Copy

The document provides an introduction to the Pandas library for data analysis and manipulation in Python, covering its core features, data structures, and basic operations. It also introduces Matplotlib for data visualization, detailing various plot types and customization options. Hands-on examples with code are included for both libraries to facilitate understanding and practical application.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

2,3. Introduction Pandas & Matplotlib - Copy

The document provides an introduction to the Pandas library for data analysis and manipulation in Python, covering its core features, data structures, and basic operations. It also introduces Matplotlib for data visualization, detailing various plot types and customization options. Hands-on examples with code are included for both libraries to facilitate understanding and practical application.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Introduction to Pandas and

dataset analysis and visualization


matplotlib
Content:
Introduction to Pandas
• What is Pandas?
• Core Features of Pandas
• Pandas Data Structures: Series & DataFrame
• Importing Data
• Basic Data Manipulation Operations
• Example Use Cases
• Hands-on Examples with Code
• Introduction to Matplotlib
• What is Matplotlib?
• Why Matplotlib for Data Visualization?
• Basic Components of a Plot
• Common Plot Types
• Customizing Plots
• Hands-on Examples with Code
Introduction to pandas
• Pandas is a Python library for working with data sets
• Pandas is a powerful Python library used for data analysis and manipulation.
• Built on top of NumPy and provides easy-to-use data structures and operations.
• It has functions for analysing, cleaning, exploring, and manipulating data.
• The name "Pandas" has a reference to both "Panel Data",
and "Python Data Analysis" and was created by Wes McKinney in 2008
pandas core feature…
Key Features:
• Loading data into DataFrames Series structures
• Handling missing data
• Data manipulation (filtering, aggregation , imputation, removal)
• Filtering data based on conditions
• Creating new columns based on existing data
• It is an opensource library

Key structures:
• Series (1D labeled array)
• DataFrame (2D labeled data, like a table)

Pandas Codebase? - https://github.com/pandas-dev/pandas


Importing and installation
Installation
• Open command prompt
• Run the command `pip install pandas`

Importing pandas
• To use the library we have to import it or include it in our project using the following command
• `import pandas ` or `import pandas as pd`
Importing example
Importing data to pandas
Data Importing:
• Pandas can easily import data from different fileformats, such as CSV, Excel, JSON.
Exploring dataset
import pandas as pd

# data = pd.read_csv(‘file-name.csv')
# data = pd.read_excel(‘file-name.xlsx’) # the path to file
data = pd.read_excel('ESD.xlsx')

# print(data.head())
# print(data.tail())
# print(data.info())
# print(data.describe())
# print(data.isnull().sum())
Handling duplicate data
import pandas as pd

#handling duplicate values


data = pd.read_csv('company1.csv')

print(data['EEID'].duplicated())
print(data)

#find duplicated value


print(data['EEID'].duplicated().sum())
#finds the non-null values
print(data['salary'].count())
# drop duplicate values in dataframe
print(data.drop_duplicates('EEID’))

# in case we want to replace duplicate values


data['EEID'] = data['EEID'].where(~data['EEID'].duplicated(),
other=pd.NA)
# data.loc[data['EEID'].duplicated(), 'EEID'] = pd.NA
Handling missing data
import numpy as np

hmd = pd.read_csv('company1.csv’)

# shows missing data


hmd.isnull()

# counts missing data


Hmd.isnull().sum()

# drop null values


hmd.dropna()

# replace null values to a custom value


hmd.replace(np.nan , 'default_value’)

# replace null values in a specific column


hmd['Name'] = hmd['Name'].replace(np.nan , 'no-name’)
Handling missing data continue…
Sometimes you can’t just drop the missing data .
Mean = the average value (the sum of all values divided by number of values).
Median = the value in the middle, after you have sorted all values ascending.
Mode = the value that appears most frequently.

mean_salary = hmd['salary'].mean() # Calculate the mean


median_salary = hmd['salary'].median() # Calculate the median
mode_salary = hmd['salary'].mode()[0] # Calculate the mode (most frequent salary value)

# Print the calculated values


print(f"\nMean Salary: {mean_salary}")
print(f"Median Salary: {median_salary}")
print(f"Mode Salary: {mode_salary}")

hmd['salary'] = hmd['salary'].replace(np.nan, mode_salary) # Replace NaN with mode in the 'salary' column

print(hmd)
Handling missing data continue…
Filling missing data
• Forward filling
• Backward filling

For example we cannot get the mean, median or mode for gender

hmd_['gender'] = hmd_['gender'].bfill() # backword fill


hmd_['gender'] = hmd_['gender'].ffill() # forward fill
print(hmd_)
Data Transformation
esd = pd.read_excel('ESD.xlsx')

esd.loc[esd['Bonus %'] == 0 , "GetBonus"] = "No bonus"


esd.loc[esd['Bonus %'] > 0 , "GetBonus"] = "bonus"

print(esd.head(10))

# another example
esd = pd.read_excel('ESD.xlsx')

esd['describe_employee'] = esd['EEID'] + ' ' + esd['Full Name'].str.upper() + ' ' + esd['Job Title']
esd['tax'] = esd['Annual Salary'] - ((esd['Annual Salary'] / 100) * 10 )

print(esd.head())
Dataset summarization
sum(): Returns the sum of the values.
mean(): Returns the average of the values.
count(): Returns the number of non-NA/null observations.
max(): Returns the maximum value.
min(): Returns the minimum value.
median(): Returns the median value.

esd = pd.read_excel('ESD.xlsx')

agg_1 = esd.groupby(['Department' , 'Gender']).agg({"EEID":"count"})


agg_2 = esd.groupby(['Department' , 'Ethnicity']).agg({"EEID":"count"})

print(agg_1)
print(agg_2)
Merge and join
import pandas as pd

employee = {
"id": [1,2,3,4],
'names': ['ahmad' , 'mahmood' , 'khalil' , 'khanwali']
}
employee2 = {
"id": [1,2,3,4],
'salary': [12000,10000,4500,8000]
}

df = pd.DataFrame(employee)
df1 = pd.DataFrame(employee2)

emp = pd.merge(df,df1, on='id')


emp = pd.merge(df,df1, how='left')
emp = pd.concat([df,df1])

print(emp)
Introduction to Matplotlib
A comprehensive library for creating static, animated, and interactive visualizations in Python.
• Built on NumPy and designed for easy and flexible plotting.
• Matplotlib was created by John D. Hunter.
• Matplotlib is open source and we can use it freely.
• https://matplotlib.org/stable/ -- documentation
• https://github.com/matplotlib/matplotlib -- codebase
Why matplotlib
Advantages:
• Versatile: Supports a variety of plots (line, scatter,
bar, etc.).
• Customizable: Extensive options for styling and
formatting.
• Integration: Works well with Jupyter notebooks, other
Installation & usage:
libraries like Pandas, and GUI applications.
• To install it `pip install matplotlib`
• To use it `import matplotlib.pyplot as plt`
Plot types
Common Plot Types are as follows:
• Line Plot: `plt.plot()`
• Scatter Plot: `plt.scatter()`
• Bar Plot: `plt.bar()`
• Histogram: `plt.hist()`
• Box Plot: `plt.boxplot()`
Scatter plot
import numpy as np
import matplotlib.pyplot as pt

x_axis = np.random.random(50) * 100


y_axis = np.random.random(50) * 100

pt.scatter(x_axis,y_axis)
pt.show()

pt.scatter(x_axis,y_axis, color='#0000ff', marker="1", s= 100)


pt.show()
Line plot or chart
#line chart

years = [2000+ x for x in range(24)] # x axis


home_price = np.random.random(24) * 1000 # y axis

years1 = [2000+ x for x in range(24)] # x axis


home_price1 = np.random.random(24) * 500 # y axis

pt.plot(years,home_price )
pt.plot(years,home_price, c='red', lw=4, label='line-1' )
#linestyle='--'
pt.plot(years1,home_price1, label='line-2' )
pt.legend('top right')
pt.show()
Barchart
import matplotlib.pyplot as plt

# Data for the bar chart


products = ['Product A', 'Product B', 'Product C', 'Product D']
sales = [120, 300, 250, 450]
# Create a bar chart
plt.bar(products, sales, color='skyblue', edgecolor='black', width=0.6)
# Add Title and Labels
plt.title('Sales of Products in Q1 2024', fontsize=16, fontweight='bold', color='darkblue')
plt.xlabel('Products', fontsize=12, fontweight='bold')
plt.ylabel('Sales (in Units)', fontsize=12, fontweight='bold')

# Gridlines (add gridlines for better readability)


plt.grid(True, which='both', axis='y', linestyle='--', linewidth=0.7)

for i, value in enumerate(sales):


plt.text(i, value + 10, str(value), ha='center', fontsize=11)

plt.gca().set_facecolor('whitesmoke’) # Add a Background Color to the Chart


plt.ylim(0, 1000) # Set the Limits for Y-axis
plt.tight_layout() # Display the bar chart
plt.show()
Histogram chart
# Data for a histogram (continuous data)
ages = [22, 23, 25, 26, 28, 30, 32, 35, 36, 37, 40, 42, 45, 48, 50]
ages = np.random.normal(20,1.5,12000)

# Creating a histogram
# pt.hist(ages, bins=5, color='lightgreen', edgecolor='black')
pt.hist(ages, bins=112, cumulative=True) # cumulative
pt.xlabel('Age')
pt.ylabel('Frequency')
pt.title('Age Distribution')
pt.show()
Pie chart
# Company names and their market caps (in billions)
companies = [
'Apple', 'Microsoft', 'Nvidia', 'Saudi Aramco', 'Alphabet (Google)', 'Amazon',
'Meta Platforms (Facebook)', 'Berkshire Hathaway', 'TSMC', 'Eli Lilly',
'JPMorgan Chase', 'Tesla', 'Visa', 'Johnson & Johnson', 'ExxonMobil',
'Samsung', 'Chevron', 'Walmart', 'Pfizer', 'Procter & Gamble',
'Mastercard', 'Alibaba', 'Boeing', 'Cisco', 'IBM’, 'Shell', 'American Express', 'Qualcomm', 'Verizon', 'Morgan Stanley'
]

market_caps = [
3100, 3100, 2200, 2100, 1700, 1400, 768, 768, 500, 500, 500, 800, 500, 450, 400, 320, 330, 648, 240, 320, 350, 240, 150, 216,
215,
211, 196, 189, 180, 179
]

# Create the pie chart


pt.figure(figsize=(10, 10))
pt.pie(market_caps, labels=companies, autopct='%2.2f%%', startangle=140, colors=plt.cm.Paired.colors)

# Add a title
pt.title('Market Share of Top 30 Companies by Market Cap (2024)')

# Display the pie chart


pt.show()
Boxplot or boxchart
# Sample data for the salaries in different departments
salaries = [
[45000, 48000, 52000, 55000, 58000, 60000, 62000, 67000, 70000], # Department A
[40000, 43000, 47000, 50000, 53000, 56000, 59000], # Department B
[30000, 35000, 40000, 42000, 45000, 47000, 49000, 50000], # Department C
[60000, 62000, 65000, 67000, 70000, 75000], # Department D
[50000, 52000, 54000, 55000, 58000, 60000, 62000, 65000] # Department E
]

# Create the box plot


plt.figure(figsize=(8, 6))
plt.boxplot(salaries, labels=['Dept A', 'Dept B', 'Dept C', 'Dept D', 'Dept E'])

# Add a title and labels


plt.title('Salary Distribution by Department')
plt.ylabel('Salary ($)')
plt.xlabel('Departments')

# Display the plot


plt.show()
Multiple Figures
import matplotlib.pyplot as plt
import numpy as np

# Generating sample data for 1 year (12 months)


months = [
'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'
]

# Sample prices for each cryptocurrency (in USD)


btc_prices = [
32000, 33000, 34000, 35000, 36000, 37000,
38000, 39000, 40000, 41000, 42000, 43000
] # Bitcoin (BTC)

eth_prices = [
2000, 2100, 2200, 2300, 2400, 2500,
2600, 2700, 2800, 2900, 3000, 3100
] # Ethereum (ETH)

# Add a title and labels


plt.title('Cryptocurrency Prices Over 1 Year')
plt.xlabel('Months')
plt.ylabel('Price (USD)')
# Create the line chart
plt.figure(1)

# Plotting the prices


plt.plot(months, btc_prices, marker='o', label='Bitcoin (BTC)', color='orange')
Subplots and saving
import matplotlib.pyplot as plt
import numpy as np

# Generate x values
x = np.linspace(-2 * np.pi, 2 * np.pi, 1000)

# Calculate y values for each function


y_cos = np.cos(x) # Cosine wave
y_sin = np.sin(x) # Sine wave
y_tan = np.tan(x) # Tangent wave

# Create a figure with 3 subplots


fig, axs = plt.subplots(3, 1, figsize=(10, 12))

# Cosine Wave
axs[0].plot(x, y_cos, color='blue', label='Cosine Wave')
axs[0].set_title('Cosine Wave')
axs[0].set_ylabel('cos(x)')
axs[0].set_ylim(-1.5, 1.5) # Limit y-axis for better visibility
axs[0].grid(True)
axs[0].legend()
Subplots and saving continue…
# Sine Wave
axs[1].plot(x, y_sin, color='orange', label='Sine Wave')
axs[1].set_title('Sine Wave')
axs[1].set_ylabel('sin(x)')
axs[1].set_ylim(-1.5, 1.5) # Limit y-axis for better visibility
axs[1].grid(True)
axs[1].legend()

# Tangent Wave
axs[2].plot(x, y_tan, color='green', label='Tangent Wave')
axs[2].set_title('Tangent Wave')
axs[2].set_ylabel('tan(x)')
axs[2].set_ylim(-10, 10) # Limit y-axis for better visibility
axs[2].grid(True)
axs[2].legend()

# Adjust layout to prevent overlap


plt.tight_layout()
plt.savefig('subplots.jpeg', dpi=300, transparent = True )
# plt.show()
3d plotting
import numpy as np
import matplotlib.pyplot as pt

# Create a grid of x and y values


x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
x, y = np.meshgrid(x, y)

# Calculate z values based on the mathematical formula


z = np.sin(np.sqrt(x**2 + y**2))

# Create a 3D plot
fig = pt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Plot the surface


surface = ax.plot_surface(x, y, z, cmap='viridis', edgecolor='none')

# Add a color bar which maps values to colors


fig.colorbar(surface, shrink=0.5, aspect=10)

# Set titles and labels


ax.set_title('3D Plot of z = sin(sqrt(x^2 + y^2))')
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')

# Show the plot


pt.show()
3d plotting second example –
scatter plot
import numpy as np
import matplotlib.pyplot as pt

# Generate random data for the scatter plot


num_points = 100
x = np.random.rand(num_points) * 10 # X values
y = np.random.rand(num_points) * 10 # Y values
z = np.random.rand(num_points) * 10 # Z values

# Create a 3D scatter plot


fig = pt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Scatter the data points


scatter = ax.scatter(x, y, z, c='g', marker='o', alpha=0.7)

# Set titles and labels


ax.set_title('3D Scatter Plot', fontsize=16)
ax.set_xlabel('X axis', fontsize=12)
ax.set_ylabel('Y axis', fontsize=12)
ax.set_zlabel('Z axis', fontsize=12)

# Show the plot


pt.show()
Using both pandas and matplotlib
import pandas as pd
import matplotlib.pyplot as plt

esd = pd.read_excel('ESD.xlsx')
agg_ = esd.groupby(['Ethnicity']).agg({"EEID": "count"})

ethnicity_counts = agg_.reset_index() # Reset index to make 'Ethnicity' a column


ethnicity_counts.columns = ['Ethnicity', 'Count'] # Rename columns for clarity

plt.figure(figsize=(8, 6)) # Set figure size


plt.pie(ethnicity_counts['Count'], labels=ethnicity_counts['Ethnicity'], autopct='%1.1f%%',
startangle=140)
plt.title('Employee Distribution by Ethnicity')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()
Datasets resources
Datasets
• https://archive.ics.uci.edu/ - uci dataset repo
• https://datasetsearch.research.google.com/ - google dataset
• https://data.un.org/ - un datasets
• https://www.statista.com

You might also like