2,3. Introduction Pandas & Matplotlib - Copy
2,3. Introduction Pandas & Matplotlib - Copy
Key structures:
• Series (1D labeled array)
• DataFrame (2D labeled data, like a table)
Importing pandas
• To use the library we have to import it or include it in our project using the following command
• `import pandas ` or `import pandas as pd`
Importing example
Importing data to pandas
Data Importing:
• Pandas can easily import data from different fileformats, such as CSV, Excel, JSON.
Exploring dataset
import pandas as pd
# data = pd.read_csv(‘file-name.csv')
# data = pd.read_excel(‘file-name.xlsx’) # the path to file
data = pd.read_excel('ESD.xlsx')
# print(data.head())
# print(data.tail())
# print(data.info())
# print(data.describe())
# print(data.isnull().sum())
Handling duplicate data
import pandas as pd
print(data['EEID'].duplicated())
print(data)
hmd = pd.read_csv('company1.csv’)
hmd['salary'] = hmd['salary'].replace(np.nan, mode_salary) # Replace NaN with mode in the 'salary' column
print(hmd)
Handling missing data continue…
Filling missing data
• Forward filling
• Backward filling
For example we cannot get the mean, median or mode for gender
print(esd.head(10))
# another example
esd = pd.read_excel('ESD.xlsx')
esd['describe_employee'] = esd['EEID'] + ' ' + esd['Full Name'].str.upper() + ' ' + esd['Job Title']
esd['tax'] = esd['Annual Salary'] - ((esd['Annual Salary'] / 100) * 10 )
print(esd.head())
Dataset summarization
sum(): Returns the sum of the values.
mean(): Returns the average of the values.
count(): Returns the number of non-NA/null observations.
max(): Returns the maximum value.
min(): Returns the minimum value.
median(): Returns the median value.
esd = pd.read_excel('ESD.xlsx')
print(agg_1)
print(agg_2)
Merge and join
import pandas as pd
employee = {
"id": [1,2,3,4],
'names': ['ahmad' , 'mahmood' , 'khalil' , 'khanwali']
}
employee2 = {
"id": [1,2,3,4],
'salary': [12000,10000,4500,8000]
}
df = pd.DataFrame(employee)
df1 = pd.DataFrame(employee2)
print(emp)
Introduction to Matplotlib
A comprehensive library for creating static, animated, and interactive visualizations in Python.
• Built on NumPy and designed for easy and flexible plotting.
• Matplotlib was created by John D. Hunter.
• Matplotlib is open source and we can use it freely.
• https://matplotlib.org/stable/ -- documentation
• https://github.com/matplotlib/matplotlib -- codebase
Why matplotlib
Advantages:
• Versatile: Supports a variety of plots (line, scatter,
bar, etc.).
• Customizable: Extensive options for styling and
formatting.
• Integration: Works well with Jupyter notebooks, other
Installation & usage:
libraries like Pandas, and GUI applications.
• To install it `pip install matplotlib`
• To use it `import matplotlib.pyplot as plt`
Plot types
Common Plot Types are as follows:
• Line Plot: `plt.plot()`
• Scatter Plot: `plt.scatter()`
• Bar Plot: `plt.bar()`
• Histogram: `plt.hist()`
• Box Plot: `plt.boxplot()`
Scatter plot
import numpy as np
import matplotlib.pyplot as pt
pt.scatter(x_axis,y_axis)
pt.show()
pt.plot(years,home_price )
pt.plot(years,home_price, c='red', lw=4, label='line-1' )
#linestyle='--'
pt.plot(years1,home_price1, label='line-2' )
pt.legend('top right')
pt.show()
Barchart
import matplotlib.pyplot as plt
# Creating a histogram
# pt.hist(ages, bins=5, color='lightgreen', edgecolor='black')
pt.hist(ages, bins=112, cumulative=True) # cumulative
pt.xlabel('Age')
pt.ylabel('Frequency')
pt.title('Age Distribution')
pt.show()
Pie chart
# Company names and their market caps (in billions)
companies = [
'Apple', 'Microsoft', 'Nvidia', 'Saudi Aramco', 'Alphabet (Google)', 'Amazon',
'Meta Platforms (Facebook)', 'Berkshire Hathaway', 'TSMC', 'Eli Lilly',
'JPMorgan Chase', 'Tesla', 'Visa', 'Johnson & Johnson', 'ExxonMobil',
'Samsung', 'Chevron', 'Walmart', 'Pfizer', 'Procter & Gamble',
'Mastercard', 'Alibaba', 'Boeing', 'Cisco', 'IBM’, 'Shell', 'American Express', 'Qualcomm', 'Verizon', 'Morgan Stanley'
]
market_caps = [
3100, 3100, 2200, 2100, 1700, 1400, 768, 768, 500, 500, 500, 800, 500, 450, 400, 320, 330, 648, 240, 320, 350, 240, 150, 216,
215,
211, 196, 189, 180, 179
]
# Add a title
pt.title('Market Share of Top 30 Companies by Market Cap (2024)')
eth_prices = [
2000, 2100, 2200, 2300, 2400, 2500,
2600, 2700, 2800, 2900, 3000, 3100
] # Ethereum (ETH)
# Generate x values
x = np.linspace(-2 * np.pi, 2 * np.pi, 1000)
# Cosine Wave
axs[0].plot(x, y_cos, color='blue', label='Cosine Wave')
axs[0].set_title('Cosine Wave')
axs[0].set_ylabel('cos(x)')
axs[0].set_ylim(-1.5, 1.5) # Limit y-axis for better visibility
axs[0].grid(True)
axs[0].legend()
Subplots and saving continue…
# Sine Wave
axs[1].plot(x, y_sin, color='orange', label='Sine Wave')
axs[1].set_title('Sine Wave')
axs[1].set_ylabel('sin(x)')
axs[1].set_ylim(-1.5, 1.5) # Limit y-axis for better visibility
axs[1].grid(True)
axs[1].legend()
# Tangent Wave
axs[2].plot(x, y_tan, color='green', label='Tangent Wave')
axs[2].set_title('Tangent Wave')
axs[2].set_ylabel('tan(x)')
axs[2].set_ylim(-10, 10) # Limit y-axis for better visibility
axs[2].grid(True)
axs[2].legend()
# Create a 3D plot
fig = pt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
esd = pd.read_excel('ESD.xlsx')
agg_ = esd.groupby(['Ethnicity']).agg({"EEID": "count"})
plt.show()
Datasets resources
Datasets
• https://archive.ics.uci.edu/ - uci dataset repo
• https://datasetsearch.research.google.com/ - google dataset
• https://data.un.org/ - un datasets
• https://www.statista.com