0% found this document useful (0 votes)

6 views

Pandas

The document provides a comprehensive overview of the Pandas library in Python, focusing on its data structures, including Series and DataFrames, and their functionalities for data manipulation, cleaning, and querying. It covers operations such as indexing, arithmetic operations, data loading, and handling missing values, as well as basic plotting techniques using Matplotlib. Additionally, it introduces data visualization with Seaborn for correlation analysis and demonstrates various methods for preprocessing data.

Uploaded by

Ali

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Pandas

Uploaded by

Ali

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Pandas

we're going to deepen our investigation to how Python can be used to manipulate,
clean, and query data by looking at the Pandas data tool kit

 The pandas Series

 The pandas is the base data structure of pandas. A series is similar to a NumPy
array, but it differs by having an index, which allows for much richer lookup of
items instead of just a zero-based array index value.

import pandas as pd
d=pd.Series([11,12,13,14])
d
 Multiple items can be retrieved by specifying their labels in a Python list.

import pandas as pd
d[[1,3]]
Pandas: Series

d=pd.Series([11,12,13,14],index=['a','b','c','d'])
d[['a','b']] or d[[0,1]]

 We can examine the index of a using the property:

d=d.index

 Two objects can be applied to each other with an arithmetic operation

d1=pd.Series([11,12,13,14],index=['a','b','c','d'])
d2=pd.Series([1,2,3,5],index=[‘a',‘b','c','d'])
diff=d1-d2
print(diff)

diff.mean()
diff
Pandas: DataFrame

 A pandas series can only have a single value associated with each index label.

 To have multiple values per index label we can use a data frame. A data frame
represents one or more objects aligned by index label.

 Each series will be a column in the data frame, and each column can have an
associated name

d1=pd.Series([11,12,13,14],index=['a','b','c','d'])
d2=pd.Series([1,2,3,4],index=['a','b','c','d'])
temp_df=pd.DataFrame({'value1':d1,'value2':d2})
temp_df
 Columns in a object can be accessed using an array indexer with the name
of the column or a list of column names

temp_df['value1']
temp_df[['value1','value2']]
Pandas: DataFrame

 Passing a list to the [] operator of DataFrame retrieves the specified columns

whereas a Series would return rows.

 new column can be added to DataFrame simply by assigning another Series to a

column using the array indexer [] notation

temp_dfs=pd.DataFrame()
g=temp_df['value1']-temp_df['value2']
print(g)
temp_df['diff']=temp_df['value1']-temp_df['value2']
temp_df
 The names of the columns in a DataFrame are accessible via the columns
property
temp_df.columns
Pandas: DataFrame

 The DataFrame and Series objects can be sliced to retrieve specific rows

temp_df [0:3]

temp_df.value1[0:3]

 Entire rows from a data frame can be retrieved using the .loc and .iloc properties.
.loc ensures that the lookup is by index label, where .iloc uses the 0-based position.

temp_df.loc['a']

temp_df.iloc[0]

temp_df.iloc[[1,3,5,7]].column_Name
Pandas: DataFrame
 The following code shows values in the IMO column that are greater than 7

df2.IMO>7

 Loading data from files into a DataFrame

import pandas as pd
df2 = pd.read_excel('2010.xlsx')
Df2=pd.read_csv('2010.csv')

 Get type of column

type(df2.IMO[0])
Pandas: DataFrame
 For traversing DataFrame (transposed), we use T assign

df2=df2.T

 Loading data from row

df2.loc[['IYR','IMO']]

df2.loc['IYR'][0]
Pandas: DataFrame
 Deleting data from DataFrames using drop for rows or del for columns

df2.drop('IYR')
del df2['IMO']
df = df.drop(['IMO''], axis=1) # axis is important
 Add column to DataFrames

df2['IMO']=0

 Read data from DataFrames

df2['IMO']=df2['IMO']+2
Query for DataFrames
 If you want accidents in months that is bigger than 6, we should write code below:

df2['IMO']>6

 Now mask the answers by where attribute:

dfbigger=df2.where(df2['IMO']>6)

dfbigger=df2[(df2['IMO']>6) & (df2['DAY']>10)]

dfbigger

 Set or reset index for DataFrames

dfbigger=dfbigger.set_index('IYR')
print(dfbigger)
dfbigger=dfbigger.reset_index('IYR')
dfbigger
DataFrames: preProcess
 Count non-NA cells for each column or row

df2.count(axis=0, numeric_only=False)

 Get numeric columns or object columns

df2.dtypes
df2._get_numeric_data().columns
df2.select_dtypes(include=['object'])

df4=df2.select_dtypes(include=['object'])
df2[~df2.isin(df4)]

 Find empty cell and replace with nan

DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')

df2 = df2.replace(r'\s',np.nan, regex=True)

Plot

 matplotlib.pyplot is a collection of command style functions that make matplotlib

work like MATLAB
import matplotlib.pyplot as plt
Plt.plot([1,2,3], [1,2,3], 'go-', linewidth=2)
Plt.plot([1,2,3], [1,4,9], 'rs', markersize=14)
plt.show()

 another sample

import numpy as np
import matplotlib.pyplot as plt
t = np.arange(0., 5., 0.2)
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.title('some values')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['t','t**2','t**3'])
plt.show()
plot

 another sample

t=np.arange(0,5,0.2)
df=pd.DataFrame({0:t , 1:t**1.5 , 2:t**2 , 3:t**2.5 , 4:t**3})
legend_labels=['Solid' , 'Dashed' , 'Dotted' , 'Dot-dashed' , 'Points']

df.plot(style=['r-','g--', 'b:', 'm-.' , 'k:'])

plt.legend(legend_labels )
plt.show()
matplotlib.pyplot.subplot

matplotlib.pyplot.subplots return an instance of Figure and an array of (or a single) Axes (array or
not depends on the number of subplots)
matplotlib.pyplot.subplot(*args, **kwargs)
import matplotlib.pyplot as plt
import numpy as np

# Simple data to display in various forms

x = np.linspace(0, 2 * np.pi, 400)
y = np.sin(x ** 2)

plt.close('all')

# Just a figure and one subplot

f, ax = plt.subplots()
ax.plot(x, y)
ax.set_title('Simple plot')
plt.show()
matplotlib.pyplot.subplot

 A scatter plot displays the correlation between a pair of variables

 Define two subplot

f, axarr = plt.subplots(2, sharex=True)
axarr[0].plot(x, y)
axarr[0].set_title('Sharing X axis')
axarr[1].scatter(x, y)
plt.show()

 Define two subplot in one row

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)

ax1.plot(x, y)
ax1.set_title('Sharing Y axis')
ax2.scatter(x, y)
plt.show()
matplotlib.pyplot.subplot

 Define three subplot sharing both x and y axes

f, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, sharey=True)

ax1.plot(x, y)
ax1.set_title('Sharing both axes')
ax2.scatter(x, y)
ax3.scatter(x, 2 * y ** 2 - 1, color='r')
plt.show()
matplotlib.pyplot.subplot

 Define Four axes, returned as a 2-d array

f, axarr = plt.subplots(2, 2)
axarr[0, 0].plot(x, y)
axarr[0, 0].set_title('Axis [0,0]')
axarr[0, 1].scatter(x, y)
axarr[0, 1].set_title('Axis [0,1]')
axarr[1, 0].plot(x, y ** 2)
axarr[1, 0].set_title('Axis [1,0]')
axarr[1, 1].scatter(x, y ** 2)
axarr[1, 1].set_title('Axis [1,1]')
plt.show()
Calculate correlation by seaborn package

1-
colNames = ["Age", "type_employer", "fnlwgt", "Education", "Education-Num", "Martial","Occupation",
"Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
"H-per-week", "Country", "Label"]
data = pd.read_csv("adult-data.txt", names=colNames,delimiter=',',header=None)
data

2- conda install seaborn

3-
import seaborn as sns
%matplotlib inline
sns.heatmap(data.corr())
plt.show()
 Read data
from sklearn import preprocessing
import pandas as pd
df2 = pd.read_excel('2010.xlsx')
df2

 Show Numeric Columns

df2.select_dtypes(include=[np.number])

 Replace empty cells with Nan value

df2 = df2.replace(r'\s',np.nan, regex=True)
 Drop all empty columns
df2=df2.dropna(axis='columns', how='all')
#df2.isnull().mean()
df2.fillna(df2.mean(),inplace=True)

Drop all empty columns with threshshold <0.5

#df2.columns[df2.isnull().mean() < 0.8]

df2=df2[df2.columns[df2.isnull().mean() < 0.5]]
Find Missing values

 Now let's see if we have any missing value

df2.isnull()
df2.notnull()
df2.isnull()[15:20]
 It is possible to drop rows with NanValue:

df2 = df2.dropna()
df2=df2.dropna(axis='columns', how='all') //rows

 If a Column like IMO2 is all Nan, we can drop it:

df2 = df2.drop(['IMO2'], axis=1)

 Show the summery of null value for each columns

df2.isnull().sum()
Delete Missing values or replace

 Fill all nan columns with mean

df2.fillna(df2.mean(),inplace=True)

 if a column like IYR of some accidents are NaN in our dataset. Let's
change NaN to mean value of

df2.IYR.iloc[[1, 2, 3]] =np.nan // df2.at[{0,11,12,13,14,15,16}, 'IYR']=np.nan

df2=df2.fillna({'IYR': df['IYR'].mean()})
df2[1:10]

Manual Solution For RISC-V Edition
100% (5)
Manual Solution For RISC-V Edition
100 pages
All The Messianic Prophecies of The Bible Herbert Lockyer
No ratings yet
All The Messianic Prophecies of The Bible Herbert Lockyer
530 pages
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
No ratings yet
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
6 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
Loading Pandas
No ratings yet
Loading Pandas
23 pages
Unit 2
No ratings yet
Unit 2
81 pages
pandas (1)
No ratings yet
pandas (1)
25 pages
Class 12 Practical File
No ratings yet
Class 12 Practical File
29 pages
Introduction to Pandas & Data Structures
No ratings yet
Introduction to Pandas & Data Structures
11 pages
Pandas
No ratings yet
Pandas
29 pages
IpNotes
No ratings yet
IpNotes
72 pages
Python Data Frame New
No ratings yet
Python Data Frame New
32 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Python Pandas and Matplotlib 7
100% (3)
Python Pandas and Matplotlib 7
72 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
PANDAS
No ratings yet
PANDAS
24 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
75 pages
14_Pandas
No ratings yet
14_Pandas
25 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
Cheat Python
No ratings yet
Cheat Python
8 pages
Pandas Class 12 Ncertttt
No ratings yet
Pandas Class 12 Ncertttt
48 pages
09_Pandas slides
No ratings yet
09_Pandas slides
33 pages
Unit 4 PPT Part2 - Pandas
No ratings yet
Unit 4 PPT Part2 - Pandas
40 pages
Unit 4
No ratings yet
Unit 4
36 pages
CH 02 - Data Handling Using Pandas Leip102 EDITED Smaller 01 Codes Only
No ratings yet
CH 02 - Data Handling Using Pandas Leip102 EDITED Smaller 01 Codes Only
15 pages
python unit 3 4
No ratings yet
python unit 3 4
92 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Python Pandas
No ratings yet
Python Pandas
21 pages
Pandas
No ratings yet
Pandas
25 pages
Data Handling Using Pandas-I-ORG
No ratings yet
Data Handling Using Pandas-I-ORG
44 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
Fundamental - Python
No ratings yet
Fundamental - Python
3 pages
05Getting Started With Pandas
No ratings yet
05Getting Started With Pandas
44 pages
Exp 25_26
No ratings yet
Exp 25_26
17 pages
Panda Ncert 1
No ratings yet
Panda Ncert 1
36 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
Pandas
No ratings yet
Pandas
44 pages
Pandas Notes(1)
No ratings yet
Pandas Notes(1)
44 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
135 pages
Pandas
No ratings yet
Pandas
5 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
1501992967_1496666168_Pandas
No ratings yet
1501992967_1496666168_Pandas
63 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Ai Workflow Data Preparation With Numpy and Pandas: MR Hew Ka Kian Hew - Ka - Kian@Rp - Edu.Sg
No ratings yet
Ai Workflow Data Preparation With Numpy and Pandas: MR Hew Ka Kian Hew - Ka - Kian@Rp - Edu.Sg
26 pages
exp3 python (1)
No ratings yet
exp3 python (1)
15 pages
Unit-4Introduction To Pandas
No ratings yet
Unit-4Introduction To Pandas
44 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
25 pages
P Unit-4 NP
No ratings yet
P Unit-4 NP
30 pages
Python Programming Pandas Across Examples
No ratings yet
Python Programming Pandas Across Examples
350 pages
Line By Line 12 IP
No ratings yet
Line By Line 12 IP
21 pages
Pandas Dataframe
No ratings yet
Pandas Dataframe
48 pages
1 IP 12 NOTES PythonPandas 2022 PDF
100% (3)
1 IP 12 NOTES PythonPandas 2022 PDF
66 pages
Python For Data Science 1662157639
No ratings yet
Python For Data Science 1662157639
6 pages
DV
No ratings yet
DV
53 pages
Data Handing Using Pandas-I
100% (2)
Data Handing Using Pandas-I
46 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
clustering
No ratings yet
clustering
1 page
1.10. Decision Trees — scikit-learn 0.24.1 documentation
No ratings yet
1.10. Decision Trees — scikit-learn 0.24.1 documentation
10 pages
2112.10318
No ratings yet
2112.10318
34 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
بارگذاری فایل
No ratings yet
بارگذاری فایل
2 pages
KNN in python
No ratings yet
KNN in python
11 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
_08ClassBasic_v1
No ratings yet
_08ClassBasic_v1
46 pages
subdivision
No ratings yet
subdivision
5 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
01 Laurie Stephey
No ratings yet
01 Laurie Stephey
14 pages
اكواد اكستريم XTREAM
No ratings yet
اكواد اكستريم XTREAM
5 pages
WEEK 4 - GRADE 10 Greek Mythology and The Creation
No ratings yet
WEEK 4 - GRADE 10 Greek Mythology and The Creation
85 pages
CIDAM Template
No ratings yet
CIDAM Template
2 pages
LAB211 Assignment: Title Background Program Specifications
No ratings yet
LAB211 Assignment: Title Background Program Specifications
3 pages
Missionary: Abraham: The First
No ratings yet
Missionary: Abraham: The First
8 pages
Management Scope
No ratings yet
Management Scope
36 pages
AFK2601 - Okt Nov 2023 - Eksamen
No ratings yet
AFK2601 - Okt Nov 2023 - Eksamen
11 pages
Treasurehunt
No ratings yet
Treasurehunt
2 pages
Minimum Spanning Tree
No ratings yet
Minimum Spanning Tree
41 pages
Math For Computer Science and Machine Learning
No ratings yet
Math For Computer Science and Machine Learning
17 pages
PGT 08 S
No ratings yet
PGT 08 S
32 pages
Data Acquisition and Management
No ratings yet
Data Acquisition and Management
11 pages
Chorus Notebook 6 - Master
No ratings yet
Chorus Notebook 6 - Master
46 pages
Use Case Document
No ratings yet
Use Case Document
6 pages
Doing Business Internationally
No ratings yet
Doing Business Internationally
14 pages
VCE English and English As An Additional Language (EAL) Text List 2020
No ratings yet
VCE English and English As An Additional Language (EAL) Text List 2020
17 pages
Jurnal Teologi
No ratings yet
Jurnal Teologi
17 pages
Cindy, Usa: Read What These Teenagers Have To Say About The City and The Countryside
No ratings yet
Cindy, Usa: Read What These Teenagers Have To Say About The City and The Countryside
2 pages
GT T File Formats
No ratings yet
GT T File Formats
4 pages
Development - Client Customization - Draft
No ratings yet
Development - Client Customization - Draft
37 pages
prim_maths_5_2ed_tr_mid_point_test
No ratings yet
prim_maths_5_2ed_tr_mid_point_test
8 pages
Dreams and Revelations August 2018 Impartation Service
No ratings yet
Dreams and Revelations August 2018 Impartation Service
3 pages
Invitation Text 1
No ratings yet
Invitation Text 1
8 pages
LSE 4030 Module Advanced English Teaching Methods
No ratings yet
LSE 4030 Module Advanced English Teaching Methods
113 pages
Partner Universities List 2023 2024
No ratings yet
Partner Universities List 2023 2024
240 pages
Sage X3 - User Guide - HTG-Payroll Interface PDF
No ratings yet
Sage X3 - User Guide - HTG-Payroll Interface PDF
13 pages
Essay Rubric Grade 6
No ratings yet
Essay Rubric Grade 6
3 pages
CS3311 - Data Structures Laboratory
No ratings yet
CS3311 - Data Structures Laboratory
66 pages

Pandas

Uploaded by

Pandas

Uploaded by

Pandas

 The pandas Series

 We can examine the index of a using the property:

 Two objects can be applied to each other with an arithmetic operation

 Passing a list to the [] operator of DataFrame retrieves the specified columns

 new column can be added to DataFrame simply by assigning another Series to a

 Loading data from files into a DataFrame

 Get type of column

 Loading data from row

 Read data from DataFrames

 Now mask the answers by where attribute:

dfbigger=df2[(df2['IMO']>6) & (df2['DAY']>10)]

 Set or reset index for DataFrames

 Get numeric columns or object columns

 Find empty cell and replace with nan

DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')

df2 = df2.replace(r'\s',np.nan, regex=True)

 matplotlib.pyplot is a collection of command style functions that make matplotlib

df.plot(style=['r-','g--', 'b:', 'm-.' , 'k:'])

# Simple data to display in various forms

# Just a figure and one subplot

 A scatter plot displays the correlation between a pair of variables

 Define two subplot

 Define two subplot in one row

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)

 Define three subplot sharing both x and y axes

f, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, sharey=True)

 Define Four axes, returned as a 2-d array

2- conda install seaborn

 Show Numeric Columns

 Replace empty cells with Nan value

Drop all empty columns with threshshold <0.5

#df2.columns[df2.isnull().mean() < 0.8]

 Now let's see if we have any missing value

 If a Column like IMO2 is all Nan, we can drop it:

df2 = df2.drop(['IMO2'], axis=1)

 Show the summery of null value for each columns

 Fill all nan columns with mean

df2.IYR.iloc[[1, 2, 3]] =np.nan // df2.at[{0,11,12,13,14,15,16}, 'IYR']=np.nan

You might also like