Pandas
Pandas
we're going to deepen our investigation to how Python can be used to manipulate,
clean, and query data by looking at the Pandas data tool kit
The pandas is the base data structure of pandas. A series is similar to a NumPy
array, but it differs by having an index, which allows for much richer lookup of
items instead of just a zero-based array index value.
import pandas as pd
d=pd.Series([11,12,13,14])
d
Multiple items can be retrieved by specifying their labels in a Python list.
import pandas as pd
d[[1,3]]
Pandas: Series
d=pd.Series([11,12,13,14],index=['a','b','c','d'])
d[['a','b']] or d[[0,1]]
d=d.index
d1=pd.Series([11,12,13,14],index=['a','b','c','d'])
d2=pd.Series([1,2,3,5],index=[‘a',‘b','c','d'])
diff=d1-d2
print(diff)
diff.mean()
diff
Pandas: DataFrame
A pandas series can only have a single value associated with each index label.
To have multiple values per index label we can use a data frame. A data frame
represents one or more objects aligned by index label.
Each series will be a column in the data frame, and each column can have an
associated name
d1=pd.Series([11,12,13,14],index=['a','b','c','d'])
d2=pd.Series([1,2,3,4],index=['a','b','c','d'])
temp_df=pd.DataFrame({'value1':d1,'value2':d2})
temp_df
Columns in a object can be accessed using an array indexer with the name
of the column or a list of column names
temp_df['value1']
temp_df[['value1','value2']]
Pandas: DataFrame
temp_dfs=pd.DataFrame()
g=temp_df['value1']-temp_df['value2']
print(g)
temp_df['diff']=temp_df['value1']-temp_df['value2']
temp_df
The names of the columns in a DataFrame are accessible via the columns
property
temp_df.columns
Pandas: DataFrame
The DataFrame and Series objects can be sliced to retrieve specific rows
temp_df [0:3]
temp_df.value1[0:3]
Entire rows from a data frame can be retrieved using the .loc and .iloc properties.
.loc ensures that the lookup is by index label, where .iloc uses the 0-based position.
temp_df.loc['a']
temp_df.iloc[0]
temp_df.iloc[[1,3,5,7]].column_Name
Pandas: DataFrame
The following code shows values in the IMO column that are greater than 7
df2.IMO>7
import pandas as pd
df2 = pd.read_excel('2010.xlsx')
Df2=pd.read_csv('2010.csv')
type(df2.IMO[0])
Pandas: DataFrame
For traversing DataFrame (transposed), we use T assign
df2=df2.T
df2.loc[['IYR','IMO']]
df2.loc['IYR'][0]
Pandas: DataFrame
Deleting data from DataFrames using drop for rows or del for columns
df2.drop('IYR')
del df2['IMO']
df = df.drop(['IMO''], axis=1) # axis is important
Add column to DataFrames
df2['IMO']=0
df2['IMO']=df2['IMO']+2
Query for DataFrames
If you want accidents in months that is bigger than 6, we should write code below:
df2['IMO']>6
dfbigger=dfbigger.set_index('IYR')
print(dfbigger)
dfbigger=dfbigger.reset_index('IYR')
dfbigger
DataFrames: preProcess
Count non-NA cells for each column or row
df2.count(axis=0, numeric_only=False)
df4=df2.select_dtypes(include=['object'])
df2[~df2.isin(df4)]
another sample
import numpy as np
import matplotlib.pyplot as plt
t = np.arange(0., 5., 0.2)
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.title('some values')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['t','t**2','t**3'])
plt.show()
plot
another sample
import numpy as np
import matplotlib.pyplot as plt
t = np.arange(0., 5., 0.2)
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.title('some values')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['t','t**2','t**3'])
plt.show()
Plot dataframe
another sample
t=np.arange(0,5,0.2)
df=pd.DataFrame({0:t , 1:t**1.5 , 2:t**2 , 3:t**2.5 , 4:t**3})
legend_labels=['Solid' , 'Dashed' , 'Dotted' , 'Dot-dashed' , 'Points']
matplotlib.pyplot.subplots return an instance of Figure and an array of (or a single) Axes (array or
not depends on the number of subplots)
matplotlib.pyplot.subplot(*args, **kwargs)
import matplotlib.pyplot as plt
import numpy as np
plt.close('all')
f, axarr = plt.subplots(2, 2)
axarr[0, 0].plot(x, y)
axarr[0, 0].set_title('Axis [0,0]')
axarr[0, 1].scatter(x, y)
axarr[0, 1].set_title('Axis [0,1]')
axarr[1, 0].plot(x, y ** 2)
axarr[1, 0].set_title('Axis [1,0]')
axarr[1, 1].scatter(x, y ** 2)
axarr[1, 1].set_title('Axis [1,1]')
plt.show()
Calculate correlation by seaborn package
1-
colNames = ["Age", "type_employer", "fnlwgt", "Education", "Education-Num", "Martial","Occupation",
"Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
"H-per-week", "Country", "Label"]
data = pd.read_csv("adult-data.txt", names=colNames,delimiter=',',header=None)
data
3-
import seaborn as sns
%matplotlib inline
sns.heatmap(data.corr())
plt.show()
Read data
from sklearn import preprocessing
import pandas as pd
df2 = pd.read_excel('2010.xlsx')
df2
df2.isnull()
df2.notnull()
df2.isnull()[15:20]
It is possible to drop rows with NanValue:
df2 = df2.dropna()
df2=df2.dropna(axis='columns', how='all') //rows
df2.isnull().sum()
Delete Missing values or replace
df2.fillna(df2.mean(),inplace=True)
if a column like IYR of some accidents are NaN in our dataset. Let's
change NaN to mean value of