A demo of pandas , 数据清洗
data_pandas.csv:
Models,col-0,col-1,col-2,col-3
A,0.1,,0.4,0.5
B,0.2,0.6,,0.1
C,0.6,,0.2,0.8
D,0.1,0.2,0.3,0.4
import pandas as pd
import numpy as np
#
# n = 30
# m = 5
# index = ["line-{}".format(i) for i in range(n)]
# columns = ["col-{}".format(i) for i in range(m)]
# df = pd.DataFrame(
# np.random.randn(n, m),
# index=index,
# columns=columns
# )
df = pd.read_csv("data_pandas.csv")
print(df)
print(df['col-1'].isnull())
filldata = df['col-1'].mean()
filldata = df['col-1'].median()
filldata = df['col-1'].mode()
df.fillna(filldata, inplace=True) # replace NaN with mean of that column
df.loc[3,'col-0'] = 100
# df.dropna(subset=['col-1'], inplace=True) # drop lines which contain NaN in col-1
# df.dropna(inplace=True)
print(df.info())
for i in df.index:
for j in ['col-{}'.format(x) for x in range(4)]:
if df.loc[i, j]<0.3:
df.loc[i,j] = 200
print(df)
# newdf = df['col-1'].dropna()
# newdf = df.dropna() # 删除包含空数据的行
# print(newdf)
# print(df.columns)
# print(df.loc[['line-0','line-1']])
# # print(df[['col-0','col-1']])
# print("thisisit")
# print(df.loc[['line-1']]['col-0'])
# df.loc[['line-1']]['col-0'] = 100
# print(df.loc[['line-0','line-1']])
# terminal:
Models col-0 col-1 col-2 col-3
0 A 0.1 NaN 0.4 0.5
1 B 0.2 0.6 NaN 0.1
2 C 0.6 NaN 0.2 0.8
3 D 0.1 0.2 0.3 0.4
0 True
1 False
2 True
3 False
Name: col-1, dtype: bool
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Models 4 non-null object
1 col-0 4 non-null float64
2 col-1 2 non-null float64
3 col-2 3 non-null float64
4 col-3 4 non-null float64
dtypes: float64(4), object(1)
memory usage: 288.0+ bytes
None
Models col-0 col-1 col-2 col-3
0 A 200.0 NaN 0.4 0.5
1 B 200.0 0.6 NaN 200.0
2 C 0.6 NaN 200.0 0.8
3 D 100.0 200.0 0.3 0.4
Process finished with exit code 0