pandas第六章——缺失数据

最新推荐文章于 2024-08-12 14:16:36 发布

陈述c

最新推荐文章于 2024-08-12 14:16:36 发布

阅读量346

点赞数

CC 4.0 BY-SA版权

文章标签： 1024程序员节

本文链接：https://blog.csdn.net/weixin_44617944/article/details/109259300

本文详细介绍了如何通过Pandas处理CSV文件中的缺失值，包括缺失观测类型、isna和notna方法的应用，以及填充、剔除、插值等技巧。从数据清洗到数据完整性的维护，为数据分析提供实用步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

df = pd.read_csv('data/table_missing.csv')
df.head()

  School Class      ID Gender   Address  Height  Weight  Math Physics      
0    S_1   C_1     NaN      M  street_1     173     NaN  34.0      A+      
1    S_1   C_1     NaN      F  street_2     192     NaN  32.5      B+      
2    S_1   C_1  1103.0      M  street_2     186     NaN  87.2      B+      
3    S_1   NaN     NaN      F  street_2     167    81.0  80.4     NaN      
4    S_1   C_1  1105.0    NaN  street_4     159    64.0  84.8      A-

一、缺失观测及其类型

了解缺失信息

isna和notna方法

对Series
isna

df['Physics'].isna().head()

0    False
1    False
2    False
3     True
4    False
Name: Physics, dtype: bool

notna

df['Physics'].notna().head()

0     True
1     True
2     True
3    False
4     True
Name: Physics, dtype: bool

对Dataframe

df.isna()

   School  Class     ID  Gender  Address  Height  Weight   Math  Physics   
0   False  False   True   False    False   False    True  False    False   
1   False  False   True   False    False   False    True  False    False   
2   False  False  False   False    False   False    True  False    False   
3   False   True   True   False    False   False   False  False     True   
4   False  False  False    True    False   False   False  False    False

对Dataframe更关注每一列有多少缺失值

df.isna().sum()

School      0
Class       4
ID          6
Gender      7
Address     0
Height      0
Weight     13
Math        5
Physics     4
dtype: int64

info也可以查看

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   School   35 non-null     object
 1   Class    31 non-null     object
 2   ID       29 non-null     float64
 3   Gender   28 non-null     object
 4   Address  35 non-null     object
 5   Height   35 non-null     int64
 6   Weight   22 non-null     float64
 7   Math     30 non-null     float64
 8   Physics  31 non-null     object
dtypes: float64(3), int64(1), object(5)
memory usage: 2.6+ KB

查看缺失值所在的行

df[df['Physics'].isna()]

   School Class      ID Gender   Address  Height  Weight  Math Physics     
3     S_1   NaN     NaN      F  street_2     167    81.0  80.4     NaN     
8     S_1   C_2  1204.0      F  street_5     162    63.0  33.8     NaN     
13    S_1   C_3  1304.0    NaN  street_2     195    70.0  85.2     NaN     
22    S_2   C_2  2203.0      M  street_4     155    91.0  73.8     NaN

挑选出所有非缺失值列

all是全部非缺失值，any是至少有一个非缺失值

c=df[df.notna().all(1)]

   School Class      ID Gender   Address  Height  Weight  Math Physics     
5     S_1   C_2  1201.0      M  street_5     159    68.0  97.0      A-     
6     S_1   C_2  1202.0      F  street_4     176    94.0  63.5      B-     
12    S_1   C_3  1303.0      M  street_7     188    82.0  49.7       B     
17    S_2   C_1  2103.0      M  street_4     157    61.0  52.5      B-     
21    S_2   C_2  2202.0      F  street_7     194    77.0  68.5      B+     
25    S_2   C_3  2301.0      F  street_4     157    78.0  72.3      B+     
27    S_2   C_3  2303.0      F  street_7     190    99.0  65.9       C     
28    S_2   C_3  2304.0      F  street_6     164    81.0  95.5      A-     
29    S_2   C_3  2305.0      M  street_4     187    73.0  48.9       B

三种缺失符号

np.nan

不等于任何东西，包括他自己

np.nan == np.nan
False

但是在使用equal函数比较时会自动忽略np.nan

df.equals(df)
True

类型为浮点型

pd.Series([1,np.nan,3]).dtype
float64

对于bool类型的列表，如果是np.nan填充，他的值会自动变为True

pd.Series([1,np.nan,3],dtype='bool')
0    True
1    True
2    True
dtype: bool

但当修改⼀个布尔列表时，会改变列表类型而不是赋值为True

s = pd.Series([True,False],dtype='bool')
s[1]=np.nan
s

0    1.0
1    NaN
dtype: float64

在所有的表格读取后，无论列是存放什么类型的数据，默认的缺失值为np.nan类型
整型列转为浮点；而字符由于无法转化为浮点型，归并为object类型（‘O’）原来是浮点型的则类型不变

None

等于他自身

None == None
True

布尔值为False

pd.Series([None],dtype='bool')

0    False
dtype: bool

修改布尔列表不会改变数据类型；传入数值类型后会自动变为np.nan；在使⽤equals函数时不会被略过因下⾯的况下回False

NaT

针对时间序列的缺失值，是pandas的内置类型，完全可以看作时序版np.nan

s_time = pd.Series([pd.Timestamp('20120101'),pd.NaT])
0   2012-01-01
1          NaT
dtype: datetime64[ns]

s_time = pd.Series([pd.Timestamp('20120101'),np.nan])
0   2012-01-01
1          NaT
dtype: datetime64[ns]

Nullable类型与NA符号

解决之出现的混乱局面，统⼀缺失值处理方法

Nullable整型

首字母大写改为为Int

s_new = pd.Series([1, 2], dtype="Int64")

0    1
1    2
dtype: Int64

前面提到的三种缺失值都会被换为统⼀的NA，且不改变数据类型

s_new[1] = np.nan#None,NaT

0       1
1    <NA>
dtype: Int64

Nullable布尔

s_new = pd.Series([0, 1], dtype="boolean")
s_new[0] = np.nan
s_new

0    <NA>
1    True
dtype: boolean

string类型

区分模糊的object，本质上也是Nullable类型。

s = pd.Series(['dog','cat'],dtype='string')

0    dog
1    cat
dtype: string

同object区别在于在调用字符方法后，string类型返回的是Nullable类型，而object则会根据缺失类型和数据类型而改变。
string

s = pd.Series(["a", None, "b"], dtype="string")
s.str.count('a')

0       1
1    <NA>
2       0
dtype: Int64

object

s2 = pd.Series(["a", None, "b"], dtype="object")
s2.str.count("a")

0    1.0
1    NaN
2    0.0
dtype: float64

NA的特性

逻辑运算

逻辑运算如果依赖NA，结果就是NA；如果不依赖NA，计算结果。

pd.NA & False
False
False | pd.NA
<NA>

算术运和比较运算

两类特殊情况，其他结果都是 NA

pd.NA ** 0
1
1 ** pd.NA
1
pd.NA == pd.NA
<NA>

convert_dtypes方法

在读取数据时，将数据转化为Nullable

pd.read_csv('data/table_missing.csv').convert_dtypes().dtypes

School      string
Class       string
ID           Int64
Gender      string
Address     string
Height       Int64
Weight       Int64
Math       float64
Physics     string
dtype: object

缺失数据的运算与分组

加号与乘号规则

使用加法时，缺失值为0

s = pd.Series([2,3,np.nan,4])
s.sum()
9.0

使用乘法时，缺失值为1

s.prod()
24.0

使用累计函数时，缺失值自动略去

s.cumsum()

0    2.0
1    5.0
2    NaN
3    9.0
dtype: float64

groupby方法中的缺失值

df_g = pd.DataFrame({'one':['A','B','C','D',np.nan],'two':np.random.randn(5)})
df_g.groupby('one').groups

{'A': Int64Index([0], dtype='int64'), 'B': Int64Index([1], dtype='int64'), 
'C': Int64Index([2], dtype='int64'), 'D': Int64Index([3], dtype='int64')}

填充与剔除

fillna方法

值填充与前后向填充

df['Physics'].fillna('missing').head()

0	 A+
1	 B+
2	 B+
3	 missing
4	 A-
Name: Physics, dtype: object

前填充

df['Physics'].fillna(method='ffill').head()

0    A+
1    B+
2    B+
3    B+
4    A-
Name: Physics, dtype: object

后填充

0    A+
1    B+
2    B+
3    A-
4    A-
Name: Physics, dtype: object

填充中的对齐特性

df_f = pd.DataFrame({'A':[1,3,np.nan],'B':[2,4,np.nan],'C':[3,5,np.nan]})
df_f.fillna(df_f.mean())

     A    B    C
0  1.0  2.0  3.0
1  3.0  4.0  5.0
2  2.0  3.0  4.0

dropna方法

df_d = pd.DataFrame({'A':[np.nan,np.nan,np.nan],'B':[np.nan,3,2],'C':[3,2,1]})

axis参数

df_d.dropna(axis=0)

Empty DataFrame
Columns: [A, B, C]
Index: []

df_d.dropna(axis=1)

how参数

可选all或者any表示全为缺失去除和存在缺失去除

df_d.dropna(axis=1,how='all')

subset参数

在某⼀组列范围中搜索缺失值）

df_d.dropna(axis=0,subset=['B','C'])

    A    B  C
1 NaN  3.0  2
2 NaN  2.0  1

四、插值（interpolation）

线性插值

索引无关的线性插值

s = pd.Series([1,10,15,-5,-2,np.nan,np.nan,28])
s.interpolate().plot()
show()

在这里插入图片描述

s.index = np.sort(np.random.randint(50,300,8))
s.interpolate().plot()

在这里插入图片描述

索引相关的线性插值

s.interpolate(method='index').plot()
show()

在这里插入图片描述

interpolate中的限制参数

limit

表示最多插入多少个

s = pd.Series([1,np.nan,np.nan,np.nan,5])
s.interpolate(limit=2)

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
dtype: float64

limit_direction

表示插值方向可选，forward,backward,both默认前向

s = pd.Series([np.nan,np.nan,1,np.nan,np.nan,np.nan,5,np.nan,np.nan,])
s.interpolate(limit_direction='backward')

0    1.0
1    1.0
2    1.0
3    2.0
4    3.0
5    4.0
6    5.0
7    NaN
8    NaN
dtype: float64

limit_area

表示插值区域，可选inside,outside默认None

s = pd.Series([np.nan,np.nan,1,np.nan,np.nan,np.nan,5,np.nan,np.nan,])
s.interpolate(limit_area='inside')```

```python
0    NaN
1    NaN
2    1.0
3    2.0
4    3.0
5    4.0
6    5.0
7    NaN
8    NaN
dtype: float64

五、练习

在这里插入图片描述

df = pd.read_csv('data/Missing_data_one.csv')
df[df['C'].isna()]

          A      B   C
1   not_NaN  0.700 NaN
5   not_NaN  0.972 NaN
11  not_NaN  0.736 NaN
19  not_NaN  0.684 NaN
21  not_NaN  0.913 NaN

min_b = df['B'].min()
df['A'] = pd.Series(list(zip(df['A'].values
 ,df['B'].values))).apply(lambda x:x[0] if np.random.rand()>0.25*x[1]/min_b else np.nan)

print(df.head())

         A      B     C
0      NaN  0.922     4
1      NaN  0.700  <NA>
2  not_NaN  0.503     8
3  not_NaN  0.938     4
4      NaN  0.952    10