2.Pandas操作快查表

本文深入探讨了Pandas库在数据分析中的高级应用,包括数据筛选、分组、排序、绘图及核密度估计等核心功能,通过实例展示了如何高效处理和分析复杂数据集。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Pandas CheatSheet

import numpy as np
import pandas as pd
data={
    'name': ['Alice', 'Bob', 'Charles', 'David', 'Eric'],
    'year': [2017, 2017, 2017, 2017, 2017], 
    'salary': [40000, 24000, 31000, 20000, 30000],
    'pair':[{'a':1},{'a':2},{'a':3},{'a':4},{'a':5}]
}
d=pd.DataFrame(data)
d
namepairsalaryyear
0Alice{'a': 1}400002017
1Bob{'a': 2}240002017
2Charles{'a': 3}310002017
3David{'a': 4}200002017
4Eric{'a': 5}300002017
d.pair
0    {'a': 1}
1    {'a': 2}
2    {'a': 3}
3    {'a': 4}
4    {'a': 5}
Name: pair, dtype: object
type(d.pair)
pandas.core.series.Series
d.pair.tolist()
[{'a': 1}, {'a': 2}, {'a': 3}, {'a': 4}, {'a': 5}]

行操作

薪水大于30000的人都有谁?

d.query('salary>30000')
namepairsalaryyear
0Alice{'a': 1}400002017
2Charles{'a': 3}310002017
d.salary>30000
0     True
1    False
2     True
3    False
4    False
Name: salary, dtype: bool
# 筛选符合条件
d[d.salary>30000]
namepairsalaryyear
0Alice{'a': 1}400002017
2Charles{'a': 3}310002017

Eric的信息是什么?

# Pandas如何进行查询?
d.query("name=='Eric'")
namepairsalaryyear
4Eric{'a': 5}300002017
d[d.name=='Eric']
namepairsalaryyear
4Eric{'a': 5}300002017
d.loc[d.name=='Eric']
namepairsalaryyear
4Eric{'a': 5}300002017

联合查找:名叫Bob且薪水大于20000

d.query("name=='Bob' and salary>20000")
namepairsalaryyear
1Bob{'a': 2}240002017

列操作

d.filter

在SQL中经常使用的 SELECT name,year,salary from table 是对二阶张量的行列进行赛选

d.filter(items=['name','salary','pair'])
namesalarypair
0Alice40000{'a': 1}
1Bob24000{'a': 2}
2Charles31000{'a': 3}
3David20000{'a': 4}
4Eric30000{'a': 5}
# 简化写法
d[['name','salary']]
namesalary
0Alice40000
1Bob24000
2Charles31000
3David20000
4Eric30000

模糊查找

d.filter(like='2', axis=0) # 模糊查找行
namepairsalaryyear
2Charles{'a': 3}310002017
d.filter(like='ea', axis=1) # 模糊查找列
year
02017
12017
22017
32017
42017

分组

聚类、分段预测

df1 = pd.DataFrame({
    "Name" : ["Alice", "Ada", "Mallory", "Mallory", "Billy" , "Mallory"],
    "City" : ["Sydney", "Sydney", "Paris", "Sydney", "Sydney", "Paris"]
})
df1
CityName
0SydneyAlice
1SydneyAda
2ParisMallory
3SydneyMallory
4SydneyBilly
5ParisMallory

各个城市各有多少人

df1.groupby(['City']).count()
Name
City
Paris2
Sydney4

统计量分析

Numerical 变量的数据分析

d.describe()
salaryyear
count5.0000005.0
mean29000.0000002017.0
std7615.7731060.0
min20000.0000002017.0
25%24000.0000002017.0
50%30000.0000002017.0
75%31000.0000002017.0
max40000.0000002017.0

统计函数

df2 = pd.DataFrame({
    'key1':['a', 'a', 'b', 'b', 'a'],
    'key2':['one', 'two', 'one', 'two', 'one'],
    'data1':np.random.randn(5),
    'data2':np.random.randn(5)
})
df2
data1data2key1key2
0-0.4531102.338225aone
1-1.9185831.469552atwo
20.338384-0.387233bone
3-0.9813801.713030btwo
4-0.2438380.547075aone

基于属性key1的类型a和b各自的均值是多少?

包括data1和data2但不包括key2

# 没有k2,因为它不是连续型变量
df2.groupby(['key1']).mean()
data1data2
key1
a-0.8718431.451617
b-0.3214980.662898

联合Group:‘a-one’,‘a-two’,‘b-one’,‘b-two’

# 同时对若干个离散数据 进行分组
df2.groupby(['key1', 'key2']).mean()
data1data2
key1key2
aone-0.3484741.442650
two-1.9185831.469552
bone0.338384-0.387233
two-0.9813801.713030
df2.count() # 每一列中的个数
data1    5
data2    5
key1     5
key2     5
dtype: int64
df2.groupby(['key1']).count() 
data1data2key2
key1
a333
b222
df2.groupby(['key1', 'key2']).count() 
data1data2
key1key2
aone22
two11
bone11
two11

排序

df2
data1data2key1key2
0-0.4531102.338225aone
1-1.9185831.469552atwo
20.338384-0.387233bone
3-0.9813801.713030btwo
4-0.2438380.547075aone
df2.data2.sort_values() # 对列进行排序
2   -0.387233
4    0.547075
1    1.469552
3    1.713030
0    2.338225
Name: data2, dtype: float64
df2.sort_values(by='key2') # 整张表按某1列的数值进行排序
data1data2key1key2
0-1.4918230.084684aone
20.950192-0.235840bone
4-0.6085150.970593aone
1-2.1032630.432440atwo
30.493304-0.907539btwo

多属性排序

df2.sort_values(by=['key1','data1']) # 相同再按后面排序
data1data2key1key2
1-2.1032630.432440atwo
0-1.4918230.084684aone
4-0.6085150.970593aone
30.493304-0.907539btwo
20.950192-0.235840bone
df2.sort_values(by=['key2'], ascending=False) # 整张表按某1列数值进行排序
data1data2key1key2
1-1.9185831.469552atwo
3-0.9813801.713030btwo
0-0.4531102.338225aone
20.338384-0.387233bone
4-0.2438380.547075aone

DataFrame对数据的增删查改

如何选择某1行

 df2.iloc[0]
data1   -0.45311
data2    2.33823
key1           a
key2         one
Name: 0, dtype: object
df2.iloc[0]['key1']
'a'
df2.loc[0].key1
'a'
df2
data1data2key1key2
0-0.4531102.338225aone
1-1.9185831.469552atwo
20.338384-0.387233bone
3-0.9813801.713030btwo
4-0.2438380.547075aone
df2.groupby(['key1']).count()
data1data2key2
key1
a333
b222
df2.groupby(['key1']).count().iloc[0]
data1    3
data2    3
key2     3
Name: a, dtype: int64
df2.groupby(['key1']).count().loc['a']
data1    3
data2    3
key2     3
Name: a, dtype: int64

使用iloc获取Dataframe的某行某列

df2
data1data2key1key2
0-0.4531102.338225aone
1-1.9185831.469552atwo
20.338384-0.387233bone
3-0.9813801.713030btwo
4-0.2438380.547075aone
#第1列
df2.iloc[:,0]
0   -0.453110
1   -1.918583
2    0.338384
3   -0.981380
4   -0.243838
Name: data1, dtype: float64
df2.iloc[1,3]
'two'
#Array slice
df2.iloc[:2]
data1data2key1key2
0-0.4531102.338225aone
1-1.9185831.469552atwo
df2.iloc[0:-1]
data1data2key1key2
0-0.4531102.338225aone
1-1.9185831.469552atwo
20.338384-0.387233bone
3-0.9813801.713030btwo
#打印前两行和后2列
df2.iloc[:2, -2:]
key1key2
0aone
1atwo

数据的修改

.at[]

df2.at[1,'data1']=2
df2
data1data2key1key2
0-0.4531102.338225aone
12.0000001.469552atwo
20.338384-0.387233bone
3-0.9813801.713030btwo
4-0.2438380.547075aone

.iat[]

df2.iat[1,1] = -2.0
df2
data1data2key1key2
0-0.4531102.338225aone
12.000000-2.000000atwo
20.338384-0.387233bone
3-0.9813801.713030btwo
4-0.2438380.547075aone

增加行

.append()

对于dataframe,每一行事实上代表着一个对象/向量,对于对象/向量的表示,使用json

df3=df2.append({'data1':1.2,'data2':1.4,'key1':'b','key2':'two'},ignore_index=True)
df3
data1data2key1key2
0-1.156629-0.479173aone
11.2526901.652271atwo
21.8127551.407725bone
3-1.132984-0.462333btwo
41.173084-1.641891aone
51.2000001.400000btwo
df2
data1data2key1key2
0-1.156629-0.479173aone
11.2526901.652271atwo
21.8127551.407725bone
3-1.132984-0.462333btwo
41.173084-1.641891aone
# 修改第6行
df3.loc[5]=[2,1, 'c','three']
df3
data1data2key1key2
0-1.156629-0.479173aone
11.2526901.652271atwo
21.8127551.407725bone
3-1.132984-0.462333btwo
41.173084-1.641891aone
52.0000001.000000cthree

增加列

.assign()

df4=df3.assign(key3=[1,2,3,4,5,6])
df4
data1data2key1key2key3
0-1.156629-0.479173aone1
11.2526901.652271atwo2
21.8127551.407725bone3
3-1.132984-0.462333btwo4
41.173084-1.641891aone5
52.0000001.000000cthree6
# 修改1列
df4.loc[:,'key3']=np.random.randn(6)
df4
data1data2key1key2key3
0-1.156629-0.479173aone0.140732
11.2526901.652271atwo1.357048
21.8127551.407725bone-1.065623
3-1.132984-0.462333btwo-0.589690
41.173084-1.641891aone0.117112
52.0000001.000000cthree-0.683054

DataFrame自带绘图

# 发表论文篇数
data3 = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}
df3 = pd.DataFrame(data3, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df3
namereportsyear
CochiceJason42012
PimaMolly242012
Santa CruzTina312013
MaricopaJake22014
YumaAmy32014
df3['reports']
Cochice        4
Pima          24
Santa Cruz    31
Maricopa       2
Yuma           3
Name: reports, dtype: int64
df3['reports'].plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x11761b978>

在这里插入图片描述

# %matplotlib inline
# 绘图过滤离散型数据
df3.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x11a86b390>

在这里插入图片描述

df3
namereportsyear
CochiceJason42012
PimaMolly242012
Santa CruzTina312013
MaricopaJake22014
YumaAmy32014

密度估计

核密度估计

import seaborn as sns
sns.kdeplot(df2['data1'])
<matplotlib.axes._subplots.AxesSubplot at 0x11cdc6e80>

在这里插入图片描述

sns.kdeplot(df2['data2'])
<matplotlib.axes._subplots.AxesSubplot at 0x11d05bcc0>

在这里插入图片描述

sns.kdeplot(df2['data1'])
sns.kdeplot(df2['data2'])
<matplotlib.axes._subplots.AxesSubplot at 0x11d11e908>

在这里插入图片描述

sns.kdeplot(df2['data1'], shade=True, color='r')
sns.kdeplot(df2['data2'], shade=True, color='g')
<matplotlib.axes._subplots.AxesSubplot at 0x11d217518>

在这里插入图片描述

生成一段数据,并进行核密度估计

x=np.random.rand(5000)
sns.kdeplot(x, shade=True, color='y')
<matplotlib.axes._subplots.AxesSubplot at 0x11cefb898>

在这里插入图片描述

type(x)
numpy.ndarray
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值