数据分析之Python可变数据类型高效编程_批量可变数据用什么语言-CSDN博客

本文链接：https://blog.csdn.net/weixin_43909804/article/details/117190254

本文聚焦Python数据分析中的列表、字典和集合的高效编程，讲解如何筛选数据、统计元素频率、按值排序字典、查找公共键以及保持字典有序。通过示例介绍了filter、列表解析、Counter、sorted以及集合操作等高阶技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

用 Python 做数据分析，主要的技术栈是 Python 基础和三驾马车 NumPy、Pandas 和 Matplotlib。先搞定 Python 基础吧。

Python3 有六个标准数据类型

不可变类型：Number（数字）、String（字符串）、Tuple（元组）
可变类型：List（列表）、Dictionary（字典）、Set（集合）

相比不可变类型数据处理起来较明确，可变类型的骚操作就多很多。

你可以快速了解 Python 语法，基于这些语法知识，你可能能够尝试出你想得到的数据，但了解一些高阶技巧是很有必要的。

高阶技巧不仅能加快你 code 的速度，还能让代码更优美，从而实现高效编程。

本文针对这 3 个可变类型，提炼出 5 个高频问题，吃透这几个高阶编程技巧。

如何在列表、字典、集合中根据条件筛选数据
如何统计序列中元素的出现频率
如何根据字典中值的大小，对字典中的项进行排序
如何快速找到多个字典中的公共键
如何让字典保持有序

1.如何在列表、字典、集合中根据条件筛选数据？

列表
- filter 函数 filter(lambda x : x>=0, data)
- 列表解析 [x for x in data if x >= 0]
字典
- 字典解析 {k : v for k, v in d.iteritems() if v > 90}
集合
- 集合解析 {x for x in s if x%3 == 0}

列表

>>> from random import randint
>>> data = [randint(-10, 10) for _ in range(10)]
>>> data
[2, -9, 2, -2, -10, -2, -1, 8, 3, 3]
>>> filter(lambda x: x >= 0, data)
[2, 2, 8, 3, 3]
>>> [x for x in data if x >= 0]
[2, 2, 8, 3, 3]

即使 filter 函数执行更快，耐不住列表解析既优美又连贯。工作中大家很爱用列表解析，推荐指数五颗星。

字典和集合

# 字典
>>> d = {x: randint(60, 100) for x in range(1, 11)}
>>> d
{1: 61, 2: 71, 3: 86, 4: 79, 5: 64, 6: 88, 7: 71, 8: 88, 9: 94, 10: 65}
>>> {k: v for k, v in d.items() if v > 90}
{9: 94}

# 集合
>>> data
[2, -9, 2, -2, -10, -2, -1, 8, 3, 3]
>>> s = set(data)
>>> s
set([2, 3, 8, -2, -10, -9, -1])
>>> {x for x in s if x % 3 == 0}
set([3, -9])

可以看到解析的方式代码很连贯，对于任何一个可迭代对象都可以用同样的逻辑去code。

2.如何统计序列中元素的出现频率？

随机序列中元素频率统计
词频统计：找出出现次数最高的 5 个单词，及出现次数

>>> from random import randint
>>> data = [randint(0, 5) for _ in range(10)]
>>> data
[5, 3, 4, 4, 1, 0, 5, 2, 0, 2]
# 不依赖库生算
>>> c = dict.fromkeys(data, 0)
>>> c
{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}
>>> for x in data: c[x] += 1
>>> c
{0: 2, 1: 1, 2: 2, 3: 1, 4: 2, 5: 2}
>>> sorted(c.items(), key=lambda c: c[1])
[(1, 1), (3, 1), (0, 2), (2, 2), (4, 2), (5, 2)]

# 用Counter库
>>> from collections import Counter
>>> c2 = Counter(data)
>>> c2
Counter({0: 2, 2: 2, 4: 2, 5: 2, 1: 1, 3: 1})
>>> c2.most_common(3)
[(0, 2), (2, 2), (4, 2)]

Counter 神奇在于一键解决词频统计问题，在自然语言处理上大有妙用。

>>> import re
# 待统计词频的文本
>>> txt="As we all know, environment pollution and energy waste have become some of the most important topics in this world. The best way to solve these problems is to live a low-carbon life. "
>>> c3 = Counter(re.split('\W+', txt))
>>> c3
Counter({'to': 2, 'and': 1, '': 1, 'all': 1, 'energy': 1, 'some': 1, 'life': 1, 'have': 1, 'in': 1, 'carbon': 1, 'best': 1, 'environment': 1, 'live': 1, 'low': 1, 'way': 1, 'waste': 1, 'topics': 1, 'we': 1, 'problems': 1, 'most': 1, 'important': 1, 'know': 1, 'world': 1, 'The': 1, 'is': 1, 'a': 1, 'this': 1, 'of': 1, 'these': 1, 'As': 1, 'solve': 1, 'become': 1, 'the': 1, 'pollution': 1})
>>> c3.most_common(5)
[('to', 2), ('and', 1), ('', 1), ('all', 1), ('energy', 1)]

3.如何根据字典中值的大小，对字典中的项进行排序

利用 zip 将字典数据转化为元组
传递 sorted 函数的 key 参数

>>> d = {x: randint(60, 100) for x in 'xyzabc'}
>>> d
{'a': 66, 'c': 77, 'b': 65, 'y': 85, 'x': 64, 'z': 83}
# 对字典排序是返回key的排序结果
>>> sorted(d)
['a', 'b', 'c', 'x', 'y', 'z']

# 利用zip将键值对转化为（值，键）的元组
>>> zip(d.values(), d.keys())
[(66, 'a'), (77, 'c'), (65, 'b'), (85, 'y'), (64, 'x'), (83, 'z')]
>>> sorted(zip(d.values(), d.keys()))
[(64, 'x'), (65, 'b'), (66, 'a'), (77, 'c'), (83, 'z'), (85, 'y')]

# 传递sorted函数的key参数
>>> d.items()
[('a', 66), ('c', 77), ('b', 65), ('y', 85), ('x', 64), ('z', 83)]
>>> sorted(d.items(), key=lambda d: d[1])
[('x', 64), ('b', 65), ('a', 66), ('c', 77), ('z', 83), ('y', 85)]

灵活掌握内置函数的功能，会让你日常工作增效百倍。但是功能不可能是记全的，理解函数可以实现什么，那些标准执行过程一定有标准实现。

4.如何快速找到多个字典中的公共键

使用遍历硬编码
使用字典的 keys()方法，得到一个字典 keys 的集合，利用集合 set()的交集操作
使用 map 函数，得到所有字典的 keys 集合，再使用 reduce 函数，取所有字典的 keys 的集合的交集

>>> from random import randint, sample
>>> s1 = {x: randint(1, 4) for x in sample('abcdefg', randint(3, 6))}
>>> s1
{'b': 2, 'd': 4, 'c': 1, 'e': 3, 'a': 1}
>>> s2 = {x: randint(1, 4) for x in sample('abcdefg', randint(3, 6))}
>>> s3 = {x: randint(1, 4) for x in sample('abcdefg', randint(3, 6))}
>>> s2
{'g': 3, 'e': 4, 'd': 3, 'a': 4}
>>> s3
{'d': 1, 'f': 1, 'e': 4, 'g': 4, 'b': 1}

# 遍历生算
>>> res = []
>>> for k in s1:
...     if k in s2 and k in s3:
...         res.append(k)
...
>>> res
['d', 'e']

# 集合&求交集
>>> s1.keys() & s2.keys() & s3.keys()
{'d', 'e'}

# map 函数 + reduce 函数
>>> map(dict.keys, [s1, s2, s3])
<map object at 0x1028d2a90>
>>> list(map(dict.keys, [s1, s2, s3]))
[dict_keys(['b', 'd', 'c', 'e', 'a']), dict_keys(['g', 'e', 'd', 'a']), dict_keys(['d', 'f', 'e', 'g', 'b'])]
>>> from functools import reduce
>>> reduce(lambda a, b: a & b, map(dict.keys, [s1, s2, s3]))
{'d', 'e'}

5.如何让字典保持有序

>>> d = {}
>>> d['Jim'] = (1, 35)
>>> d['Leo'] = (2, 37)
>>> d['Bob'] = (3, 40)
>>> d
{'Jim': (1, 35), 'Leo': (2, 37), 'Bob': (3, 40)}
>>> for k in d:
...     print(k)
...
Jim
Leo
Bob

上述字典实际上是无序的，无法保证其按排名顺序输出，虽然这个例子中是增序的。

>>> from collections import OrderedDict
>>> d = OrderedDict()
>>> d['Jim'] = (1, 35)
>>> d['Leo'] = (2, 37)
>>> d['Bob'] = (3, 40)
>>> d
OrderedDict([('Jim', (1, 35)), ('Leo', (2, 37)), ('Bob', (3, 40))])
>>> for k in d:
...     print(k)
...
Jim
Leo
Bob