Python Requests编码识别Bug

最新推荐文章于 2024-08-27 14:32:37 发布

aini_zlr2008

最新推荐文章于 2024-08-27 14:32:37 发布

阅读量232

点赞数

CC 4.0 BY-SA版权

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/aini_zlr2008/article/details/84796330

python 专栏收录该内容

4 篇文章

订阅专栏

本文探讨了使用Python的Requests库获取网页内容时遇到的编码识别问题。通过具体实例展示了如何正确设置编码以避免乱码，并提供了一种通用的解决方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Python+Requests编码识别Bug

>>> r = requests.get('http://cn.python-requests.org/en/latest/')
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> requests.utils.get_encodings_from_content(r.content)
['utf-8']

>>> r = requests.get('http://reader.360duzhe.com/2013_24/index.html')
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'gb2312'
>>> requests.utils.get_encodings_from_content(r.content)
['gb2312']

终极解决办法：
if r.encoding == 'ISO-8859-1':
encodings = requests.utils.get_encodings_from_content(r.content)
if encodings:
r.encoding = encodings[0]
else:
r.encoding = r.apparent_encoding
r._content = r.content.decode(r.encoding, 'replace').encode('utf8', 'replace')

http://www.tuicool.com/articles/vEJzMv