问题及探索:urlopen(url).read()抓取网页后显示乱码问题
首先用python2.7,后来显示乱码后,尝试了encode decode未果,以为是python2.7不支持中文的问题(确实后一部分原因),后改用python3.8,发现还是乱码,不过不是一点也不可读,稍后展示乱码。后用chardet检查,结果为{'confidence': 0.0, 'language': None, 'encoding': None},后经各种google,发现是网页传回来的页面被压缩了,需要用gzip库解压。
python2.7环境下
import urllib2
import chardet
url = 'https://devapi.qweather.com/v7/weather/now?location=101060110&key=cd6ce864556a4c099b3342a674a87d5e'
html = urllib2.urlopen(url)
html_text = html.read()
print html
# <addinfourl at 50103944L whose fp = <socket._fileobject object at 0x0000000002F0A930>>
print html_text
"""
e��n�@�_ym�����يԡbH/�bq���.�x��ݫ�A��Ny���v���������JH��������H��l��T�2����jvK2v�!��t�6��fe*��K6��� �����E�����]��yLE�~I��1���K��ξ]`8�}2l��٭��sy�{r��_����*}^��=bͼ`n��j{�<g�*�����������<�_��[2.��C�)�]5�0Nᘀ�]����)�'x�*��Fw-�
� hRh<ƞ�#e���J�������
"""
print chardet.detect(html_text)
{'confidence': 0.0, 'language': None, 'encoding': None}
python3.8环境下
import urllib.request
import chardet
url = 'https://devapi.qweather.com/v7/weather/now?location=101060110&key=cd6ce864556a4c099b3342a674a87d5e'
html = urllib.request.urlopen(url)
html_text = html.read()
print(html)
# <http.client.HTTPResponse object at 0x0000025A34B61488>
print(html_text)
# b"\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00e\x8f=n\xc30\x0c\x85\xafbp\xad\x13\xcbv\x9c\xc4^\xdb1[\x03t(2\xb82\x03\x13\x91%C\x92\x9b\x04A\xae\xd0\xb9{\xd1\x1bt\xefu\x9a[\x94\xcaO\x97n\xfc\x1e\x1f\x1f\xc9\x03H\xd3 T\x90\t\x011\x0c}S{\\Rw\x91\xb2t$\xf2QV.3Q\xa5\xb3;1\xaf\xce\xae\xf5nAz\xc3\x8e\xd6\xfb\xbeJ\x92v\xbd\x1b+V\x92\xd4\xab}\xca\x06m\xb6P\x1d\xc0\xbc\xb8\xffQiY\x15\xe2/\xcac\xd7s\x7f\x16R\x11\x95[\xd0&\xf8'\xcc$\x8d\xe62-&g\xdf\xce3\x9c\xde\xbf\x18\xb6\xa4\x9b|*B\xee\xb4\xb8\xf2\x03Y\xe6\x9f\xcf\xef\xd3\xc7\xdbUz\x94\xb5:/\xbfq\x8f\xd80\xcf\x99\xdb\xa1\xa3\x86\xfc\x9e\xb1\x08\x11\xbdEI\xe1\x121\x16\x17tn\xb0a\xba\x9c\x85\xfe+\xb9pK\xce\xa5Tf\x081e\xb8\xabA\xfe\x14F\x19\x1cc\xb0\xb8F\x1b\xfevf\xb0\x12y\xe0\x19\x9e\xb0\xf6-\xda\xe8\xbe%]\xc3*\x06E\x12\xb5\xc3\xd0\xd3&\x92\xa6\xeb\xd0J\xaaU4\xb0\xb8:\x1e\x7f\x01\x94q\xecQ\x91\x01\x00\x00"
解决方法
import urllib.request
import chardet
from io import BytesIO
import gzip
url = 'https://devapi.qweather.com/v7/weather/now?location=101060110&key=cd6ce864556a4c099b3342a674a87d5e'
html = urllib.request.urlopen(url)
html_text = html.read()
buff = BytesIO(html_text)
f = gzip.GzipFile(fileobj=buff)
html_text = f.read().decode('utf-8')
print(html_text)
# {"code":"200","updateTime":"2021-03-29T20:27+08:00","fxLink":"http://hfx.link/1tly1","now":{"obsTime":"2021-03-29T20:00+08:00","temp":"7","feelsLike":"4","icon":"154","text":"阴","wind360":"359","windDir":"风","windScale":"2","windSpeed":"8","humidity":"57","precip":"0.0","pressure":"975","vis":"13","cloud":"94","dew":"-2"},"refer":{"sources":["Weather China"],"license":["no commercial use"]}}