爬虫路上踩的第一个坑:
UnicodeEncodeError: 'gbk' codec can't encode character '\xbb' in position 29531: illegal multibyte sequence
就这么几行代码,为了获取baidu主页的网页源代码,一直报错,如标题,在网上查了许多,最终解决了
import urllib.request
import time
import platform
import os
import sys
import io
def clear():
'''该函数用于清屏 '''
print('内容较多,显示3秒后翻页')
time.sleep(3)
OS = platform.system()
if (OS == 'Windows'):
os.system('cls')
else:
os.system('clear')
def linkBaidu():
url = 'http://www.baidu.com'
try:
response = urllib.request.urlopen(url,timeout=3)
result = response.read().decode('utf-8','ignore')
#result = result.encode('GBK','ignore')
except Exception as e:
print("网络地址错误")
exit()
with open('baidu.txt', 'w') as fp:
fp.write(result)
print("获取url信息 : response.geturl() : %s" %response.geturl())
print("获取返回代码 : response.getcode() : %s" %response.getcode())
print("获取返回信息 : response.info() : %s" %response.info())
print("获取的网页内容已存入当前目录的baidu.txt中,请自行查看")
if __name__ == '__main__':
linkBaidu()
即解码之后重新编码,然后将字节流转换为字符串
#第一步
result = result.encode('GBK','ignore')
#第二步
fp.write(str(result))#字节流类型转换为字符串
然后就OK了
输出如下:
获取url信息 : response.geturl() : http://www.baidu.com
获取返回代码 : response.getcode() : 200
获取返回信息 : response.info() : Bdpagetype: 1
Bdqid: 0xec36b3870004e4e7
Cache-Control: private
Content-Type: text/html
Cxy_all: baidu+2a11a08485d2a15f9348ad46d5b91ff9
Date: Thu, 28 Mar 2019 13:26:04 GMT
Expires: Thu, 28 Mar 2019 13:25:05 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Server: BWS/1.1
Set-Cookie: BAIDUID=FF75FD7466E5ADC2E8EBAB609E90671E:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BIDUPSID=FF75FD7466E5ADC2E8EBAB609E90671E; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1553779564; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: delPer=0; path=/; domain=.baidu.com
Set-Cookie: BDSVRTM=0; path=/
Set-Cookie: BD_HOME=0; path=/
Set-Cookie: H_PS_PSSID=1429_28777_21084_28771_28724_28557_28697_28584_26350_28519_28626_22158; path=/; domain=.baidu.com
Vary: Accept-Encoding
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close
Transfer-Encoding: chunked
获取的网页内容已存入当前目录的baidu.txt中,请自行查看