robots.txt
disallow不可以访问
分类
通用网络爬虫
聚焦网络爬虫
F12
动态数据可能在Sources找不到
url
http协议,https代表http的加密版本:在http下加入了SSL层
REquest MEthod 是请求方式(get,post)
Remote Address:xxx.xxx.xxx.xxx:hhh
hhh代表着端口号http:80,https:443
https://xx.xxx.com/s? --> url
/s -->uri
? -->请求参数
cookie
AJIX请求
随后在json.cn进行解析
判断是否是ajix请求,如果在elements找到的数据,sources找不到,那就是ajix请求,找到了就说明是html当中的
首先得右击网页,查看网页源代码,确认网页编码(gbk,utf-8)
BeautifulSoup解析数据
Urllib
urllib.request
一般来说,会出现418,也就是反爬
解决方式:User-Agent:
Python操作Excel文件
网页数据读取
import requests
from pyquery import PyQuery
url='https://www.qidian.com/rank/yuepiao/chn0/year2024-month05-page2/'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0','Cookie':'_yep_uuid=85a81d49-29f0-b252-5ef0-99a7605962f6; e1=%7B%22l6%22%3A%22%22%2C%22l1%22%3A%22%22%2C%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22l6%22%3A%22%22%2C%22l1%22%3A5%2C%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22qd_C44%22%7D; e1=%7B%22l6%22%3A%22%22%2C%22l1%22%3A%22%22%2C%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22l6%22%3A%22%22%2C%22l1%22%3A%22%22%2C%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; newstatisticUUID=1716961289_537816331; fu=979453169; supportwebp=true; _csrfToken=2gzZLO1pAUrsmTpALsz7qGFxGFCkCunLxWF7b2yT; _gid=GA1.2.1149782972.1717063257; traffic_utm_referer=https%3A//www.baidu.com/link; e1=%7B%22l6%22%3A%22%22%2C%22l1%22%3A11%2C%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A118%22%2C%22l2%22%3A1%7D; e2=%7B%22l6%22%3A%22%22%2C%22l1%22%3A3%2C%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A78%22%7D; Hm_lvt_f00f67093ce2f38f215010b699629083=1716961290,1717063256,1717066204; x-waf-captcha-referer=; Hm_lpvt_f00f67093ce2f38f215010b699629083=1717067134; _gat_gtag_UA_199934072_2=1; _ga_FZMMH98S83=GS1.1.1717063256.4.1.1717067133.0.0.0; _ga=GA1.1.389148743.1716961290; _ga_PFYW0QLV3P=GS1.1.1717063256.4.1.1717067134.0.0.0; w_tsfp=ltvgWVEE2utBvS0Q6KzgkkqsHj87Z2R7xFw0D+M9Os09B6MnVZ+A0oF+t9fldCyCt5Mxutrd9MVxYnGGVtAifBQXQsmTb5tH1VPHx8NlntdKRQJtA57VXVMcJO0m62YQLWhdcUHmiW9+JNVAxeMzilBa4iR337ZlCa8hbMFbixsAqOPFm/97DxvSliPXAHGHM3wLc+6C6rgv8LlSgX2NuwDuLi09Ru8CiRTNwCppXSB94QS5AvtBJ03/fZHtBbxsvznhwjn3apCs2RYj4VA3sB49AtX02TXKL3ZEIAtrZViygr42Z7z4brN24nJcAb5FWw1G+gkYs7Zt+hdICyjsYiHbUP97sQYBQqBY+cv4eyiWxNc='}
resp=requests.get(url,headers=headers)
# print(resp.text)
#初始化PyQuery对象
doc=PyQuery(resp.text) #使用字符串初始化PyQuery对象
# a_tag=doc('h2 a')
# print(a_tag)
names=[a.text for a in doc('h2 a')]
authors=doc('p.author a')
author_lst=[]
for index in range(len(authors)):
if index%2==0:
author_lst.append(authors[index].text)
# print(author_lst)
# print(names)
for name,author in zip(names,author_lst):
print(name,':',author)
首先放入两个库
url写入读取的网页
headers写入网页的抬头以及它的Cookie
查看网页是get请求还是post
打印,看是否读取成功
使用字符串初始化PyQuery对象
选择你要读取的对象
用for循环清除不需要的对象
网页数据写入excel表格
# 开发时间: 2024/5/30 22:31
#课堂案例-爬取下厨房的菜品数据
import requests
from bs4 import BeautifulSoup
import openpyxl
def send_request():
url='https://www.xiachufang.com/explore/'
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0'}
resp=requests.get(url,headers=header)
return resp.text
1、相同的方法把网页的源代码写入
def parse_html(html):
#解析数据
count=0
bs=BeautifulSoup(html,'html.parser')
lst_name=bs.find_all('p',class_='name')
lst_category=bs.find_all('p',class_='ing ellipsis')
# print(lst_name)
# print(lst_category)
lst=[] #建立一个列表
for i in range(len(lst_name)):
count+=1
food_url='https://www.xiachufang.com/'+lst_name[i].find('a')['href']
print(food_url)
lst.append([count,lst_name[i].text[18:-14],lst_category[i].text[1:-1],food_url])
print(lst)
2、count方便区分多少个菜谱
BeautifulSoup解析数据(选择什么解析)
选择哪个标签,标签的名字是什么(例:P标签下面的class标签,class标签名字叫做name)
如上
循环需要的菜谱相对应的网站,text【:】为删除不需要的字符
save(lst)
def save(lst):
wb=openpyxl.Workbook()
sheet=wb.active
for row in lst:
sheet.append(row)
wb.save('下厨房美食.xlsx')
def start():
result=send_request()
parse_html(result)
if __name__ == '__main__':
start()
3、创建一个元组,随后创建excel需要的操作
往文件里面写入数据
操作
网页读取数据
xxx=item.find('父标签',class_='子标签的名字').text