爬虫123

robots.txt

disallow不可以访问

分类

通用网络爬虫

聚焦网络爬虫

F12

动态数据可能在Sources找不到

url

http协议,https代表http的加密版本:在http下加入了SSL层

REquest MEthod 是请求方式(get,post)

Remote Address:xxx.xxx.xxx.xxx:hhh

hhh代表着端口号http:80,https:443

https://xx.xxx.com/s? --> url

/s -->uri

? -->请求参数

cookie

AJIX请求

随后在json.cn进行解析

判断是否是ajix请求,如果在elements找到的数据,sources找不到,那就是ajix请求,找到了就说明是html当中的

首先得右击网页,查看网页源代码,确认网页编码(gbk,utf-8)

BeautifulSoup解析数据

Urllib

urllib.request

一般来说,会出现418,也就是反爬

解决方式:User-Agent:

Python操作Excel文件

网页数据读取

import requests
from pyquery import PyQuery
url='https://www.qidian.com/rank/yuepiao/chn0/year2024-month05-page2/'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0','Cookie':'_yep_uuid=85a81d49-29f0-b252-5ef0-99a7605962f6; e1=%7B%22l6%22%3A%22%22%2C%22l1%22%3A%22%22%2C%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22l6%22%3A%22%22%2C%22l1%22%3A5%2C%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22qd_C44%22%7D; e1=%7B%22l6%22%3A%22%22%2C%22l1%22%3A%22%22%2C%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22l6%22%3A%22%22%2C%22l1%22%3A%22%22%2C%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; newstatisticUUID=1716961289_537816331; fu=979453169; supportwebp=true; _csrfToken=2gzZLO1pAUrsmTpALsz7qGFxGFCkCunLxWF7b2yT; _gid=GA1.2.1149782972.1717063257; traffic_utm_referer=https%3A//www.baidu.com/link; e1=%7B%22l6%22%3A%22%22%2C%22l1%22%3A11%2C%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A118%22%2C%22l2%22%3A1%7D; e2=%7B%22l6%22%3A%22%22%2C%22l1%22%3A3%2C%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A78%22%7D; Hm_lvt_f00f67093ce2f38f215010b699629083=1716961290,1717063256,1717066204; x-waf-captcha-referer=; Hm_lpvt_f00f67093ce2f38f215010b699629083=1717067134; _gat_gtag_UA_199934072_2=1; _ga_FZMMH98S83=GS1.1.1717063256.4.1.1717067133.0.0.0; _ga=GA1.1.389148743.1716961290; _ga_PFYW0QLV3P=GS1.1.1717063256.4.1.1717067134.0.0.0; w_tsfp=ltvgWVEE2utBvS0Q6KzgkkqsHj87Z2R7xFw0D+M9Os09B6MnVZ+A0oF+t9fldCyCt5Mxutrd9MVxYnGGVtAifBQXQsmTb5tH1VPHx8NlntdKRQJtA57VXVMcJO0m62YQLWhdcUHmiW9+JNVAxeMzilBa4iR337ZlCa8hbMFbixsAqOPFm/97DxvSliPXAHGHM3wLc+6C6rgv8LlSgX2NuwDuLi09Ru8CiRTNwCppXSB94QS5AvtBJ03/fZHtBbxsvznhwjn3apCs2RYj4VA3sB49AtX02TXKL3ZEIAtrZViygr42Z7z4brN24nJcAb5FWw1G+gkYs7Zt+hdICyjsYiHbUP97sQYBQqBY+cv4eyiWxNc='}
resp=requests.get(url,headers=headers)
# print(resp.text)
#初始化PyQuery对象
doc=PyQuery(resp.text) #使用字符串初始化PyQuery对象
# a_tag=doc('h2 a')
# print(a_tag)
names=[a.text for a in doc('h2 a')]

authors=doc('p.author a')

author_lst=[]
for index in range(len(authors)):
    if index%2==0:
        author_lst.append(authors[index].text)
# print(author_lst)
# print(names)

for name,author in zip(names,author_lst):
    print(name,':',author)

首先放入两个库

url写入读取的网页

headers写入网页的抬头以及它的Cookie

查看网页是get请求还是post

打印,看是否读取成功

使用字符串初始化PyQuery对象

选择你要读取的对象

用for循环清除不需要的对象

网页数据写入excel表格

# 开发时间: 2024/5/30  22:31
#课堂案例-爬取下厨房的菜品数据
import  requests
from bs4 import BeautifulSoup
import openpyxl
def send_request():
    url='https://www.xiachufang.com/explore/'
    header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0'}
    resp=requests.get(url,headers=header)
    return resp.text

1、相同的方法把网页的源代码写入

def parse_html(html):
    #解析数据
    count=0
    bs=BeautifulSoup(html,'html.parser')
    lst_name=bs.find_all('p',class_='name')
    lst_category=bs.find_all('p',class_='ing ellipsis')
    # print(lst_name)
    # print(lst_category)
    lst=[] #建立一个列表
    for i in range(len(lst_name)):
        count+=1
        food_url='https://www.xiachufang.com/'+lst_name[i].find('a')['href']
        print(food_url)
        lst.append([count,lst_name[i].text[18:-14],lst_category[i].text[1:-1],food_url])
    print(lst)

2、count方便区分多少个菜谱

BeautifulSoup解析数据(选择什么解析)

选择哪个标签,标签的名字是什么(例:P标签下面的class标签,class标签名字叫做name)

如上

循环需要的菜谱相对应的网站,text【:】为删除不需要的字符

    save(lst)
def save(lst):
    wb=openpyxl.Workbook()
    sheet=wb.active
    for row in lst:
        sheet.append(row)
    wb.save('下厨房美食.xlsx')
def start():
    result=send_request()
    parse_html(result)
if __name__ == '__main__':
    start()

3、创建一个元组,随后创建excel需要的操作

往文件里面写入数据

操作

网页读取数据

xxx=item.find('父标签',class_='子标签的名字').text

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值