详解如何通过playwright的 page.eval_on_selector_all() 方法来爬取网页中link进而实现爬虫

page.eval_on_selector_all介绍

我们可以通过playwright的 page.eval_on_selector_all() 方法来实现获取一个页面中所有link信息的操作进而实现爬虫。它是一个用于批量操作页面元素的方法,它可以对所有匹配指定选择器的元素执行 JavaScript 函数,并返回处理后的结果。

results = page.eval_on_selector_all(
    selector, 
    expression, 
    arg=None 
)

参数详解

参数类型说明
selectorstrCSS 或 XPath 选择器(需根据语法前缀判断,如 css=button 或 xpath=//div
expressionstr要执行的 JavaScript 函数体,接收两个参数:
elements:匹配的元素数组
arg:从 Python 传递的额外参数(可选)
argAny可选参数,传递给 JavaScript 函数(需可序列化为 JSON)

代码示例

获取百度首页中的所有link信息

import time
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # 启动浏览器并打开百度首页
    print("start")
    browser = p.chromium.launch(headless=True)  # headless=False 可视化模式
    page = browser.new_page()
    page.goto("https://www.baidu.com")
    time.sleep(5)

    # 使用 eval_on_selector_all 获取所有链接的 href
    all_links = page.eval_on_selector_all(
        selector="a",  # 选择所有 <a> 标签
        expression="""(elements) => {
            return elements
                .map(e => e.href)          // 提取 href
                .filter(href => href !== '')  // 过滤空值
                .filter(href => href.startsWith('http')); // 过滤非 HTTP 链接
        }"""
    )

    # 输出结果
    print(f"共找到 {len(all_links)} 个链接:")
    for idx, url in enumerate(all_links, 1):
        print(f"{idx}. {url}")

    # 关闭浏览器
    browser.close()

输出:

共找到 60 个链接:
1. https://www.baidu.com/
2. https://passport.baidu.com/v2/?login&tpl=mn&u=https%3A%2F%2F2.zoppoz.workers.dev%3A443%2Fhttp%2Fwww.baidu.com%2F&sms=5
3. http://news.baidu.com/
4. https://www.hao123.com/?src=from_pc
5. http://map.baidu.com/
6. http://tieba.baidu.com/
7. https://haokan.baidu.com/?sfrom=baidu-top
8. http://image.baidu.com/
9. https://pan.baidu.com/?from=1026962h
10. https://wenku.baidu.com/?fr=bdpcindex
11. https://chat.baidu.com/search?isShowHello=1&pd=csaitab&setype=csaitab&extParamsJson=%7B%22enter_type%22%3A%22home_tab%22%7D
12. http://www.baidu.com/more/
13. http://fanyi.baidu.com/
14. http://xueshu.baidu.com/
15. https://baike.baidu.com/
16. https://zhidao.baidu.com/
17. https://jiankang.baidu.com/widescreen/home
18. http://e.baidu.com/ebaidu/home?refer=887
19. https://live.baidu.com/
20. http://music.taihe.com/
21. https://cp.baidu.com/?sa=bdindex
22. http://www.baidu.com/more/
23. https://passport.baidu.com/v2/?login&tpl=mn&u=https%3A%2F%2F2.zoppoz.workers.dev%3A443%2Fhttp%2Fwww.baidu.com%2F&sms=5
24. https://www.baidu.com/
25. https://chat.baidu.com/search?isShowHello=1&pd=csaitab&setype=csaitab&extParamsJson=%7B%22enter_type%22%3A%22ai_explore_home%22%7D&usedModel=%7B%22modelName%22%3A%22DeepSeek-R1%22%7D
26. https://top.baidu.com/board?platform=pc&sa=pcindex_entry
27. https://www.baidu.com/s?wd=%E5%9C%A8%E6%96%B0%E6%97%B6%E4%BB%A3%E7%BB%A7%E6%89%BF%E5%92%8C%E5%BC%98%E6%89%AC%E4%BC%9F%E5%A4%A7%E6%8A%97%E6%88%98%E7%B2%BE%E7%A5%9E&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1
28. https://www.baidu.com/s?wd=%E7%BD%91%E5%8F%8B%E5%B8%AE%E7%94%B7%E5%A4%A7%E5%AD%A6%E7%94%9F%E6%8B%8D%E7%85%A7+%E5%9B%9E%E5%AE%B6%E5%8F%91%E7%8E%B0%E6%98%AF%E6%9D%8E%E7%8E%B0&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1
29. https://www.baidu.com/s?wd=%E5%85%A8%E7%90%83%E8%82%A1%E5%B8%82%E5%B7%A8%E9%9C%87&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1
30. https://www.baidu.com/s?wd=%E8%80%90%E5%85%8B%E7%AD%89%E5%93%81%E7%89%8C%E6%88%96%E5%B0%86%E8%A2%AB%E8%BF%AB%E6%8F%90%E4%BB%B7&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1
31. https://www.baidu.com/s?wd=%E7%89%B9%E6%9C%97%E6%99%AE%E9%A1%BE%E9%97%AE%EF%BC%9A%E4%B8%8D%E5%8D%96%E6%8E%89%E8%82%A1%E7%A5%A8%E5%B0%B1%E4%B8%8D%E4%BC%9A%E4%BA%8F%E9%92%B1&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1
32. https://www.baidu.com/s?wd=%E7%89%B9%E6%9C%97%E6%99%AE%E6%94%BF%E5%BA%9C%E9%98%B5%E8%84%9A%E5%BC%80%E5%A7%8B%E4%B9%B1%E4%BA%86&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1
33. https://www.baidu.com/s?wd=%E6%B0%91%E8%90%A5%E4%BC%81%E4%B8%9A%E5%BC%80%E5%90%AF%E2%80%9C%E5%8A%A0%E9%80%9F%E8%B7%91%E2%80%9D&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1
34. https://www.baidu.com/s?wd=%E6%B2%AA%E6%8C%87%E8%B7%8C%E8%B6%858%25+%E5%88%9B%E4%B8%9A%E6%9D%BF%E6%8C%87%E8%B7%8C%E8%B6%8514%25&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1
35. https://www.baidu.com/s?wd=%E9%87%91%E4%BB%B7%E4%B8%BA%E4%BD%95%E5%BC%80%E5%A7%8B%E8%B7%8C%E4%BA%86&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1
36. https://www.baidu.com/s?wd=%E4%B8%9C%E9%83%A8%E6%88%98%E5%8C%BA%EF%BC%9A%E5%AD%90%E5%A4%9C%E5%8D%87%E7%A9%BA+%E5%B1%95%E5%BC%80%E6%88%98%E6%96%97&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1
37. https://home.baidu.com/
38. http://ir.baidu.com/
39. https://www.baidu.com/duty
40. https://help.baidu.com/question?prod_id=1
41. https://e.baidu.com/?refer=1271
42. http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001
43. https://beian.miit.gov.cn/
44. https://www.baidu.com/licence/
45. http://ir.baidu.com/
46. https://www.baidu.com/duty
47. https://help.baidu.com/question?prod_id=1
48. https://e.baidu.com/?refer=1271
49. http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001
50. https://beian.miit.gov.cn/
51. https://www.baidu.com/licence/
52. https://chat.baidu.com/search?pd=csaitab&setype=csaitab&extParamsJson=%7B%22enter_type%22%3A%22search_a_tab%22%2C%22sa%22%3A%22vs_tab%22%7D
53. http://image.baidu.com/i?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8
54. https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&ie=utf-8
55. http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8
56. http://www.baidu.com/s?pd=note&rpf=pc
57. https://map.baidu.com/?newmap=1&ie=utf-8&from=pstab&s=s
58. http://tieba.baidu.com/f?fr=wwwt&ie=utf-8
59. http://wenku.baidu.com/search?lm=0&od=0&ie=utf-8
60. http://www.baidu.com/more/

上面对代码做了详细的注释讲解,但是相信很多朋友对 箭头函数(=>)的基本语法不是很清楚,在这里我们做一下具体介绍

 箭头函数(=>)的基本语法

=> 是JS中的箭头函数

  • 核心符号=> 是箭头函数的标志,左侧是参数列表,右侧是函数体。

  • 与传统函数的等价关系

  • // 箭头函数
    (参数1, 参数2) => { 函数体 }
    
    // 等价于传统函数
    function(参数1, 参数2) {
      // 函数体
    }

入口函数 (elements) => { ... }

  • 作用
    Playwright 会将所有匹配选择器 a 的 DOM 元素作为数组 elements 传入这个函数。

  • 参数
    elements 是一个 JavaScript 数组,包含所有 <a> 标签对应的 DOM 元素对象。

 .map(e => e.href)

  • map 方法
    遍历 elements 数组中的每个元素,并对其执行 e => e.href 操作,生成新数组。

  • 箭头函数 e => e.href

    • e:当前遍历到的 <a> 元素。

    • e.href:获取该元素的完整 URL(自动转换为绝对路径,如 /about → https://www.baidu.com/about)。

隐式返回

  • 单行箭头函数(如 e => e.href)会自动返回结果,无需写 return

链式调用

  • .map().filter().filter():链式调用(Chaining)通过连续调用数组方法,逐步处理数据,代码更简洁。

  • 链式调用等效分步写法:

    const hrefs = elements.map(e => e.href);
    const nonEmptyHrefs = hrefs.filter(href => href !== '');
    const httpHrefs = nonEmptyHrefs.filter(href => href.startsWith('http'));
    return httpHrefs;
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

测试开发Kevin

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值