使用selenium进行动态网页爬取详解-CSDN博客

本文链接：https://blog.csdn.net/cckavin/article/details/104943885

本文详细介绍了为何使用selenium进行动态网页爬取，包括安装selenium及浏览器驱动，如何创建WebDriver对象，使用各种查找元素的方法，以及如何通过WebElement对象进行页面操作。还提醒了在使用过程中注意不要直接打开浏览器。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

为什么使用selenium

因为有些网页是动态渲染的，如果使用传统的请求库进行爬虫，可能得不到所需要的内容，所以使用selenium库。

安装

1、selenium

selenium的安装方式详见参考资料[2]。

pip install selenium

2、浏览器

可以安装谷歌，火狐，edge等浏览器。

3、浏览器对应的驱动

浏览器驱动可以到淘宝镜像站(详见参考资料[4])下载。

下载的时候注意浏览器驱动版本最好和浏览器版本一致。

下载后最好将浏览器驱动防止到Python安装目录的根目录下，不然可能会提示selenium.common.exceptions.WebDriverException 错误。

序号	浏览器	浏览器驱动
1	谷歌浏览器	ChromaDriver
2	火狐浏览器	geckodriver

使用

1、WebDriver对象

创建浏览器对象方式参考各个浏览器驱动的说明文档（如ChromeDriver的文档,详见参考资料[5]）。

# -*- coding: utf-8 -*-
from selenium import webdriver

# 创建WebDriver对象
browser = webdriver.Chrome()
print(type(browser))  # <class 'selenium.webdriver.chrome.webdriver.WebDriver'>

webdriver.Chrome()会创建一个WebDriver对象。

2、WebDriver对象的方法

调用浏览器对象的get()/post()方法发起请求。然后再调用浏览器对象的其它方法即可获取页面内容。

# -*- coding: utf-8 -*-
from selenium import webdriver

# 创建WebDriver对象
browser = webdriver.Chrome()

# 调用WebDriver对象的get()方法发起请求
url = 'https://voice.baidu.com/act/newpneumonia/newpneumonia/?from=osari_pc_1#tab2'
browser.get(url)
# 调用WebDriver对象page_source()方法获取页面源代码
browser.page_source()

浏览器对象的所有方法详见：https://www.selenium.dev/selenium/docs/api/py/webdriver_remote/selenium.webdriver.remote.webdriver.html#module-selenium.webdriver.remote.webdriver

3、WebElement对象

WebDriver对象有很多种查找元素的方法，调用这些查找元素方法方法会创建一个WebElement 类型对象。

这些查找元素的方法可以分为两类：

（1）find_element_*

find_element_*方法返回一个 WebElement 类型的字符串，代表页面中匹配查询的第一个元素。

# -*- coding: utf-8 -*-
from selenium import webdriver

# 不要打开浏览器
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sanbox')

# 创建Webdriver象
browser = webdriver.Chrome(options=options)

url = 'https://voice.baidu.com/act/virussearch/virussearch?from=osari_map&tab=0&infomore=1'
browser.get(url)

# 点击查看更多
# btn = browser.find_element_by_css_selector('.VirusHot_1-5-4_1Fqxy-')
# btn.click()


# 调用WebDriver的方法，生成WebElement对象
element = browser.find_element_by_xpath(
    '//*[@id="ptab-0"]/div/div[2]/section/a/div/span[2]')

print(type(element))  # str

（2）find_elements_*

# -*- coding: utf-8 -*-
from selenium import webdriver

# 不要打开浏览器
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sanbox')

# 创建Webdriver象
browser = webdriver.Chrome(options=options)

url = 'https://voice.baidu.com/act/virussearch/virussearch?from=osari_map&tab=0&infomore=1'
browser.get(url)

# 点击查看更多
# btn = browser.find_element_by_css_selector('.VirusHot_1-5-4_1Fqxy-')
# btn.click()


# 调用WebDriver的方法，生成WebElement对象
elements = browser.find_elements_by_xpath(
    '//*[@id="ptab-0"]/div/div[2]/section/a/div/span[2]')

print(type(elements))  # list

find_elements_* 方法返回 WebElement类型的列表，包含页面中所有匹配的元素。

除了*_by_tag_name()方法，所有方法的参数都是区分大小写的。

如果页面上没有元素匹配该方法要查找的元素，selenium 模块就会抛出 NoSuchElement 异常。如
果你不希望这个异常让程序崩溃，就在代码中添加 try 和 except 语句

4、WebElement对象的方法

调用WebElement对象的方法和属性就可以在页面进行相关操作——如WebElement.txt属性获取元素内的文本。

# -*- coding: utf-8 -*-
from selenium import webdriver

# 不要打开浏览器
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sanbox')

# 创建Webdriver象
browser = webdriver.Chrome(options=options)

url = 'https://voice.baidu.com/act/virussearch/virussearch?from=osari_map&tab=0&infomore=1'
browser.get(url)

# 点击查看更多
# btn = browser.find_element_by_css_selector('.VirusHot_1-5-4_1Fqxy-')
# btn.click()


# 调用WebDriver的方法，生成WebElement对象
elements = browser.find_elements_by_xpath(
    '//*[@id="ptab-0"]/div/div[2]/section/a/div/span[2]')

print(type(elements))  # list


# 调用WebElement对象的属性获取数据
for ele in elements:
    print(ele.text)
browser.close()

WebElement对象的方法和属性详见：https://www.selenium.dev/selenium/docs/api/py/webdriver_remote/selenium.webdriver.remote.webelement.html#module-selenium.webdriver.remote.webelement

说明

1、不要打开浏览器

# -*- coding: utf-8 -*-
from selenium import webdriver

# 不要打开浏览器
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sanbox')

# 创建Webdriver象
browser = webdriver.Chrome(options=options)