selenium结合lxml爬取豆瓣电影相关信息

最新推荐文章于 2020-08-02 11:46:03 发布

原创最新推荐文章于 2020-08-02 11:46:03 发布 · 1k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python #selenium

python 同时被 2 个专栏收录

15 篇文章

订阅专栏

scrapy

5 篇文章

订阅专栏

本文介绍了一种使用Python和Selenium爬取豆瓣电影网站数据的方法。通过PhantomJS加载整个网页，并利用XPath解析获取电影标题、排名及图片链接。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

环境说明
重要代码解释
完整代码

环境说明

python3.5
centos7.2

重要代码解释

使用selenium加载网页：

driver=webdriver.PhantomJS()
driver.get("https://movie.douban.com/")

使用selenium和web进行互动将网页加在完全：

end = True
while (end):
    try:
        end = driver.find_element_by_class_name("more")
        end.click()
    except Exception as e:
        print("没有这样的text.")
        end = False

获得电影信息的web的源代码：

movis = driver.page_source
driver.close()

使用xpath解析web代码：

html = etree.HTML(movis)
titles = html.xpath("//a[@class='item']")

提取需要的内容：

i =0
while(i<len(titles)):
    url_img = titles[i].xpath("./div/img/@src")
    title_moive = titles[i].xpath("./p/text()")
    rank_movie = titles[i].xpath("./p/strong/text()")
    title_moive=re.sub("\s+","",title_moive[0])
    i= i+1

完整代码

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from scrapy.selector import Selector
from lxml import etree
import re

driver=webdriver.PhantomJS()
driver.get("https://movie.douban.com/")

end = True
while (end):
    try:
        end = driver.find_element_by_class_name("more")
        end.click()
    except Exception as e:
        print("没有这样的text.")
        end = False


movis = driver.page_source
driver.close()

print(type(movis))
html = etree.HTML(movis)
titles = html.xpath("//a[@class='item']")

i =0
while(i<len(titles)):
    url_img = titles[i].xpath("./div/img/@src")
    title_moive = titles[i].xpath("./p/text()")
    rank_movie = titles[i].xpath("./p/strong/text()")
    title_moive=re.sub("\s+","",title_moive[0])
    i= i+1
    print(url_img,"===",title_moive,"===",rank_movie)
    print("****************************************************************************")