python爬虫爬取网页信息

桑梓。

于 2021-09-28 22:38:18 发布

阅读量629

点赞数

分类专栏： Python 文章标签： python 爬虫 html

本文链接：https://blog.csdn.net/Lin_YG/article/details/120539059

版权

本文介绍了一种使用Python进行网络爬虫的方法，详细阐述了如何爬取古诗词网站的数据，包括导入所需库、获取网页、格式转换、解析网页元素、保存数据、获取诗词主题和页数等步骤，最终实现数据的完整抓取。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

python爬虫爬取古诗词网站

导入库

import requests  # 获取网页
from lxml import etree  # 解析网页库

函数1：获取网页

def get_html(k, type_v1):
    url = "https://so.gushiwen.cn/mingju/default.aspx?p={}&c={}&t=".format(k + 1, type_v1)  # 网址
    print(url)
    ua = {
   'User-Agent': "User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}
    resp = requests.get(url, headers=ua)
    print(resp.status_code)  # 打印状态码
    # print(resp.text)   # 打印网页
    return resp

函数2：转换格式

def change_html(resp):
    # 2.解析元素
    html = etree.HTML(resp.text)  # 将网页转换为xpath能解析的格式
    result = etree.tostring(html)  # 格式修正
    return html

函数3：解析元素

def elem_def(html):
    # 2.1 解析标题
    title_v1 = html.xpath

最低0.47元/天解锁文章

python爬虫 爬取网页信息

python爬虫 爬取古诗词网站

导入库

函数1：获取网页

函数2：转换格式

函数3：解析元素

python爬虫爬取网页信息

python爬虫爬取古诗词网站