百度新闻网页爬虫_爬虫爬取百度新闻代码-CSDN博客

本文链接：https://blog.csdn.net/rvgekj/article/details/146171212

百度新闻网页爬虫开发记录

软件需求分析

本项目旨在开发一个简单的百度新闻网页爬虫，主要需求为能够从百度新闻网页中提取新闻标题和链接，并以表格形式展示新闻标题。该爬虫需要具备处理网页编码的能力，以确保正确解析网页内容，同时需要根据实际网页结构调整 CSS 选择器，以准确提取新闻信息。

技术选型

编程语言

选择 Python 作为开发语言，因为它具有丰富的库和简洁的语法，适合快速开发爬虫程序。

库的选择

requests：用于发送 HTTP 请求，获取百度新闻网页的内容。
BeautifulSoup：用于解析 HTML 内容，方便提取新闻标题和链接。
plotly.graph_objects：用于绘制表格，展示提取的新闻标题。

编码测试过程

导入所需库

在 web_crawler.py 中，导入了 requests、BeautifulSoup 和 plotly.graph_objects 库。

import requests
from bs4 import BeautifulSoup
import plotly.graph_objects as go

封装请求和解析逻辑

定义了 fetch_baidu_news 函数，封装了请求和解析百度新闻网页的逻辑。具体步骤如下：

发送 HTTP 请求获取网页内容。
从响应头中获取编码信息，处理编码问题。
检查响应状态码，确保请求成功。
使用 BeautifulSoup 解析 HTML 内容。
根据实际网页结构更新 CSS 选择器，提取新闻标题和链接。
使用 plotly 绘制表格展示新闻标题。

def fetch_baidu_news(url):
    # 发送 HTTP 请求获取网页内容
    response = requests.get(url)
    # 从响应头中获取编码信息
    content_type = response.headers.get('Content-Type')
    encoding = content_type.split('charset=')[-1] if content_type and 'charset=' in content_type else None
    # 如果获取到编码信息，则使用该编码，否则使用默认的 apparent_encoding
    if encoding:
        response.encoding = encoding
    else:
        response.encoding = response.apparent_encoding
    # 检查响应状态码
    if response.status_code == 200:
        # 使用 BeautifulSoup 解析 HTML 内容
        soup = BeautifulSoup(response.text, 'html.parser')
        # 假设百度新闻的最新新闻有特定的 CSS 选择器，这里只是示例
        # 这里需要根据实际网页结构更新 CSS 选择器，假设为下面这个，需根据实际情况修改
        # 根据实际百度新闻网页结构更新 CSS 选择器
        # 使用更准确的 CSS 选择器，尝试获取新闻信息
        news_items = soup.select('div#pane-news a')
        news_titles = []
        news_links = []
        for item in news_items:
            title = item.get_text()
            link = item.get('href')
            if title and link:
                # 将标题转为超链接格式
                title = f'<a href="{link}">{title}</a>'
                news_titles.append(title)
                news_links.append(link)
        fig = go.Figure(data=[go.Table(header=dict(values=['标题']),
                                       cells=dict(values=[news_titles]))])
        fig.show()
        for item in news_items:
            title = item.get_text()
            link = item.get('href')
            if title and link:
                print(f'标题: {title}, 链接: {link}')
            else:
                print("未找到标题或链接信息")
    else:
        print(f'请求失败，状态码: {response.status_code}')

调用函数

在代码末尾调用 fetch_baidu_news 函数，传入百度新闻的 URL。

# 目标网页的 URL
url = 'https://news.baidu.com/'

# 调用函数
fetch_baidu_news(url)

问题解决

编码问题

在获取网页内容时，需要处理编码问题。通过从响应头中获取编码信息，如果获取到则使用该编码，否则使用默认的 apparent_encoding。

content_type = response.headers.get('Content-Type')
encoding = content_type.split('charset=')[-1] if content_type and 'charset=' in content_type else None
if encoding:
    response.encoding = encoding
else:
    response.encoding = response.apparent_encoding