爬虫应用开发

最新推荐文章于 2025-08-19 15:08:56 发布

w000006

最新推荐文章于 2025-08-19 15:08:56 发布

阅读量869

点赞数 5

CC 4.0 BY-SA版权

文章标签：爬虫

本文链接：https://blog.csdn.net/w000006/article/details/149766930

x爬虫开发的基本流程

1. 确定目标与需求：明确需要爬取的数据类型（如文本、图片、表格等）、来源网站及具体字段。

2. 分析目标网站：

◦ 查看网页结构（静态HTML或动态加载，动态通常需分析AJAX请求）。

◦ 检查robots协议（网站对爬虫的限制规则）。

◦ 识别反爬机制（如验证码、IP限制、请求频率限制等）。

3. 选择开发工具与技术：

◦ 编程语言：Python（常用，库丰富）、Java、JavaScript等。

◦ 核心库/框架：

◦ Python：Requests（发送HTTP请求）、BeautifulSoup（解析HTML）、Scrapy（专业爬虫框架）、Selenium（模拟浏览器，应对动态页面）。

◦ 其他：Node.js的Cheerio、Java的Jsoup等。

4. 编写爬虫程序：

◦ 发送请求获取网页数据。

◦ 解析数据，提取目标信息。

◦ 处理反爬（如设置请求头、使用代理IP、随机请求间隔）。

5. 数据存储与处理：将爬取的数据保存到数据库（MySQL、MongoDB等）、文件（CSV、JSON等），或进行清洗、分析。

6. 测试与优化：检查数据完整性、程序稳定性，优化爬取效率和反爬应对策略。

注意事项

• 合法性与合规性：遵守网站robots协议，避免爬取受版权保护或隐私数据，尊重网站的访问频率限制。

• 反爬应对：合理设置请求间隔，避免对目标网站造成服务器压力；必要时使用代理池、用户代理池等技术。

爬虫广泛应用于数据采集、市场分析、舆情监控、搜索引擎索引等场景，开发时需在效率、合规性和稳定性之间找到平衡。

下面爬虫某网站代码

import requests
from lxml import etree
import csv
import pymysql

def get_html(url, time=30):
try:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/80.0.3987.132 Safari/537.36"}
r = requests.get(url, headers=headers, timeout=time)
r.encoding = r.apparent_encoding
r.raise_for_status()
return r.text
except Exception as error:
print(error)

def parser(html):
doc = etree.HTML(html)
out_list = []
for row in doc.xpath("//*[@id='content']/div[2]/div[1]/ul/li"):
title = row.xpath("./div[2]/h2/a/text()")[0].strip()
score = row.xpath("./div[2]/div/span[2]/text()")[0].strip()
info = row.xpath("./div[2]/p[1]/text()")[0].strip().split("/")
row_list = [title, score, info[0], info[-2], info[-1]]
out_list.append(row_list)
return out_list

def save_mysql(sql, val, **dbinfo):
try:
connect = pymysql.connect(**dbinfo)
cursor = connect.cursor()
cursor.executemany(sql, val)
connect.commit()
except Exception as err:
connect.rollback()
print(err)
finally:
cursor.close()
connect.close()

if __name__ == "__main__":
url = "https://book.1111111.com/latest"
html = get_html(url)
out_list = parser(html)
params = {
"host": "127.0.0.1",
"user": "root",
"password": "123456",
"db": "spider",
"charset": "utf8",
"cursorclass": pymysql.cursors.DictCursor
}
sql = "insert into bookinfo(BookName,Score,Autor,Press,Pubdate) VALUES(%s,%s,%s,%s,%s)"
save_mysql(sql, out_list, **params)