使用scrapy和selenium实现对于用益信托网产品页(http://www.yanglee.com/Product/Index.aspx) 下所有信托产品明细信息部分字段的自动爬取。
时间: 2025-06-09 18:06:19 浏览: 13
使用Scrapy和Selenium结合可以实现动态网页内容的抓取,如用益信托网这类网站,因为其产品详情页面的内容可能会有JavaScript生成,单纯的Scrapy可能无法直接获取。以下是大致步骤:
1. **安装依赖**:
- 安装Scrapy库: `pip install scrapy`
- 如果需要浏览器驱动,比如ChromeDriver,`pip install selenium`
- 对于Python 3.7及更高版本,还需要安装`webdriver_manager`库来管理浏览器驱动: `pip install webdriver-manager`
2. **创建Scrapy项目**:
```
scrapy startproject yanglee_crawler
cd yanglee_crawler
```
3. **设置Scrapy中间件**:
创建`middlewares.py`文件,添加SeleniumMiddleware来处理渲染后的HTML:
```python
from scrapy import signals
from selenium import webdriver
class SeleniumMiddleware:
def __init__(self):
self.driver = webdriver.Chrome() # 需要替换为你实际的ChromeDriver路径
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
return middleware
def spider_opened(self, spider):
spider.logger.info("Spider opened")
self.driver.get(spider.start_url)
def process_request(self, request, spider):
if request.meta.get('use_selenium', False):
self.driver.get(request.url)
html = self.driver.page_source
request.meta['html'] = html
return request
def close_spider(self, spider, reason):
self.driver.quit()
```
4. **配置Scrapy**:
在`settings.py`中启用中间件,并指定需要使用Selenium的URL模式:
```python
SPIDER_MIDDLEWARES = {
'yanglee_crawler.middlewares.SeleniumMiddleware': 543,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': None,
}
DOWNLOAD_DELAY = 1 # 控制请求间隔,避免频率过高
DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
# 设置需要使用Selenium的URL前缀,例如'/Product/Index.aspx':
SELENIUM_ENABLED_URLS = r'^/Product/Index\.aspx$'
```
5. **定义Spider**:
在`spiders`目录下创建一个Spider,如`yanglee_product_spider.py`:
```python
import scrapy
class YangleeProductSpider(scrapy.Spider):
name = "yanglee_product"
start_urls = ['http://www.yanglee.com/Product/Index.aspx']
def parse(self, response):
if 'html' in response.meta:
soup = BeautifulSoup(response.meta['html'], 'lxml')
# 解析产品详情数据并提取所需字段,这里只是一个示例:
products = soup.find_all('div', class_='product-detail') # 确定选择器
for product in products:
title = product.find('h2').text
yield {'title': title, ...} # 提取其他字段,例如利率、期限等
```
6. **运行爬虫**:
```
scrapy crawl yanglee_product
```
注意,具体的解析过程取决于用益信托网的具体结构,你需要查看实际页面元素来确定正确的CSS选择器或XPath表达式。此外,Selenium的使用会消耗更多资源和时间,所以记得监控和控制爬取速度。
阅读全文
相关推荐
















