为这段代码import requests from bs4 import BeautifulSoup import pprint import json def download_all_htmls(): """下载所有列表页面的HTML,用于后续的分析""" htmls = [] headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'} for idx in range(24): url = f"https://www.utusan.com.my/page/{idx+1}" print("craw html:", url) r = requests.get(url) if r.status_code != 200: raise Exception("error") htmls.append(r.text) return htmls htmls = download_all_htmls() print(htmls[0]),添加爬取的每页数据单存保存到txt中,txt命名为页面页码的代码,再添加每隔10s爬取20个页面的代码
时间: 2024-03-07 13:52:26 浏览: 104
添加取每页数据并保存到 txt 文件中的代码如下:
```
import requests
from bs4 import BeautifulSoup
import pprint
import json
import time
def download_all_htmls():
"""下载所有列表页面的HTML,用于后续的分析"""
htmls = []
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
for idx in range(24):
url = f"https://www.utusan.com.my/page/{idx+1}"
print("craw html:", url)
r = requests.get(url, headers=headers)
if r.status_code != 200:
raise Exception("error")
htmls.append(r.text)
with open(f"page{idx+1}.txt", "w", encoding="utf-8") as f:
f.write(r.text)
return htmls
htmls = download_all_htmls()
print(htmls[0])
```
在这个例子中,我们使用 `with` 语句打开一个命名为 `page{idx+1}.txt` 的文件,并将每页的 HTML 写入该文件中。
添加每隔 10 秒爬取 20 个页面的代码如下:
```
import requests
from bs4 import BeautifulSoup
import pprint
import json
import time
def download_all_htmls():
"""下载所有列表页面的HTML,用于后续的分析"""
htmls = []
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
for idx in range(24):
url = f"https://www.utusan.com.my/page/{idx+1}"
print("craw html:", url)
r = requests.get(url, headers=headers)
if r.status_code != 200:
raise Exception("error")
htmls.append(r.text)
with open(f"page{idx+1}.txt", "w", encoding="utf-8") as f:
f.write(r.text)
if (idx+1) % 20 == 0:
print("Sleep for 10 seconds...")
time.sleep(10)
return htmls
htmls = download_all_htmls()
print(htmls[0])
```
在这个例子中,我们使用了 `time` 模块中的 `sleep()` 函数,使程序在每爬取 20 个页面后暂停 10 秒钟。这样可以降低爬虫的访问频率,减少被封禁的风险。
阅读全文