python爬虫爬取 boss直聘并可视化
时间: 2025-07-01 11:43:31 浏览: 9
### 使用Python编写爬虫抓取Boss直聘数据
为了完成这一目标,可以采用Selenium库来模拟浏览器行为获取动态加载的内容。具体来说,在启动项目前需安装必要的依赖包:
```bash
pip install selenium pandas matplotlib echarts-countries-pypkg echarts-china-provinces-pypkg mysql-connector-python requests
```
#### Selenium初始化配置
通过设置ChromeDriver路径以及创建WebDriver实例来进行环境准备。
```python
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.add_argument('--headless') # 设置无界面模式
driver_path = 'path_to_chromedriver' # 替换成自己的chromedriver位置
browser = webdriver.Chrome(executable_path=driver_path, options=options)
url = "https://www.zhipin.com/job_detail/?query=大数据&city=101010100"
browser.get(url)
time.sleep(3) # 等待页面加载完毕
```
#### 获取职位详情页链接列表
利用XPath定位器提取每条招聘信息中的URL地址,并存储在一个列表里以便后续访问各个具体的职位页面。
```python
job_links = []
elements = browser.find_elements_by_xpath('//div[@class="info-primary"]/h3/a')
for element in elements:
link = element.get_attribute('href')
job_links.append(link)
print(f"共找到 {len(job_links)} 条记录")
```
#### 解析单个职位页面内容
进入每一个收集到的链接后解析HTML文档结构,分别抽取工作地点、薪资范围等字段信息存入字典对象中;对于那些在同一`<div>`标签下的多个子项,则借助BeautifulSoup进一步细分处理[^2]。
```python
from bs4 import BeautifulSoup
def parse_job_page(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
location_tag = soup.select_one('.location-address').text.strip() if soup.select_one('.location-address') else ''
salary_range = soup.select_one('.salary').text.replace('\n', '').strip()
description_section = soup.find('div', {'class': ['job-detail']}).get_text(separator='\n').splitlines()
responsibilities = []
requirements = []
flag_responsibility = False
for line in description_section:
stripped_line = line.strip()
if not stripped_line or stripped_line.startswith(('任职资格:', '岗位要求:')):
flag_responsibility = True
continue
if flag_responsibility and (stripped_line.startswith(('工作职责:', '主要负责'))):
break
elif flag_responsibility:
requirements.append(stripped_line)
else:
responsibilities.append(stripped_line)
return {
'location': location_tag,
'salary': salary_range,
'responsibilities': '\n'.join(responsibilities),
'requirements': '\n'.join(requirements)
}
```
#### 存储至MySQL数据库
建立与本地或远程服务器上的MySQL实例连接,定义表结构并将之前获得的信息批量写入其中保存起来以供后期查询统计之用。
```sql
CREATE TABLE IF NOT EXISTS jobs (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(255),
company_name VARCHAR(255),
work_location TEXT,
salary_estimate VARCHAR(64),
duty_description LONGTEXT,
requirement_specification LONGTEXT);
```
```python
import mysql.connector as mc
db_config = {
'host': 'localhost',
'port': 3306,
'user': 'root',
'password': '',
'database': 'boss_zhipin'
}
conn = mc.connect(**db_config)
cursor = conn.cursor()
insert_sql = """
INSERT INTO jobs(title, company_name, work_location, salary_estimate, duty_description, requirement_specification)
VALUES (%s,%s,%s,%s,%s,%s)
"""
data_tuples = [(item['positionName'], item['companyShortName'],
result['location'], result['salary'],
result['responsibilities'], result['requirements'])
for idx, item in enumerate(items)]
try:
cursor.executemany(insert_sql, data_tuples)
finally:
conn.commit()
cursor.close()
conn.close()
```
### 数据可视化展示
最后一步就是调用ECharts或其他图形化工具绘制图表直观呈现分析成果了。这里给出一个简单的柱状图例子用于比较不同城市间平均工资水平差异情况。
```python
import pyecharts.options as opts
from pyecharts.charts import Bar
average_salaries_per_city = calculate_average_salary_groupby_cities()
bar_chart = (
Bar(init_opts=opts.InitOpts(width='800px', height='400px'))
.add_xaxis(list(average_salaries_per_city.keys()))
.add_yaxis("", list(average_salaries_per_city.values()), category_gap="60%")
.set_global_opts(
title_opts=opts.TitleOpts(title="各城市平均月薪对比"),
yaxis_opts=opts.AxisOpts(name="月薪/元", type_="value"),
xaxis_opts=opts.AxisOpts(axislabel_opts={"rotate": 45}, name="城市"))
).render_notebook()
display(bar_chart)
```
阅读全文
相关推荐


















