给出文档要求的完整python代码
时间: 2025-06-15 17:20:32 浏览: 13
由于文档中的任务较为复杂,涉及到多个步骤和不同的数据采集与处理方法,以下是每个任务的Python代码示例。为了简洁起见,我将提供一个基本框架,并在每个部分详细说明如何实现。
### 1. 分析网页结构并提取标签文本和属性值
```python
import requests
from bs4 import BeautifulSoup
url = 'https://www.5iai.com/#/index'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# 提取顶部标签
top_tags = soup.select('.header a')
for tag in top_tags:
print(f'Text: {tag.text}, href: {tag["href"]}')
# 提取底部标签
bottom_tags = soup.select('.footer a')
for tag in bottom_tags:
print(f'Text: {tag.text}, href: {tag["href"]}')
```
### 2. 模拟登录
```python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
driver.get('https://www.5iai.com/#/index')
# 假设登录按钮的CSS选择器是 .login-button
login_button = driver.find_element(By.CSS_SELECTOR, '.login-button')
login_button.click()
time.sleep(2) # 等待页面加载
# 输入用户名和密码
username_field = driver.find_element(By.ID, 'username')
password_field = driver.find_element(By.ID, 'password')
username_field.send_keys('your_username')
password_field.send_keys('your_password')
password_field.send_keys(Keys.RETURN)
time.sleep(5) # 等待登录完成
driver.quit()
```
### 3. 提取banner图
```python
import requests
from bs4 import BeautifulSoup
url = 'https://www.5iai.com/#/index'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# 提取banner图
banner_images = soup.select('.banner img')
for img in banner_images:
print(f'Src: {img["src"]}')
```
### 4. 提取合作企业和合作高校的信息
```python
import requests
from bs4 import BeautifulSoup
url = 'https://www.5iai.com/#/about'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# 提取合作企业
cooperate_companies = soup.select('.cooperate-companies .company')
for company in cooperate_companies:
title = company['title']
img_src = company.find('img')['src']
print(f'Title: {title}, Img Src: {img_src}')
# 提取合作高校
cooperate_universities = soup.select('.cooperate-universities .university')
for university in cooperate_universities:
title = university['title']
img_src = university.find('img')['src']
print(f'Title: {title}, Img Src: {img_src}')
```
### 5. 爬取职位信息
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
base_url = 'https://www.5iai.com/#/jobList'
data = []
def extract_job_info(soup):
job_list = soup.select('.job-list .job-item')
for job in job_list:
company_name = job.select_one('.company-name').text.strip()
position_name = job.select_one('.position-name').text.strip()
location = job.select_one('.location').text.strip()
salary = job.select_one('.salary').text.strip()
employee_count = job.select_one('.employee-count').text.strip()
industry = job.select_one('.industry').text.strip()
position_keywords = job.select_one('.position-keywords').text.strip()
skill_requirements = job.select_one('.skill-requirements').text.strip()
position_description = job.select_one('.position-description').text.strip()
data.append({
'Company Name': company_name,
'Position Name': position_name,
'Location': location,
'Salary': salary,
'Employee Count': employee_count,
'Industry': industry,
'Position Keywords': position_keywords,
'Skill Requirements': skill_requirements,
'Position Description': position_description
})
for page in range(1, 51):
url = f'{base_url}?page={page}'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
extract_job_info(soup)
df = pd.DataFrame(data)
df.to_csv('job_list.csv', index=False)
```
### 6. 数据分析
```python
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import jieba
# 读取数据
df = pd.read_csv('job_list.csv')
# 1. 词云图分析职位的技能要求
skill_requirements_text = ' '.join(df['Skill Requirements'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(skill_requirements_text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Skill Requirements Word Cloud')
plt.show()
# 2. 词云图分析职位关键词
position_keywords_text = ' '.join(df['Position Keywords'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(position_keywords_text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Position Keywords Word Cloud')
plt.show()
# 3. 大数据岗位招聘的学历要求
degree_counts = df['Position Description'].apply(lambda x: ' '.join(jieba.cut(x))).str.contains('本科|大专|硕士|博士').value_counts()
degree_counts.plot(kind='bar')
plt.title('Degree Requirements Distribution')
plt.xlabel('Degree')
plt.ylabel('Count')
plt.show()
# 4. 提取职位描述关键词
position_descriptions = ' '.join(df['Position Description'])
position_descriptions_cut = ' '.join(jieba.cut(position_descriptions))
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(position_descriptions_cut)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Position Descriptions Word Cloud')
plt.show()
# 5. 各城市的大数据人才需求量分布情况
city_counts = df['Location'].value_counts()
city_counts.plot(kind='bar')
plt.title('City Demand Distribution')
plt.xlabel('City')
plt.ylabel('Count')
plt.show()
# 6. 不同岗位或者行业公司的薪资待遇
salary_by_position = df.groupby('Position Name')['Salary'].mean().sort_values(ascending=False)
salary_by_position.plot(kind='bar')
plt.title('Average Salary by Position')
plt.xlabel('Position')
plt.ylabel('Average Salary')
plt.show()
salary_by_industry = df.groupby('Industry')['Salary'].mean().sort_values(ascending=False)
salary_by_industry.plot(kind='bar')
plt.title('Average Salary by Industry')
plt.xlabel('Industry')
plt.ylabel('Average Salary')
plt.show()
```
以上代码涵盖了文档中要求的所有任务,包括数据采集、模拟登录、提取特定信息、爬取多页数据以及数据分析。你可以根据需要进一步调整和优化代码。
阅读全文
相关推荐
















