从给定的Web访问数据Web_Visiting_Log.csv文件读取数据并生成Spark Dataframe,按照不同国家(Country)分组,汇总并统计上网人员年龄(Age)的平均值,绘制不同国家(Country)的上网人员年龄(Age)的盒型图(Boxplot);
时间: 2025-06-15 17:21:07 浏览: 10
首先,我们需要加载CSV文件到Spark DataFrame,然后按国家(Country)分组,计算年龄(Age)的平均值,并创建盒型图(Boxplot)。以下是步骤:
```python
from pyspark.sql.functions import avg
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import ArrayType
from pyspark.sql.functions import udf
from pyspark.sql.window import Window
import matplotlib.pyplot as plt
plt.style.use('_mpl-gallery')
# 读取CSV数据
data_df = spark.read.format("csv").option("header", True).load("Web_Visiting_Log.csv")
# 如果Age是一个字符串字段,需要转换为数值
if data_df.schema['Age'].dataType != IntegerType():
age_udf = udf(lambda x: int(x), IntegerType())
data_df = data_df.withColumn('Age', age_udf(data_df['Age']))
# 组合 Age 和 Country 用于分组
assembler = VectorAssembler(
inputCols=['Age'],
outputCol='features')
combined_df = assembler.transform(data_df)
# 添加窗口函数进行分组
window_spec = Window.partitionBy('Country')
avg_age = combined_df.selectExpr('Country', 'avg(features[0]) as avg_age').withColumnRenamed('avg_age', 'average_age')
# 计算每个国家的平均年龄
avg_age_per_country = avg_age.groupby('Country').agg({'average_age': 'mean'})
# 创建盒型图
def create_boxplot(avg_age_df):
countries = avg_age_df.collect()[::-1] # 反转结果以便x轴有序
ages = [row.average_age for row in countries]
fig, ax = plt.subplots(figsize=(10, 6))
bp = ax.boxplot(ages, vert=True, patch_artist=True, labels=countries)
ax.set_title('上网人员年龄分布 - 按照国家分组')
ax.set_xlabel('国家')
ax.set_ylabel('平均年龄')
return fig, ax
fig, ax = create_boxplot(avg_age_per_country)
plt.show()
阅读全文
相关推荐

















