Apache Spark 练习七:使用Spark分析化妆品电子商务数据

一、源数据

本章所分析的数据来自于Kaggle公开的化妆品电子商务数据(https://www.kaggle.com/datasets/nowingkim/ecommerce-data-cosmetics-shop)。该数据具体包括以下字段:

column namedescription
timeTime when event happened at (in UTC).
event_name4 kinds of value: purchase, cart, view, removefromcart
product_idID of a product
category_idProduct's category ID
category_nameProduct's category taxonomy (code name) if it was possible to make it. Usually present for meaningful categories and skipped for different kinds of accessories.
brandDowncased string of brand name.
priceFloat price of a product.
user_idPermanent user ID.
sessionTemporary user's session ID. Same for each user's session. Is changed every time user come back to online store from a long pause.
category_1Largest class of product included
category_2Bigger class of product included
category_3Smallest class of product included

在开始下面的练习前,将csv文件中的数据全部写入到Kafka的“E_Commerce”消息主题中。

二、练习题

0. 数据预处理

首先,我们对Kafka中的原始数据进行预处理,得到Structured Streaming Dateset。

val spark = SparkSession
  .builder()
  .appName("Online Shopping")
  .getOrCreate()
spark.sparkContext.setLogLevel("WARN")
import spark.implicits._

// 从Kafka中加载源数据
val schema = new StructType()
  .add("time", "timestamp")
  .add("event_name", "string")
  .add("product_id", "string")
  .add("category_id", "string")
  .add("category_name", "string")
  .add("brand", "string")
  .add("price", "double")
  .add("user_id", "string")
  .add("session", "string")
  .add("category_1", "string")
  .add("category_2", "string")
  .add("category_3", "string")
val df = spark.readStream
  .format("kafka")
  // Kafka配置:IP:localhost:9092;Topic:E_Commerce
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "E_Commerce")
  // 为更好的进行模拟,设置从头开始消费数据
  .option("startingOffsets", "earliest")
  // 为模拟micro-batch的效果,设置每一批数据的最大规模为100000
  .option("maxOffsetsPerTrigger", 100000)
  .load()
  // 从Kafka的消息中提取value
  .selectExpr("CAST(value AS STRING)")
  // 将按逗号分隔的数据切分为列
  .select(
    from_csv(
      $"value",
      schema,
      Map("sep" -> ",", "timestampFormat" -> "yyyy-MM-dd HH:mm:ssZ")
    ).as("data")
  )
  .select("data.*")

1. 找出购买了价格超过100元的家具类商品的购买时间、商品ID和用户ID

val res = df
  .where("price > 100 and category_1 = 'furniture'")
  .select("time", "product_id", "user_id")

2. 统计每种品牌的累计销售金额

val res = df
  .filter($"brand" =!= "Not defined" and $"event_name" === "purchase")
  .groupBy($"brand")
  .sum("price")

3. 找出实时交易金额最高的10个品牌名(不含Not defined)

val res = df
  .withWatermark("time", "10 minute")
  .filter($"event_name" === "purchase" and $"brand" =!= "Not defined")
  .groupBy($"brand")
  .sum("price")
  .orderBy($"sum(price)".desc)
  .limit(10)

4. 统计每小时的用户浏览量、用户购买量

val res = df
  .withWatermark("time", "10 minutes")
  .filter($"event_name" === "view" or $"event_name" === "purchase")
  .groupBy(window($"time", "1 hour", "1 hour", "0 seconds"))
  .agg(
    sum(when($"event_name" === "view", 1).otherwise(0)).as("total_view"),
    sum(when($"event_name" === "purchase", 1).otherwise(0)).as("total_purchase")
  )

5. 统计每天的交易金额

val res = df
  .withWatermark("time", "10 minutes")
  .filter($"event_name" === "purchase")
  .groupBy(window($"time", "1 day"))
  .sum("price")

6. 找出每天浏览量最高的10款商品的商品ID

val res = df
  .withWatermark("time", "10 minutes")
  .filter($"event_name" === "view")
  .groupBy(window($"time", "1 day"), $"product_id")
  .count()
  .orderBy($"window".desc, $"count".desc)
  .limit(10)

7. 找出添加购物车后24小时内购买该商品的用户

val res = df
  .withWatermark("time", "10 minutes")
  .filter($"event_name" === "cart")
  .as("cart")
  .join(
    df.withWatermark("time", "10 minutes")
      .filter($"event_name" === "purchase")
      .as("purchase"),
    expr("""
           |cart.user_id = purchase.user_id AND
           |purchase.time >= cart.time AND
           |purchase.time <= cart.time + interval 24 hours
           |""".stripMargin),
    joinType = "inner"
  )
  .select("cart.user_id")

8. 找出浏览了商品但在24小时内未购买它的用户

val res = df
  .withWatermark("time", "10 minutes")
  .filter($"event_name" === "view")
  .as("view")
  .join(
    df.withWatermark("time", "10 minutes")
      .filter($"event_name" === "purchase")
      .as("purchase"),
    expr("""
            |view.user_id = purchase.user_id AND
            |purchase.time >= view.time AND
            |purchase.time <= view.time + interval 24 hours
            |""".stripMargin),
    joinType = "leftOuter"
  )
  .where("purchase.user_id IS NULL")
  .select("view.user_id")

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值