# **Hive综合应用案例:用户搜索日志分析**
## **1. 案例背景**
假设我们有一个电商平台的用户搜索日志数据,记录了用户的搜索行为,包括:
- **用户ID**、**搜索关键词**、**搜索时间**、**设备类型**、**搜索结果点击情况**等。
**目标**:使用Hive进行数据分析,挖掘用户搜索行为特征,优化搜索推荐系统。
---
## **2. 数据准备**
### **(1) 数据样例(CSV格式)**
```
user_id,search_keyword,search_time,device_type,clicked
1001,"智能手机",2023-10-01 09:15:23,"mobile",1
1002,"笔记本电脑",2023-10-01 10:30:45,"desktop",0
1001,"无线耳机",2023-10-01 11:20:12,"mobile",1
1003,"运动鞋",2023-10-01 12:45:33,"mobile",0
1002,"机械键盘",2023-10-01 14:10:56,"desktop",1
...
```
### **(2) 创建Hive表**
```sql
-- 创建外部表(数据存储在HDFS)
CREATE EXTERNAL TABLE user_search_logs (
user_id INT,
search_keyword STRING,
search_time TIMESTAMP,
device_type STRING,
clicked INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/search_logs';
-- 加载数据(如果数据在本地)
LOAD DATA LOCAL INPATH '/path/to/search_logs.csv' INTO TABLE user_search_logs;
```
---
## **3. 数据分析实战**
### **(1) 基础统计分析**
```sql
-- 1. 总搜索量
SELECT COUNT(*) AS total_searches FROM user_search_logs;
-- 2. 每日搜索量趋势
SELECT
DATE(search_time) AS search_date,
COUNT(*) AS daily_searches
FROM user_search_logs
GROUP BY DATE(search_time)
ORDER BY search_date;
-- 3. 设备类型分布
SELECT
device_type,
COUNT(*) AS searches,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM user_search_logs), 2) AS percentage
FROM user_search_logs
GROUP BY device_type;
```
### **(2) 用户行为分析**
```sql
-- 1. 最活跃用户(搜索次数Top 10)
SELECT
user_id,
COUNT(*) AS search_count
FROM user_search_logs
GROUP BY user_id
ORDER BY search_count DESC
LIMIT 10;
-- 2. 用户搜索关键词偏好
SELECT
user_id,
search_keyword,
COUNT(*) AS keyword_count
FROM user_search_logs
GROUP BY user_id, search_keyword
ORDER BY user_id, keyword_count DESC;
-- 3. 点击率(CTR)分析
SELECT
search_keyword,
COUNT(*) AS total_searches,
SUM(clicked) AS clicks,
ROUND(SUM(clicked) * 100.0 / COUNT(*), 2) AS ctr
FROM user_search_logs
GROUP BY search_keyword
ORDER BY ctr DESC
LIMIT 20;
```
### **(3) 热门搜索词 & 长尾分析**
```sql
-- 1. 热门搜索词(Top 20)
SELECT
search_keyword,
COUNT(*) AS search_count
FROM user_search_logs
GROUP BY search_keyword
ORDER BY search_count DESC
LIMIT 20;
-- 2. 长尾关键词(搜索量低但有点击)
SELECT
search_keyword,
COUNT(*) AS search_count,
SUM(clicked) AS clicks
FROM user_search_logs
GROUP BY search_keyword
HAVING COUNT(*) < 10 AND SUM(clicked) > 0
ORDER BY clicks DESC;
```
### **(4) 时间维度分析**
```sql
-- 1. 每小时搜索量分布
SELECT
HOUR(search_time) AS hour_of_day,
COUNT(*) AS searches
FROM user_search_logs
GROUP BY HOUR(search_time)
ORDER BY hour_of_day;
-- 2. 用户搜索时间段偏好(早晨/下午/晚上)
SELECT
CASE
WHEN HOUR(search_time) BETWEEN 6 AND 11 THEN 'Morning'
WHEN HOUR(search_time) BETWEEN 12 AND 17 THEN 'Afternoon'
WHEN HOUR(search_time) BETWEEN 18 AND 23 THEN 'Evening'
ELSE 'Night'
END AS time_period,
COUNT(*) AS searches
FROM user_search_logs
GROUP BY
CASE
WHEN HOUR(search_time) BETWEEN 6 AND 11 THEN 'Morning'
WHEN HOUR(search_time) BETWEEN 12 AND 17 THEN 'Afternoon'
WHEN HOUR(search_time) BETWEEN 18 AND 23 THEN 'Evening'
ELSE 'Night'
END;
```
---
## **4. 高级分析(UDF/窗口函数)**
### **(1) 用户搜索行为序列分析**
```sql
-- 使用LAG计算用户两次搜索的时间间隔
SELECT
user_id,
search_keyword,
search_time,
LAG(search_time) OVER (PARTITION BY user_id ORDER BY search_time) AS prev_search_time,
UNIX_TIMESTAMP(search_time) - UNIX_TIMESTAMP(LAG(search_time) OVER (PARTITION BY user_id ORDER BY search_time)) AS time_diff_seconds
FROM user_search_logs;
```
### **(2) 关键词共现分析(用户连续搜索)**
```sql
-- 找出用户连续搜索的相关词
WITH user_search_sequence AS (
SELECT
user_id,
search_keyword,
LEAD(search_keyword) OVER (PARTITION BY user_id ORDER BY search_time) AS next_keyword
FROM user_search_logs
)
SELECT
search_keyword,
next_keyword,
COUNT(*) AS co_occurrence_count
FROM user_search_sequence
WHERE next_keyword IS NOT NULL
GROUP BY search_keyword, next_keyword
ORDER BY co_occurrence_count DESC
LIMIT 20;
```
---
## **5. 可视化 & 业务应用**
### **(1) 业务洞察**
- **热门搜索词** → 优化首页推荐、广告投放。
- **低CTR关键词** → 改进搜索算法或商品匹配。
- **用户搜索时间分布** → 调整服务器资源或促销活动时间。
- **设备差异** → 优化移动端/PC端搜索体验。
### **(2) 导出数据到BI工具**
```sql
-- 导出热门关键词数据到HDFS
INSERT OVERWRITE DIRECTORY '/output/popular_keywords'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT search_keyword, COUNT(*) AS search_count
FROM user_search_logs
GROUP BY search_keyword
ORDER BY search_count DESC
LIMIT 100;
```
---
## **6. 总结**
本案例使用Hive完成了:
1. **数据导入**(建表、加载数据)
2. **基础分析**(搜索量、设备分布、CTR)
3. **高级分析**(时间序列、关键词共现)
4. **业务应用**(优化搜索推荐、广告策略)
**扩展方向**:
- 结合**用户画像**数据,进行个性化推荐。
- 使用**机器学习**(如PySpark)预测热门搜索趋势。
- 实时分析(如Flink)监控搜索行为变化。
这个案例展示了Hive在**用户行为分析**中的强大能力,适用于电商、搜索引擎、社交平台等场景。 🚀