目录
1. 代码思路
1)消费Kafka中的数据;(实时数据已经采集到kafka消息队列之中)
详见:https://blog.csdn.net/weixin_42796403/article/details/113574197?spm=1001.2014.3001.5501
2)利用Redis过滤当日已经计入的日活设备;
3)把每批次新增的当日日活信息保存到HBase中;
4)从HBase中查询出数据,发布成数据接口,通可视化工程调用。
2. 配置相关
1)config.properties
# Kafka配置
kafka.broker.list=hadoop102:9092,hadoop103:9092,hadoop104:9092
# Redis配置
redis.host=hadoop102
redis.port=6379
2)pom.xml
<dependency>
<groupId>com.atguigu</groupId>
<artifactId>gmall-common</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
</dependency>
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-spark</artifactId>
<version>5.0.0-HBase-2.0</version>
<exclusions>
<exclusion>
<groupId>org.glassfish</groupId>
<artifactId>javax.el</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 该插件用于将Scala代码编译成class文件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<!-- 声明绑定到maven的compile阶段 -->
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
3. 工具类
1)PropertiesUtil 获取配置文件
import java.io.InputStreamReader
import java.util.Properties
object PropertiesUtil {
def load(propertieName:String): Properties ={
val prop=new Properties()
prop.load(new InputStreamReader(Thread.currentThread().getContextClassLoader.getResourceAsStream(propertieName) , "UTF-8"))
prop
}
}
2)MykafkaUtil 获取kafak连接
import java.util.Properties
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
object MyKafkaUtil {
//1.创建配置信息对象
private val properties: Properties = PropertiesUtil.load("config.properties")
//2.用于初始化链接到集群的地址
val broker_list: String = properties.getProperty("kafka.broker.list")
//3.kafka消费者配置
val kafkaParam = Map(
"bootstrap.servers" -> broker_list,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
//消费者组
"group.id" -> "bigdata2020",
//如果没有初始化偏移量或者当前的偏移量不存在任何服务器上,可以使用这个配置属性
//可以使用这个配置,latest自动重置偏移量为最新的偏移量
"auto.offset.reset" -> "latest",
//如果是true,则这个消费者的偏移量会在后台自动提交,但是kafka宕机容易丢失数据
//如果是false,会需要手动维护kafka偏移量
"enable.auto.commit" -> (true: java.lang.Boolean)
)
// 创建DStream,返回接收到的输入数据
// LocationStrategies:根据给定的主题和集群地址创建consumer
// LocationStrategies.PreferConsistent:持续的在所有Executor之间分配分区
// ConsumerStrategies:选择如何在Driver和Executor上创建和配置Kafka Consumer
// ConsumerStrategies.Subscribe:订阅一系列主题
def getKafkaStream(topic: String, ssc: StreamingContext): InputDStream[ConsumerRecord[String, String]] = {
val dStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Array(topic), kafkaParam))
dStream
}
}
3)RedisUtil 获取连接
import java.util.Properties
import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig}
object RedisUtil {
var jedisPool: JedisPool = _
def getJedisClient: Jedis = {
if (jedisPool == null) {
println("开辟一个连接池")
val config: Properties = PropertiesUtil.load("config.properties")
val host: String = config.getProperty("redis.host")
val port: String = config.getProperty("redis.port")
val jedisPoolConfig = new JedisPoolConfig()
jedisPoolConfig.setMaxTotal(100) //最大连接数
jedisPoolConfig.setMaxIdle(20) //最大空闲
jedisPoolConfig.setMinIdle(20) //最小空闲
jedisPoolConfig.setBlockWhenExhausted(true) //忙碌时是否等待
jedisPoolConfig.setMaxWaitMillis(500) //忙碌时等待时长 毫秒
jedisPoolConfig.setTestOnBorrow(true) //每次获得连接的进行测试
jedisPool = new JedisPool(jedisPoolConfig, host, port.toInt)
}
println(s"jedisPool.getNumActive = ${jedisPool.getNumActive}")
println("获得一个连接")
jedisPool.getResource
}
}
4) 样例类
import java.text.SimpleDateFormat
import java.util.Date
case class StartUpLog(mid:String,
uid:String,
appid:String,
area:String,
os:String,
ch:String,
`type`:String,
vs:String,
var logDate:String=null,
var logHour:String=null,
var ts:Long){
val date: Date = new Date(ts)
logDate = new SimpleDateFormat("yyyy-MM-dd").format(date)
logHour = new SimpleDateFormat("HH").format(date)
}
4. 实时数据处理类(主业务类)
package com.gmall.app
import com.alibaba.fastjson.JSON
import com.gmall.beans.StartUpLog
import com.gmall.common.constansts.GmallConstants
import com.gmall.util.{MykafkaUtil, RedisUtil}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import redis.clients.jedis.Jedis
/**
* DAU(Daily Access User): 日活用户,按照设备统计!
* UV(User View): 日活用户,根据IP统计
*
* 需要设计Redis存储的K-V形式:
* 使用Redis的原因:1.保存streaming流的状态信息 2.redis可以支持每秒百万级别的ops 3.redis作为内存缓存比较快速
* 使用Redis进行去重:只保存去重的关键字段mid!
* 使用Hbase保存启动日志的所有信息!
*
* K: 前缀(DAU)+当天日期
* V: 确定的类型 Set
*
*/
object DAUApp {
def main(args: Array[String]): Unit = {
val streamingContext: StreamingContext = new StreamingContext("local[*]","DAUApp",Seconds(5))
//1.从kafka获取DS
val ds: InputDStream[ConsumerRecord[String, String]] = MykafkaUtil.getKafkaStream(GmallConstants.KAFKA_TOPIC_STARTUP,streamingContext)
//2.将DS[ConsumerRecord]转换为DS[StartUpLog]
ds.map(record=>{
//从kafka中获取的json字符串
val jsonStr: String = record.value()
//将jsonStr转换为一个Java对象
val log: StartUpLog = JSON.parseObject(jsonStr,classOf[StartUpLog])
log
})
//3.统计当前批次去重后的用户
//需要从redis中获取当天已经产生的用户(跨批次去重,和之前的历史批次去重)
val ds3: DStream[StartUpLog] = removeDuplicateMid2(ds2)
//统计跨批次去重后的用户
ds3.count().print()
//4.将和历史用户去重后的用户,继续在本批次中去重
val ds4: DStream[StartUpLog] = removeDuplicateMidWithCommonBatch(ds3)
//统计本批次去重后写入redis的条数
ds4.count().print()
//5.将去重结果写入redis
writeMidToRedis(ds4)
//6.将启动日志的信息,写入HBase
ds4.foreachRDD(rdd=>rdd.saveToPhoenix("gmall2020_dau".toUpperCase,
Seq("MID", "UID", "APPID", "AREA", "OS", "CH", "TYPE", "VS", "LOGDATE","LOGHOUR", "TS"),
HBaseConfiguration.create,
Some("hadoop102,hadoop103,hadoop104:2181")
))
streamingContext.start()
streamingContext.awaitTermination()
}
}
自定义的一些方法:
使用Redis中的set集合保存sparkstreaming中的状态信息,达到跨批次去重的效果!
//从redis中读取已经活跃的Mid,在本批次数据中去重
def removeDuplicateMid(ds:DStream[StartUpLog]) ={
//函数返回true的留下
val result: DStream[StartUpLog] = ds.filter(log => {
//获取redis连接
val jedisClient: Jedis = RedisUtil.getJedisClient
//判断当前log的mid是否已经在redis当天的日活用户的set集合中存在了,如果存在,返回false
val exists: Boolean = jedisClient.sismember("DAU:" + log.logDate, log.mid)
//将连接还回池中
jedisClient.close()
!exists
})
result
}
/*
在spark中 DS的本质是将一批数据离散化为一个RDD,以分区为单位进行操作,一个分区只需要建立一个连接
mapPartitions: 有返回值
foreachPartition: 无返回值
foreachRDD: 用于向RDD写入数据,不要求有返回值!
transform: DS[T] --> transformFun: RDD[T] --> RDD[U] --> DS[U]
*/
def removeDuplicateMid2(ds:DStream[StartUpLog])={
val result: DStream[StartUpLog] = ds.transform(rdd => {
rdd.mapPartitions(iter => {
//iter代表一个分区的所有RDD数据
val jedisClient: Jedis = RedisUtil.getJedisClient
val logs: Iterator[StartUpLog] = iter.filter(log => {
val exists: Boolean = jedisClient.sismember("DAU:" + log.logDate, log.mid)
!exists
})
jedisClient.close()
//将过滤后的启动日志的集合返回
logs
})
})
result
}
/**
* 使用广播变量(只限于广播变量set集合不大的情形,不超过1g,200m以下为宜)
* ①每个分区只获取一个连接
* ②每个分区使用此连接,将redis中的set集合请求到本地客户端
* ③在本地使用逻辑判断元素是否在集合中存在
*
* spark是分布式计算,将来每个分区的若干Task可能会分配到多个executor运行!!
* 从redis下载的集合需要分配到多个executor,使其使用!!
*
* 广播变量:在一个spark应用中,共享一个变量,高效广播!
* 广播的要点:①每批数据在去重之前,都需要将redis中的最新的历史Mids广播,不能写在driver端,
* 这样只能广播一次,广播是最开始的Mids,必须写在DStream的tranform中
* ②不能在算子中广播,广播需要使用sparkContext,在算子中广播要求SparkContext实现序列化,
* 在算子外广播,返回广播变量!让广播变量作为闭包变量传入到算子!!
*
* @param ds
* @return
*/
def removeDuplicateMid3(ds:DStream[StartUpLog])={
//当前批次读取redis中的set,广播到集群,此处存在闭包,不能创建广播变量
val result: DStream[StartUpLog] = ds.transform(rdd => {
//在算子外广播
val jedisClient: Jedis = RedisUtil.getJedisClient
val mids: util.Set[String] = jedisClient.smembers("DAU:" + LocalDate.now().toString)
jedisClient.close()
//广播
val bc = streamingContext.sparkContext.broadcast(mids)
//算子,分布式运算
val value1: RDD[StartUpLog] = rdd.filter(log => !bc.value.contains(log.mid))
value1
})
result
}
将跨批次去重后的数据进行本批次去重,因为同一批次中可能包含多条用户的登陆信息,在跨批次去重后返回的是本批次中所有的新增的活跃用户,保证之后写入到hbase中的数据是同一用户只写入一次!!!
/**
* 本批次数据去重:
* 一个批次的数据,使用一个RDD进行封装,因此一个批次内去重,本质就是使用RDD的算子对RDD中元素进行去重!
* 去重是根据Mid去重!!!
*
* RDD.distinct()
* RDD.groupBy
*
* 如果当前批次有同一个mid的多条数据,一般只取最早启动的日志信息!!!
*/
def removeDuplicateMidWithCommonBatch(ds:DStream[StartUpLog])={
//封装数据:封装数据必须包含Mid,防止数据跨天,再封装日期!
//最终需要将整条日志信息写入hbase,封装数据中也要有log信息
val ds2: DStream[((String, String), StartUpLog)] = ds.transform(rdd => {
rdd.map(log => ((log.mid, log.logDate), log))
})
//按mid,日期进行分组去重,将同一个mid的日志数据按照从早到晚排序,只取第一条
val ds3: DStream[((String, String), Iterable[StartUpLog])] = ds2.groupByKey()
val result: DStream[StartUpLog] = ds3.flatMap {
case ((mid, date), iter) => iter.toList.sortBy(_.ts).take(1)
}
result
}
将数据写入到redis的方法:
/**
* 将去重后数据写入redis
*/
def writeMidToRedis(ds:DStream[StartUpLog])={
ds.foreachRDD(rdd=>{
rdd.foreachPartition(iter=>{
//每个分区获取一个连接
val jedisClient: Jedis = RedisUtil.getJedisClient
iter.foreach(log=>{
//将分区数据写入
jedisClient.sadd("DAU:"+log.logDate,log.mid)
})
//归还连接
jedisClient.close()
})
})
}
5. 通过Phoenix将数据存入Hbase
5.1 利用Phoenix建立数据表
create table gmall2020_dau
(
mid varchar,
uid varchar,
appid varchar,
area varchar,
os varchar,
ch varchar,
type varchar,
vs varchar,
logDate varchar,
logHour varchar,
ts bigint
CONSTRAINT dau_pk PRIMARY KEY (mid, logDate));
5.2 pom.xml 中增加依赖
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-spark</artifactId>
<version>5.0.0-HBase-2.0</version>
<exclusions>
<exclusion>
<groupId>org.glassfish</groupId>
<artifactId>javax.el</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
</dependency>
5.3 业务保存代码
//把数据写入hbase+phoenix
distictDstream.foreachRDD{rdd=>
rdd.saveToPhoenix("GMALL2020_DAU",Seq("MID", "UID", "APPID", "AREA", "OS", "CH", "TYPE", "VS", "LOGDATE", "LOGHOUR", "TS") ,new Configuration,Some("hadoop102,hadoop103,hadoop104:2181"))
}