Streaming x Kafka
实时统计数据时需要用到Spark Sreaming x kafka,spark版本就不多赘述了,kafka版本现在主要分0.8.x.x和0.10.x.x,但是调用相同API消费时发现两者有区别,这里做一下记录。Kafka Streaming生成选择常用的Direct Approach(No receiver)方式简化并行,提升straming接数据时的稳定性。
0.8.x.x maven 依赖 与 消费
生成Spark Streaming时也可以不调用Spark Context,直接将Spark Conf 传给 Streaming Context,这里sc可以用来读取其他变File
maven
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.8.x.x</version>
</dependency>
消费topic
val kafkaParams = Map(
"metadata.broker.list" -> KAFKA_BROKERS,
"group.id" -> KAFKA_GROUP_ID,
"auto.offset.reset" -> kafka.api.OffsetRequest.LargestTimeString
)
val sparkConf = if (local) {
new SparkConf()
.setMaster(SPARK_LOCAL_HOST)
.setAppName(appName)
} else {
new SparkConf().setAppName(appName)
}
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc,
Seconds(SPARK_STREAMING_INTERVAL.toInt)
)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
messages.foreachRDD(rdd => {
rdd.foreachPartition(partition => {
partition.foreach(line => {
Execute(line)
})
})
})
ssc.start()
ssc.awaitTermination()
}
0.10.x.x maven 依赖 与 消费
与0.8.x.x的消费主要区别在kafka配置与DStream生成的API改动,主要逻辑写在Excute函数中即可
maven
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.x.x</version>
</dependency>
消费topic
val kafkaParameters = Map[String, Object](
"bootstrap.servers" -> KAFKA_BROKERS,
"group.id" -> KAFKA_GROUP_ID,
"enable.auto.commit" -> (true: java.lang.Boolean),
"auto.offset.reset" -> "latest",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"security.protocol" -> "SASL_PLAINTEXT",
"fetch.min.bytes" -> "4096",
"sasl.mechanism" -> "PLAIN"
)
val sparkConf = if (local) {
new SparkConf()
.setMaster(SPARK_LOCAL_HOST)
.setAppName(appName)
} else {
new SparkConf().setAppName(appName)
}
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc,
Seconds(SPARK_STREAMING_INTERVAL.toInt)
)
val kafkaStream = KafkaUtils.createDirectStream[String, String](ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Array(KAFKA_TOPIC), kafkaParameters))
kafkaStream.foreachRDD(rdd=>{
rdd.foreachPartition(partition => {
partition.foreach(line => {
Execute(line.value())
})
})
})
ssc.start()
ssc.awaitTermination()