项目描述:
概要:从Kafka中读取数据,在DataFrame 中用 spark.sql() 方法内写 业务逻辑,对数据进行清洗,包含解析 IP 地址,解析 电话号码。(通过调用Hive 的 UDF 函数来实现)
项目代码:
# -*- coding: utf-8 -*-
"""
@CreateTime :2020/12/14 18:33
@Author : Liangde
@Description :
数据上游 Kafka topic_sip_full_r1p4
数据清洗 SSS
数据下游 topic_sip_full_format_r1p4
@Modify:
"""
from pyspark import SparkConf
from pyspark.sql import SparkSession
from conf.setting import KAFKA_CONFIG
"""
设置任务 常量
lost data 异常应该要自己捕捉
.option("failOnDataLoss", "false")
Whether to fail the query when it's possible that data is lost
(e.g., topics are deleted, or offsets are out of range).
This may be a false alarm. You can disable it when it
doesn't work as you expected. Batch queries will always fail
if it fails to read any data from the provided offsets due to lost data.
"""
TOPIC = KAFKA_CONFIG["TOPIC_F"]
FORMAT_TOPIC = KAFKA_CONFIG["FORMAT_TOPIC_F"]
MAX_OFFSETS_PER_TRIGGER = KAFKA_CONFIG["MAX_OFFSETS_PER_TRIGGER_F"]
CHECK_POINT_LOCATION = KAFKA_CONFIG["CHECK_POINT_LOCATION_F"]
PROCESSING_TIME = KAFKA_CONFIG["PROCESSING_TIME_F"]
BOOTSTRAP_SERVERS = KAFKA_CONFIG["BOOTSTRAP_SERVERS"]
STARTING_OFFSETS = KAFKA_CONFIG["STARTING_OFFSETS"]
"""
业务逻辑部分
1、 去重
2、 解析sipInfo、 电话号码归属地、IP 地理位置等信息。
"""
if __name__ == '__main__':
"""
初始化SparkConf shuffle 分区设为 60
提交给 yarn 来处理
建立 spark 对象,并支持启用 hive 的 UDF 函数
"""
conf = SparkConf() \
.setAppName('structuredStreamingCleanFFile') \
.set("spark.sql.shuffle.partitions", 60) \
.set("stopGracefullyOnShutdown", "true") \
.set("spark.cleaner.referenceTracking.cleanCheckpoints", "true") \
.set("spark.executor.memoryOverhead", "1G")
spark = SparkSession.builder.enableHiveSupport().config(conf=conf).getOrCreate()
"""
创建 读取 Kafka Source 流
"""
kafkaSourceFDF = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS) \
.option("subscribe", TOPIC) \
.option("failOnDataLoss", "false") \
.option("startingOffsets", "earliest") \
.option("maxOffsetsPerTrigger",MAX_OFFSETS_PER_TRIGGER) \
.load() \
.selectExpr("CAST(value AS STRING)", "timestamp") \
.withWatermark("timestamp", "3 seconds") \
.dropDuplicates() \
.createOrReplaceTempView("dfTable")
# 获取SIP协议 解析 IP 信息
cdf = spark.sql("""
select
cseq as key,
cast(concat_ws(',',*) as string) as value
from
(
-----业务逻辑SQL
)
""")
"""
Kafka Sink
"""
formatSStreaming = cdf \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.trigger(processingTime=PROCESSING_TIME) \
.format("kafka") \
.option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS) \
.option("topic", FORMAT_TOPIC) \
.option("checkpointLocation", CHECK_POINT_LOCATION) \
.start()
formatSStreaming.awaitTermination()
采坑总结:
1、 编写 Hive 的UDF 函数时,不要定义静态全局变量,会报Bug。具体见网址: https://issues.apache.org/jira/browse/HIVE-22643
2、 一些算子在使用的时候,一定要注意是否会产生state 文件,保存长时间的状态,随着数据积累会造成很严重的后果。
例如我在使用 去重 deduplication 时,忘记指定 watermark 导致无限增长state,历史性的数据都要做去重判定。
详见: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.
html#streaming-deduplication
3、 写入 Kafka ,遵循官方给的 Key value 结构。