spark structed streaming 两种消费kafka json数组的方式
kafka过来的原消息体
{"gamecode":"abcd","resultguid":"81_18148_184_-1699285363_4","startguid":"81_18148_184_1573391420_4","records":[{"cards":[40],"optype":0,"playtime":1573391438014,"type":1,"userid":53435,"waittime":17344},{"cards":[54],"optype":0,"playtime":1573391445155,"type":1,"userid":4354,"waittime":7141},{"optype":1,"playtime":1573391447514,"type":0,"userid":4546,"waittime":2359}]}
1、配置kafka参数
前面这些参数数据量小的时候没事,数据一旦大起来,一个参数都不能马虎
# 创建sparksession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
# 配置kafka消费
kafka_df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \ #kafka集群
.option("subscribe", topic) \ #kafka topic
.option("group.id", groupid) \ # kafka 组号,便于归类,不一定需要
.option("failOnDataLoss", "false") \ #数据丢失之后(topic被删除,或者offset不在可用范围内时)查询是否失败
.option("startingOffsets", starting_offsets) \ # 从头消费
.option("includeTimestamp", True) \ # 包含kafka的timestamp
.option("maxOffsetsPerTrigger", max_offsets_per_trigger) \ # 最大单批次消费数
.load()
2、进行数据处理
2.1、json体的shema方式
schema不匹配的这种可能会丢数据
# 指定schema
json_data = '''
{
"resultguid": "123