flink同步kafka到paimon，doris加速查询

cg6

已于 2025-07-14 17:51:03 修改

阅读量974

点赞数 29

CC 4.0 BY-SA版权

分类专栏： flink 大数据文章标签： flink kafka paimon

于 2025-06-27 15:47:17 首次发布

本文链接：https://blog.csdn.net/a123147abc/article/details/148952503

flink同步kafka到paimon，doris加速查询

kafka to paimon
paimon 时间旅行
查询paimon
paimon 快照数据压缩命令
遇到的问题及解决方案
flink + paimon 的pg cdc 同步
- flink 1.15.3 + pg 11
- flink 1.15.3 + pg 15

kafka to paimon

yarn-session启动命令

启动本地集群，提供flink sql 查询，推荐使用doris进行加速查询

bin/yarn-session.sh -nm yarn-session2 -tm 6144m -qu flink -d

同步时动态识别kafka - topic 正则表达式：

^debezium.plus.test.test_instance.(?!mc_background_setting$).+$

排除topic:debezium.plus.test.test_instance.mc_background_setting

执行同步命令注意事项如下：
备注：topic 动态识别参数：properties.partition.discovery.interval.ms=30000
地址：https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/connectors/datastream/kafka/

不是kafka 的参数：scan.topic-partition-discovery.interval

paimon 底层存储是minio

按库同步-通用命令行：

bin/flink run -m yarn-cluster \
-ynm paimon_kafka_sync_database \
-yqu flink \
-ytm 4096m \
-ys 1 \
-p 6 \
-D execution.runtime-mode=batch \
-D execution.buffer-timeout=10ms \
-D taskmanager.memory.managed.fraction=0.4 \
-D table.exec.resource.default-parallelism=6 \
lib/paimon-flink-action-1.0.1.jar \
kafka_sync_database \
--warehouse s3://dev-bucket-bigdata-flink/paimon \
--database demo \
--primary_keys id \
--kafka_conf connector=upsert-kafka \
--kafka_conf 'properties.bootstrap.servers=kafka-test01.com:32295,kafka-test02.com:32295,kafka-test03.com:32295' \
--kafka_conf 'properties.security.protocol=SASL_PLAINTEXT' \
--kafka_conf 'properties.sasl.mechanism=PLAIN' \
--kafka_conf 'properties.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="username" password="password";' \
--kafka_conf 'topic=debezium.plus.test_breeding.test_breeding.feed_message_report' \
--kafka_conf 'properties.group.id=cid_yz.bigdata.paimon_sync_$(date +%s)' \
--kafka_conf 'properties.partition.discovery.interval.ms=30000' \
--kafka_conf 'properties.max.poll.records=1000' \
--kafka_conf 'scan.startup.mode=earliest-offset' \
--kafka_conf 'key.format=debezium-json' \
--kafka_conf 'value.format=debezium-json' \
--catalog_conf metastore=filesystem \
--catalog_conf 's3.endpoint=https://yos.test.com' \
--catalog_conf 's3.access-key=bigdata-flink-user' \
--catalog_conf 's3.secret-key=25dhHosSbUxcsQJINzKZr8D' \
--catalog_conf 's3.path.style.access=true' \
--catalog_conf 's3.connection.maximum=50' \
--catalog_conf 's3.threads.max=20' \
--table_conf bucket=6 \
--table_conf changelog-producer=input \
--table_conf sink.parallelism=6 \
--table_conf 'compaction.trigger=num_commits' \
--table_conf 'compaction.num_commits=10' \
--table_conf schema.automerge=false \
--table_conf auto-create-table=true \
--computed_column 'compute_time__=now() STORED' \  # 新增计算时间列
--table_conf 'snapshot.time-retained=1h' \  # 快照保存时间
--table_conf 'snapshot.num-retained.min=1' \ # 快照保存最小数量
--table_conf 'snapshot.num-retained.max=5' \	# 快照保存最大数量
--table_conf tag.automatic-creation=process-time \	# tag 基于处理时间创建，保存全量快照版本数据
--table_conf tag.creation-period=hourly	# tag创建周期

同步单个topic 单表

bin/flink run -m yarn-cluster -ynm paimon_kafka_sync_database -ytm 3172m -ys 1 -yqu flink -d \
-D execution.runtime-mode=batch \
lib/paimon-flink-action-0.8.2.jar \
kafka_sync_table  \
--warehouse s3://dev-bucket-bigdata-flink/paimon \
--database demo \
--table event_list2 \
--partition_keys org_id \
--primary_keys org_id,id \
--kafka_conf connector=upsert-kafka \
--kafka_conf properties.bootstrap.servers=kafka-test01.com:32295,kafka-test02.com:32295,kafka-test03.com:32295 \
--kafka_conf properties.security.protocol=SASL_PLAINTEXT \
--kafka_conf properties.sasl.mechanism=PLAIN \
--kafka_conf properties.sasl.jaas.config='org.apache.kafka.common.security.plain.PlainLoginModule required username="username" password="password";' \
--kafka_conf topic=debezium.plus.test_citus.test_citus.event_list \
--kafka_conf scan.startup.mode=earliest-offset \
--kafka_conf key.format=debezium-json \
--kafka_conf value.format=debezium-json \
 --catalog_conf metastore=filesystem \
--catalog_conf s3.endpoint=https://yos.test.com \
--catalog_conf s3.access-key='bigdata-flink-user' \
--catalog_conf s3.secret-key='25dhHosSbUtx9XQJINzKZr8D' \
--catalog_conf s3.path.style.access='true' \
--table_conf bucket=4 \
--table_conf changelog-producer=input \
--table_conf sink.parallelism=4

paimon 底层存储是hdfs

支持kafka topic 数据带schema 命令行：

部分数据格式

{
   
   
  "schema": {
   
   
    "type": "struct",
    "fields": [
      {
   
   
        "type": "struct",
        "fields": [
          {
   
   
            "type": "int64",
            "optional": false,
            "field": "id"
          },
          {
   
   
            "type": "string",
            "optional": true,
            "field": "create_user"
          }
    ],
    "optional": false,
    "name": "debezium.plus.test_metrics.test_metrics.ods_event_list.Envelope"
  },
  "payload": {
   
   
    "before": null,
    "after": {
   
   
      "id": 781465173331251450,
      "create_user": "708978104768487425"</