flink同步kafka到paimon,doris加速查询
- kafka to paimon
- paimon 时间旅行
- 查询paimon
- paimon 快照数据压缩命令
- 遇到的问题及解决方案
-
- 1、指定了hive元数据,paimon 表在flink sql 删除了,hive存储的表元数据无法删除
- 2、kafka cdc api debug 同时出现2个报错
- 3、Failed to execute goal on project paimon-common: Could not resolve dependencies for project org.apache.paimon:paimon-common:jar:1.0-SNAPSHOT: Could not find artifact org.apache.paimon:paimon-test-utils:jar:1.0-SNAPSHOT in aliyunmaven (https://maven.aliyun.com/repository/public) -> [Help 1]**
- 4、This exception is intentionally thrown after committing the restored checkpoints. By restarting the job we hope that writers can start writing based on these new commits.
- 5、Caused by: java.lang.ClassNotFoundException: org.codehaus.stax2.XMLInputFactory2
- 6、paimon 命令行不支持删除字段,通过flink sql 执行 删除字段语句后,在查询端刷新catalog 元数据,重新查询字段才能更新
- 7、Caused by: java.lang.ClassNotFoundException: org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.ConsumerRecord
- flink + paimon 的pg cdc 同步
kafka to paimon
yarn-session启动命令
启动本地集群,提供flink sql 查询,推荐使用doris进行加速查询
bin/yarn-session.sh -nm yarn-session2 -tm 6144m -qu flink -d
同步时动态识别kafka - topic 正则表达式:
^debezium.plus.test.test_instance.(?!mc_background_setting$).+$
排除topic:debezium.plus.test.test_instance.mc_background_setting
执行同步命令注意事项如下:
备注:topic 动态识别参数:properties.partition.discovery.interval.ms=30000
地址:https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/connectors/datastream/kafka/
不是kafka 的参数:scan.topic-partition-discovery.interval
paimon 底层存储是minio
按库同步-通用命令行:
bin/flink run -m yarn-cluster \
-ynm paimon_kafka_sync_database \
-yqu flink \
-ytm 4096m \
-ys 1 \
-p 6 \
-D execution.runtime-mode=batch \
-D execution.buffer-timeout=10ms \
-D taskmanager.memory.managed.fraction=0.4 \
-D table.exec.resource.default-parallelism=6 \
lib/paimon-flink-action-1.0.1.jar \
kafka_sync_database \
--warehouse s3://dev-bucket-bigdata-flink/paimon \
--database demo \
--primary_keys id \
--kafka_conf connector=upsert-kafka \
--kafka_conf 'properties.bootstrap.servers=kafka-test01.com:32295,kafka-test02.com:32295,kafka-test03.com:32295' \
--kafka_conf 'properties.security.protocol=SASL_PLAINTEXT' \
--kafka_conf 'properties.sasl.mechanism=PLAIN' \
--kafka_conf 'properties.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="username" password="password";' \
--kafka_conf 'topic=debezium.plus.test_breeding.test_breeding.feed_message_report' \
--kafka_conf 'properties.group.id=cid_yz.bigdata.paimon_sync_$(date +%s)' \
--kafka_conf 'properties.partition.discovery.interval.ms=30000' \
--kafka_conf 'properties.max.poll.records=1000' \
--kafka_conf 'scan.startup.mode=earliest-offset' \
--kafka_conf 'key.format=debezium-json' \
--kafka_conf 'value.format=debezium-json' \
--catalog_conf metastore=filesystem \
--catalog_conf 's3.endpoint=https://yos.test.com' \
--catalog_conf 's3.access-key=bigdata-flink-user' \
--catalog_conf 's3.secret-key=25dhHosSbUxcsQJINzKZr8D' \
--catalog_conf 's3.path.style.access=true' \
--catalog_conf 's3.connection.maximum=50' \
--catalog_conf 's3.threads.max=20' \
--table_conf bucket=6 \
--table_conf changelog-producer=input \
--table_conf sink.parallelism=6 \
--table_conf 'compaction.trigger=num_commits' \
--table_conf 'compaction.num_commits=10' \
--table_conf schema.automerge=false \
--table_conf auto-create-table=true \
--computed_column 'compute_time__=now() STORED' \ # 新增计算时间列
--table_conf 'snapshot.time-retained=1h' \ # 快照保存时间
--table_conf 'snapshot.num-retained.min=1' \ # 快照保存最小数量
--table_conf 'snapshot.num-retained.max=5' \ # 快照保存最大数量
--table_conf tag.automatic-creation=process-time \ # tag 基于处理时间创建,保存全量快照版本数据
--table_conf tag.creation-period=hourly # tag创建周期
同步单个topic 单表
bin/flink run -m yarn-cluster -ynm paimon_kafka_sync_database -ytm 3172m -ys 1 -yqu flink -d \
-D execution.runtime-mode=batch \
lib/paimon-flink-action-0.8.2.jar \
kafka_sync_table \
--warehouse s3://dev-bucket-bigdata-flink/paimon \
--database demo \
--table event_list2 \
--partition_keys org_id \
--primary_keys org_id,id \
--kafka_conf connector=upsert-kafka \
--kafka_conf properties.bootstrap.servers=kafka-test01.com:32295,kafka-test02.com:32295,kafka-test03.com:32295 \
--kafka_conf properties.security.protocol=SASL_PLAINTEXT \
--kafka_conf properties.sasl.mechanism=PLAIN \
--kafka_conf properties.sasl.jaas.config='org.apache.kafka.common.security.plain.PlainLoginModule required username="username" password="password";' \
--kafka_conf topic=debezium.plus.test_citus.test_citus.event_list \
--kafka_conf scan.startup.mode=earliest-offset \
--kafka_conf key.format=debezium-json \
--kafka_conf value.format=debezium-json \
--catalog_conf metastore=filesystem \
--catalog_conf s3.endpoint=https://yos.test.com \
--catalog_conf s3.access-key='bigdata-flink-user' \
--catalog_conf s3.secret-key='25dhHosSbUtx9XQJINzKZr8D' \
--catalog_conf s3.path.style.access='true' \
--table_conf bucket=4 \
--table_conf changelog-producer=input \
--table_conf sink.parallelism=4
paimon 底层存储是hdfs
支持kafka topic 数据带schema 命令行:
部分数据格式
{
"schema": {
"type": "struct",
"fields": [
{
"type": "struct",
"fields": [
{
"type": "int64",
"optional": false,
"field": "id"
},
{
"type": "string",
"optional": true,
"field": "create_user"
}
],
"optional": false,
"name": "debezium.plus.test_metrics.test_metrics.ods_event_list.Envelope"
},
"payload": {
"before": null,
"after": {
"id": 781465173331251450,
"create_user": "708978104768487425"</