探究flink-stream如何增量的读取iceberg table

最新推荐文章于 2025-03-09 21:32:54 发布

小饭大人

最新推荐文章于 2025-03-09 21:32:54 发布

阅读量3.9k

点赞数 5

CC 4.0 BY-SA版权

分类专栏： flink 文章标签： flink 大数据 iceberg 数据湖

本文链接：https://blog.csdn.net/nmyphp/article/details/121266705

本文详细探讨了Flink-Stream如何通过StreamingReaderOperator增量地读取Iceberg表。从官方文档出发，分析了startSnapshotId的作用，并深入源码，解释了StreamingMonitorFunction的工作机制，如何生成FlinkInputSplit以及如何从FlinkInputSplit中获取增量数据。通过跟踪DataIterator.next()，揭示了底层文件读取的过程，包括对PARQUET、AVRO、ORC等文件格式的支持。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

从iceberg的官方文档上可以看到如下介绍：
在这里插入图片描述
实例程序中设置了startSnapshotId，介绍说可以从指定的快照版本号开始读取增量的数据。那么笔者的问题来了：

flink-stream如何增量的读取iceberg table？

flink本身肯定没有增量读取iceberg的能力，这是提供框架层的方法，在源码iceberg/flink/src/main/java/org/apache/iceberg/flink/source/中找到了StreamingReaderOperator.java类，继承了flink的AbstractStreamOperator，我们常识从这里入手去读源码。

/**
 * The operator that reads the {@link FlinkInputSplit splits} received from the preceding {@link
 * StreamingMonitorFunction}. Contrary to the {@link StreamingMonitorFunction} which has a parallelism of 1,
 * this operator can have multiple parallelism.
 *
 * <p>As soon as a split descriptor is received, it is put in a queue, and use {@link MailboxExecutor}
 * read the actual data of the split. This architecture allows the separation of the reading thread from the one split
 * processing the checkpoint barriers, thus removing any potential back-pressure.
 */
 public class StreamingReaderOperator extends AbstractStreamOperato

最低0.47元/天解锁文章