目录
JobGraph源码解读
上回说到,StreamGraph的源码其中是在客户端生成,并且是生成Node节点和Edge,主要是通过StreamAPI生成,表示拓扑结构,这次给大家讲讲JobGraph的生成(以Yarn集群模式)。
首先,JobGraph是基于StreamGraph进行优化(包括设置Checkpoint、slot分组策略,内存占比等),最主要是将多个符合条件的StreamNode链接chain在一起作为一个节点,减少数据在节点之间的流动所需要的序列化、反序列化、传输的消耗。
简单讲一下JobGraph的过程,将符合条件的Operator算子组合成ChainableOperator,生成对应的JobVertex、InermediateDataSet和JobEdge等,并且通过JobEdge连接上IntermediateDataSet和JobVertex,这里只是生成粗粒度的用户代码逻辑结构(如数据结构),真正的数据是在后续生成Task时构造的ResultSubPartition和InputGate才会交互用户的物理数据。
JobGraph核心对象
1、JobVertex
在StreamGraph中,每个算子对应一个StreamNode。在JobGraph中,符合条件的多个StreamNode会合并成一个JobVertex,即一个JobVertex包含一个或多个算子。
2、JobEdge
在StreamGraph中,StreamNode之间的连接使用StreamEdge表示,而在JobGraph中,JobVertex之间的连接使用JobEdge表示。JobEdge相当于JobGraph中的数据流转通道,上游数据是IntermediateDataSet,IntermediateDataSet是JobEdge的输入数据集,下游消费者是JobVertex。
JobEdge存储了目标JobVertex信息,没有源JobVertex信息,但是存储了源IntermediateDataSet。
3、IntermediateDataSet
IntermediateDataSet是由一个算子、源或任何中间操作产生的数据集,用于表示JobVertex的输出。
JobGraph生成过程
JobGraph的生成入口是StreamingJobGraphGenerator.createJobGraph(this, jobID)
,最终调用StreamingJobGraphGenerator.createJobGraph()。
入口函数
入口函数调用的过程:executeAsync(生成YarnJobClusterExecutorFactory)->execute(生成JobGraph,并向集群发布部署任务)->getJobGraph(根据Pipeline类型生成离线planTranslator或者实时的streamGraphTranslator)->createJobGraph(生成StreamingJobGraphGenerator实例并创建JobGraph)
@Internal
public JobClient executeAsync(StreamGraph streamGraph) throws Exception {
checkNotNull(streamGraph, "StreamGraph cannot be null.");
checkNotNull(configuration.get(DeploymentOptions.TARGET), "No execution.target specified in your configuration file.");
//调用DefaultExecutorServiceLoader生成YarnJobClusterExecutorFactory
final PipelineExecutorFactory executorFactory =
executorServiceLoader.getExecutorFactory(configuration);
checkNotNull(
executorFactory,
"Cannot find compatible factory for specified execution.target (=%s)",
configuration.get(DeploymentOptions.TARGET));
//生成YarnJobClusterExecutor调用生成JobGraph后向集群提交任务资源申请
CompletableFuture<JobClient> jobClientFuture = executorFactory
.getExecutor(configuration) //new YarnJobClusterExecutor
.execute(streamGraph, configuration, userClassloader);
........
}
@Override
public CompletableFuture<JobClient> execute(@Nonnull final Pipeline pipeline, @Nonnull final Configuration configuration, @Nonnull final ClassLoader userCodeClassloader) throws Exception {
//生成JobGraph
final JobGraph jobGraph = PipelineExecutorUtils.getJobGraph(pipeline, configuration);
try (final ClusterDescriptor<ClusterID> clusterDescriptor = clusterClientFactory.createClusterDescriptor(configuration)) {
final ExecutionConfigAccessor configAccessor = ExecutionConfigAccessor.fromConfiguration(configuration);
final ClusterSpecification clusterSpecification = clusterClientFactory.getClusterSpecification(configuration);
//开始向集群发布部署任务
final ClusterClientProvider<ClusterID> clusterClientProvider = clusterDescriptor
.deployJobCluster(clusterSpecification, jobGraph, configAccessor.getDetachedMode());
LOG.info("Job has been submitted with JobID " + jobGraph.getJobID());
//启动异步可回调线程,返会完成的部署任务
return CompletableFuture.completedFuture(
new ClusterClientJobClientAdapter<>(clusterClientProvider, jobGraph.getJobID(), userCodeClassloader));
}
}
public static JobGraph getJobGraph(
Pipeline pipeline,
Configuration optimizerConfiguration,
int defaultParallelism) {
//根据Pipeline类型生成离线planTranslator或者实时的streamGraphTranslator
FlinkPipelineTranslator pipelineTranslator = getPipelineTranslator(pipeline);
return pipelineTranslator.translateToJobGraph(pipeline,
optimizerConfiguration,
defaultParallelism);
}
//生成StreamingJobGraphGenerator实例并创建JobGraph并
public static JobGraph createJobGraph(StreamGraph streamGraph, @Nullable JobID jobID) {
return new StreamingJobGraphGenerator(streamGraph, jobID).createJobGraph();
}
createJobGraph函数
在StreamingJobGraphGenerator生成器当中,基本上所有的成员变量都是为了辅助生成最终的JobGraph。
其中createJobGraph函数的过程:首先为所有节点都生成一个唯一的hash id,这个哈希函数可以用户进行自己定义,如果节点在多次提交中没有改变(如组、并发度、上下游关系等),那么这个hash id就不会改变,这个主要是用于故障恢复。然后在chaining处理、生成JobVetex、JobEdge等,之后就是写入各种配置信息例如缓存、checkpoints等。
public class StreamingJobGraphGenerator {
private StreamGraph streamGraph;
private JobGraph jobGraph;
// id -> JobVertex
private Map<Integer, JobVertex> jobVertices;
// 已经构建的JobVertex的id集合
private Collection<Integer> builtVertices;
// 物理边集合(排除了chain内部的边), 按创建顺序排序
private List<StreamEdge> physicalEdgesInOrder;
// 保存chain信息,部署时用来构建 OperatorChain,startNodeId -> (currentNodeId -> StreamConfig)
private Map<Integer, Map<Integer, StreamConfig>> chainedConfigs;
// 所有节点的配置信息,id -> StreamConfig
private Map<Integer, StreamConfig> vertexConfigs;
/