Spark SQLHadoopMapReduceCommitProtocol中mapreduce.fileoutputcommitter.algorithm.version选择1还是2

鸿乃江边鸟

已于 2024-07-08 23:36:45 修改

阅读量532

点赞数

CC 4.0 BY-SA版权

分类专栏：大数据 spark 文章标签： spark hadoop

于 2023-08-02 23:20:14 首次发布

本文链接：https://blog.csdn.net/monkeyboy_tech/article/details/132073410

大数据同时被 2 个专栏收录

144 篇文章

订阅专栏

spark

91 篇文章

订阅专栏

文章探讨了Spark3.1.1中mapreduce.fileoutputcommitter.algorithm.version默认为1的原因，比较了V1和V2在性能和一致性上的差异。鉴于V1的安全性和性能特点，建议在Spark分布中保持默认的V1行为，尽管V2在某些场景下性能更好但可能导致数据一致性问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

背景

本文基于 spark 3.1.1
对于spark来说默认的mapreduce.fileoutputcommitter.algorithm.version是1
这个在SparkHadoopUtil.scala代码中可以看到：

  private def appendSparkHadoopConfigs(conf: SparkConf, hadoopConf: Configuration): Unit = {
    // Copy any "spark.hadoop.foo=bar" spark properties into conf as "foo=bar"
    for ((key, value) <- conf.getAll if key.startsWith("spark.hadoop.")) {
      hadoopConf.set(key.substring("spark.hadoop.".length), value)
    }
    if (conf.getOption("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version").isEmpty) {
      hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version", "1")
    }
  }

闲说杂谈

在InsertIntoHadoopFsRelationCommand类中会调用FileFormatWriter.write方法，最终会调用到sparkSession.sparkContext.runJob方法：

      sparkSession.sparkContext.runJob(
        rddWithNonEmptyPartitions,
        (taskContext: TaskContext, iter: Iterator[InternalRow]) => {
          executeTask(
            description = description,
            jobIdInstant = jobIdInstant,
            sparkStageId = taskContext.stageId(),
            sparkPartitionId = taskContext.partitionId(),
            sparkAttemptNumber = taskContext.taskAttemptId().toInt & Integer.MAX_VALUE,
            committer,
            iterator = iter)
        },
        rddWithNonEmptyPartitions.partitions.indices,
        (index, res: WriteTaskResult) => {
          committer.onTaskCommit(res.commitMsg)
          ret(index) = res
        })

该executeTask方法最后会调用dataWriter.write和commit方法：

  override def commit(): WriteTaskResult = {
    releaseResources()
    val summary = ExecutedWriteSummary(
      updatedPartitions = updatedPartitions.toSet,
      stats = statsTrackers.map(_.getFinalStats()))
    WriteTaskResult(committer.commitTask(taskAttemptContext), summary)
  }

最终还是会调用到HadoopMapReduceCommitProtocol.commitTask,从而调用到FileOutputCommitter.commitTask方法：

···
if (algorithmVersion == 1) {
    Path committedTaskPath = getCommittedTaskPath(context);
    if (fs.exists(committedTaskPath)) {
       if (!fs.delete(committedTaskPath, true)) {
         throw new IOException("Could not delete " + committedTaskPath);
       }
    }
    if (!fs.rename(taskAttemptPath, committedTaskPath)) {
      throw new IOException("Could not rename " + taskAttemptPath + " to "
          + committedTaskPath);
    }
    LOG.info("Saved output of task '" + attemptId + "' to " +
        committedTaskPath);
  } else {
    // directly merge everything from taskAttemptPath to output directory
    mergePaths(fs, taskAttemptDirStatus, outputPath);
    LOG.info("Saved output of task '" + attemptId + "' to " +
        outputPath);
···

这里的algorithmVersion就会根据是1或者2来进行不同的操作：

对于1来说，会把task生成的文件，移动到另一个临时目录，在job完成后再移动到最终的写出文件目录
低于2来说，会把task生成的文件，移动到最终的写出文件目录

对于1和2的优缺点：2是性能比1好，1是一致性比2好，下面分析spark中是怎么做的：

spark中对于该问题的处理

就像SPARK-33019这里说的一样：

Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from v1 to v2 and now there exists a discussion to remove v2. We had better provide a consistent default behavior of v1 across various Spark distributions

也就是为了保证spark向前向后的兼容性，强行设置为V1版本
当然Spark官方文档也有解释Recommended settings for writing to object stores：

For object stores whose consistency model means that rename-based commits are safe use the FileOutputCommitter v2 algorithm for performance; v1 for safety

更多关于细节，可以参考大数据云上存算分离，我们应该关注什么

Hadoop中对于该问题的处理

参考MAPREDUCE-7282:

he v2 MR commit algorithm moves files from the task attempt dir into the dest dir on task commit -one by one

It is therefore not atomic

if a task commit fails partway through and another task attempt commits -unless exactly the same filenames are used, output of the first attempt may be included in the final result
if a worker partitions partway through task commit, and then continues after another attempt has committed, it may partially overwrite the output -even when the filenames are the same
Both MR and spark assume that task commits are atomic. Either they need to consider that this is not the case, we add a way to probe for a committer supporting atomic task commit, and the engines both add handling for task commit failures (probably fail job)

Better: we remove this as the default, maybe also warn when it is being used

大概的意思因为要保证task commits的原子性，所以好的建议是remove掉v2，不推荐使用V2。
当然后面讨论中：

Daryn Sharp Added a comment:
I'm also -1 on changing the default.  It exposes users to new (old but new to them) behavior that may have quirks. This was a 2.7 change from 5 years ago so if it's a high risk issue our customers would have squawked by now. Has this been frequently observed or theorized?

Notably our users won't tolerate the performance regression and SLA misses. I seem to recall jobs that ran for a single-digit minutes followed by a double-digit commit. The v2 commit amortized the commit to under a minute.

I'm not a MR expert. Here's my understanding:

if a task commit fails partway through and another task attempt commits -unless exactly the same filenames are used, output of the first attempt may be included in the final result

Isn't that indicative of a non-deterministic job? Should the risk to a few "bad" jobs outweigh the benefit to the mass majority of jobs? Why not change the committer for at risk jobs?

if a worker partitions partway through task commit, and then continues after another attempt has committed, it may partially overwrite the output -even when the filenames are the same

I don't think this can happen. Tasks request permission from the AM to commit.

---
Steve Loughran added a comment: 

Tasks request permission from the AM to commit.

yes, and then we assume that they continue to completion, rather than pausing for an extended period of time, so by the time the AM/spark driver gets a timeout, it can be assumed to be one of a network failure or the worker has failed/VM/k8s container terminated. The "suspended for a long time and then continues" risk does exist, and is unlikely on a physical cluster, but in a world of VMs, not entirely inconceivable.

I note the MR AM does track its time from last heartbeat to the YARN RM to detect partitions, workers don't.

这里有意思的就是，如果理想情况下，如果每个任务提交的时候都跟Driver通信，以确定只有一个任务能够提交成功(同一个task的其他attempt不会提交)，那么也能保证task commit的正确性，但是如果由于网络原因导致了driver和executor的超时，而于此同时该task所在的executor又和Driver通信上了（可以提交该task），那么该task还会继续提交任务，直到driver发通知，去移除掉executor，那这段时间还是会存在数据的不一致性（当然这里面涉及到spark中的超时配置spark.executor.heartbeatInterval 和spark.network.timeout以及spark.rpc.askTimeout）。