Spark SQLHadoopMapReduceCommitProtocol中mapreduce.fileoutputcommitter.algorithm.version选择1还是2

文章探讨了Spark3.1.1中mapreduce.fileoutputcommitter.algorithm.version默认为1的原因,比较了V1和V2在性能和一致性上的差异。鉴于V1的安全性和性能特点,建议在Spark分布中保持默认的V1行为,尽管V2在某些场景下性能更好但可能导致数据一致性问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

背景

本文基于 spark 3.1.1
对于spark来说默认的mapreduce.fileoutputcommitter.algorithm.version1
这个在SparkHadoopUtil.scala代码中可以看到:

  private def appendSparkHadoopConfigs(conf: SparkConf, hadoopConf: Configuration): Unit = {
    // Copy any "spark.hadoop.foo=bar" spark properties into conf as "foo=bar"
    for ((key, value) <- conf.getAll if key.startsWith("spark.hadoop.")) {
      hadoopConf.set(key.substring("spark.hadoop.".length), value)
    }
    if (conf.getOption("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version").isEmpty) {
      hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version", "1")
    }
  }

闲说杂谈

InsertIntoHadoopFsRelationCommand类中会调用FileFormatWriter.write方法,最终会调用到sparkSession.sparkContext.runJob方法:

      sparkSession.sparkContext.runJob(
        rddWithNonEmptyPartitions,
        (taskContext: TaskContext, iter: Iterator[InternalRow]) => {
          executeTask(
            description = description,
            jobIdInstant = jobIdInstant,
            sparkStageId = taskContext.stageId(),
            sparkPartitionId = taskContext.partitionId(),
            sparkAttemptNumber = taskContext.taskAttemptId().toInt & Integer.MAX_VALUE,
            committer,
            iterator = iter)
        },
        rddWithNonEmptyPartitions.partitions.indices,
        (index, res: WriteTaskResult) => {
          committer.onTaskCommit(res.commitMsg)
          ret(index) = res
        })

该executeTask方法最后会调用dataWriter.write和commit方法

  override def commit(): WriteTaskResult = {
    releaseResources()
    val summary = ExecutedWriteSummary(
      updatedPartitions = updatedPartitions.toSet,
      stats = statsTrackers.map(_.getFinalStats()))
    WriteTaskResult(committer.commitTask(taskAttemptContext), summary)
  }

最终还是会调用到HadoopMapReduceCommitProtocol.commitTask,从而调用到FileOutputCommitter.commitTask方法

···
if (algorithmVersion == 1) {
    Path committedTaskPath = getCommittedTaskPath(context);
    if (fs.exists(committedTaskPath)) {
       if (!fs.delete(committedTaskPath, true)) {
         throw new IOException("Could not delete " + committedTaskPath);
       }
    }
    if (!fs.rename(taskAttemptPath, committedTaskPath)) {
      throw new IOException("Could not rename " + taskAttemptPath + " to "
          + committedTaskPath);
    }
    LOG.info("Saved output of task '" + attemptId + "' to " +
        committedTaskPath);
  } else {
    // directly merge everything from taskAttemptPath to output directory
    mergePaths(fs, taskAttemptDirStatus, outputPath);
    LOG.info("Saved output of task '" + attemptId + "' to " +
        outputPath);
···

这里的algorithmVersion就会根据是1或者2来进行不同的操作:

  • 对于1来说,会把task生成的文件,移动到另一个临时目录,在job完成后再移动到最终的写出文件目录
  • 低于2来说,会把task生成的文件,移动到最终的写出文件目录

对于1和2的优缺点:2是性能比1好,1是一致性比2好,下面分析spark中是怎么做的:

spark中对于该问题的处理

就像SPARK-33019这里说的一样:

Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from v1 to v2 and now there exists a discussion to remove v2. We had better provide a consistent default behavior of v1 across various Spark distributions

也就是为了保证spark向前向后的兼容性,强行设置为V1版本
当然Spark官方文档也有解释Recommended settings for writing to object stores

For object stores whose consistency model means that rename-based commits are safe use the FileOutputCommitter v2 algorithm for performance; v1 for safety

更多关于细节,可以参考大数据云上存算分离,我们应该关注什么

Hadoop中对于该问题的处理

参考MAPREDUCE-7282:

he v2 MR commit algorithm moves files from the task attempt dir into the dest dir on task commit -one by one

It is therefore not atomic

if a task commit fails partway through and another task attempt commits -unless exactly the same filenames are used, output of the first attempt may be included in the final result
if a worker partitions partway through task commit, and then continues after another attempt has committed, it may partially overwrite the output -even when the filenames are the same
Both MR and spark assume that task commits are atomic. Either they need to consider that this is not the case, we add a way to probe for a committer supporting atomic task commit, and the engines both add handling for task commit failures (probably fail job)

Better: we remove this as the default, maybe also warn when it is being used

大概的意思因为要保证task commits的原子性,所以好的建议是remove掉v2,不推荐使用V2。
当然后面讨论中:

Daryn Sharp Added a comment:
I'm also -1 on changing the default.  It exposes users to new (old but new to them) behavior that may have quirks. This was a 2.7 change from 5 years ago so if it's a high risk issue our customers would have squawked by now. Has this been frequently observed or theorized?

Notably our users won't tolerate the performance regression and SLA misses. I seem to recall jobs that ran for a single-digit minutes followed by a double-digit commit. The v2 commit amortized the commit to under a minute.

I'm not a MR expert. Here's my understanding:

if a task commit fails partway through and another task attempt commits -unless exactly the same filenames are used, output of the first attempt may be included in the final result

Isn't that indicative of a non-deterministic job? Should the risk to a few "bad" jobs outweigh the benefit to the mass majority of jobs? Why not change the committer for at risk jobs?

if a worker partitions partway through task commit, and then continues after another attempt has committed, it may partially overwrite the output -even when the filenames are the same

I don't think this can happen. Tasks request permission from the AM to commit.

---
Steve Loughran added a comment: 

Tasks request permission from the AM to commit.

yes, and then we assume that they continue to completion, rather than pausing for an extended period of time, so by the time the AM/spark driver gets a timeout, it can be assumed to be one of a network failure or the worker has failed/VM/k8s container terminated. The "suspended for a long time and then continues" risk does exist, and is unlikely on a physical cluster, but in a world of VMs, not entirely inconceivable.

I note the MR AM does track its time from last heartbeat to the YARN RM to detect partitions, workers don't.

这里有意思的就是,如果理想情况下,如果每个任务提交的时候都跟Driver通信,以确定只有一个任务能够提交成功(同一个task的其他attempt不会提交),那么也能保证task commit的正确性,但是如果由于网络原因导致了driver和executor的超时,而于此同时该task所在的executor又和Driver通信上了(可以提交该task),那么该task还会继续提交任务,直到driver发通知,去移除掉executor,那这段时间还是会存在数据的不一致性(当然这里面涉及到spark中的超时配置spark.executor.heartbeatInterval spark.network.timeout以及spark.rpc.askTimeout)。

结论

所以最后得出的结论就是:V1是安全的,但是性能不好,V2有可能是不安全的,但是性能好,推荐使用V1。

scala> counts.saveAsTextFile("hdfs://Hadoop1:9000/wordcount/output/count") org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://Hadoop1:9000/wordcount/output/count already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131) at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.assertConf(SparkHadoopWriter.scala:299) at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1091) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:410) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1089) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1062) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:410) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1027) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$3(PairRDDFunctions.scala:1009) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:410) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1008) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$2(PairRDDFunctions.scala:965) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:410) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:963) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$2(RDD.scala:1623) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:410) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1623) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$1(RDD.scala:1609) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:410) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1609) ... 42 elided \
最新发布
03-08
<think>好的,我现在需要解决用户在Spark中执行`counts.saveAsTextFile("hdfs://Hadoop1:9000/wordcount/output/count")`时遇到的`FileAlreadyExistsException`错误。首先,我得确认这个问题的根本原因。根据错误信息,输出目录已经存在,而Spark默认不会覆盖已有的输出目录,所以会抛出异常。 接下来,我需要回想一下Spark中处理这种情况的方法。通常,有两种主要解决方案:一种是先手动删除已存在的目录,另一种是在保存时设置覆盖选项。不过,用户可能对这两种方法不太熟悉,需要详细说明步骤。 对于第一种方法,用户需要使用HDFS命令来删除目录。我需要检查HDFS的删除命令是否正确,比如`hdfs dfs -rm -r /wordcount/output/count`,并提醒用户确保路径正确,以及权限是否足够。另外,用户可能在执行命令时没有正确指定Hadoop的主机名或端口,但根据错误信息,路径是正确的,所以可能不是这个问题。 第二种方法是在保存时添加配置参数,允许覆盖输出目录。在Spark中,可以通过`sc.hadoopConfiguration.set`来设置`mapreduce.fileoutputcommitter.algorithm.version`为2,或者其他相关配置。或者,更简单的方法是在调用`saveAsTextFile`之前设置`spark.hadoop.validateOutputSpecs`为`false`,但这可能不推荐,因为它会影响其他操作。或者,用户可以使用Hadoop的配置参数,在保存时传递`overwrite`选项,比如在`saveAsTextFile`时添加配置,但可能需要使用不同的方法,比如使用`saveAsHadoopFile`并传递配置。 然后,我需要确认用户使用的Spark版本,因为不同版本可能有不同的处理方式。例如,旧版本的Spark可能需要显式删除目录,而新版本可能支持通过配置参数来覆盖。不过,一般来说,`saveAsTextFile`本身可能没有直接的覆盖选项,所以通常的做法是先删除目录,或者在代码中设置Hadoop配置来允许覆盖。 此外,用户可能对HDFS操作不太熟悉,需要提供具体的命令示例,比如如何通过命令行或代码删除目录。同时,需要提醒用户谨慎操作,避免误删数据。 最后,我应该总结这两种方法,并建议用户首选先删除目录,或者在代码中配置覆盖选项,同时解释各自的优缺点。例如,手动删除可能需要额外的步骤,但更安全;而配置覆盖可能更方便,但需要确保没有其他进程在同时操作同一目录。</think>你遇到的错误是由于输出目录已存在导致的。Spark为防止数据丢失,默认不允许覆盖已有输出目录。以下是解决方法: **方法一:手动删除已存在目录** ```bash # 通过HDFS命令删除目录(注意路径一致性) hdfs dfs -rm -r /wordcount/output/count ``` 执行删除操作后再次运行`saveAsTextFile` **方法二:代码自动覆盖目录** 在Spark代码中添加Hadoop配置参数: ```scala // 在调用saveAsTextFile之前添加配置 sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2") sc.hadoopConfiguration.set("spark.hadoop.validateOutputSpecs", "false") counts.saveAsTextFile("hdfs://Hadoop1:9000/wordcount/output/count") ``` **补充说明:** 1. 路径格式验证:确保HDFS地址`hdfs://Hadoop1:9000`指向正确的NameNode 2. 权限检查:当前用户需具有HDFS目录的写权限 3. 目录状态检查:通过`hdfs dfs -ls /wordcount/output`查看目录是否存在 4. 推荐优先使用方法一,生产环境慎用覆盖操作 两种方法对比: | 方法 | 优点 | 缺点 | |------|------|------| | 手动删除 | 数据操作可见,安全性高 | 需要额外执行命令 | | 自动覆盖 | 代码自动化处理 | 存在误覆盖风险 | 建议在测试环境中优先使用方法二,生产环境推荐使用方法一确保数据安全。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值