Map Reduce commit job 优化

最新推荐文章于 2022-10-28 15:48:22 发布

原创最新推荐文章于 2022-10-28 15:48:22 发布

· 3.2k 阅读

0 ·

版权

Hadoop 专栏收录该内容

8 篇文章

订阅专栏

本文探讨了Hadoop MapReduce版本2 (MRv2) 中针对Job提交过程的优化措施。通过改进FileOutputCommitter组件，在commitJob阶段减少了大量的文件重命名操作，从而显著提升了大型作业的提交效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

经常会看到用户的job在所有的map和reduce都完成之后，还需要几分钟时间才能finish。这个阶段主要在进行job output的commit过程。

MR v2中有进行这部分的优化。

https://issues.apache.org/jira/browse/MAPREDUCE-4815

https://issues.apache.org/jira/browse/MAPREDUCE-6275

https://issues.apache.org/jira/browse/MAPREDUCE-6280

目前看来在hadoop 2.7之后才有这些功能，但是还是有坑。

在FileOutputCommitter中的commitJob方法中，可以看到根据mapreduce.fileoutputcommitter.algorithm.version的不同，会有不同的处理逻辑。

org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java

mapreduce.fileoutputcommitter.algorithm.version

官方文档中的介绍https://hadoop.apache.org/docs/r2.7.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml非常清楚，如下：

总结来说，就是减少了一步rename的过程，而且老版本中commitJob是单线程串行rename大量output，这本身很花时间。现在新版本中，只是rename一个文件夹就行了，可以大大提高速度。

The file output committer algorithm version valid algorithm version number: 1 or 2 default to 1, which is the original algorithm In algorithm version 1, 1. commitTask will rename directory $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/_temporary/$appAttemptID/$taskID/ 2. recoverTask will also do a rename $joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/_temporary/($appAttemptID + 1)/$taskID/ 3. commitJob will merge every task output file in $joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/, then it will delete $joboutput/_temporary/ and write $joboutput/_SUCCESS It has a performance regression, which is discussed in MAPREDUCE-4815. If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. the commit is single-threaded and waits until all tasks have completed before commencing. algorithm version 2 will change the behavior of commitTask, recoverTask, and commitJob. 1. commitTask will rename all files in $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/ 2. recoverTask actually doesn't require to do anything, but for upgrade from version 1 to version 2 case, it will check if there are any files in $joboutput/_temporary/($appAttemptID - 1)/$taskID/ and rename them to $joboutput/ 3. commitJob can simply delete $joboutput/_temporary and write $joboutput/_SUCCESS This algorithm will reduce the output commit time for large jobs by having the tasks commit directly to the final output directory as they were completing and commitJob had very little to do.