Spark RDD Partition和Scheduler调度的梳理

最新推荐文章于 2023-06-06 08:41:58 发布

lilyjoke

最新推荐文章于 2023-06-06 08:41:58 发布

阅读量1.3k

点赞数

CC 4.0 BY-SA版权

本文链接：https://blog.csdn.net/lilyjoke/article/details/122284732

问题背景

业务同学报障，同一个Spark计算，数据源，执行代码和提交客户端配置都一模一样，第一次运行跑了几个小时没出数kill掉了，失败后第二次运行，跑了18分钟就出数了。我这边要分析一下原因，提供解决方案，避免再出现类似的问题。

说明：该记录只是问题梳理，不会涉及任何业务信息。

问题分析

对比了Spark History的详情，以及日志，发现同样的执行计划，失败的任务的Stage 的并发数是10个，成功的并发数是500个。

查看driver日志，可以知道，在跑Job 0之前，只收到了4个executor，一个executor用于运行driver，剩下三个分别是4核，4核，2核，所以一共可用的并发资源就10个，因此在Job 0的Stage 0环节开始，创建了10个task运行。查看当时的队列资源，发现当时pending:avaliable的比例是8：1，就是可用资源为1，等待资源需要8，因此资源是十分紧张的。

21/12/30 04:10:07 INFO Utils: Using initial executors = 150, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances 
...
21/12/30 04:10:07 INFO YarnAllocator: Will request 150 executor container(s), each with 8 core(s) and 11264 MB memory (including 1024 MB of overhead) //按客户端的配置，需要150个executor，每个8核，10G内存
21/12/30 04:10:07 INFO YarnAllocator: Submitted 150 unlocalized container requests.
21/12/30 04:10:07 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
21/12/30 04:10:34 INFO YarnAllocator: Launching container container_e188_1639619541647_741615_01_000007 on host fs-hiido-dn-12-9-224.hiido.host.yydevops.com for ex