Spark ERROR: org.apache.spark.shuffle.FetchFailedException 问题追查

本文详细记录了在升级到Spark 2.2.1 thriftserver服务后,遇到的task高失败率问题,特别是FetchFailedException错误。通过对错误日志的深入分析,发现并非由于资源竞争导致,而是与网络连接失败有关。具体表现为无法连接到指定的IP地址和端口,引发IOException。解决方案可能涉及检查网络连接、shuffle服务配置以及排查集群间通信问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

####背景
因为这两天升级上线spark 2.2.1 thriftserver服务,对服务运行情况,尤其对失败作业的关注比较高,今天发现有一台机器上的task失败率格外高,报错FetchFailedException,过去我会认为是资源竞争,导致executor因内存不足而挂掉,从而导致block fetch失败。今天仔细一查,发现问题的另一个原因。

下面是追查过程:

####1) 首先看到spark web显示报错:
这里写图片描述

FetchFailed(BlockManagerId(149, hadoop848.bx.com, 11681, None), shuffleId=135, mapId=12, reduceId=154, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to hadoop848.bx.com/10.88.69.188:11681
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513)
at


####2) 然后进入stderr页面观察task 日志情况:
![这里写图片描述](https://img-blog.csdn.net/20171211115653128?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQveHdjMzUwNDc=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)

17/12/11 11:42:02 ERROR RetryingBlockFetcher: Exception while beginning fetch of 6 outstanding blocks (after 1 retries)
java.io.IOException: Failed to connect to hadoop972.bx.com/10.87.112.82:15733
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon1.createAndStart(NettyBlockTransferService.scala:97)atorg.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)atorg.apache.spark.network.shuffle.RetryingBlockFetcher.lambda1.createAndStart(NettyBlockTransferService.scala:97) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141) at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda1.createAndStart(NettyBlockTransferService.scala:97)atorg.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)atorg.apache.spark.network.shuffle.RetryingBlockFetcher.lambdainitiateRetry0(RetryingBlockFetcher.java:169)atjava.util.concurrent.Executors0(RetryingBlockFetcher.java:169) at java.util.concurrent.Executors0(RetryingBlockFetcher.java:169)atjava.util.concurrent.ExecutorsRunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutorWorker.run(ThreadPoolExecutor.java:617)atio.netty.util.concurrent.DefaultThreadFactoryWorker.run(ThreadPoolExecutor.java:617) at io.netty.util.concurrent.DefaultThreadFactoryWorker.run(ThreadPoolExecutor.java:617)atio.netty.util.concurrent.DefaultThreadFactoryDefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Caused by: io.netty.channel.AbstractChannelAnnotatedConnectException:Connectiontimedout:hadoop972.bx.com/10.87.112.82:15733atsun.nio.ch.SocketChannelImpl.checkConnect(NativeMethod)atsun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)atio.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)atio.netty.channel.nio.AbstractNioChannelAnnotatedConnectException: Connection timed out: hadoop972.bx.com/10.87.112.82:15733 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257) at io.netty.channel.nio.AbstractNioChannelAnnotatedConnectException:Connectiontimedout:hadoop972.bx.com/10.87.112.82:15733atsun.nio.ch.SocketChannelImpl.checkConnect(NativeMethod)atsun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)atio.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)atio.netty.channel.nio.AbstractNioChannelAbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(Sing

日志的重点是连接超时,而且重试多次仍然超时。

####3)登入到hadoop244.bx,使用top观察进程资源占用情况
![这里写图片描述](https://imgconvert.csdnimg.cn/aHR0cDovL2ltZy5ibG9nLmNzZG4ubmV0LzIwMTcxMjExMTE0OTE2ODQz?x-oss-process=image/format,png)
 发现有一个PID为95479的进程cpu占用一直超过100%

####4)然后jps查看进程情况
![这里写图片描述](https://imgconvert.csdnimg.cn/aHR0cDovL2ltZy5ibG9nLmNzZG4ubmV0LzIwMTcxMjExMTE1MjQ1MzQy?x-oss-process=image/format,png)
然后发现,PID 95479 对应是一个tez作业,这个作业长期占用大量cpu资源,导致其他进程(如当前场景的spark进程)获取执行时间要等待很长,从而导致连接timeout,导致spark作业失败。

**总结: 有必要让yarn采用类似cgroup的资源限制功能,限制单个进程对资源的长期占用,避免异常对其他作业的影响;同时加大timeout超时时间,提高对计算及网络环境不好情况下的鲁棒性。**

**文章会同步到公众号,关注公众号,交流更方便:**
![在这里插入图片描述](https://img-blog.csdnimg.cn/2019090815181861.jpg)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值