####背景
因为这两天升级上线spark 2.2.1 thriftserver服务,对服务运行情况,尤其对失败作业的关注比较高,今天发现有一台机器上的task失败率格外高,报错FetchFailedException,过去我会认为是资源竞争,导致executor因内存不足而挂掉,从而导致block fetch失败。今天仔细一查,发现问题的另一个原因。
下面是追查过程:
####1) 首先看到spark web显示报错:
FetchFailed(BlockManagerId(149, hadoop848.bx.com, 11681, None), shuffleId=135, mapId=12, reduceId=154, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to hadoop848.bx.com/10.88.69.188:11681
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513)
at
####2) 然后进入stderr页面观察task 日志情况:

17/12/11 11:42:02 ERROR RetryingBlockFetcher: Exception while beginning fetch of 6 outstanding blocks (after 1 retries)
java.io.IOException: Failed to connect to hadoop972.bx.com/10.87.112.82:15733
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon1.createAndStart(NettyBlockTransferService.scala:97)atorg.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)atorg.apache.spark.network.shuffle.RetryingBlockFetcher.lambda1.createAndStart(NettyBlockTransferService.scala:97)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda1.createAndStart(NettyBlockTransferService.scala:97)atorg.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)atorg.apache.spark.network.shuffle.RetryingBlockFetcher.lambdainitiateRetry0(RetryingBlockFetcher.java:169)atjava.util.concurrent.Executors0(RetryingBlockFetcher.java:169)
at java.util.concurrent.Executors0(RetryingBlockFetcher.java:169)atjava.util.concurrent.ExecutorsRunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutorWorker.run(ThreadPoolExecutor.java:617)atio.netty.util.concurrent.DefaultThreadFactoryWorker.run(ThreadPoolExecutor.java:617)
at io.netty.util.concurrent.DefaultThreadFactoryWorker.run(ThreadPoolExecutor.java:617)atio.netty.util.concurrent.DefaultThreadFactoryDefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Caused by: io.netty.channel.AbstractChannelAnnotatedConnectException:Connectiontimedout:hadoop972.bx.com/10.87.112.82:15733atsun.nio.ch.SocketChannelImpl.checkConnect(NativeMethod)atsun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)atio.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)atio.netty.channel.nio.AbstractNioChannelAnnotatedConnectException: Connection timed out: hadoop972.bx.com/10.87.112.82:15733
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
at io.netty.channel.nio.AbstractNioChannelAnnotatedConnectException:Connectiontimedout:hadoop972.bx.com/10.87.112.82:15733atsun.nio.ch.SocketChannelImpl.checkConnect(NativeMethod)atsun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)atio.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)atio.netty.channel.nio.AbstractNioChannelAbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(Sing
日志的重点是连接超时,而且重试多次仍然超时。
####3)登入到hadoop244.bx,使用top观察进程资源占用情况

发现有一个PID为95479的进程cpu占用一直超过100%
####4)然后jps查看进程情况

然后发现,PID 95479 对应是一个tez作业,这个作业长期占用大量cpu资源,导致其他进程(如当前场景的spark进程)获取执行时间要等待很长,从而导致连接timeout,导致spark作业失败。
**总结: 有必要让yarn采用类似cgroup的资源限制功能,限制单个进程对资源的长期占用,避免异常对其他作业的影响;同时加大timeout超时时间,提高对计算及网络环境不好情况下的鲁棒性。**
**文章会同步到公众号,关注公众号,交流更方便:**
