gen_tcp发送进程被挂起起因分析及对策

最新推荐文章于 2024-10-08 22:25:08 发布

孙飞 Sunface

最新推荐文章于 2024-10-08 22:25:08 发布

阅读量2.4k

点赞数

CC 4.0 BY-SA版权

文章标签： gen_tcp

本文链接：https://blog.csdn.net/erlib/article/details/9046653

本文分析了gen_tcp:send和port_command在处理大量TCP连接时可能导致进程挂起的问题。当数据量超过缓冲区高水位线，进程会被挂起，导致性能下降。解决方案包括使用force标志强制提交数据、设置send_timeout以及避免单一进程拥有过多port，以防止内存占用过大和进程调度问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

最近有同学在gmail上问关于gen_tcp发送进程被挂起的问题，问题描述的非常好，见底下：

第一个问题是关于port_command和gen_tcp:send的。从项目上线至今，我在tcp发送的地方遇到过两次问题，都跟port_command有关系。

起初程序的性能不好，我从各方面尝试分析和优化，还有部分是靠猜测，当初把全服广播消息的地方，换成了port_command，当时参考了hotwheels的代码和您的一遍相关博文。

根据您的分析，port_command应该比直接用gen_tcp:send高效的，并且没有阻塞。但是我却在这个地方遇到了阻塞，具体表现如下（两次，分别出现在项目不同阶段，下面分别描述）

项目上线初期：

当时玩家进程给玩家发消息用的是gen_tcp:send，广播进程为了高效率用了port_command。当活跃玩家到了一定数量以后，玩家无法进入游戏，分析原因，是全局发送广播消息的进程堵住了，从message_queue_len可以看出来，改为广播进程给玩家进程发消息再让玩家进程给玩家自己发消息后，状况排除。

最近一段时间：

这时候玩家进程的tcp发送数据，已经被我替换成了port_command并运行了一段时间都没问题。但是一些流量比较大的游戏服，活跃玩家到了一定数量以后，消息延迟很大（5-6秒），做任何操作都卡，在出现状况期间，服务器CPU、内存、负载各项指标并未异常，ssh连到服务器操作也很正常，没有任何卡顿现象。同服务器的其它游戏服也都正常，但是出问题的游戏服的整个erlang节点都进入一个“很卡”的状态，体现在我进入erlang shell中进行操作时，输入文字延迟很大。

起初我没怀疑过port_command有问题，所以我到处找原因和“优化”代码，这个优化是加了引号的。

但是最后，在一次服务器同样出现状况很卡的时候，我把tcp发送数据的代码改回了gen_tcp:send，并热更新了相关模块，服务器立即恢复正常。

我一直对上面的情况百思不得其解，我之前写的代码如下：

tcp_send (Socket, Bin) ->
try erlang:port_command(Socket, Bin, [force, nosuspend]) of
false ->
exit({game_tcp_send_error, busy});
true ->
true
catch
error : Error ->
exit({game_tcp_send_error, {error, einval, Error}})
end.

希望您能帮忙分析下是什么原因导致整个erlang节点都卡的，我想这对其他的erlang程序员也会有帮助!

关于这个问题我之前写了篇文章，系统的介绍了gen_tcp的行为，gen_tcp:send的深度解刨和使用指南(初稿)见这里

gen_tcp.erl:L235
send(S, Packet) when is_port(S) ->
case inet_db:lookup_socket(S) of
{ok, Mod} ->
Mod:send(S, Packet);
Error ->
Error
end.

我们就这个问题再深入的分析下，首先看gen_tcp:send的代码：

      
      %% inet_tcp.erl:L50 
     
      %%                                                                                                                           
     
      %% Send data on a socket                                                                                                     
     
      %%                                                                                                                           
     
      send(Socket,Packet,Opts) ->prim_inet:send(Socket,Packet,Opts). 
     
      send(Socket,Packet) ->prim_inet:send(Socket,Packet, []). 
     
      %%prim_inet.erl:L349 
     
      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%                                               
     
      %%                                                                                                                           
     
      %% SEND(insock(), Data) -> ok | {error, Reason}                                                                               
     
      %%                                                                                                                           
     
      %% send Data on the socket (io-list)                                                                                         
     
      %%                                                                                                                           
     
      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%                                               
     
      %% This is a generic "port_command" interface used by TCP, UDP, SCTP, depending                                              
     
      %% on the driver it is mapped to, and the "Data". It actually sends out data,--                                              
     
      %% NOT delegating this task to any back-end.  For SCTP, this function MUST NOT                                               
     
      %% be called directly -- use "sendmsg" instead:                                                                              
     
      %%                                                                                                                           
     
      send(S, Data,OptList)whenis_port(S), is_list(OptList) -> 
     
          ?DBG_FORMAT("prim_inet:send(~p, ~p)~n", [S,Data]), 
     
          tryerlang:port_command(S,Data,OptList)of 
     
              false ->% Port busy and nosuspend option passed                                                                     
     
                  ?DBG_FORMAT("prim_inet:send() -> {error,busy}~n", []),  
     
                  {error,busy}; 
     
              true -> 
     
                  receive 
     
                      {inet_reply,S,Status} -> 
     
                          ?DBG_FORMAT("prim_inet:send() -> ~p~n", [Status]), 
     
                          Status 
     
                  end 
     
          catch 
     
              error:_Error->  
     
                  ?DBG_FORMAT("prim_inet:send() -> {error,einval}~n", []),  
     
                   {error,einval} 
     
          end.

我们可以看到gen_tcp:send分为二个步骤 1. port_command提交数据 2. 等待{inet_reply,S,Status}回应。这是一个典型的阻塞操作，在等待的时候，进程被调出。
所以如果系统中有大量的tcp链接要发送数据，这种方式有点低效。所以很多系统把这个动作改成集中提交数据，集中等待回应。

典型的例子见rabbitmq:

      
      %%rabbit_writer.erl 
     
      ...  
     
      handle_message({inet_reply, _, ok},State) -> 
     
          State; 
     
      handle_message({inet_reply, _,  Status}, _State) -> 
     
          exit({writer, send_failed,Status}); 
     
      handle_message(shutdown, _State) -> 
     
          exit(normal); 
     
      ...  
     
      internal_send_command_async(Sock,Channel,MethodRecord,Content,FrameMax) -> 
     
          true = port_cmd(Sock, assemble_frames(Channel,MethodRecord, 
     
                                                    Content,FrameMax)), 
     
          ok. 
     
      port_cmd(Sock,Data) -> 
     
          tryrabbit_net:port_command(Sock,Data) 
     
          catcherror:Error-> exit({writer, send_failed, Error}) 
     
          end.

它的做法是用一个进程集中来发送数据，集中接收回应。在正常情况下，这种处理会大大提高进程切换的开销，减少等待时间。但是也会带来问题，我们看到port_command这个操作如果出现意外，被阻塞了，那么这个系统的消息发送会被卡死。而之前由每个处理进程去gen_tcp:send只会阻塞个别进程。

我们仔细看下port_command的文档

port_command(Port, Data, OptionList) -> true|false

Types:

Port = port() | atom()
Data = iodata()
OptionList = [Option]
Option = force
Option = nosuspend
Sends data to a port. port_command(Port, Data, []) equals port_command(Port, Data).

If the port command is aborted false is returned; otherwise, true is returned.

If the port is busy, the calling process will be suspended until the port is not busy anymore.

Currently the following Options are valid:

force
The calling process will not be suspended if the port is busy; instead, the port command is forced through. The call will fail with a notsup exception if the driver of the port does not support this. For more information see the ERL_DRV_FLAG_SOFT_BUSY driver flag.
nosuspend
The calling process will not be suspended if the port is busy; instead, the port command is aborted and false is returned.
Note
More options may be added in the future.

Failures:

badarg
If Port is not an open port or the registered name of an open port.
badarg
If Data is not a valid io list.
badarg
If OptionList is not a valid option list.
notsup
If the force option has been passed, but the driver of the port does not allow forcing through a busy port.

调用port_command是可能引起经常被suspend的,什么条件呢？出于性能的考虑, inet会在gen_tcp驱动port中起用一个发送缓存区，当我们的数据超过了缓冲区的高水位线默认情况就会被挂起。

那什么是发送缓冲区高低水位线呢？我们看代码：

      
      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%                                               
     
      %%                                                                                                                           
     
      %% SETOPT(insock(), Opt, Value) -> ok | {error, Reason}                                                                      
     
      %% SETOPTS(insock(), [{Opt,Value}]) -> ok | {error, Reason}                                                                  
     
      %%                                                                      &nb