【计算机网络】TCP拥塞控制在Linux内核中的实现

本文链接：https://blog.csdn.net/unoreason/article/details/145353421

TCP拥塞控制避免发送方向网络中注入过多的数据，从而避免网络拥塞提升传输效率。本文基于Linux 5.19.0 内核代码，梳理TCP拥塞控制在内核中的实现。下面首先从比较宏观的角度观察TCP是如何进行拥塞控制的，然后再探讨具体细节。

1、收到ACK报文后TCP都干了些什么

TCP根据ACK报文反馈信息判断网络拥塞，那么从收到ACK报文到进行拥塞控制都经历了哪些过程呢？

内核首先调用tcp_ack函数处理ACK报文，tcp_ack函数中调用的涉及到拥塞控制的函数主要包括tcp_ack_is_dubious, tcp_fastretrans_alert和tcp_cong_control。函数tcp_ack_is_dubious首先被调用，该函数的作用就像它的名字一样“检查是否是可疑ACK”。如果该ACK是“可疑ACK”的话证明当前网络状态可能不是那么好，有可能发生了丢包。接下来内核将调用tcp_fastretrans_alert函数，该函数主要用于检测是否发生了丢包，更新tcp结构体中的丢包计数器用于之后的拥塞控制。最后内核将调用tcp_cong_control函数来进行拥塞控制。

以上就是内核进行拥塞控制的宏观框架，下面将通过具体函数介绍内核拥塞控制过程。

2、TCP是如何检测网络拥塞的

上面我们提到内核首先会判断ACK是否是“可疑ACK”，内核对于“可疑ACK”的判断如下：

/net/ipv4/tcp_input.c

static inline bool tcp_ack_is_dubious(const struct sock *sk, const int flag)
{
	return !(flag & FLAG_NOT_DUP) || (flag & FLAG_CA_ALERT) ||
		inet_csk(sk)->icsk_ca_state != TCP_CA_Open;
}

简单来说“可疑ACK”指的是：
（1）冗余ACK
（2）被CA_ALERT标记的ACK
（3）当TCP处于非Open拥塞状态时收到的ACK

如果当前收到的ACK是“可疑ACK”的话，函数tcp_fastretrans_alert将被调用，警示可能发生丢包了。该函数使用判断丢包的算法（例如RACK）检测是否发生了丢包，如果检测到丢包则更新TCP丢包计数器，并将TCP拥塞状态更新为Recover状态。

RACK是一种基于时间的丢包检测算法。每当发送方收到ACK的时候都会更新RACK时间窗口，我们可以简单地理解为：既然当前的skb已经被确认了，那么在这个被确认的skb之前一段时间（RACK维护的时间窗口）发送的skb必然已经被确认，如果还没有被确认的话那么就将该skb标记位lost，并更新TCP丢包计数器，即tp->lost_out。

在检测完丢包后，内核会调用tcp_time_to_recovery函数根据是否丢包判断是否需要将TCP拥塞状态切换成Recovery。注意，该函数只是决定此时是否需要进入恢复阶段，并不进行实际的cwnd reduction和retransmit操作。

/net/ipv4/tcp_input.c

/* Really tricky (and requiring careful tuning) part of algorithm
 * is hidden in functions tcp_time_to_recover() and tcp_xmit_retransmit_queue().
 * The first determines the moment _when_ we should reduce CWND and,
 * hence, slow down forward transmission. In fact, it determines the moment
 * when we decide that hole is caused by loss, rather than by a reorder.
 *
 * tcp_xmit_retransmit_queue() decides, _what_ we should retransmit to fill
 * holes, caused by lost packets.
*/
static bool tcp_time_to_recover(struct sock *sk, int flag)
{
	struct tcp_sock *tp = tcp_sk(sk);

	/* Trick#1: The loss is proven. */
	if (tp->lost_out)
		return true;

	/* Not-A-Trick#2 : Classic rule... */
	if (!tcp_is_rack(sk) && tcp_dupack_heuristics(tp) > tp->reordering)
		return true;

	return false;
}

我们可以看到，如果tp->lost_out非零的话（即已经检测到丢包）则返回True，代表此时需要将TCP拥塞状态切换成Recovery了。

/net/ipv4/tcp_input.c

void tcp_enter_recovery(struct sock *sk, bool ece_ack)
{
	struct tcp_sock *tp = tcp_sk(sk);
	int mib_idx;

	if (tcp_is_reno(tp))
		mib_idx = LINUX_MIB_TCPRENORECOVERY;
	else
		mib_idx = LINUX_MIB_TCPSACKRECOVERY;

	NET_INC_STATS(sock_net(sk), mib_idx);

	tp->prior_ssthresh = 0;
	tcp_init_undo(tp);

	if (!tcp_in_cwnd_reduction(sk)) {
		if (!ece_ack)
			tp->prior_ssthresh = tcp_current_ssthresh(sk);
		tcp_init_cwnd_reduction(sk);
	}
	tcp_set_ca_state(sk, TCP_CA_Recovery);
}

函数tcp_enter_recovery首先保存并初始化一些变量，然后调用tcp_set_ca_state函数将TCP拥塞状态切换成Recovery。

3、使用拥塞控制算法避免网络拥塞

我们在上文中提到，内核进行丢包检测和拥塞控制是分开进行的。目前为止我们已经介绍了涉及到内核检测拥塞的函数，接下来我们进一步介绍内核是如何避免拥塞的。

在计算好丢包数以及发送速率（用于BBR等基于带宽估计的拥塞控制算法）后，内核正式开始拥塞控制。

net/ipv4/tcp_input.c

delivered = tcp_newly_delivered(sk, delivered, flag);
lost = tp->lost - lost;			/* freshly marked lost */
rs.is_ack_delayed = !!(flag & FLAG_ACK_MAYBE_DELAYED);
tcp_rate_gen(sk, delivered, lost, is_sack_reneg, sack_state.rate);
tcp_cong_control(sk, ack, delivered, flag, sack_state.rate);

内核对拥塞窗口或发送速率的调整主要在函数tcp_cong_control中实现

net/ipv4/tcp_input.c

/* The "ultimate" congestion control function that aims to replace the rigid
 * cwnd increase and decrease control (tcp_cong_avoid,tcp_*cwnd_reduction).
 * It's called toward the end of processing an ACK with precise rate
 * information. All transmission or retransmission are delayed afterwards.
 */
static void tcp_cong_control(struct sock *sk, u32 ack, u32 acked_sacked,
			     int flag, const struct rate_sample *rs)
{
	const struct inet_connection_sock *icsk = inet_csk(sk);

	if (icsk->icsk_ca_ops->cong_control) {
		icsk->icsk_ca_ops->cong_control(sk, rs);
		return;
	}

	if (tcp_in_cwnd_reduction(sk)) {
		/* Reduce cwnd if state mandates */
		tcp_cwnd_reduction(sk, acked_sacked, rs->losses, flag);
	} else if (tcp_may_raise_cwnd(sk, flag)) {
		/* Advance cwnd if state allows */
		tcp_cong_avoid(sk, ack, acked_sacked);
	}
	tcp_update_pacing_rate(sk);
}

如果内核已经拥有注册cong_control的拥塞控制算法的话则按照已经注册的拥塞控制算法进行拥塞控制。如果没有的话，则使用PRR算法调整拥塞窗口并使用Linux内核默认的CUBIC拥塞控制算法进行拥塞控制。

对于内核已经注册了cong_control的拥塞控制算法（我们以BBR为例）：

/net/ipv4/tcp_bbr.c

static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = {
	.flags		= TCP_CONG_NON_RESTRICTED,
	.name		= "bbr",
	.owner		= THIS_MODULE,
	.init		= bbr_init,
	.cong_control	= bbr_main,
	.sndbuf_expand	= bbr_sndbuf_expand,
	.undo_cwnd	= bbr_undo_cwnd,
	.cwnd_event	= bbr_cwnd_event,
	.ssthresh	= bbr_ssthresh,
	.min_tso_segs	= bbr_min_tso_segs,
	.get_info	= bbr_get_info,
	.set_state	= bbr_set_state,
};

/net/ipv4/tcp_bbr.c

static void bbr_main(struct sock *sk, const struct rate_sample *rs)
{
	struct bbr *bbr = inet_csk_ca(sk);
	u32 bw;

	bbr_update_model(sk, rs);

	bw = bbr_bw(sk);
	bbr_set_pacing_rate(sk, bw, bbr->pacing_gain);
	bbr_set_cwnd(sk, rs, rs->acked_sacked, bw, bbr->cwnd_gain);
}

我们可以看到，BBR已经声明了cong_control选项，那么内核的拥塞控制过程则完全由BBR拥塞控制算法接管，BBR将使用bbr_main函数避免网络拥塞。但是对于CUBIC来说，其并未声明cong_control选项，因此内核首先使用PRR算法将拥塞窗口设置为当前inflight，然后基于CUBIC拥塞控制算法使拥塞窗口重新收敛。

/net/ipv4/tcp_cubic.c

static struct tcp_congestion_ops cubictcp __read_mostly = {
	.init		= cubictcp_init,
	.ssthresh	= cubictcp_recalc_ssthresh,
	.cong_avoid	= cubictcp_cong_avoid,
	.set_state	= cubictcp_state,
	.undo_cwnd	= tcp_reno_undo_cwnd,
	.cwnd_event	= cubictcp_cwnd_event,
	.pkts_acked     = cubictcp_acked,
	.owner		= THIS_MODULE,
	.name		= "cubic",
};