本文分析基于linux内核4.19.195
软件实现原理
软件上实现rcu的原理,无非就是识别宽限期(GP:Grace Period)的开始,检测静止状态(QS:Quiescent Status),然后判断宽限期的结束,处理宽限期结束的相关事宜并且启动下一个宽限期这样一个大型状态机;linux内核主要依赖的就是内核在tick时能否调度来识别一个CPU是否退出了临界区(也就是是否离开了宽限期),然后利用是否所用CPU都度过了一次静止状态来识别是否度过了一个宽限期;在宽限期结束之后,就可以进行相关的内存释放动作,以及启动下一次宽限期等动作。实现原理可以参考其他一些博客,讲的都很好。
硬件实现前提
一开始我也没注意到,内核实现的RCU是有硬件前提的,借用wikipedia的一个句子:RCU-based updaters typically take advantage of the fact that writes to single aligned pointers are atomic on modern CPUs, allowing atomic insertion, removal, and replacement of data in a linked structure without disrupting readers. 为什么要有这个前提呢?其实看看内核双链表的rcu版本的实现,就很容易明白,如果写端不是原子的,那么相当于没有加锁的读端就有可能看到一个基于旧版本和新版本之间状态的值,这样读端就有问题了。现代64位CPU,对于大小小于8byte的aligned的变量,基本都是可以做到write动作是atomic的,所以原则上来说 ,大小小于8byte的aligned的变量都可以用rcu来保护,但是我们很少(应该说就是没有,如果真的有请大佬指正)看到有非指针类型的变量用rcu来保护的,这是为什么呢?我觉得有两点:一点是非指针变量,size太小,也保存不了多少信息;另外就是,原则上数据是否有效,是需要一个引用计数来指示的,这也就是很多使用rcu保护的数据结构里,都有一个引用计数的原因,没有引用计数的话,我们无法知道这个数据是有效的还是过时的,而且,在获取到这个数据的时候,我们也需要通过增加引用计数的方式,来保证其内存不被释放。而一个引用计数基本就是4byte以上的体积了,因而rcu通过保护指针来获得数据就再正常不过了。
当然,这个硬件实现的前提,也适合内核里一些展示数据的代码,这类代码会在cpu 1上去读取cpu2上的percpu变量,然后输出。显然在cpu 1上去使用cpu2上的percpu变量是不合理的,但是如果是读操作,而且只是一个展示数据的代码,对精度要求不高,那么我们最多就是读到一个旧的数据,并不会带来什么严重的后果。
RCU实现浅析
下面以tree rcu--------rcu_sched为例来说明linux内核的rcu实现
基本数据结构
rcu_state描述rcu全局状态,内核位每种RCU定义了一个rcu_state实例
rcu_node描述一个处理器分组的rcu状态;是Tree RCU中的组织节点
rcu_data 描述一个处理器的rcu状态,每个处理器对应一个rcu_data实例;每个cpu对于自己正在经历的宽限期会有自己的视角,不一定是全局统一的,这是因为有些CPU可能经历的停止TICK、长时间的IDLE、甚至是offline/online等操作,这会带来与其他CPU之间的差异;此外,可能某些CPU的宽限期比较长,也会导致其宽限期号落后于其他cpu。
rcu_segcblist是rcu的call back链表相关的数据结构,这个数据结构将rcu的call back链表分成了四段,每一段的具体含义可以看代码注释
数据结构间的关系可以参考这里
struct rcu_state {
}
struct rcu_node {
}
struct rcu_data {
}
/* Complicated segmented callback lists. ;-) */
/*
* Index values for segments in rcu_segcblist structure.
*
* The segments are as follows:
*
* [head, *tails[RCU_DONE_TAIL]):
* Callbacks whose grace period has elapsed, and thus can be invoked.
* [*tails[RCU_DONE_TAIL], *tails[RCU_WAIT_TAIL]):
* Callbacks waiting for the current GP from the current CPU's viewpoint.
* [*tails[RCU_WAIT_TAIL], *tails[RCU_NEXT_READY_TAIL]):
* Callbacks that arrived before the next GP started, again from
* the current CPU's viewpoint. These can be handled by the next GP.
* [*tails[RCU_NEXT_READY_TAIL], *tails[RCU_NEXT_TAIL]):
* Callbacks that might have arrived after the next GP started.
* There is some uncertainty as to when a given GP starts and
* ends, but a CPU knows the exact times if it is the one starting
* or ending the GP. Other CPUs know that the previous GP ends
* before the next one starts.
*
* Note that RCU_WAIT_TAIL cannot be empty unless RCU_NEXT_READY_TAIL is also
* empty.
*
* The ->gp_seq[] array contains the grace-period number at which the
* corresponding segment of callbacks will be ready to invoke. A given
* element of this array is meaningful only when the corresponding segment
* is non-empty, and it is never valid for RCU_DONE_TAIL (whose callbacks
* are already ready to invoke) or for RCU_NEXT_TAIL (whose callbacks have
* not yet been assigned a grace-period number).
*/
#define RCU_DONE_TAIL 0 /* Also RCU_WAIT head. */
#define RCU_WAIT_TAIL 1 /* Also RCU_NEXT_READY head. */
#define RCU_NEXT_READY_TAIL 2 /* Also RCU_NEXT head. */
#define RCU_NEXT_TAIL 3
#define RCU_CBLIST_NSEGS 4
struct rcu_segcblist {
struct rcu_head *head;
struct rcu_head **tails[RCU_CBLIST_NSEGS];
unsigned long gp_seq[RCU_CBLIST_NSEGS];
long len;
long len_lazy;
};
后台线程
每种RCU都会有一个后台线程来完成宽限期的启停动作,这个宽限期线程代码就是函数rcu_gp_kthread(),它主要完成三个动作,也就是函数里实现的三个部分:1)等待并创建新的宽限期GP;2)等待强制静止状态,设置超时,提前唤醒说明所有处理器经过了静止状态;3)宽限期结束处理。
第一个部分,是在等待rsp->gp_flags设置RCU_GP_FLAG_INIT标志位,这个标志一旦置位,就说明需要开启一个新的宽限期;这个标志会在函数rcu_gp_cleanup()及rcu_start_this_gp()里被置位
第二个部分,会等待所有CPU都度过宽限期,如果所有CPU都在超时前度过宽限期,就会顺利的进入到第三个阶段;否则,到期时间一到,会强制那些没有度过宽限期的CPU参生静止状态
第三部分,用于完成宽限期结束后的清理动作
/*
* Body of kthread that handles grace periods.
*/
static int __noreturn rcu_gp_kthread(void *arg)
{
bool first_gp_fqs;
int gf;
unsigned long j;
int ret;
struct rcu_state *rsp = arg;
struct rcu_node *rnp = rcu_get_root(rsp);
rcu_bind_gp_kthread();
for (;;) {
/* Handle grace-period start. */
for (;;) {
//这个for循环用来启动宽限期
trace_rcu_grace_period(rsp->name,
READ_ONCE(rsp->gp_seq),
TPS("reqwait"));
rsp->gp_state = RCU_GP_WAIT_GPS;
swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
RCU_GP_FLAG_INIT); //等待启动新的宽限期
rsp->gp_state = RCU_GP_DONE_GPS;
/* Locking provides needed memory barrier. */
if (rcu_gp_init(rsp)) //启动新的宽限期并初始化
break;
cond_resched_tasks_rcu_qs();
WRITE_ONCE(rsp->gp_activity, jiffies);
WARN_ON(signal_pending(current));
trace_rcu_grace_period(rsp->name,
READ_ONCE(rsp->gp_seq),
TPS("reqwaitsig"));
}
/* Handle quiescent-state forcing. */ //处理强制静止状态
first_gp_fqs = true;
j = jiffies_till_first_fqs;
ret = 0;
for (;;) {
if (!ret) {
rsp->jiffies_force_qs = jiffies + j;
WRITE_ONCE(rsp->jiffies_kick_kthreads,
jiffies + 3 * j);
}
trace_rcu_grace_period(rsp->name,
READ_ONCE(rsp->gp_seq),
TPS("fqswait"));
rsp->gp_state = RCU_GP_WAIT_FQS;//表示等待强制静止状态
ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, //等待超时或者gp完成
rcu_gp_fqs_check_wake(rsp, &gf), j);
rsp->gp_state = RCU_GP_DOING_FQS;
/* Locking provides needed memory barriers. */
/* If grace period done, leave loop. */
if (!READ_ONCE(rnp->qsmask) &&
!rcu_preempt_blocked_readers_cgp(rnp))
break;
/* If time for quiescent-state forcing, do it. */
if (ULONG_CMP_GE(jiffies, rsp->jiffies_force_qs) ||
(gf & RCU_GP_FLAG_FQS)) {
trace_rcu_grace_period(rsp->name,
READ_ONCE(rsp->gp_seq),
TPS("fqsstart"));
rcu_gp_fqs(rsp, first_gp_fqs);
first_gp_fqs = false;
trace_rcu_grace_period(rsp->name,
READ_ONCE(rsp->gp_seq),
TPS("fqsend"));
cond_resched_tasks_rcu_qs();
WRITE_ONCE(rsp->gp_activity, jiffies);
ret = 0; /* Force full wait till next FQS. */
j = jiffies_till_next_fqs;
} else {
/* Deal with stray signal. */
cond_resched_tasks_rcu_qs();
WRITE_ONCE(rsp->gp_activity, jiffies);
WARN_ON(signal_pending(current));
trace_rcu_grace_period(rsp->name,
READ_ONCE(rsp->gp_seq),
TPS("fqswaitsig"));
ret = 1; /* Keep old FQS timing. */
j = jiffies;
if (time_after(jiffies, rsp->jiffies_force_qs))
j = 1;
else
j = rsp->jiffies_force_qs - j;
}
}
/* Handle grace-period end. */ //处理宽限期结束
rsp->gp_state = RCU_GP_CLEANUP;
rcu_gp_cleanup(rsp);
rsp->gp_state = RCU_GP_CLEANED;
}
}
第一部分实现非常简单,就不分析了;
第二部分的实现不复杂,先是睡眠等待函数rcu_gp_fqs_check_wake()判断为true才往下运行,rcu_gp_fqs_check_wake()判断为true有两种情况,一种是需要强制产生一个静止状态,另外一种是当前的宽限期已经完成了;如果不需要强制静止状态,那么这个宽限期就可以结束了;否则,就需要进行强制禁止状态,这个过程会在第二部分的大循环里面不断循环,并且以jiffies_till_first_fqse为周期,调用rcu_gp_fqs()强制没有度过静止状态的处理器经历一个静止状态。如果是本宽限期第一次处理强制宽限期,那么rcu_gp_fqs()会调用force_qs_rnp(rsp, dyntick_save_progress_counter);完成处理,否则调用force_qs_rnp(rsp, rcu_implicit_dynticks_qs);来看看这部分的实现
/*
* Do one round of quiescent-state forcing.
*/
static void rcu_gp_fqs(struct rcu_state *rsp, bool first_time)
{
struct rcu_node *rnp = rcu_get_root(rsp);
WRITE_ONCE(rsp->gp_activity, jiffies);
rsp->n_force_qs++;
if (first_time) {
/* Collect dyntick-idle snapshots. */
force_qs_rnp(rsp, dyntick_save_progress_counter);
} else {
/* Handle dyntick-idle and offline CPUs. */
force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
}
/* Clear flag to prevent immediate re-entry. */
if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
raw_spin_lock_irq_rcu_node(rnp);
WRITE_ONCE(rsp->gp_flags,
READ_ONCE(rsp->gp_flags) & ~RCU_GP_FLAG_FQS);
raw_spin_unlock_irq_rcu_node(rnp);
}
}
/*
* Scan the leaf rcu_node structures, processing dyntick state for any that
* have not yet encountered a quiescent state, using the function specified.
* Also initiate boosting for any threads blocked on the root rcu_node.
*
* The caller must have suppressed start of new grace periods.
*/
static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *rsp))
{
int cpu;
unsigned long flags;
unsigned long mask;
struct rcu_node *rnp;
rcu_for_each_leaf_node(rsp, rnp) {
cond_resched_tasks_rcu_qs();
mask = 0;
raw_spin_lock_irqsave_rcu_node(rnp, flags);
if (rnp->qsmask == 0) {
if (rcu_state_p == &rcu_sched_state ||
rsp != rcu_state_p ||
rcu_preempt_blocked_readers_cgp(rnp)) {
/*
* No point in scanning bits because they
* are all zero. But we might need to
* priority-boost blocked readers.
*/
rcu_initiate_boost(rnp, flags);
/* rcu_initiate_boost() releases rnp->lock */
continue;
}
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
continue;
}
for_each_leaf_node_possible_cpu(rnp, cpu) {
unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
if (