torchvision Faster-RCNN ResNet-50 FPN代码解析（ROI）

本文链接：https://blog.csdn.net/defi_wang/article/details/109015343

本文详细介绍了基于FPN的Faster R-CNN目标检测算法流程，包括ROI池化、Box头部处理、Box预测及检测结果后处理等关键步骤。特别关注了FPN层级映射、RoIAlign的具体实现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

总体架构¹

ROI对从RPN中选出来的1000个Proposal Boxes，以及从FPN中输出的多层特征图进行ROI Pool，对于box中的对象进行分类，并再次进行Proposal Boxes偏移（offset/delta）数值回归，产生新的分数和再次微调的box，以及得到标签，最后再次进行非极大值抑制(NMS)：
在这里插入图片描述
基于FPN的ROI处理会比传统的Faster RCNN多出一些步骤，要更加复杂一些。
主要包含如下步骤：

Box ROI Pool，根据1000个Proposal box的面积，确定选择在哪一层特征图上进行ROI Pool操作
Box Head，由两个全连接层组成，对ROI Align处理出来的7x7的bounding-box所包含的特征图进一步处理
Box Predicator，在Box Head处理得到的结果在进一步进行分类和Box的位置偏移(offset)做数值回归
Postprocess Detection，做Softmax，进行最后分类，并将Box的位置偏移回归结果和Proposal boxes进行合并，得到调整后的detection boxes，最后进行极大值抑制(NMS)过滤出有效的detection结果（scores, boxes和labels）。

Box ROI Pool

本模型可以同时对多个图像进行处理，分别检测出各个图片中对象，所以首先需要通过convert_to_roi_format将各个图像的每层特征图合并在一起，统一进行ROI Align处理。

setup_scales则对输出的前4个特征图（最小的pool层不用），对mapper对象进行配置。Mapper是FPN中引入的一个新的概念，主要是计算Proposal Box的面积，并根据面积算出在哪一个特征图层进行ROI Align处理，具体可以参考论文中：
在这里插入图片描述
对应的实现代码为：

class LevelMapper(object):
    """Determine which FPN level each RoI in a set of RoIs should map to based
    on the heuristic in the FPN paper.

    Arguments:
        k_min (int)
        k_max (int)
        canonical_scale (int)
        canonical_level (int)
        eps (float)
    """

    def __init__(self, k_min, k_max, canonical_scale=224, canonical_level=4, eps=1e-6):
        # type: (int, int, int, int, float) -> None
        self.k_min = k_min
        self.k_max = k_max
        self.s0 = canonical_scale
        self.lvl0 = canonical_level
        self.eps = eps

    def __call__(self, boxlists):
        # type: (List[Tensor]) -> Tensor
        """
        Arguments:
            boxlists (list[BoxList])
        """
        # Compute level ids
        s = torch.sqrt(torch.cat([box_area(boxlist) for boxlist in boxlists]))

        # Eqn.(1) in FPN paper
        target_lvls = torch.floor(self.lvl0 + torch.log2(s / self.s0) + torch.tensor(self.eps, dtype=s.dtype))
        target_lvls = torch.clamp(target_lvls, min=self.k_min, max=self.k_max)
        return (target_lvls.to(torch.int64) - self.k_min).to(torch.int64)

比如有一个Proposal Box对应宽高分别为：100, 120，那么根据上述公式：
$\begin{aligned} k &= \lfloor k0 + \log_2(\sqrt{100\times120}/224) \rfloor\\ & = \lfloor 4 + \log_2(109.5445/224) \rfloor\\ & = \lfloor 2.9680183191660503 \rfloor = 2 \end{aligned}$
下表是ResNet50和FPN的对应关系，参考libtorch学习笔记（17）- ResNet50 FPN以及如何应用于Faster-RCNN

ResNet Layer Name	ResNet Level(k)	FPN Level	Minimum Area( $\times h$ )
conv1	1	n/a	n/a
conv2_x	2	0	$56^2$
conv3_x	3	1	$112^2$
conv4_x	4	2	$224^2$
conv5_x	5	3	$448^2$
n/a	n/a	pool

所以这个proposal box会从Feature Map Level#0(2 - 2 = 0)中取出特征图进行RoI Align²处理。

Box Head

这部分包括两个全连接层，并用于后续的预测模块用来做分类和bouding-box delta预测：

class TwoMLPHead(nn.Module):
    """
    Standard heads for FPN-based models

    Arguments:
        in_channels (int): number of input channels
        representation_size (int): size of the intermediate representation
    """

    def __init__(self, in_channels, representation_size):
        super(TwoMLPHead, self).__init__()

        self.fc6 = nn.Linear(in_channels, representation_size)
        self.fc7 = nn.Linear(representation_size, representation_size)

    def forward(self, x):
        x = x.flatten(start_dim=1)

        x = F.relu(self.fc6(x))
        x = F.relu(self.fc7(x))

        return x

Box Predicator

这部分主要用来对1000个proposal boxes进行分类，并再次进行调整得到更精确的boxes。

class FastRCNNPredictor(nn.Module):
    """
    Standard classification + bounding box regression layers
    for Fast R-CNN.

    Arguments:
        in_channels (int): number of input channels
        num_classes (int): number of output classes (including background)
    """

    def __init__(self, in_channels, num_classes):
        super(FastRCNNPredictor, self).__init__()
        self.cls_score = nn.Linear(in_channels, num_classes)
        self.bbox_pred = nn.Linear(in_channels, num_classes * 4)

    def forward(self, x):
        if x.dim() == 4:
            assert list(x.shape[2:]) == [1, 1]
        x = x.flatten(start_dim=1)
        scores = self.cls_score(x)
        bbox_deltas = self.bbox_pred(x)

        return scores, bbox_deltas

Postprocess Detection

首先将Box Predicator预测出来的bbox的delta值和proposal boxes进行合并，得到每个proposal boxes的左上角和右下角坐标，和RPN的算法相似，可以参考Box-Coder.Decode，里面由详细介绍。

    def postprocess_detections(self,
                               class_logits,    # type: Tensor
                               box_regression,  # type: Tensor
                               proposals,       # type: List[Tensor]
                               image_shapes     # type: List[Tuple[int, int]]
                               ):
        # type: (...) -> Tuple[List[Tensor], List[Tensor], List[Tensor]]
        device = class_logits.device
        num_classes = class_logits.shape[-1]

        boxes_per_image = [boxes_in_image.shape[0] for boxes_in_image in proposals]
        pred_boxes = self.box_coder.decode(box_regression, proposals)

然后对回归的前景对象分类进行softmax

        pred_scores = F.softmax(class_logits, -1)

接着去取每张图片的boxes, scores和image_shape：

        pred_boxes_list = pred_boxes.split(boxes_per_image, 0)
        pred_scores_list = pred_scores.split(boxes_per_image, 0)

        all_boxes = []
        all_scores = []
        all_labels = []
        for boxes, scores, image_shape in zip(pred_boxes_list, pred_scores_list, image_shapes):

然后将800x1216坐标clip到800x1202，具体参考torchvision Faster-RCNN ResNet-50 FPN代码解析（图片转换和坐标）

            boxes = box_ops.clip_boxes_to_image(boxes, image_shape)

创建一个labels的张量，用来存放过滤出来的detection bbox的label index:

            # create labels for each prediction
            labels = torch.arange(num_classes, device=device)
            labels = labels.view(1, -1).expand_as(scores)

移除背景labels, scores和boxes:

            # remove predictions with the background label
            boxes = boxes[:, 1:]
            scores = scores[:, 1:]
            labels = labels[:, 1:]

移除低分的detection，这里score_thresh为0.05：

            # batch everything, by making every class prediction be a separate instance
            boxes = boxes.reshape(-1, 4)
            scores = scores.reshape(-1)
            labels = labels.reshape(-1)

            # remove low scoring boxes
            inds = torch.nonzero(scores > self.score_thresh).squeeze(1)
            boxes, scores, labels = boxes[inds], scores[inds], labels[inds]

移除空的boxes:

            # remove empty boxes
            keep = box_ops.remove_small_boxes(boxes, min_size=1e-2)
            boxes, scores, labels = boxes[keep], scores[keep], labels[keep]

经过这些步骤后得到大概这样的boxes和labels:
在这里插入图片描述
最后用极大值抑制(NMS³)剔除那些重叠box：

            # non-maximum suppression, independently done per class
            keep = box_ops.batched_nms(boxes, scores, labels, self.nms_thresh)
            # keep only topk scoring predictions
            keep = keep[:self.detections_per_img]
            boxes, scores, labels = boxes[keep], scores[keep], labels[keep]

这样得到的boxes和labels是：
在这里插入图片描述

结语

经过ROI处理之后，可以检测到的对象已经比较精确了，而且这里还带有检测对象的分数，比如：

[
	0.9996865, 0.999302, 0.9909377, 
	0.964582, 0.8458481, 0.79095364, 
	0.3160024, 0.16850659, 0.16231589, 
	0.106609166, 0.07780073, 0.07285354, 0.06343418
]