torchvision Faster-RCNN ResNet-50 FPN代码解析(ROI)

本文详细介绍了基于FPN的Faster R-CNN目标检测算法流程,包括ROI池化、Box头部处理、Box预测及检测结果后处理等关键步骤。特别关注了FPN层级映射、RoIAlign的具体实现。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

总体架构1

ROI对从RPN中选出来的1000个Proposal Boxes,以及从FPN中输出的多层特征图进行ROI Pool,对于box中的对象进行分类,并再次进行Proposal Boxes偏移(offset/delta)数值回归,产生新的分数和再次微调的box,以及得到标签,最后再次进行非极大值抑制(NMS):
在这里插入图片描述
基于FPN的ROI处理会比传统的Faster RCNN多出一些步骤,要更加复杂一些。
主要包含如下步骤:

  1. Box ROI Pool,根据1000个Proposal box的面积,确定选择在哪一层特征图上进行ROI Pool操作
  2. Box Head,由两个全连接层组成,对ROI Align处理出来的7x7的bounding-box所包含的特征图进一步处理
  3. Box Predicator,在Box Head处理得到的结果在进一步进行分类和Box的位置偏移(offset)做数值回归
  4. Postprocess Detection,做Softmax,进行最后分类,并将Box的位置偏移回归结果和Proposal boxes进行合并,得到调整后的detection boxes,最后进行极大值抑制(NMS)过滤出有效的detection结果(scores, boxes和labels)。

Box ROI Pool

本模型可以同时对多个图像进行处理,分别检测出各个图片中对象,所以首先需要通过convert_to_roi_format将各个图像的每层特征图合并在一起,统一进行ROI Align处理。

setup_scales则对输出的前4个特征图(最小的pool层不用),对mapper对象进行配置。Mapper是FPN中引入的一个新的概念,主要是计算Proposal Box的面积,并根据面积算出在哪一个特征图层进行ROI Align处理,具体可以参考论文中:
在这里插入图片描述
对应的实现代码为:

class LevelMapper(object):
    """Determine which FPN level each RoI in a set of RoIs should map to based
    on the heuristic in the FPN paper.

    Arguments:
        k_min (int)
        k_max (int)
        canonical_scale (int)
        canonical_level (int)
        eps (float)
    """

    def __init__(self, k_min, k_max, canonical_scale=224, canonical_level=4, eps=1e-6):
        # type: (int, int, int, int, float) -> None
        self.k_min = k_min
        self.k_max = k_max
        self.s0 = canonical_scale
        self.lvl0 = canonical_level
        self.eps = eps

    def __call__(self, boxlists):
        # type: (List[Tensor]) -> Tensor
        """
        Arguments:
            boxlists (list[BoxList])
        """
        # Compute level ids
        s = torch.sqrt(torch.cat([box_area(boxlist) for boxlist in boxlists]))

        # Eqn.(1) in FPN paper
        target_lvls = torch.floor(self.lvl0 + torch.log2(s / self.s0) + torch.tensor(self.eps, dtype=s.dtype))
        target_lvls = torch.clamp(target_lvls, min=self.k_min, max=self.k_max)
        return (target_lvls.to(torch.int64) - self.k_min).to(torch.int64)

比如有一个Proposal Box对应宽高分别为:100, 120,那么根据上述公式:
k = ⌊ k 0 + log ⁡ 2 ( 100 × 120 / 224 ) ⌋ = ⌊ 4 + log ⁡ 2 ( 109.5445 / 224 ) ⌋ = ⌊ 2.9680183191660503 ⌋ = 2 \begin{aligned} k &= \lfloor k0 + \log_2(\sqrt{100\times120}/224) \rfloor\\ & = \lfloor 4 + \log_2(109.5445/224) \rfloor\\ & = \lfloor 2.9680183191660503 \rfloor = 2 \end{aligned} k=k0+log2(100×120 /224)=4+log2(109.5445/224)=2.9680183191660503=2
下表是ResNet50和FPN的对应关系,参考libtorch学习笔记(17)- ResNet50 FPN以及如何应用于Faster-RCNN

ResNet Layer NameResNet Level(k)FPN LevelMinimum Area( w × h w \times h w×h)
conv11n/an/a
conv2_x20 5 6 2 56^2 562
conv3_x31 11 2 2 112^2 1122
conv4_x42 22 4 2 224^2 2242
conv5_x53 44 8 2 448^2 4482
n/an/apool

所以这个proposal box会从Feature Map Level#0(2 - 2 = 0)中取出特征图进行RoI Align2处理。

Box Head

这部分包括两个全连接层,并用于后续的预测模块用来做分类和bouding-box delta预测:

class TwoMLPHead(nn.Module):
    """
    Standard heads for FPN-based models

    Arguments:
        in_channels (int): number of input channels
        representation_size (int): size of the intermediate representation
    """

    def __init__(self, in_channels, representation_size):
        super(TwoMLPHead, self).__init__()

        self.fc6 = nn.Linear(in_channels, representation_size)
        self.fc7 = nn.Linear(representation_size, representation_size)

    def forward(self, x):
        x = x.flatten(start_dim=1)

        x = F.relu(self.fc6(x))
        x = F.relu(self.fc7(x))

        return x

Box Predicator

这部分主要用来对1000个proposal boxes进行分类,并再次进行调整得到更精确的boxes。

class FastRCNNPredictor(nn.Module):
    """
    Standard classification + bounding box regression layers
    for Fast R-CNN.

    Arguments:
        in_channels (int): number of input channels
        num_classes (int): number of output classes (including background)
    """

    def __init__(self, in_channels, num_classes):
        super(FastRCNNPredictor, self).__init__()
        self.cls_score = nn.Linear(in_channels, num_classes)
        self.bbox_pred = nn.Linear(in_channels, num_classes * 4)

    def forward(self, x):
        if x.dim() == 4:
            assert list(x.shape[2:]) == [1, 1]
        x = x.flatten(start_dim=1)
        scores = self.cls_score(x)
        bbox_deltas = self.bbox_pred(x)

        return scores, bbox_deltas

Postprocess Detection

首先将Box Predicator预测出来的bbox的delta值和proposal boxes进行合并,得到每个proposal boxes的左上角和右下角坐标,和RPN的算法相似,可以参考Box-Coder.Decode,里面由详细介绍。

    def postprocess_detections(self,
                               class_logits,    # type: Tensor
                               box_regression,  # type: Tensor
                               proposals,       # type: List[Tensor]
                               image_shapes     # type: List[Tuple[int, int]]
                               ):
        # type: (...) -> Tuple[List[Tensor], List[Tensor], List[Tensor]]
        device = class_logits.device
        num_classes = class_logits.shape[-1]

        boxes_per_image = [boxes_in_image.shape[0] for boxes_in_image in proposals]
        pred_boxes = self.box_coder.decode(box_regression, proposals)

然后对回归的前景对象分类进行softmax

        pred_scores = F.softmax(class_logits, -1)

接着去取每张图片的boxes, scores和image_shape:

        pred_boxes_list = pred_boxes.split(boxes_per_image, 0)
        pred_scores_list = pred_scores.split(boxes_per_image, 0)

        all_boxes = []
        all_scores = []
        all_labels = []
        for boxes, scores, image_shape in zip(pred_boxes_list, pred_scores_list, image_shapes):

然后将800x1216坐标clip到800x1202,具体参考torchvision Faster-RCNN ResNet-50 FPN代码解析(图片转换和坐标)

            boxes = box_ops.clip_boxes_to_image(boxes, image_shape)

创建一个labels的张量,用来存放过滤出来的detection bbox的label index:

            # create labels for each prediction
            labels = torch.arange(num_classes, device=device)
            labels = labels.view(1, -1).expand_as(scores)

移除背景labels, scores和boxes:

            # remove predictions with the background label
            boxes = boxes[:, 1:]
            scores = scores[:, 1:]
            labels = labels[:, 1:]

移除低分的detection,这里score_thresh为0.05:

            # batch everything, by making every class prediction be a separate instance
            boxes = boxes.reshape(-1, 4)
            scores = scores.reshape(-1)
            labels = labels.reshape(-1)

            # remove low scoring boxes
            inds = torch.nonzero(scores > self.score_thresh).squeeze(1)
            boxes, scores, labels = boxes[inds], scores[inds], labels[inds]

移除空的boxes:

            # remove empty boxes
            keep = box_ops.remove_small_boxes(boxes, min_size=1e-2)
            boxes, scores, labels = boxes[keep], scores[keep], labels[keep]

经过这些步骤后得到大概这样的boxes和labels:
在这里插入图片描述
最后用极大值抑制(NMS3)剔除那些重叠box:

            # non-maximum suppression, independently done per class
            keep = box_ops.batched_nms(boxes, scores, labels, self.nms_thresh)
            # keep only topk scoring predictions
            keep = keep[:self.detections_per_img]
            boxes, scores, labels = boxes[keep], scores[keep], labels[keep]

这样得到的boxes和labels是:
在这里插入图片描述

结语

经过ROI处理之后,可以检测到的对象已经比较精确了,而且这里还带有检测对象的分数,比如:

[
	0.9996865, 0.999302, 0.9909377, 
	0.964582, 0.8458481, 0.79095364, 
	0.3160024, 0.16850659, 0.16231589, 
	0.106609166, 0.07780073, 0.07285354, 0.06343418
]

这里还可以继续过滤一些分数比较低的detection,比如设置一个阈值为0.5,分数大于这个阈值,就是最终检测到的对象。


  1. 假设原始图片大小是599x900,转换之后输入图片大小为800x1202,然后通过padding之后用于backbone网络的处理大小是800x1216 ↩︎

  2. torchvision中ROI Align的算法是通过python C extension来实现的。 ↩︎

  3. torchvision中的NMS也是通过python C extension来实现的。 ↩︎

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值