论文阅读笔记:Swin-Transformer

Swin Transformer是2021年提出的一种改进版的Transformer模型,用于图像识别任务。它通过引入层次化结构和窗口注意力机制,解决了ViT模型中全局注意力计算效率低的问题。Swin Transformer使用滑动窗口策略,将图像划分为多个小窗口,每个窗口内部执行自注意力操作,相邻窗口间通过Shifted Window MSA实现信息交互。此外,模型还包括DropPath、相对位置偏置等技术,提高模型性能并减少过拟合。Swin TransformerBlock是其核心组成部分,包含了W-MSA和SW-MSA交替进行的结构,通过BasicLayer和PatchMerging层实现多尺度特征学习。该模型在保持高效计算的同时,提高了Transformer在图像领域的应用效果。

1. Swin-Transformer

Liu, Ze, et al. “Swin transformer: Hierarchical vision transformer using shifted windows.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

本文是一篇奠定了Transformer在图像领域地位的论文,它不同于ViT(Vision Transformer),提出了一种层次化的结构,因为ViT一开始就固定了patch的划分,因此感受野不会变化,而Swin Transformer采用了传统CNN下采样的设计,在不同的阶段采用不同的感受野尺度,最终得到了比ViT更好的性能表现。
在这里插入图片描述

1.1 patch划分

论文代码提供了一种用卷积来进行初始划分patch的方法,就是用kernerl_size,stride与patch_size的卷积核做卷积操作。

class PatchEmbed(nn.Module):

    def __init__(self, patch_size=4, in_c=1, embed_dim=96, norm_layer=None):
        super(PatchEmbed, self).__init__()
        patch_size = (patch_size, patch_size)
        self.patch_size = patch_size
        self.in_channels = in_c
        self.embed_dim = embed_dim
        self.proj = nn.Conv2d(
            in_channels = in_c,
            out_channels = embed_dim,
            kernel_size=patch_size,
            stride= patch_size
        )  # 用卷积做patch的划分,kernel_size和stride一致即可
        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()

    def forward(self, x):
        # x [batch_size, c, h, w]
        _, _, H, W = x.shape
        # 若H,W不是patch_size的整数倍,则进行填充
        pad_input = (H % self.patch_size[0] != 0) or (W % self.patch_size[1] != 0)
        # pad函数的作用是填充图像,pad(input, tuple)
        # input 输入的图像
        # tuple 例如(1, 2) 最后一维左边填充1列,右边填充2列 (1, 2, 3, 4) w左边填充1列,右边填充2列,h上边填充3行,下边填充4行
        if pad_input:
            x = F.pad(x, (0, self.patch_size[1] - W % self.patch_size[1],
                      0, self.patch_size[0] - H % self.patch_size[0], 0, 0))

        x = self.proj(x)  # [batch_size, embed_dim, h//patch_size, w//patch_size]
        _, _, H, W = x.shape  # H,W为feature map的高宽
        x = x.flatten(2).transpose(1, 2)
        x = self.norm(x)
        # flatten: [B, C, H, W] -> [B, C, HW]
        # transpose: [B, C, HW] -> [B, HW, C]
        # 代表了最终的输出是 HW个patch,每个patch的通道数是embed_dim
        return x, H, W

1.2 DropPath

http://www.manongjc.com/detail/24-jjgknxdkdzormze.html

论文代码中为了减少过拟合的影响,引入了DropPath的方法,具体参考上面的链接,代码如下:

class DropPath(nn.Module):
    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
    DropPath是将深度学习模型中的多分支结构随机”删除“
    """
    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path_f(x, self.drop_prob, self.training)

def drop_path_f(x, drop_prob: float = 0., training: bool = False):
    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
    This is the same as the DropConnect impl I created for EfficientNet, etc networks, however,
    the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for
    changing the layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use
    'survival rate' as the argument.
    """
    if drop_prob == 0. or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
    random_tensor.floor_()  # binarize
    output = x.div(keep_prob) * random_tensor
    return output

1.3 窗口划分和恢复

Swin Transformer和ViT最大的不同就是引入了窗口的概念,对一个窗口中的像素/patch做自注意力,而不是整张图片所有的像素/patch做自注意力,因此计算效率更高。

def window_partition(x,window_size):
    """
    划分Feature Map, 划分成一个个没有重叠的Window;
    这个window_partition与ViT Patch的划分方法如出一辙;
    若干patch组合成一个window
    :param x: (B,H,W,C)
    :param window_size: (M)
    :return: windows: (num_windows*B, window_size, window_size, C)
    """
    B,H,W,C = x.shape
    x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)
    # permute: [B, H//Mh, Mh, W//Mw, Mw, C] -> [B, H//Mh, W//Mh, Mw, Mw, C]
    # view: [B, H//Mh, W//Mw, Mh, Mw, C] -> [B*num_windows, Mh, Mw, C]
    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)  # 使得一个window中的patch做MSA
    return windows

def window_reverse(windows, window_size: int, H: int, W: int):
    """
    将一个个window还原成一个feature map
    Args:
        windows: (num_windows*B, window_size, window_size, C)
        window_size (int): Window size(M)
        H (int): Height of image
        W (int): Width of image
    Returns:
        x: (B, H, W, C)
    """
    B = int(windows.shape[0] / (H * W / window_size / window_size))
    # view: [B*num_windows, Mh, Mw, C] -> [B, H//Mh, W//Mw, Mh, Mw, C]
    x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)
    # permute: [B, H//Mh, W//Mw, Mh, Mw, C] -> [B, H//Mh, Mh, W//Mw, Mw, C]
    # view: [B, H//Mh, Mh, W//Mw, Mw, C] -> [B, H, W, C]
    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
    return x

1.4 PatchMerge

https://blog.csdn.net/qq_37541097/article/details/121119988?spm=1001.2014.3001.5502

上面这篇博文对patchmerge概括的很好,具体来说就是做了一个跟cnn类似的事情,通道翻倍,宽高减半,从而可以替代cnn。文章里的patch划分基本就是在做这件事,因此要和window_size区分开。

在这里插入图片描述

class PatchMerging(nn.Module):
    """
    用来在每个Stage开始前进行DownSample,以缩小分辨率,并调整通道数量,以达到分层和高效的作用。
    - 类似于CNN内,通过调整Stride来降低分辨率的作用。
    Step1: 行列间隔2选取元素
    Step2: 拼接为一整个Tensor(通道数变为4倍)
    Step3: 通过FC Layer调整通道数
    """
    def __init__(self,dim,norm_layer=nn.LayerNorm):
        # dim : 输入的通道数
        
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值