【实验】vit代码

最新推荐文章于 2024-10-16 11:13:27 发布

盏云

最新推荐文章于 2024-10-16 11:13:27 发布

阅读量5.9k

点赞数 6

CC 4.0 BY-SA版权

分类专栏：实验代码文章标签： pytorch 深度学习 python

本文链接：https://blog.csdn.net/zhe470719/article/details/124907059

这里写目录标题

参考
讲解一：代码+理论
版本一:lucidrains
版本二 :rwightman 源码可直接运行

参考

霹雳吧啦Wz-pytorch_classification/vision_transformer
视频：
霹雳吧啦Wz

笔记：
VIT(vision transformer)模型介绍+pytorch代码炸裂解析

Visual Transformer (ViT) 代码实现 PyTorch版本
 详细—Vision Transformer——ViT代码解读

讲解一：代码+理论

很详细：理论+代码----Vision Transformer（ViT）PyTorch代码全解析（附图解）

版本一:lucidrains

使用einopseinops和einsum：直接操作张量的利器
代码：
大佬复现-pytorch版
这个版本的代码超级受欢迎且易使用，我看的时候，Git repo已经被star 5.7k次。大家直接 pip install vit-pytorch就好。
所以作为初次接触vit的同学们来说，推荐看第二个版本，结构清晰明了。
笔记：
强推——很详细！-lucidrains-版本讲解
在这里插入图片描述

1. 大佬复现版本给的使用案例

大家完全可以把这段代码copy-paste到自己的pycharm里，然后使用调试功能，一步步看ViT的每一步操作。

import torch
from vit_pytorch import ViT

v = ViT(
    image_size = 256,    # 图像大小
    patch_size = 32,     # patch大小（分块的大小）
    num_classes = 1000,  # imagenet数据集1000分类
    dim = 1024,          # position embedding的维度
    depth = 6,           # encoder和decoder中block层数是6
    heads = 16,          # multi-head中head的数量为16
    mlp_dim = 2048,
    dropout = 0.1,       # 
    emb_dropout = 0.1
)

img = torch.randn(1, 3, 256, 256)

preds = v(img) # (1, 1000)
print(preds.shape)  # (16, 1000)

2. Transformer结构

进行6次for循环，有6层encoder结构。
for循环内部依次使用multi-head attention和Feed Forward，
对应Transformer的Encoder结构中多头自注意力模块和MLP模块。在自注意力后及feed forward后，有残差连接。

class Transformer(nn.Module):
    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
        super().__init__()
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)),
                PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout))
            ]))
    def forward(self, x):
        for attn, ff in self.layers:
            x = attn(x) + x
            x = ff(x) + x
        return x

PreNorm类代码如下，在使用multi-head attention和Feed Forward之前，首先对输入通过LayerNorm进行处理。

class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn
    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)

3. Attention

torch.chunk(tensor, chunk_num, dim)函数的功能：与torch.cat()刚好相反，它是将tensor按dim（行或列）分割成chunk_num个tensor块，返回的是一个元组。
attention操作的整体流程：

1.首先对输入生成query, key和value，这里的“输入”有可能是整个网络的输入，也可能是某个hidden layer的output。在这里，生成的qkv是个长度为3的元组，每个元组的大小为(1, 65, 1024)
2.对qkv进行处理，重新指定维度，得到的q, k, v维度均为(1, 16, 65, 64)
3.q和k做点乘，得到的dots维度为(1, 16, 65, 65)
4.对dots的最后一维做softmax，得到各个patch对其他patch的注意力得分
5.将attention和value做点乘
6.对各个维度重新排列，得到与输入相同维度的输出 (1, 65, 1024)

class Attention(nn.Module):
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
        super().__init__()
        inner_dim = dim_head *  heads
        project_out = not (heads == 1 and dim_head == dim)

        self.heads = heads
        self.scale = dim_head ** -0.5

        self.attend = nn.Softmax(dim = -1)
        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)

        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        ) if project_out else nn.Identity()

    def forward(self, x):
        qkv = self.to_qkv(x).chunk(3, dim = -1)    # 首先生成q,k,v
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)

        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale

        attn = self.attend(dots)

        out = torch.matmul(attn, v)
        out = rearrange(out, 'b h n d -> b n (h d)')
        return self.to_out(out)

4. FeedForward

FeedForward模块共有2个全连接层，整个结构是：

1.首先过一个全连接层
2.经过GELU()激活函数进行处理
3.nn.Dropout()，以一定概率丢失掉一些神经元，防止过拟合
4.再过一个全连接层
5.nn.Dropout()
注意：GELU(x) = x * Φ(x), 其中，Φ(x)是高斯分布的累积分布函数。

class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, dropout =

最低0.47元/天解锁文章

200万优质内容无限畅学