25年1月显卡显存AI性能天梯(运行70b模型,含50系)

基于kcores大语言模型推理专用显存天梯作为参考,运行 llama-3.1-70b-instruct-4bit 模型的情况下,计算单位显卡对应token生成数量(理论性能,未计算损耗,仅供参考),并进行排名。

单位显卡(或集群)每秒理论token数量排行:
显卡名称显卡数量每秒总token显卡平均token排名
NVIDIA GB200 NVL72112000120001
NVIDIA GB200 Grace Blackwell Superchip1333.33333.332
NVIDIA B200 SXM 192GB1170.83170.833
NVIDIA H100 PCIe/SXM5 96GB170704
NVIDIA H100 SXM5 80GB170705
NVIDIA H800 SXM5 80GB170706
NVIDIA H100 PCIe/CNX 80GB142.542.57
NVIDIA A100/A100X SXM4 80GB142.542.58
NVIDIA A800 SXM4 80GB142.542.59
NVIDIA H800 PCIe 80GB142.542.510
NVIDIA H100 SXM5 64GB142.0842.0811
NVIDIA A100 PCIe 80GB140.4240.4212
NVIDIA A800 80GB Active Ampere140.4240.4213
NVIDIA GRID/DRIVE A100A 32GB277.9238.9614
NVIDIA GRID A100B 48GB138.9638.9615
NVIDIA GeForce RTX 5090 32GB (Preliminary)274.6737.33516
NVIDIA A100 PCIe/SXM4 40GB26532.517
NVIDIA A800 40GB Active Ampere26532.518
NVIDIA A30X 24GB250.8325.41519
NVIDIA Tesla V100 SXM2 16GB (version 2019)494.1723.542520
NVIDIA Tesla V100S PCIe 32GB247.0823.5421
NVIDIA GeForce RTX 5080 16GB (Preliminary)485.3321.332522
NVIDIA GeForce RTX 4090 24GB242.0821.0423
NVIDIA GeForce RTX 3090 Ti 24GB242.0821.0424
NVIDIA Tesla V100 SXM3 32GB240.8820.4425
NVIDIA RTX 6000 Ada 48GB1202026
NVIDIA GeForce RTX 3090 24GB239.0119.50527
NVIDIA A30 PCIe 24GB238.8819.4428
NVIDIA GeForce RTX 3080 12GB (Ti 12GB)476.0319.007529
NVIDIA Tesla V100 PCIe/SXM2/DGXS 32GB237.4218.7130
NVIDIA Tesla V100 PCIe/SXM2 16GB474.7518.687531
NVIDIA GeForce RTX 5070 Ti 16GB (Preliminary)474.6718.667532
NVIDIA Quadro GV100 32GB236.1818.0933
NVIDIA TITAN V CEO Edition 32GB236.1818.0934
NVIDIA L40/L40G 24GB2361835
NVIDIA L20 48GB1181836
NVIDIA L40/L40S 48GB1181837
NVIDIA RTX 5880 Ada 48GB1181838
Apple MacStudio M1 Ultra 64GB117.0717.0739
Apple MacStudio M1 Ultra 128GB117.0717.0740
Apple MacStudio M2 Ultra 64GB117.0717.0741
Apple MacStudio M2 Ultra 128GB117.0717.0742
Apple MacStudio M2 Ultra 192GB117.0717.0743
DDR6 12 Channel 8400 512GB116.816.844
NVIDIA A16 PCIe 64GB116.6816.6845
NVIDIA RTX A5500 Ampere 24GB2321646
NVIDIA RTX A5000 Ampere 24GB2321647
NVIDIA RTX A6000 Ampere 48GB1161648
NVIDIA GeForce RTX 3080 10GB8126.7215.8449
NVIDIA GeForce RTX 3080 Ti 20GB463.3615.8450
NVIDIA GeForce RTX 4080 SUPER 16GB461.3615.3451
NVIDIA Quadro GP100 16GB461.0215.25552
NVIDIA Tesla P100 SXM2/DGXS 16GB461.0215.25553
NVIDIA GeForce RTX 4080 16GB459.7314.932554
NVIDIA A40 PCIe 48GB114.514.555
NVIDIA Tesla P10 24GB228.9314.46556
NVIDIA GeForce RTX 4070 Ti SUPER 16GB456.0314.007557
NVIDIA GeForce RTX 5070 12GB (Preliminary)4561458
NVIDIA Quadro RTX 6000 24GB2281459
NVIDIA TITAN RTX 24GB2281460
NVIDIA Quadro RTX 8000 48GB1141461
NVIDIA TITAN V 12GB454.2813.5762
NVIDIA RTX A4500 Ampere 20GB453.3313.332563
NVIDIA GeForce RTX 2080 Ti 11GB8102.6712.8337564
NVIDIA GeForce RTX 3070 Ti 8GB8101.3812.672565
NVIDIA GeForce RTX 3060 Ti GDDR6X 8GB8101.3812.672566
NVIDIA GeForce RTX 3070 Ti 16GB450.6912.672567
NVIDIA A10M 24GB225.0112.50568
NVIDIA A10/A10G PCIe 24GB225.0112.50569
NVIDIA RTX 5000 Ada 32GB2241270
Intel Arc A770 16GB446.6711.667571
NVIDIA Tesla P100 PCIe 12GB445.7611.4472
NVIDIA TITAN Xp 12GB445.6311.407573
Apple MacBook Pro M4 Max 64GB111.3811.3874
Apple MacBook Pro M4 Max 128GB111.3811.3875
Apple MacBook Pro M4 Max 48GB222.7511.37576
NVIDIA Project DIGITS 128GB110.6710.6777
Intel Arc A750 8GB885.3310.6662578
Intel Arc A580 8GB885.3310.6662579
NVIDIA GeForce RTX 4080 12GB442.0210.50580
NVIDIA GeForce RTX 4070 Ti 12GB442.0210.50581
NVIDIA GeForce RTX 4070 SUPER 12GB442.0210.50582
NVIDIA GeForce RTX 4070 12GB442.0210.50583
NVIDIA Tesla K80 24GB220.0510.02584
Intel Arc B580 12GB4389.585
DDR5 12 Channel 4800 512GB19.389.3886
NVIDIA GeForce RTX 3070 8GB874.679.3337587
NVIDIA GeForce RTX 3060 Ti 8GB874.679.3337588
NVIDIA GeForce RTX 2070 8GB874.679.3337589
NVIDIA GeForce RTX 2080 8GB874.679.3337590
NVIDIA GeForce RTX 5060 8GB (Preliminary)874.679.3337591
NVIDIA RTX A4000 Ampere 16GB437.339.332592
NVIDIA Quadro RTX 5000 16GB437.339.332593
NVIDIA Quadro P6000 24GB218.039.01594
NVIDIA GeForce RTX 4080 Mobile 12GB436995
NVIDIA RTX 4500 Ada 24GB218996
NVIDIA Tesla T10 16GB435.858.962597
NVIDIA Quadro RTX 4000 8GB869.338.6662598
Apple MacStudio M1 Max 32GB217.078.53599
Apple MacStudio M2 Max 32GB217.078.535100
Apple MacBook Pro M3 Max 48GB217.078.535101
Apple MacStudio M1 Max 64GB18.538.53102
Apple MacStudio M2 Max 64GB18.538.53103
Apple MacStudio M2 Max 96GB18.538.53104
Apple MacBook Pro M3 Max 64GB18.538.53105
Apple MacBook Pro M3 Max 128GB18.538.53106
Intel Arc B570 10GB863.337.91625107
NVIDIA GeForce RTX 3060 12GB4307.5108
NVIDIA RTX 4000 Ada 20GB4307.5109
NVIDIA Tesla P40 24GB214.467.23110
NVIDIA Tesla M10 32GB213.876.935111
NVIDIA Tesla M60 16GB426.736.6825112
NVIDIA RTX 4000 SFF Ada 20GB426.676.6675113
NVIDIA Tesla T4/T4G 16GB426.676.6675114
NVIDIA L4 24GB212.56.25115
DDR5 8 Channel 4800 512GB16.256.25116
NVIDIA Tesla M40 24GB212.026.01117
NVIDIA Tesla M40 12GB424.036.0075118
NVIDIA GeForce RTX 4060 Ti 8GB8486119
NVIDIA RTX A2000 12GB Ampere4246120
NVIDIA GeForce RTX 4060 Ti 16GB4246121
Apple MacMini M4 Pro 48GB211.385.69122
Apple MacMini M4 Pro 64GB15.695.69123
Apple MacBook Pro M4 Pro 64GB15.695.69124
Apple MacMini M4 Pro 24GB422.755.6875125
NVIDIA GeForce RTX 4060 8GB845.335.66625126
NVIDIA Quadro P4000 8GB840.555.06875127
NVIDIA GeForce RTX 3060 8GB8405128
NVIDIA RTX 2000 Ada 16GB418.674.6675129
NVIDIA GeForce RTX 3050 8GB837.334.66625130
NVIDIA Jetson AGX Orin 64GB14.274.27131
NVIDIA Jetson AGX Orin 32GB28.534.265132
NVIDIA A2 16GB416.684.17133
DDR4 8 Channel 3200 512GB (EPYC SP3 LGA-4189)14.174.17134
Apple MacMini M2 Pro 16GB833.334.16625135
Apple MacMini M2 Pro 32GB28.334.165136
NVIDIA Tesla P4 8GB832.054.00625137
NVIDIA RTX A1000 Ampere 8GB8324138
NVIDIA T1000 8GB Turing826.673.33375139
Apple MacBook Pro M3 Pro 36GB26.43.2140
DDR4 6 Channel 2933 384GB (LGA-3647)12.852.85141
NVIDIA Jetson AGX Xavier 16GB411.382.845142
NVIDIA Jetson AGX Xavier 32GB25.692.845143
Apple MacMini M4 16GB8202.5144
Apple MacMini M4 24GB4102.5145
Apple MacMini M4 32GB252.5146
Apple MacBook Pro M4 32GB252.5147
NVIDIA Jetson Orin NX 8GB817.072.13375148
NVIDIA Jetson Orin NX 16GB48.532.1325149
Apple MacBook Pro M3 24GB48.532.1325150
Jetson Orin Nano Super 8GB8172.125151
Apple MacMini M2 16GB816.672.08375152
Apple MacMini M2 24GB48.332.0825153
DDR4 4 Channel 3200 256GB (LGA-2011-3)11.561.56154
NVIDIA Jetson Orin Nano 8GB811.381.4225155
Apple MacMini M1 16GB811.111.38875156
NVIDIA Jetson Xavier NX 16GB44.981.245157
NVIDIA Jetson Xavier NX 8GB89.951.24375158

天梯原文如下:

### RT-DETRv3 网络结构分析 RT-DETRv3 是一种基于 Transformer 的实时端到端目标检测法,其核心在于通过引入分层密集正监督方法以及一列创新性的训练策略,解决了传统 DETR 模型收敛慢和解码器训练不足的问题。以下是 RT-DETRv3 的主要网络结构特点: #### 1. **基于 CNN 的辅助分支** 为了增强编码器的特征表示能,RT-DETRv3 引入了一个基于卷积神经网络 (CNN) 的辅助分支[^3]。这一分支提供了密集的监督信号,能够与原始解码器协同工作,从而提升整体性能。 ```python class AuxiliaryBranch(nn.Module): def __init__(self, in_channels, out_channels): super(AuxiliaryBranch, self).__init__() self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1) self.bn = nn.BatchNorm2d(out_channels) def forward(self, x): return F.relu(self.bn(self.conv(x))) ``` 此部分的设计灵感来源于传统的 CNN 架构,例如 YOLO 列中的 CSPNet 和 PAN 结构[^2],这些技术被用来优化特征提取效率并减少计开销。 --- #### 2. **自注意扰动学习策略** 为解决解码器训练不足的问题,RT-DETRv3 提出了一种名为 *self-att 扰动* 的新学习策略。这种策略通过对多个查询组中阳性样本的标签分配进行多样化处理,有效增加了阳例的数量,进而提高了模型的学习能和泛化性能。 具体实现方式是在训练过程中动态调整注意权重分布,确保更多的高质量查询可以与真实标注 (Ground Truth) 进行匹配。 --- #### 3. **共享权重解编码器分支** 除了上述改进外,RT-DETRv3 还引入了一个共享权重的解编码器分支,专门用于提供密集的正向监督信号。这一设计不仅简化了模型架构,还显著降低了参数量和推理时间,使其更适合实时应用需求。 ```python class SharedDecoderEncoder(nn.Module): def __init__(self, d_model, nhead, num_layers): super(SharedDecoderEncoder, self).__init__() decoder_layer = nn.TransformerDecoderLayer(d_model=d_model, nhead=nhead) self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers) def forward(self, tgt, memory): return self.decoder(tgt=tgt, memory=memory) ``` 通过这种方式,RT-DETRv3 实现了高效的目标检测流程,在保持高精度的同时大幅缩短了推理延迟。 --- #### 4. **与其他模型的关** 值得一提的是,RT-DETRv3 并未完全抛弃经典的 CNN 技术,而是将其与 Transformer 结合起来形成混合架构[^4]。例如,它采用了 YOLO 列中的 RepNCSP 模块替代冗余的多尺度自注意层,从而减少了不必要的计负担。 此外,RT-DETRv3 还借鉴了 DETR 的一对一匹配策略,并在此基础上进行了优化,进一步提升了小目标检测的能。 --- ### 总结 综上所述,RT-DETRv3 的网络结构主要包括以下几个关键组件:基于 CNN 的辅助分支、自注意扰动学习策略、共享权重解编码器分支以及混合编码器设计。这些技术创新共同推动了实时目标检测领域的发展,使其在复杂场景下的表现更加出色。 ---
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值