Sebastian Raschka 最新博客：从头开始用 Llama 2 构建 Llama 3.2-CSDN博客

本文链接：https://blog.csdn.net/2401_86435672/article/details/142979742

最近已有不少大厂都在秋招宣讲了，也有一些在 Offer 发放阶段。

节前，我们邀请了一些互联网大厂朋友、今年参加社招和校招面试的同学。

针对新手如何入门算法岗、该如何准备面试攻略、面试常考点、大模型技术趋势、算法项目落地经验分享等热门话题进行了深入的讨论。
在这里插入图片描述

近日，机器学习研究员 Sebastian Raschka 光速发布长篇教程《Converting Llama 2 to Llama 3.2 From Scratch》。

博文链接：https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/07_gpt_to_llama/converting-llama2-to-llama3.ipynb

本文是《 Converting a From-Scratch GPT Architecture to Llama 2》的后续，更新的内容是如何将 Meta 的 Llama 2 架构模型逐步转换为 Llama 3、Llama 3.1 和 Llama 3.2。为了避免不必要的冗长，本文特意将解释部分缩至最短，并将重点放在主代码上。

机器之心对文章内容进行了不改变原意的编译：

1 逐步转换 Llama 模型实现

如果你是初次实施 LLM 架构，建议从《Build a Large Language Model From Scratch》（https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch04/01_main-chapter-code/ch04.ipynb）的第 4 章开始，那部分内容将逐步指导你实施原始 GPT 架构。

然后可参考《Converting a From-Scratch GPT Architecture to Llama 2》（https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb），将实现 Llama 特有的组件，如 RMSNorm 层、SiLU 和 SwiGLU 激活、RoPE（旋转位置嵌入）和 SentencePiece tokenizer。

本笔记本采用 Llama 2 架构，并通过以下方式将其转换为 Llama 3 架构：

修改旋转嵌入
实现分组查询注意力
使用定制版的 GPT-4 tokenizer

随后，我们将 Meta 共享的原始 Llama 3 权重加载到架构中：

1.1 复用 Llama 2 的组件

Llama 2 实际上与 Llama 3 非常相似，如上文所述和本文开头的图片所示。

这意味着我们可以使用以下代码从 Llama 2 笔记本中导入多个构建模块：

import os
import sys
import io
import nbformat
import types
def import_from_notebook():
def import_definitions_from_notebook(fullname, names):
current_dir = os.getcwd()
path = os.path.join(current_dir, fullname + ".ipynb")
path = os.path.normpath(path)
# Load the notebook
if not os.path.exists(path):
raise FileNotFoundError(f"Notebook file not found at: {
     
     path}")
with io.open(path, "r", encoding="utf-8") as f:
nb = nbformat.read(f, as_version=4)
# Create a module to store the imported functions and classes
mod = types.ModuleType(fullname)
sys.modules[fullname] = mod
# Go through the notebook cells and only execute function or class definitions
for cell in nb.cells:
if cell.cell_type == "code":
cell_code = cell.source
for name in names:
# Check for function or class definitions
if f"def {
     
     name}" in cell_code or f"class {
     
     name}" in cell_code:
exec(cell_code, mod.__dict__)
return mod
fullname = "converting-gpt-to-llama2"
names = ["precompute_rope_params", "compute_rope", "SiLU", "FeedForward", "RMSNorm", "MultiHeadAttention"]
return import_definitions_from_notebook(fullname, names)
123456789101112131415161718192021222324252627282930
imported_module = import_from_notebook()
# We need to redefine precompute_rope_params
# precompute_rope_params = getattr(imported_module, "precompute_rope_params", None)
compute_rope = getattr(imported_module, "compute_rope", None)
SiLU = getattr(imported_module, "SiLU", None)
FeedForward = getattr(imported_module, "FeedForward", None)
RMSNorm = getattr(imported_module, "RMSNorm", None)
# MultiHeadAttention only for comparison purposes
MultiHeadAttention = getattr(imported_module, "MultiHeadAttention", None)
123456789

1.2 修改后的 RoPE

Llama 3 使用的 RoPE 与 Llama 2 相似，可参阅 RoPE 论文（https://arxiv.org/abs/2104.09864）。

不过，二者 RoPE 设置有一些细微差别。Llama 3 现在支持多达 8192 个 token，是 Llama 2（4096）的两倍。

RoPE 的基础值（见下文公式），从 10000（Llama 2）增加到 50000（Llama 3），公式如下（改编自 RoPE 论文）：

这些值是一组预定义的参数，用于确定旋转矩阵中的旋转角度，其中的维数是嵌入空间的维数。

将基数从 10000 增加到 50000，频率（或旋转角度）在各维度上的衰减速度会更慢，这意味着维度越高，角度越大（本质上，这是对频率的解压缩）。

此外，我们还在下面的代码中引入了一个 freq_config 部分，用于调整频率；不过，在 Llama 3（只有 Llama 3.1 和 Llama 3.2）中并不需要它，所以稍后会重新访问这个 freq_config（默认设置为「无」并被忽略）。

import torch
def precompute_rope_params(head_dim, theta_base=10000, context_length=4096, freq_config=None):
assert head_dim % 2 == 0, "Embedding dimension must be even"
# Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2)))
################################ NEW ###############################################
# Frequency adjustments
if freq_config is not None:
low_freq_wavelen = freq_config["original_context_length"] / freq_config["low_freq_factor"]
high_freq_wavelen = freq_config["original_context_length"] / freq_config["high_freq_factor"]
wavelen = 2 * torch.pi / inv_freq
inv_freq_llama = torch.where(
wavelen > low_freq_wavelen, inv_freq / freq_config["factor"], inv_freq
)
smooth_factor = (freq_config["original_context_length"] / wavelen - freq_config["low_freq_factor"]) / (
freq_config["high_freq_factor"] - freq_config["low_freq_factor"]
)
smoothed_inv_freq = (
(1 - smooth_factor) * (inv_freq / freq_config["factor"]) + smooth_factor * inv_freq
)
is_medium_freq = (wavelen <= low_freq_wavelen) & (wavelen >= high_freq_wavelen)
inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)
inv_freq = inv_freq_llama
####################################################################################
# Generate position indices
positions = torch.arange(context_length)
# Compute the angles
angles = positions[:, None] * inv_freq[None, :]  # Shape: (context_length, head_dim // 2)
# Expand angles to match the head_dim
angles = torch.cat([angles, angles], dim=1)  # Shape: (context_length, head_dim)
# Precompute sine and cosine
cos = torch.cos(angles)
sin = torch.sin(angles)
return cos, sin
12345678910111213141516171819202122232425262728293031323334

总之，与 Llama 2 相比，Llama 3 的新功能是「上下文长度」和 theta 基底参数：

# Instantiate RoPE parameters
llama_2_context_len = 4096
llama_3_context_len = 8192
llama_2_theta_base = 10_000
llama_3_theta_base = 50_000
12345

在 Llama 2 中，用法与以前相同：

# Settings
batch_size = 2
num_heads = 4
head_dim = 16
# Instantiate RoPE parameters
cos, sin = precompute_rope_params(
head_dim=head_dim,
theta_base=llama_3_theta_base,
context_length=llama_3_context_len
)
# Dummy query and key tensors
torch.manual_seed(123)
queries = torch.randn(batch_size, llama_3_context_len, num_heads, head_dim)
keys = torch.randn(batch_size, llama_3_context_len, num_heads, head_dim)
# Apply rotary position embeddings
queries_rot = compute_rope(queries, cos, sin)
keys_rot = compute_rope(keys, cos, sin)
1234567891011121314151617

1.3 分组查询注意力

本节将用一种名为分组查询注意力（GQA）的替代机制来取代多头注意力（MHA）。简而言之，可以将 GQA 视为计算和参数效率更高的 MHA 版本。

在 GQA 中，通过在多个注意力头之间共享来减少键和值投影的数量，每个注意力头仍有其独特的查询，但这些查询关注同一组键和值。

下面是具有 2 个 key-value 组的 GQA 示例：

GQA 的主要思想是减少与键值对相关的唯一查询组的数量，从而在不显著降低建模性能的情况下，减少 MHA 中某些矩阵乘法的大小和参数的数量。

简而言之，GQA 的主要变化是每个查询组都需要重复，以匹配与之相关的头数量，具体实现如下：

import torch.nn as nn

class GroupedQueryAttention(nn.Module):
    def __init__(
            self, d_in, d_out, context_length, num_heads,
            num_kv_groups,       # NEW
            rope_base=10_000,    # NEW
            rope_config=None,    # NEW
            dtype=None
        ):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
        assert num_heads % num_kv_groups == 0, "num_heads must be divisible by num_kv_groups"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads

        ############################# NEW  #############################
        # self.W_key = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        # self.W_value = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        self.W_key = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
        self.W_value = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
        self.num_kv_groups = num_kv_groups
        self.group_size = num_heads // num_kv_groups
        ################################################################

        self.W_query = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        self.out_proj = nn.Linear(d_out, d_out, bias=False, dtype=dtype)

        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))
        cos, sin = precompute_rope_params(
            head_dim=self.head_dim,
            theta_base=rope_base,      # NEW
            freq_config=rope_config,   # NEW
            context_length=8192
        )
        self.register_buffer("cos", cos)
        self.register_buffer("sin", sin)

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        queries = self.W_query(x)  # Shape: (b, num_tokens, d_out)
        keys = self.W_key(x)  # Shape: (b, num_tokens, num_kv_groups * head_dim)
        values = self.W_value(x)  # Shape: (b, num_tokens, num_kv_groups * head_dim)

        # Reshape queries, keys, and values
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        ##################### NEW  #####################
        # keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        # values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        keys = keys.view(b, num_tokens, self.num_kv_groups, self.head_dim)
        values = values.view(b, num_tokens, self.num_kv_groups, self.head_dim)
        ################################################

        # Transpose keys, values, and queries
        keys = keys.transpose(1, 2)  # Shape: (b, num_heads, num_tokens, head_dim)
        values = values.transpose(1, 2)  # Shape: (b, num_heads, num_tokens, head_dim)
        queries = queries.transpose(1, 2)  # Shape: (b, num_query_groups, num_tokens, head_dim)

        # Apply RoPE
        keys = compute_rope(keys, self.cos, self.sin)
        queries = compute_rope(queries, self.cos, self.sin)

        ##################### NEW  #####################
        # Expand keys and values to match the number of heads
        # Shape: (b, num_heads, num_tokens, head_dim)

        keys = keys.repeat_interleave