最近已有不少大厂都在秋招宣讲了,也有一些在 Offer 发放阶段。
节前,我们邀请了一些互联网大厂朋友、今年参加社招和校招面试的同学。
针对新手如何入门算法岗、该如何准备面试攻略、面试常考点、大模型技术趋势、算法项目落地经验分享等热门话题进行了深入的讨论。
近日,机器学习研究员 Sebastian Raschka 光速发布长篇教程《Converting Llama 2 to Llama 3.2 From Scratch》。
- 博文链接:https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/07_gpt_to_llama/converting-llama2-to-llama3.ipynb
本文是《 Converting a From-Scratch GPT Architecture to Llama 2》的后续,更新的内容是如何将 Meta 的 Llama 2 架构模型逐步转换为 Llama 3、Llama 3.1 和 Llama 3.2。为了避免不必要的冗长,本文特意将解释部分缩至最短,并将重点放在主代码上。
机器之心对文章内容进行了不改变原意的编译:
1 逐步转换 Llama 模型实现
如果你是初次实施 LLM 架构,建议从《Build a Large Language Model From Scratch》(https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch04/01_main-chapter-code/ch04.ipynb)的第 4 章开始,那部分内容将逐步指导你实施原始 GPT 架构。
然后可参考《Converting a From-Scratch GPT Architecture to Llama 2》(https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb),将实现 Llama 特有的组件,如 RMSNorm 层、SiLU 和 SwiGLU 激活、RoPE(旋转位置嵌入)和 SentencePiece tokenizer。
本笔记本采用 Llama 2 架构,并通过以下方式将其转换为 Llama 3 架构:
- 修改旋转嵌入
- 实现分组查询注意力
- 使用定制版的 GPT-4 tokenizer
随后,我们将 Meta 共享的原始 Llama 3 权重加载到架构中:
1.1 复用 Llama 2 的组件
Llama 2 实际上与 Llama 3 非常相似,如上文所述和本文开头的图片所示。
这意味着我们可以使用以下代码从 Llama 2 笔记本中导入多个构建模块:
import os
import sys
import io
import nbformat
import types
def import_from_notebook():
def import_definitions_from_notebook(fullname, names):
current_dir = os.getcwd()
path = os.path.join(current_dir, fullname + ".ipynb")
path = os.path.normpath(path)
# Load the notebook
if not os.path.exists(path):
raise FileNotFoundError(f"Notebook file not found at: {
path}")
with io.open(path, "r", encoding="utf-8") as f:
nb = nbformat.read(f, as_version=4)
# Create a module to store the imported functions and classes
mod = types.ModuleType(fullname)
sys.modules[fullname] = mod
# Go through the notebook cells and only execute function or class definitions
for cell in nb.cells:
if cell.cell_type == "code":
cell_code = cell.source
for name in names:
# Check for function or class definitions
if f"def {
name}" in cell_code or f"class {
name}" in cell_code:
exec(cell_code, mod.__dict__)
return mod
fullname = "converting-gpt-to-llama2"
names = ["precompute_rope_params", "compute_rope", "SiLU", "FeedForward", "RMSNorm", "MultiHeadAttention"]
return import_definitions_from_notebook(fullname, names)
123456789101112131415161718192021222324252627282930
imported_module = import_from_notebook()
# We need to redefine precompute_rope_params
# precompute_rope_params = getattr(imported_module, "precompute_rope_params", None)
compute_rope = getattr(imported_module, "compute_rope", None)
SiLU = getattr(imported_module, "SiLU", None)
FeedForward = getattr(imported_module, "FeedForward", None)
RMSNorm = getattr(imported_module, "RMSNorm", None)
# MultiHeadAttention only for comparison purposes
MultiHeadAttention = getattr(imported_module, "MultiHeadAttention", None)
123456789
1.2 修改后的 RoPE
Llama 3 使用的 RoPE 与 Llama 2 相似,可参阅 RoPE 论文(https://arxiv.org/abs/2104.09864)。
不过,二者 RoPE 设置有一些细微差别。Llama 3 现在支持多达 8192 个 token,是 Llama 2(4096)的两倍。
RoPE 的基础值(见下文公式),从 10000(Llama 2)增加到 50000(Llama 3),公式如下(改编自 RoPE 论文):
这些值是一组预定义的参数,用于确定旋转矩阵中的旋转角度,其中的维数是嵌入空间的维数。
将基数从 10000 增加到 50000,频率(或旋转角度)在各维度上的衰减速度会更慢,这意味着维度越高,角度越大(本质上,这是对频率的解压缩)。
此外,我们还在下面的代码中引入了一个 freq_config 部分,用于调整频率;不过,在 Llama 3(只有 Llama 3.1 和 Llama 3.2)中并不需要它,所以稍后会重新访问这个 freq_config(默认设置为「无」并被忽略)。
import torch
def precompute_rope_params(head_dim, theta_base=10000, context_length=4096, freq_config=None):
assert head_dim % 2 == 0, "Embedding dimension must be even"
# Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2)))
################################ NEW ###############################################
# Frequency adjustments
if freq_config is not None:
low_freq_wavelen = freq_config["original_context_length"] / freq_config["low_freq_factor"]
high_freq_wavelen = freq_config["original_context_length"] / freq_config["high_freq_factor"]
wavelen = 2 * torch.pi / inv_freq
inv_freq_llama = torch.where(
wavelen > low_freq_wavelen, inv_freq / freq_config["factor"], inv_freq
)
smooth_factor = (freq_config["original_context_length"] / wavelen - freq_config["low_freq_factor"]) / (
freq_config["high_freq_factor"] - freq_config["low_freq_factor"]
)
smoothed_inv_freq = (
(1 - smooth_factor) * (inv_freq / freq_config["factor"]) + smooth_factor * inv_freq
)
is_medium_freq = (wavelen <= low_freq_wavelen) & (wavelen >= high_freq_wavelen)
inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)
inv_freq = inv_freq_llama
####################################################################################
# Generate position indices
positions = torch.arange(context_length)
# Compute the angles
angles = positions[:, None] * inv_freq[None, :] # Shape: (context_length, head_dim // 2)
# Expand angles to match the head_dim
angles = torch.cat([angles, angles], dim=1) # Shape: (context_length, head_dim)
# Precompute sine and cosine
cos = torch.cos(angles)
sin = torch.sin(angles)
return cos, sin
12345678910111213141516171819202122232425262728293031323334
总之,与 Llama 2 相比,Llama 3 的新功能是「上下文长度」和 theta 基底参数:
# Instantiate RoPE parameters
llama_2_context_len = 4096
llama_3_context_len = 8192
llama_2_theta_base = 10_000
llama_3_theta_base = 50_000
12345
在 Llama 2 中,用法与以前相同:
# Settings
batch_size = 2
num_heads = 4
head_dim = 16
# Instantiate RoPE parameters
cos, sin = precompute_rope_params(
head_dim=head_dim,
theta_base=llama_3_theta_base,
context_length=llama_3_context_len
)
# Dummy query and key tensors
torch.manual_seed(123)
queries = torch.randn(batch_size, llama_3_context_len, num_heads, head_dim)
keys = torch.randn(batch_size, llama_3_context_len, num_heads, head_dim)
# Apply rotary position embeddings
queries_rot = compute_rope(queries, cos, sin)
keys_rot = compute_rope(keys, cos, sin)
1234567891011121314151617
1.3 分组查询注意力
本节将用一种名为分组查询注意力(GQA)的替代机制来取代多头注意力(MHA)。简而言之,可以将 GQA 视为计算和参数效率更高的 MHA 版本。
在 GQA 中,通过在多个注意力头之间共享来减少键和值投影的数量,每个注意力头仍有其独特的查询,但这些查询关注同一组键和值。
下面是具有 2 个 key-value 组的 GQA 示例:
GQA 的主要思想是减少与键值对相关的唯一查询组的数量,从而在不显著降低建模性能的情况下,减少 MHA 中某些矩阵乘法的大小和参数的数量。
简而言之,GQA 的主要变化是每个查询组都需要重复,以匹配与之相关的头数量,具体实现如下:
import torch.nn as nn
class GroupedQueryAttention(nn.Module):
def __init__(
self, d_in, d_out, context_length, num_heads,
num_kv_groups, # NEW
rope_base=10_000, # NEW
rope_config=None, # NEW
dtype=None
):
super().__init__()
assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
assert num_heads % num_kv_groups == 0, "num_heads must be divisible by num_kv_groups"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads
############################# NEW #############################
# self.W_key = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
# self.W_value = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
self.W_key = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
self.W_value = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
self.num_kv_groups = num_kv_groups
self.group_size = num_heads // num_kv_groups
################################################################
self.W_query = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
self.out_proj = nn.Linear(d_out, d_out, bias=False, dtype=dtype)
self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))
cos, sin = precompute_rope_params(
head_dim=self.head_dim,
theta_base=rope_base, # NEW
freq_config=rope_config, # NEW
context_length=8192
)
self.register_buffer("cos", cos)
self.register_buffer("sin", sin)
def forward(self, x):
b, num_tokens, d_in = x.shape
queries = self.W_query(x) # Shape: (b, num_tokens, d_out)
keys = self.W_key(x) # Shape: (b, num_tokens, num_kv_groups * head_dim)
values = self.W_value(x) # Shape: (b, num_tokens, num_kv_groups * head_dim)
# Reshape queries, keys, and values
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
##################### NEW #####################
# keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
# values = values.view(b, num_tokens, self.num_heads, self.head_dim)
keys = keys.view(b, num_tokens, self.num_kv_groups, self.head_dim)
values = values.view(b, num_tokens, self.num_kv_groups, self.head_dim)
################################################
# Transpose keys, values, and queries
keys = keys.transpose(1, 2) # Shape: (b, num_heads, num_tokens, head_dim)
values = values.transpose(1, 2) # Shape: (b, num_heads, num_tokens, head_dim)
queries = queries.transpose(1, 2) # Shape: (b, num_query_groups, num_tokens, head_dim)
# Apply RoPE
keys = compute_rope(keys, self.cos, self.sin)
queries = compute_rope(queries, self.cos, self.sin)
##################### NEW #####################
# Expand keys and values to match the number of heads
# Shape: (b, num_heads, num_tokens, head_dim)
keys = keys.repeat_interleave