一、配置Accelerate-config
可以启动accelerate config交互式配置,也可以直接改config文件。
accelerate config
交互内容大致为:
(Qwen2-5-vl) zx@s-System-Product-Name:/whut_data/Yujg$ accelerate config
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: no
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------What should be your DeepSpeed's ZeRO optimization stage?
2
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Where to offload optimizer states?
none
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Where to offload parameters?
none
How many gradient accumulation steps you're passing in your script? [1]: 1
Do you want to use gradient clipping? [yes/NO]: yes
What is the gradient clipping value? [1.0]: 1
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: no
Do you want to enable Mixture-of-Experts training (MoE)? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use mixed precision?
bf16
accelerate configuration saved at /home/zx/.cache/huggingface/accelerate/default_config.yaml
配置完成会生成default_config.yaml 文件,打开可直接修改配置文件进行配置。
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config: # 在你交互时选择是否启动deepspeed时才会有,也可以通过代码DeepSpeedPlugin设置
gradient_accumulation_steps: 1 # 梯度积累步数
gradient_clipping: 1.0
offload_optimizer_device: none # 卸载优化器到cpu,减少显存占用,但会减慢训练速度。
offload_param_device: none # 卸载优化器到cpu,减少显存占用,但会减慢训练速度。
zero3_init_flag: false
zero_stage: 2 # 分布式训练模式:1,2,3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1 # 机器数1--1台机器--Global rank
num_processes: 2 # 进程数2--2GPU
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
关于deepspeed配置部分,有两种方法:
(1)accelerate config里面交互式启动,然后选择deepspeed相关的参数即可。
Do you want to use DeepSpeed? [yes/NO]: yes
(2)代码文件中通过DeepSpeedPlugin传入Accelerator。
from accelerate import Accelerator, DeepSpeedPlugin
deepspeed_plugin = DeepSpeedPlugin(zero_stage=2, gradient_clipping=1.0)
accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
二、使用Accelerator训练模型-实战
以一个简单的神经网络为例:
1.导入需要的包
from accelerate import Accelerator, DeepSpeedPlugin
import torch
from torch.utils.data import TensorDataset, DataLoader
2.定义简单网络
class SimpleNet(torch.nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(SimpleNet, self).__init__()
self.fc1 = torch.nn.Linear(input_dim, hidden_dim)
self.fc2 = torch.nn.Linear(hidden_dim, output_dim)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
3.主函数部分
def main():
# 网络结构参数
input_dim = 10
hidden_dim = 20
output_dim = 2
batch_size = 64
data_size = 10000
# 随机生成数据、标签
input_data = torch.randn(data_size, input_dim)
labels = torch.randn(data_size, output_dim)
# 加载数据
dataset = TensorDataset(input_data, labels)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) # shuffle=True,打乱数据顺序
# 创建模型
model = SimpleNet(input_dim, hidden_dim, output_dim)
# 配置deepspeed参数, 这里我在yaml里面配置过了就不需要传入了
# deepspeed_plugin = DeepSpeedPlugin(zero_stage=2, gradient_accumulation_steps=1, gradient_clipping=1)
# accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
accelerator = Accelerator()
# 创建优化器和loss函数
optimization = torch.optim.Adam(model.parameters(), lr=0.00015)
crition = torch.nn.MSELoss()
# 使用accelerator必要的一步,传入将模型相关的参数都传入 .prepare
model, dataloader, optimization= accelerator.prepare(model, dataloader, optimization)
# 分布式训练
for epoch in range(1000):
model.train()
for batch in dataloader:
inputs, labels = batch
outputs = model(inputs)
loss = crition(outputs, labels)
optimization.zero_grad() # 梯度清零
accelerator.backward(loss) # 必须使用accelerator.backward
optimization.step()
print(f"Epoch{epoch} loss{loss.item()}")
accelerator.save(model.state_dict(), "model.pth")
if __name__ == '__main__':
main()
4.运行:
accelerate launch --config_file acc_dp_cfg.yaml acc_dp.py
需要使用accelerate launch启动,--config_file选择自己的yaml配置文件。
运行结果:
Epoch0 loss0.9848306179046631
Epoch0 loss0.9311917424201965
Epoch1 loss1.0081628561019897
Epoch1 loss0.8537667989730835
Epoch2 loss1.153598666191101
Epoch2 loss1.3423147201538086
Epoch3 loss0.8436083197593689
Epoch3 loss0.9412739276885986
Epoch4 loss0.8727619647979736
Epoch4 loss0.9575796127319336
Epoch5 loss1.285043478012085
Epoch5 loss1.1377384662628174
Epoch6 loss1.0347694158554077
Epoch6 loss1.0022046566009521
Epoch7 loss1.0058305263519287
Epoch7 loss1.0705955028533936
Epoch8 loss1.0044686794281006
Epoch8 loss1.0007789134979248
2GPU分布式训练,每次打印2个loss。