参考链接
'''
1.分布式训练:并行化,跨进程和跨集群的计算
2.torch.distributed.init_process_group() 来初始化进程组,需要指定worker的通信机制,
一般为nccl(NVIDIA推出),1个进程对应1个gpu.
nccl是nvidia的显卡通信方式,用于把模型参数、梯度传递到训练的每个节点上。
3.DDP让进程组的所有worker进行通信
4.torch.utils.data.DataLoader中的batch_size指的是每个进程下的batch_size。也就是说,
总batch_size是这里的batch_size再乘以并行数(world_size)。
#########################################################################
rank是指在整个分布式任务中进程的序号;local_rank是指在一个node上进程的相对序号
nnodes是指物理节点数量 (node:物理节点,可以是一台机器也可以是一个容器,节点内部可以有多个GPU )
node_rank是物理节点的序号
nproc_per_node是指每个物理节点上面进程的数量
########################################
上一个运算题: 每个node包含16个GPU,且nproc_per_node=8,nnodes=3,机器的node_rank=5,请问word_size是多少?
答案:word_size = 3*8 = 24
结论:word_size = nproc_per_node * nnodes
##########################################################################
多进程组启动方法:torch.distributed.launch 或 torchrun
常用启动方法示例:python3 -m torch.distributed.launch --nproc_per_node 2 main.py
'''
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
class ToyModel(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(10, 5)
def forward(self, x):
return self.fc2(self.relu(self.fc1(x)))
def demo_basic():
dist.init_process_group('nccl')
rank = dist.get_rank()
print(f'running on {rank}')
# get the number of GPUs available
n_gpus = torch.cuda.device_count()
# get every rank
device_id = rank % n_gpus
model = ToyModel().to(device_id)
'''
多卡多线程,设置broadcast_buffers=False,会报错:
'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu!
(when checking argument for argument mat1 in method wrapper_CUDA_addmm)'
示例
ddp_model = DDP(model, broadcast_buffers=False)
'''
ddp_model = DDP(model, device_ids=[device_id])
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
optimizer.zero_grad()
inputs = torch.randn(3, 10)
labels = torch.randn(3, 10).to(device_id)
output = ddp_model(inputs)
loss = loss_fn(output, labels)
loss.backward()
optimizer.step()
if __name__ == '__main__':
demo_basic()