现象
如下代码
import torch
# NCCL_NVLS_ENABLE=0
# https://huggingface.co/blog/huseinzol05/tensor-parallelism
# https://github.com/pytorch/elastic/tree/master/examples
# https://docs.pytorch.org/docs/stable/distributed.html#torch.distributed.gather
import torch.nn as nn
import torch.distributed as dist
import os
import like_logger
logger = like_logger.init_logger(__name__)
def main():
local_rank = int(os.environ["LOCAL_RANK"])
world_size = int(os.environ["WORLD_SIZE"])
local_device = f'cuda:{local_rank}'
logger.info(f'local_device:{local_device},world size: {world_size}')
#dist.init_process_group(backend='nccl', init_method="env://", rank=local_rank, world_size=world_size)
dist.init_process_group(backend='nccl')
tensor_size = 2
output_tensor = torch.zeros(tensor_size, device=local_device)
logger.info(f'output tensor dev:{output_tensor.device}')
if dist.get_rank() == 0:
scatter_list = []
for i in range(world_size):
scatter_list.append(torch.ones(tensor_size, device=f'cuda:{local_rank}')*i * 10)
all_dev = [elem.device for elem in scatter_list]
logger.info(f"all dev:{all_dev}")
else:
scatter_list = None
dist.scatter(output_tensor, scatter_list, src=0)
logger.info(f'local rank: {local_rank}, output_tensor:{output_tensor}')
output_tensor += 1
logger.info(f"local rank {local_rank}, dist.get_rank():{dist.get_rank()}, output_tensor:{output_tensor}")
dist.destroy_process_group()
if __name__ == "__main__":
main()
在h100上运行
torchrun --nproc-per-node=4 ../torchmo/temp/h3_tp.py
报错
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 1 'invalid argument'
经过反复排查,从知乎得到答案(https://zhuanlan.zhihu.com/p/29263848323)
需要环境变量
NCCL_NVLS_ENABLE=0 CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun --nproc-per-node=4 ../torchmo/temp/h3_tp.py
多机多卡scatter
like@JYTFY-D1-308-H100-D01-4:~/package/torchmo$ cat temp/h20_scater_multi_gpu.py
import torch
import torch.distributed as dist
import os
import like_logger
logger = like_logger.init_logger(__name__)
def main():
# Get distributed environment variables
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
local_rank = int(os.environ["LOCAL_RANK"])
# Set the GPU for this process
torch.cuda.set_device(local_rank)
device = torch.device(f"cuda:{local_rank}")
log_info = []
interest=['LOCAL_RANK', 'RANK', 'GROUP_RANK', 'ROLE_RANK', 'LOCAL_WORLD_SIZE', 'WORLD_SIZE', 'GROUP_WORLD_SIZE', 'ROLE_WORLD_SIZE',]
for elem in interest:
env_value = os.getenv(elem)
log_info.append(f"{elem}:{env_value}")
log_line = ",".join(log_info)
logger.info(log_line)
#logger.info(f"envs:{os.environ.keys()}")
# Initialize process group
dist.init_process_group(backend="nccl", init_method="env://")
tensor_size = 10
recv_tensor = torch.empty(tensor_size, dtype=torch.float32, device=device)
if rank == 0:
# Create a (world_size x tensor_size) tensor
#full_tensor = torch.randn(world_size, tensor_size, dtype=torch.float32, device=device)
full_tensor = torch.arange(world_size*tensor_size).reshape((world_size, tensor_size)).float().to(device)
scatter_list = [full_tensor[i].contiguous() for i in range(world_size)]
print(f"[Rank {rank}] Scattering tensor:\n{full_tensor}")
dist.scatter(recv_tensor, scatter_list=scatter_list, src=0)
else:
dist.scatter(recv_tensor, src=0)
logger.info(f"[Rank {rank}] received tensor: {recv_tensor}")
dist.destroy_process_group()
if __name__ == "__main__":
main()
在主结点启动
(torch2) like@JYTFY-D1-308-H100-D07-2:~/package/torchmo$ NCCL_NVLS_ENABLE=0 CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun --nnodes=2 --nproc-per-node=4 --node-rank=0 --rdzv-id=123 --rdzv-backend=c10d --rdzv-endpoint=10.157.101.103:29500 temp/h20_scater_multi_gpu.py
W0621 13:44:54.167000 733495 site-packages/torch/distributed/run.py:792]
W0621 13:44:54.167000 733495 site-packages/torch/distributed/run.py:792] *****************************************
W0621 13:44:54.167000 733495 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0621 13:44:54.167000 733495 site-packages/torch/distributed/run.py:792] *****************************************
733628 INFO 06-21 13:45:01 [h20_scater_multi_gpu.py:23:main] LOCAL_RANK:0,RANK:0,GROUP_RANK:0,ROLE_RANK:0,LOCAL_WORLD_SIZE:4,WORLD_SIZE:8,GROUP_WORLD_SIZE:2,ROLE_WORLD_SIZE:8
733631 INFO 06-21 13:45:02 [h20_scater_multi_gpu.py:23:main] LOCAL_RANK:3,RANK:3,GROUP_RANK:0,ROLE_RANK:3,LOCAL_WORLD_SIZE:4,WORLD_SIZE:8,GROUP_WORLD_SIZE:2,ROLE_WORLD_SIZE:8
733629 INFO 06-21 13:45:02 [h20_scater_multi_gpu.py:23:main] LOCAL_RANK:1,RANK:1,GROUP_RANK:0,ROLE_RANK:1,LOCAL_WORLD_SIZE:4,WORLD_SIZE:8,GROUP_WORLD_SIZE:2,ROLE_WORLD_SIZE:8
733630 INFO 06-21 13:45:02 [h20_scater_multi_gpu.py:23:main] LOCAL_RANK:2,RANK:2,GROUP_RANK:0,ROLE_RANK:2,LOCAL_WORLD_SIZE:4,WORLD_SIZE:8,GROUP_WORLD_SIZE:2,ROLE_WORLD_SIZE:8
[Rank 0] Scattering tensor:
tensor([[ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[10., 11., 12., 13., 14., 15., 16., 17., 18., 19.],
[20., 21., 22., 23., 24., 25., 26., 27., 28., 29.],
[30., 31., 32., 33., 34., 35., 36., 37., 38., 39.],
[40., 41., 42., 43., 44., 45., 46., 47., 48., 49.],
[50., 51., 52., 53., 54., 55., 56., 57., 58., 59.],
[60., 61., 62., 63., 64., 65., 66., 67., 68., 69.],
[70., 71., 72., 73., 74., 75., 76., 77., 78., 79.]], device='cuda:0')
733628 INFO 06-21 13:45:04 [h20_scater_multi_gpu.py:42:main] [Rank 0] received tensor: tensor([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.], device='cuda:0')
733631 INFO 06-21 13:45:05 [h20_scater_multi_gpu.py:42:main] [Rank 3] received tensor: tensor([30., 31., 32., 33., 34., 35., 36., 37., 38., 39.], device='cuda:3')
733630 INFO 06-21 13:45:05 [h20_scater_multi_gpu.py:42:main] [Rank 2] received tensor: tensor([20., 21., 22., 23., 24., 25., 26., 27., 28., 29.], device='cuda:2')
733629 INFO 06-21 13:45:05 [h20_scater_multi_gpu.py:42:main] [Rank 1] received tensor: tensor([10., 11., 12., 13., 14., 15., 16., 17., 18., 19.], device='cuda:1')
在副节点启动
(torch2) like@JYTFY-D1-308-H100-C09-4:~/package/torchmo$ NCCL_NVLS_ENABLE=0 CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun --nnodes=2 --nproc-per-node=4 --node-rank=1 --rdzv-id=123 --rdzv-backend=c10d --rdzv-endpoint=10.157.101.103:29500 temp/h20_scater_multi_gpu.py
W0621 13:44:53.817000 4157747 site-packages/torch/distributed/run.py:792]
W0621 13:44:53.817000 4157747 site-packages/torch/distributed/run.py:792] *****************************************
W0621 13:44:53.817000 4157747 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0621 13:44:53.817000 4157747 site-packages/torch/distributed/run.py:792] *****************************************
4157822 INFO 06-21 13:45:00 [h20_scater_multi_gpu.py:23:main] LOCAL_RANK:0,RANK:4,GROUP_RANK:1,ROLE_RANK:4,LOCAL_WORLD_SIZE:4,WORLD_SIZE:8,GROUP_WORLD_SIZE:2,ROLE_WORLD_SIZE:8
4157824 INFO 06-21 13:45:01 [h20_scater_multi_gpu.py:23:main] LOCAL_RANK:2,RANK:6,GROUP_RANK:1,ROLE_RANK:6,LOCAL_WORLD_SIZE:4,WORLD_SIZE:8,GROUP_WORLD_SIZE:2,ROLE_WORLD_SIZE:8
4157825 INFO 06-21 13:45:01 [h20_scater_multi_gpu.py:23:main] LOCAL_RANK:3,RANK:7,GROUP_RANK:1,ROLE_RANK:7,LOCAL_WORLD_SIZE:4,WORLD_SIZE:8,GROUP_WORLD_SIZE:2,ROLE_WORLD_SIZE:8
4157823 INFO 06-21 13:45:01 [h20_scater_multi_gpu.py:23:main] LOCAL_RANK:1,RANK:5,GROUP_RANK:1,ROLE_RANK:5,LOCAL_WORLD_SIZE:4,WORLD_SIZE:8,GROUP_WORLD_SIZE:2,ROLE_WORLD_SIZE:8
4157823 INFO 06-21 13:45:04 [h20_scater_multi_gpu.py:42:main] [Rank 5] received tensor: tensor([50., 51., 52., 53., 54., 55., 56., 57., 58., 59.], device='cuda:1')
4157825 INFO 06-21 13:45:04 [h20_scater_multi_gpu.py:42:main] [Rank 7] received tensor: tensor([70., 71., 72., 73., 74., 75., 76., 77., 78., 79.], device='cuda:3')
4157822 INFO 06-21 13:45:04 [h20_scater_multi_gpu.py:42:main] [Rank 4] received tensor: tensor([40., 41., 42., 43., 44., 45., 46., 47., 48., 49.], device='cuda:0')
4157824 INFO 06-21 13:45:04 [h20_scater_multi_gpu.py:42:main] [Rank 6] received tensor: tensor([60., 61., 62., 63., 64., 65., 66., 67., 68., 69.], device='cuda:2')
环境变量 | 含义 |
---|---|
WORLD_SIZE | 所有机器GPU 数的和 |
RANK | 在所有节点中,GPU ID, node id |
GROUP_WORLD_SIZE | 节点数 |
GROUP_RANK | group id |
LOCAL_WORLD_SIZE | 本机的GPU数 |
LOCAL_RANK | 在本节点内部,GPU ID |
以上