Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] misaligned address during in SyncBuffersHook all_reduce when using bf16 with deepspeed #1557

Open
2 tasks done
SCZwangxiao opened this issue Jun 17, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@SCZwangxiao
Copy link
Contributor

SCZwangxiao commented Jun 17, 2024

Prerequisite

Environment

  • Env in logs:
System environment:
    sys.platform: linux
    Python: 3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0]
    CUDA available: True
    numpy_random_seed: 950529031
    GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.1, V12.1.105
    GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
    PyTorch: 2.1.2+cu121
    PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.5  (built against CUDA 11.7)
    - Built with CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

    TorchVision: 0.16.2+cu118
    OpenCV: 4.8.1
    MMEngine: 0.10.2

Runtime environment:
    launcher: pytorch
    randomness: {'seed': None}
    dist_cfg: {'backend': 'nccl'}
    seed: None
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 8
  • Output of python -c "from mmengine.utils.dl_utils import collect_env; print(collect_env())":
OrderedDict([('sys.platform', 'linux'), ('Python', '3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0,1,2,3,4,5,6,7', 'NVIDIA H100 80GB HBM3'), ('CUDA_HOME', '/usr/local/cuda'), ('NVCC', 'Cuda compilation tools, release 12.1, V12.1.105'), ('GCC', 'x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0'), ('PyTorch', '2.1.2+cu121'), ('PyTorch compiling details', 'PyTorch built with:\n  - GCC 9.3\n  - C++ Version: 201703\n  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications\n  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n  - LAPACK is enabled (usually provided by MKL)\n  - NNPACK is enabled\n  - CPU capability usage: AVX512\n  - CUDA Runtime 12.1\n  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n  - CuDNN 8.5  (built against CUDA 11.7)\n    - Built with CuDNN 8.9.2\n  - Magma 2.6.1\n  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.16.2+cu118'), ('OpenCV', '4.8.1'), ('MMEngine', '0.10.2')])

Reproduces the problem - code sample

The bug is very strange. I have not found the minimal reproducible code yet. There are some strange observations:

  1. The bug consistently appears when I change my machines from A800 to H800. The docker is unchanged.
  2. Only occurs under bfloat16.
  3. misalign address error only occurs after the epoch due to SyncBuffersHook.
  4. The bug disappears when I delete SyncBuffersHook.

I was fine-tuning LLaVA. The buffers includes rope embeddings.

Reproduces the problem - command or script

See above

Reproduces the problem - error message

06/17 19:51:14 - mmengine - INFO - Epoch(train) [1][ 8/10]  base_lr: 1.5433e-03 lr: 1.5433e-03  eta: 0:00:03  time: 1.6954  data_time: 0.0657  memory: 20203  image/loss: 8.3296
06/17 19:51:15 - mmengine - INFO - Epoch(train) [1][ 9/10]  base_lr: 7.3223e-04 lr: 7.3223e-04  eta: 0:00:01  time: 1.6479  data_time: 0.0588  memory: 20213  image/loss: 8.2626
06/17 19:51:16 - mmengine - INFO - Exp name: llava_vitl-14_336_7b_pt_CSC_16xb16_zero2_20240617_194817
06/17 19:51:16 - mmengine - INFO - Epoch(train) [1][10/10]  base_lr: 1.9030e-04 lr: 1.9030e-04  eta: 0:00:00  time: 1.6133  data_time: 0.0533  memory: 20310  image/loss: 8.2316

aiplatform-wlf2-ge11-33:19124:19124 [0] enqueue.cc:1087 NCCL WARN Cuda failure 'misaligned address'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 143, in <module>
    main()
  File "tools/train.py", line 139, in main
    runner.train()
  File "/home/wangxiao24/dev_videochat/kvchat/engine/runner/kvchat_runner.py", line 260, in train
    model = self.train_loop.run()  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/loops.py", line 96, in run
    self.run_epoch()
  File "/home/wangxiao24/dev_videochat/kvchat/engine/runner/video_pt_loop.py", line 80, in run_epoch
    self.runner.call_hook('after_train_epoch')
  File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/_flexible_runner.py", line 1271, in call_hook
    getattr(hook, fn_name)(self, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmengine/hooks/sync_buffer_hook.py", line 42, in after_train_epoch
    all_reduce_params(runner.model.buffers(), op='mean')
  File "/usr/local/lib/python3.8/dist-packages/mmengine/dist/dist.py", line 1160, in all_reduce_params
    _all_reduce_coalesced(params_data, bucket_size_mb, op=op, group=group)
  File "/usr/local/lib/python3.8/dist-packages/mmengine/dist/dist.py", line 1108, in _all_reduce_coalesced
    all_reduce(flat_tensors, op=op, group=group)
  File "/usr/local/lib/python3.8/dist-packages/mmengine/dist/dist.py", line 98, in all_reduce
    torch_dist.all_reduce(data_on_device, _get_reduce_op('sum'), group)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 52, in wrapper
    "args": f"{args}, {kwargs}",
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 431, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 664, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 595, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 329, in _tensor_str
    self = self.float()
RuntimeError: CUDA error: misaligned address
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: misaligned address
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Additional information

No response

@SCZwangxiao SCZwangxiao added the bug Something isn't working label Jun 17, 2024
@fangchuan
Copy link

hi xiao, have you fixed this error? I met the same error msg when using accelerate integrated with deepspeed, and I cannot find useful solutions to fix this problem, do you have any update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants